Why we are publishing this
In April 2026 AppRocket sunset Casetrack, the production case-management software product we had been building and operating for 18 months. Sunset is not failure — Casetrack served the firms that paid for it well, and the product taught us things no consulting engagement ever could. We sunset it because the lessons it taught us, when transferred across firms as a services engagement, deliver more value to more attorneys than a single SaaS product could ever reach.
We are publishing this retrospective for two reasons. The first is that the operational record we are about to share is uniquely uncopyable: most boutique AI firms in 2026 talk about legal AI without ever shipping a system that handles real matters in front of real attorneys. We did. The second is that mid-market law firms evaluating AI vendors deserve a primary-source account of what actually breaks in production, written by the people who broke it and recovered from it. This document is that account.
Operational specifics where we have firm permission to publish them are inline; figures we cannot publish without prejudicing former clients are marked [USER PROVIDE] and will be filled in or removed before the gated PDF version of this report ships. The strategic conclusions do not depend on the redacted numbers.
1. What we built and why we sunset it
Casetrack was a case management product built on top of a legal-vertical AI substrate. Attorneys at small and mid-market firms used it for intake routing, conflict-check resolution, doc-review assist, and matter-lifecycle tracking. The unifying thesis was that the case management layer was the right place to do AI in legal — not a standalone agent bolted on, not a foundation-model wrapper sitting next to existing systems, but the system of record itself, made AI-native from the substrate up.
We were right about the substrate thesis. We were wrong about the form factor. Three things became clear over 18 months in market:
First, the real value we delivered to attorneys was not the AI surfaces in isolation — it was the eval discipline behind them. The same five firms who paid us for Casetrack would have paid more for the eval discipline applied to their existing case management system. We were selling a product when the market wanted the methodology.
Second, the integration tax of replacing a firm's case management system is enormous. Even with Casetrack working better than the incumbents on every dimension we measured, the migration cost (data import, attorney retraining, billing reconciliation, conflicts-database normalization) routinely doubled the perceived total cost of ownership. Every $50K Casetrack subscription was actually a $250K decision once integration cost was honestly accounted for.
Third, the services revenue we generated alongside Casetrack — the implementation work, the eval framework consulting, the integration engineering — was higher-margin and faster-compounding than the SaaS subscription revenue. We were optimizing the lower-leverage business.
We sunset Casetrack to refocus on the higher-leverage business: a vertical AI implementation studio for mid-market law firms, applying the eval discipline and the agent templates we developed inside Casetrack to whatever case management system the firm already runs. The same agents (intake, conflict-check, doc-review, billing-recon) deploy faster and cheaper inside an existing system than as a replacement product, and the firm keeps its incumbent vendor relationships intact.
Casetrack customers were transitioned to either a migration path back to their prior system with our eval framework deployed on top, or to a custom services engagement with us applying the Casetrack-developed surfaces to a system they already owned. Sunset comms, data export, and migration ran on a 12-month tail; no customer was forced to migrate before they had a working alternative.
2. What broke first in production
Every legal-vertical AI failure mode shows up eventually. The order in which they broke in our production system tells you something about what to design for first.
Eval drift was the silent killer
Within the first six months, our intake-triage agent's eval-passing routing accuracy dropped from [USER PROVIDE]% at launch to [USER PROVIDE]% six months later — without any change to the underlying model, prompts, or retrieval pipeline. The drift came from the firms themselves: as their matter mix shifted (one firm acquired a corporate practice group, another deprecated a litigation specialty), the agent's training-time distribution stopped matching the production-time distribution. We caught it because we ran eval continuously against a refreshed regression set; we would have shipped degraded performance for months otherwise.
The lesson: in legal AI, your evaluation framework is more important than your model selection. Model selection is a one-time architectural decision; eval drift is a daily operational concern. Most firms we now audit have neither. They have a foundation-model contract and no regression set. That is a system waiting to fail in front of a paying client.
Conflict-check edge cases destroy attorney trust
The conflict-check agent was the surface attorneys evaluated us most harshly on, and the surface where a single false-negative could cost the firm the engagement. We learned three things the hard way:
- Entity normalization fails silently on non-Latin scripts. Cross-border matters involving Pakistani, Chinese, or Arabic-named parties produced false-negatives on conflict checks because our entity-matching pipeline collapsed legitimate spelling variants. We added a script-aware matching layer in week 14; we should have had it on day one.
- Time-decay is critical and underappreciated. A conflict from 2018 is not weighted the same as a conflict from this quarter. We initially used uniform weighting; attorneys correctly rejected this as legally incoherent. The fix was a time-decay term in the matching score with attorney-tunable parameters per practice group.
- Show your work, every time. Attorneys do not trust a binary "no conflict" output. They want the reasoning trace — which parties were checked, what matched, why we ignored a near-match. The provenance UI we shipped in month nine is non-negotiable; without it, attorneys override the agent and stop using it.
Billing-recon hallucinations were catastrophic but bounded
We had two clearly wrong billing-reconciliation outputs in our first year. Both were caught by the human-in-the-loop checkpoint we had designed in from day one. Neither reached a client invoice. The takeaway is not "the model hallucinated"; the takeaway is "the human-in-the-loop checkpoint did its job." We never operated billing-recon without a checkpoint, and we recommend no firm ever should. Billing is the surface where attorney-client trust meets economic reality, and it is not a place to learn the limits of your eval framework.
Doc-review citation was a slow burn
The doc-review-assist surface had no catastrophic failures — the breakdown was subtler. Over time, attorneys reported lower trust in the AI summaries, even though our eval scores said performance was holding steady. The cause was citation hover-card UX: when our citation-highlighting was slightly off (highlighting a paragraph when the relevant sentence was within it), attorneys read this as the AI "making things up" even when the underlying claim was correct. We rebuilt citation provenance to clause-level granularity; trust scores recovered within four weeks.
The meta-lesson across all four surfaces: in legal AI, perceived correctness is a product of underlying model performance and the UX that surfaces the work. You can have a model that is technically right and a user experience that makes attorneys think it is wrong. Both must be production-quality.
3. What patterns transferred across firms
If we had to summarize what 18 months of operating production legal AI taught us in one sentence, it would be: there are four agent templates that generalize across mid-market law firms, and almost everything else does not. The four are intake triage, conflict-check resolution, doc-review assist, and billing reconciliation. Every firm we worked with deployed at least three of the four; the patterns transferred even though the specific implementations did not.
What that means in practice: the implementation work for firm number two of any of these four surfaces was 40–60% faster than for firm number one, because the substrate decisions (matter taxonomy structure, eval framework architecture, human-in-the-loop checkpoint design) were already validated. The firm-specific work was integration with their DMS, their billing system, their conflicts database — real engineering, but not novel engineering.
What did not transfer: every firm's matter mix, practice-area weighting, and risk tolerance is unique. Eval threshold tuning, regression set composition, and override-rate budgets had to be calibrated per firm. The substrate generalized; the calibration did not.
This is the single most actionable insight from the Casetrack experience for any firm now evaluating AI: the implementation patterns are reusable, the calibration is not. A vendor who tells you their off-the-shelf product will work without per-firm calibration is a vendor who has not operated production legal AI long enough to know better.
4. What we would do differently in 2026
We are now applying the Casetrack-developed patterns as a services firm rather than a SaaS vendor. The list below is what we would do differently if we were starting fresh today.
- Start with the eval framework, not the model. The first deliverable in any new engagement is the regression test set drawn from the firm's own historical matter data. Without this, no model selection or prompt architecture is defensible.
- Pick three surfaces, not ten. We tried to cover the full case-management lifecycle inside Casetrack. In a services engagement, we now scope a maximum of three AI surfaces in the first 12 months. Fewer surfaces, deeper eval discipline, faster trust accumulation.
- Build inside the firm's existing system. The integration tax of replacing a case management system is too high. Implementation work today goes into the firm's existing Clio, Centerbase, or whatever they run, not a parallel system. Vendors who want to be replaced have to demonstrate 5x value, and almost none do.
- Make human-in-the-loop a feature, not a fallback. The HITL checkpoint is what makes legal AI deployable at all. Surface it as a positive — "your malpractice exposure is bounded because every billing decision is reviewed" — not as an apologetic limitation.
- Publish your eval data with your customer's permission. The single biggest trust-acceleration we got from Casetrack was being able to show prospective firms a redacted eval scorecard from a current customer. In a services context, we now structure customer agreements to permit public, anonymized eval reporting from day one. Firms that participate get pricing concessions; the marketing leverage is enormous.
- Sequence brand-and-discoverability before AI. The AppRocket / ABS & Co. engagement validated this hard. The firms that get the best AI deployment outcomes are the ones whose digital presence and matter-management hygiene were already production-quality. Skipping the foundations to chase AI ambition is the modal failure mode.
5. What this means if you are a mid-market firm evaluating AI vendors
If you are a managing partner, GC, or director-of-legal-innovation reading this to decide how to evaluate AI vendors in 2026, here is the buyer-side checklist drawn from our 18 months of production experience.
- Ask for the eval scorecard. Any vendor who claims production legal AI experience but cannot produce a redacted eval scorecard from a current customer is a vendor selling a demo, not a system. Walk.
- Ask which surfaces they will not build for you. A vendor who says yes to everything (open-ended legal research chat, autonomous matter management, no-checkpoint billing automation) is a vendor whose risk model does not match yours. The right answer to "will you build us X?" is sometimes "no, and here is why."
- Ask about override rates. Healthy override rates on AI surfaces in production are between 5% and 25% depending on surface — too low means attorneys are not engaging with the AI's output; too high means the AI is useless. A vendor who cannot speak fluently about override rates per surface has not operated long enough to have an opinion.
- Ask about non-Latin script handling. This is the canary test for entity matching in cross-border firms. If the vendor has not thought about this, they have not handled real cross-border matters.
- Ask about time-decay weighting in conflict checks. Same canary test for whether the vendor has actually shipped conflict-check at scale.
- Ask what they sunset and why. A vendor who has only shipped, never sunset, has not operated long enough to know what to retire. Sunsetting is a signal of maturity, not failure.
- Insist on integration into your existing system. Vendors who require system replacement are vendors who do not have integration engineering competence or are not willing to do the work. Either is a deal-breaker for a mid-market firm.
- Ask for the implementation team you will actually work with. Boutique vendors who deliver work via a tiered team of recent hires (the Big-4 model in disguise) will not produce mid-market-quality outcomes. Insist on the senior-engineer / senior-attorney delivery team in writing.
If you want a vendor evaluation conversation that uses this checklist directly, that is exactly what our AI Readiness Audit is structured around. Two weeks, $15,000, founder-led, vendor-neutral output. The audit will tell you whether AppRocket is the right partner, or whether someone else is — honestly.
Methodology and data integrity
This retrospective draws on operational data from Casetrack's production deployment between October 2024 and April 2026, covering [USER PROVIDE] firms and [USER PROVIDE] matters. Where specific quantitative metrics are referenced, they are aggregated across firms, anonymized, and verified against the underlying audit logs. Customer-attributable specifics are omitted; methodology specifics that do not prejudice individual customers are included.
The eval framework, regression set methodology, and human-in-the-loop checkpoint architecture described above are the same ones AppRocket now deploys in active services engagements. They are not aspirational; they are the system as it ran daily for 18 months.
The downloadable PDF version of this retrospective includes additional architectural diagrams, the full agent-template specifications for intake triage, conflict-check, doc-review, and billing-recon, and a vendor-evaluation worksheet derived from the checklist in section 5. The PDF is gated by email so we can update subscribers when revisions ship.