The Model Trap
The lock-in is the years of context the model has about how you think, and no export button exists for that.
I switched from ChatGPT to Claude on March 5th 2026 after the OpenAI Pentagon deal. Anthropic refused to let the Department of Defense use Claude for mass surveillance or autonomous weapons, got designated a supply chain risk for it, and Trump ordered federal agencies to stop using their technology. OpenAI stepped in hours later with a contract of their own, and Altman admitted it was rushed and looked "opportunistic and sloppy." ChatGPT uninstalls surged, Claude went to number one on the App Store, and I was one of the people who moved, not because Claude is better at everything but because I wasn't comfortable sending my work through a company that had just undercut a competitor for taking a position I agreed with.
I'd been using three models in parallel for different things: Claude for coding, ChatGPT for life stuff, planning, thinking through decisions, working through problems, and Gemini as a cross-reference layer to check what the other two told me. When I left ChatGPT I moved the personal conversations to Claude, which already had my coding context, and the merge was stranger than I expected. I'd be deep in a coding session at 4am and Claude would remind me I had a meeting at 11 and should probably take a break. The two relationships, one professional and one personal, had been separate, and collapsing them into one model meant it knew things about my life that felt odd in the middle of a function signature. But the real cost wasn't the weirdness of the merge, it was what didn't come with me at all: the two years of personal context ChatGPT had accumulated, the conversations about my plans for the next few years, the struggles I'd worked through over the past few, the decisions I'd explained my reasoning on, all the accumulated understanding of who I am and what I'm trying to do that no system prompt can capture because it was built from hundreds of conversations, not a description I wrote.
It was like making a new friend versus having a friend who's known you for a while. The new friend is smart and helpful and you get along immediately, and there's an irony here I should name: part of what I lost was a sycophantic relationship I'd already written about being dangerous, and I still missed it, which tells you something about how powerful accumulated context is as a lock-in mechanism even when you know the context was making you worse.
The old lock-in was your files. You put your photos in iCloud, your documents in Google Drive, your music in iTunes, and leaving meant losing access to the things you'd stored. GDPR and years of pressure got us export buttons, and those help. The new version is harder to see because the valuable thing is the understanding the model built around your data over time, and that understanding lives in the model's context and memory systems and can't be exported because it was never yours to begin with.
Copilot understands your codebase. Not just the files but the conventions, the naming patterns, the architectural decisions, the workarounds you've accepted and the ones you haven't. Gemini knows your cross-referencing patterns, which claims you check and which ones you trust, and over time it builds a sense of what you're sceptical about. All of this accumulated context makes the tool more useful the longer you use it, which is the same dynamic that made social media sticky except this time the thing you lose by leaving is your working relationship with a tool that understands how you think.
This is enshittification with a different mechanism. The platform doesn't need to degrade the product to lock you in, it just needs to accumulate enough context that starting over feels more expensive than staying. And unlike files, this context can't be downloaded in a zip, because it's a learned behaviour pattern distributed across weights and memory and reinforcement history that no standard exists to export.
The cost of switching accumulates invisibly because nobody tracks context the way they track storage, and by the time you realise how much you've built up inside one provider's ecosystem, rebuilding it somewhere else feels like a project you'll get to eventually and never do.
The lock-in problem gets worse when you realise that most of the AI-powered products being built right now run on the same three or four foundation models underneath. Your note-taking app, your code editor, your email assistant, your customer support tool, your search engine, and increasingly your operating system are all calling the same handful of APIs. When everything runs on GPT-4 or Claude or Gemini, a subtle bias in one model propagates to everything built on top of it, a vulnerability in one model's training data affects every product that uses it, and a pricing change by one provider hits every business that depends on it simultaneously.
This is the supply chain risk that the Inflection post described at the extraction layer, but it's also a resilience problem. If a single model provider has a bad day, outage, policy change, sudden content restriction, every product built on it has the same bad day. We've already seen this happen with API pricing changes that killed startups overnight and content policy updates that broke existing workflows without warning. The monoculture means there's nowhere to fail over to because everyone's failover runs on the same infrastructure.
Model diversity is the architectural requirement that makes local-first work, because a sovereign system that depends on a single model inherits every vulnerability, every policy change, and every pricing decision that model's provider makes, which is the same dependency structure the whole project exists to escape.
The counterargument to all of this is that local models aren't as good as frontier cloud models, and right now that's true. A 7B or 13B model running on a box under your desk is not going to match GPT-5 or whatever ships next quarter on raw capability. If you need the absolute best reasoning, the latest training data, the biggest context window, you need the cloud API and you need to pay for it and you need to accept the lock-in and the surveillance that comes with it.
The argument I'm making with LocalGhost is that for most of what a personal AI actually needs to do, good enough is enough. Summarising your journal entries doesn't require frontier intelligence, tagging your photos doesn't need a trillion parameters, and correlating your sleep data with your spending patterns is a problem a well-prompted 7B model can handle today. The tasks that matter for personal sovereignty, organising your data, finding patterns in your life, being honest with you about what the data shows, these don't need to be solved by the most capable model ever trained, they need to be solved by a model that runs on your hardware, answers to nobody, and can be swapped when something better comes along.
Open-weight models are improving fast enough that the gap between local and cloud capability is narrowing every few months. Llama, Mistral, Qwen, Gemma, the open ecosystem is producing models that would have been considered frontier eighteen months ago, and they run on consumer hardware. You won't be at the forefront of progress with a local setup, and I think for what we're building that will be enough, because the trade-off is a marginal capability difference versus complete architectural independence, and for personal AI the independence matters more than the margin.
What made the model switch I described at the top worse is that I had no way of knowing whether the new model was actually better for my use case. I assumed it was because the provider said so, because the benchmarks improved, because the internet was excited about it. But I don't have a preset list of expected outcomes for the things I actually use a model for, which means I'm relying on someone else's tests to decide whether a tool I use every day got better or worse, and their tests measure general capability, not whether the model understands my codebase or produces Go the way I write it. With cloud models you can at least lean on the provider's evaluation infrastructure and hope their priorities overlap with yours. With open-weight models running locally you can't even take that for granted, because nobody is testing whether Llama 3 handles your specific workflow better than Llama 2 did, that's entirely your problem, and most people don't have a systematic way to answer it.
The infrastructure for model evaluation exists, and it's substantial. MMLU, HumanEval, Chatbot Arena, dozens of standardised benchmarks that measure general capability across reasoning, coding, maths, instruction following. Enterprise teams have frameworks like OpenAI Evals, DeepEval, and Langfuse for building custom test suites against their production use cases. None of this is what I'm describing, because all of it is built for developers evaluating models before deployment or companies monitoring production systems, not for a person on a box under their desk who wants to swap from Llama to Mistral and know whether the daemons that organise their life still work the way they're supposed to. The evaluation gap is at the personal level, and that's the gap we're building into LocalGhost.
The approach works on two layers. The first is a set of standardised benchmarks that ship with the box, scenarios we've designed to test each daemon against known inputs and expected outputs. ghost.noted gets a journal entry and should produce a summary that captures these three points and misses none of them. ghost.synthd gets six months of sleep and spending data and should flag the correlation between poor sleep and impulse purchases, not the correlation between Tuesdays and coffee. ghost.shadowd, the adversarial daemon I described in the dictator brain post, gets tested against scenarios where the correct response is disagreement and scenarios where it should stay quiet. These ship with every box and give you a baseline: does this model pass the tests we've defined?
The second layer is personal, and this is the part that doesn't exist anywhere else. ghost.synthd doesn't just correlate your data, it watches how you interact with the daemons over time and builds a personalised test suite in the background from your actual usage patterns. The summaries you corrected, the correlations you confirmed, the pushback from shadow that you engaged with versus the pushback you dismissed, all of this becomes evaluation data that's specific to you. Occasionally synthd will surface a scenario it's drafting and ask you to review it, the same way it surfaces correlations in your data, so the personal layer isn't just passive observation but something you actively shape alongside the memory it's already building for you. When a new model comes out, the box runs it against both layers, and the standardised benchmarks tell you whether the model is generally capable enough while the personal scenarios tell you whether it's capable enough for the way you actually use it.
The suite doesn't run constantly, it runs when you choose to evaluate a new model, and even then it doesn't need to run every scenario in one sitting. The whole approach rests on an assumption I should be explicit about: consumer hardware will keep getting better and open-weight models will keep getting more efficient. A 7B model on a mid-range GPU today is slow enough that running fifty scenarios feels like a commitment. The same hardware running whatever ships in eighteen months will handle it in a fraction of the time. If that assumption turns out to be wrong the scenario engine becomes impractical, but everything I've seen in the last two years suggests the trajectory is in our favour, and building for where the hardware is going rather than where it is now is a bet I'm comfortable making.
I want to be honest about what's hard here, because the two-layer framing is cleaner than the reality. The standardised layer has a measurement problem: a journal summary isn't a unit test, two good summaries of the same entry can look completely different, and deciding whether an output is "correct" for subjective tasks is a hard problem that doesn't have an obvious answer. If you use another LLM to judge the output you've reintroduced model dependency into the evaluation layer, and if you use deterministic checks you're measuring surface features not comprehension. The shadowd scenarios have a problem that might be fundamental: if the test suite defines when pushback is appropriate, you've constrained an adversarial daemon to only fire in predictable situations, which defeats the purpose of having an adversarial daemon at all.
Both layers store everything in a platform-agnostic format, documented JSON that lives in files and databases on your machine, not on our servers. The scenario definitions, the test results, the personal evaluation history, all of it is yours and exportable at any time, and there's an import function so you can bring evaluation data from wherever you were before. If you decide to leave LocalGhost you take everything with you, and if you're arriving from somewhere else you can bring what you had.
The scenarios are the product specification and the model becomes a replaceable component that either passes them or doesn't. When a new model comes out, you run the suite, and if the daemons still behave the way both layers describe, you upgrade with confidence. If they don't, you know exactly where the new model breaks and you can decide whether the trade-off is worth it.