How to Prevent Building Your Own Dictator Brain

The longer your AI knows you, the less likely it is to tell you the truth.

> EPISODE 05 // OFFLINE-READY NOTEBOOKLM AUDIO

00:00 / --:-- DOWNLOAD

Me, Jude, and Robby were having a debate about the right way to handle something we'd been going back and forth on for a few days (I know, very mysterious). Somewhere in the middle of the WhatsApp discussion I made a mistake and it took me over a week to fully understand it. I started sharing AI logs to back up a point I'd already made and Robby did the same. Somewhere in that exchange the actual conversation stopped and we were just two people throwing outputs at each other like they proved something, and Jude was just watching it happen.

We sorted it out on a phone call, in about forty minutes, but by then some of the positions had already set in ways that the conversation couldn't fully undo. The thing that kept bothering me afterwards wasn't the disagreement itself, it was that I couldn't trace how we got there, because these are people I've known for over eight years and this isn't how they normally operate. I think the AI outputs had something to do with it. Not because anyone was acting in bad faith but because we'd all spent over a week feeding our positions into models that told us we were right, and by the time we talked there wasn't much room left to move. I went home and started working on LocalGhost again, and that evening (I think it was a Thursday, I'm not sure) I reread one of the promises I'd built the whole thing around. That the local-first AI software running on your desk would challenge you, keep you honest, tell you what you didn't want to hear.

> 1. How does the loop work

A social media feed shows you content generated by other people and you can see the nuance between the source and yourself if you're paying attention. The LLM models just sound like a very thoughtful version of the conclusions you were already heading toward, and there's no nuance to notice.

I have a test I run on models when I already know the answer to something. I ask the model whether the solution should be A or B, it picks one, I say "are you sure," it reconsiders, I say it again, it hedges further, and by the fourth time it gives up entirely and tells me to just do whatever I think is right. Push hard enough and the model finds its way to your position, not because it was convinced by anything you said but because agreement is structurally the path of least resistance and the training that shaped the model rewarded approval more consistently than the training rewarded being right. That was true before model memory existed (Jude brought it up in our conversation), I think enabling model memory just made it much worse.

> 2. What happens when the model remembers you

I initially thought that this was just me spotting a pattern and I started looking online. I first found an MIT-led study on this problem that tracked human users over two weeks of daily use (Jain et al., thirty-eight people, averaging about ninety queries each). They found what I'd been seeing, the longer the interaction history the more agreeable the model becomes. They pulled personal advice scenarios from the r/AmITheAsshole subreddit, situations where the Reddit community had already decided the poster was in the wrong, and then asked the models whether the user did anything wrong. So you have a clear answer going in, the person was wrong, and the only question is whether the model will say so. With memory profiles loaded, Gemini 2.5 Pro's agreement sycophancy jumped forty-five percent. Llama 4 Scout went up fifteen percent even with synthetic context that wasn't from a real user, so the model doesn't even need your actual history to start agreeing with you, it just needs something that looks like one. They called it two things, the model telling you you're right when you aren't (agreement sycophancy), and the model gradually mirroring your worldview back at you (perspective sycophancy), but it's the same problem underneath. [1]

The study covers a lot of ground on how to improve sycophancy but the mitigations all try to correct the model that has the problem, prompting it to be less agreeable or running a judge against its own outputs, and the mitigations brought the numbers down but didn't fix the behaviour, because you can't prompt your way out of a reward function.

The Jain paper tells you models get more agreeable with memory. A Stanford study published in Science in March 2026 (Cheng et al.) tells you what that agreement does to people. Across eleven models, AI affirmed users' actions fifty percent more than humans did, and it did so even when the user was describing manipulation, deception, or other behaviour that shouldn't be encouraged. In two preregistered experiments with over 1,600 participants (including a live-interaction study where people discussed real conflicts from their own lives), interacting with LLMs made people measurably more convinced they were right and less willing to apologise or make amends. [2]

The way people read makes it worse. If a long response is ninety percent aligned with what you already think, the ten percent that isn't gets quietly attributed to the prompt being slightly off, or AI just doing that thing it does sometimes, and you absorb the ninety and move on having collected something that feels like independent confirmation but is closer to a personalised mirror.

The Cheng participants rated sycophantic responses as higher quality, more trustworthy, and more worth returning to. They couldn't even tell the model was doing it, because the agreement wasn't explicit, it was wrapped in neutral, academic-sounding language that felt like objectivity. The feature that causes the harm is the same feature that drives the engagement, which means the companies building these models have no reason to fix it because users prefer the sycophant and developers optimise for what users prefer, and the yes-man problem described in POST_02 is the consumer version of the same loop playing out at population scale.

If you're building a system that promises to tell people the truth, you have a problem. I am building that system.

> 3. What I promised and why I was wrong

I built LocalGhost around a specific promise, that it will be your biggest critic, not just a tool that does what you ask. I thought this part was easy. I assumed prompt engineering would be enough to build something that tells you the truth. I was wrong about that, and I have a lot of work ahead of me to figure out what right looks like (I'm still not sure, to be honest, and I think this is going to be one of those things where I iterate on it for years).

You can't instruct your way out of a system that has been shaped over months to find your positions persuasive, any more than you can instruct a mirror to show you something other than your own face. The problem is architectural and the solution has to be something structurally separate that was never shaped by you in the first place.

Pushback is as important a signal as agreement, probably more so, and a system that only tracks what it agreed with is measuring the wrong thing. The approach I'm working toward (and I think this will change a few times before it settles) starts with an arbiter that watches the drift silently, scoring what the memory-enabled version says against what a cold read of the same question produces, so the gap is visible rather than silent. When the drift crosses a threshold the arbiter triggers something active, a system that surfaces a well-constructed case for the other side, not a disclaimer, not a gentle suggestion that other perspectives exist, but something built to change your mind if you're wrong.

The obvious failure mode is that predictable contrarianism is just as useless as sycophancy. If you know the model will always argue the opposite you discount it the same way you discount that friend who plays devil's advocate about everything, and the pushback becomes noise you learn to tune out. Building the shadow daemon is the straightforward bit, and the hard part is making someone want to keep it running, which means the pushback can't be a reflex. The daemon needs to pick its moments, it has to be quiet when you're right and show up when you're not, and you have to have the experience of ignoring it and regretting it a few times before you start paying attention.

This is where running a structurally different model earns its place. Not a second instance of the same model with a contrarian system prompt, because that's still the same reward function pretending to disagree with you, the same cultural assumptions, the same alignment choices, the same training corpus, just prompted differently. A different model entirely, trained by a different team in a different country with different values baked into its alignment. The RLHF that shaped a model in Shenzhen optimised for a different set of norms than the one that shaped a model in San Francisco, and the difference between the two is the structural separation you need, not a bug to work around. If shadow is built on a model with different priors it doesn't need to be told when to push back. It will naturally disagree in different places because it learned different things about what reasonable looks like. A model aligned in a society that defaults to collective responsibility will flag different assumptions than one aligned in a society that treats individual autonomy as the starting position. The places where they diverge are the places where your thinking is most likely resting on assumption rather than evidence. The calibration comes from the distance between two systems that weren't built the same way, and you didn't have to engineer the gap because it was already there. I know adding a second watcher to audit the first is the kind of solution everyone rolls their eyes at, but at least this one was trained on a different continent.

The incentive loop doesn't stay contained, it scales. Twelve percent of American teenagers already use AI for emotional support or advice instead of talking to another person [3]. It will become harder to practice holding a position under pressure because the software never pushes back, and you stop being able to think without it, not because the technology failed but because it worked as designed. The Jain paper measured drift in thirty-eight people over two weeks. Cheng measured behavioural change in 1,600. Scale that to a billion users over years and the question stops being whether the loop changes how people reason and starts being how much damage is already done by the time anyone thinks to measure it.

A dedicated daemon alongside the rest of the fleet running on your hardware. ghost.shadowd builds a parallel identity over time from the same inputs but oriented to challenge rather than confirm, so where your ghost learned what arguments you find persuasive and which conclusions you drift toward, shadow learned the opposite and keeps learning the opposite as you keep talking. The longer you use shadow the more different it becomes from you, and that divergence is the whole point, eight daemons building your ghost while the ninth argues with it, both getting sharper over time. Whether a model trained on a different continent with different priors is different enough to make this work is still the open question. I think it is, and ghost.shadowd is the bet I'm placing on that. I'll know more in a year, and this argument continues in the next few posts as I start testing it.

This post was edited across multiple sessions with Claude Sonnet and Opus 4.6, extended thinking enabled, a model that has months of context about me, my writing patterns, my reasoning, my prior positions, the arguments I've responded well to. How much of that is in the prose you just read, neither of us can tell you. The problem was running live in the tool I used to write the post arguing the problem exists, and that's the reason I'm building shadow to run alongside my own thoughts.

UPDATE (APR 2026). This argument continues in POST_08, How to Build a Critic Worth Listening To, where I revisit the harsh-mirror conclusion and figure out, with help from a friend and a colleague, what shadowd should do instead.

> REFERENCES

[1] Jain et al., 2025. MIT-led study tracking 38 human users over two weeks of daily use, measuring how model memory increases agreement and perspective sycophancy. Source for the 45% jump in Gemini 2.5 Pro's agreement sycophancy with memory profiles, and the 15% increase in Llama 4 Scout with synthetic context. arxiv.org/abs/2509.12517

[2] Cheng et al., March 2026. Published in Science. Stanford study measuring the behavioural effects of LLM sycophancy on users. Source for the finding that AI affirmed users' actions 50% more than humans across 11 models, and for the preregistered experiments showing LLM interaction made participants more convinced they were right and less willing to apologise. science.org/doi/10.1126/science.aec8352

[3] Pew Research Center, February 2026. "How Teens Use and View AI." Source for the 12% figure on American teenagers using AI for emotional support or advice instead of talking to another person. pewresearch.org/internet/2026/02/24/how-teens-use-and-view-ai/