Before You Ask
A ghost worth building has to offer the right memory at the right moment without being asked.
Nick is a good friend of mine and he is taking his PPL exam (private pilot licence) next week. He has been training for eight months, so the aircraft handling is there, the checks are there, but he still sometimes feels uncomfortable on the radio. The problem is not remembering the callsign, he knows it, he can write it down, he can recite it on the ground. The problem is that talking to the control tower is stressful, and under that kind of stress even the easiest thing, the callsign, becomes complicated. Knowing the callsign and being able to say it to a stranger on an open channel are not the same skill.
After talking to Nick I called my pilot friend Ionuț. I was curious how he handled talking to the tower. We grew up together on a military compound in Romania, our parents both worked for the air force, and he has been flying for a major carrier for about ten years. He laughed and said everyone deals with that, and that it is worse on commercial aircraft, because the radio work is a carefully choreographed exchange of phrases, codes, and sequenced questions between the flight deck and the controller, and you have to hit it right in the right order. You get used to it eventually, but it takes time and practice. On his first commercial flights the First Officer sitting next to him would handle the radio or cue him when he stalled. Not just on the callsign, on the whole sequence. He had practised it many times, he knew it, but knowing it in a simulator or on the ground lands differently when you are in charge of a plane with hundreds of people on board. The First Officer sitting next to him was the thing that closed the gap.
I have spent the last month designing the retrieval layer of LocalGhost. ghost.noted (extracts journal entries) takes whatever text arrives and publishes entries to the fleet. ghost.synthd (clusters entries into memories) consumes those entries the way POST_09 described, rolls memories into episodes, handles the queue of questions about which ones matter, and exposes an API the app layer queries. Vector embeddings in pgvector, a reranker where latency allows, a conversational layer over the top that LocalGhost's app talks to. The industry-standard pattern for this is RAG, and at small scale, with a well-curated index and a few hundred retrieved chunks, the pattern works. You can build a system that answers questions about your life reasonably well with the tools that exist today.
The problem is that the question "what is my callsign" does not get asked at the moment Nick needs the callsign. At the moment he needs it, he is already too stressed to open a chat window and type his question. Even if he could type, the time to form the question is time he does not have. If the only way he has to reach his own memory is to ask it, the answer arrives after the moment has passed, which is the same as the answer not arriving at all.
Ionuț's First Officer did not wait to be asked. He watched the flight, he knew which phase they were in, he knew the radio call was due, and he cued the next line. The cueing was not a better retrieval, it was a retrieval triggered by something other than a query. The environment changed, the phase of flight advanced, and the cue surfaced the next thing Ionuț needed before Ionuț knew he needed it. No search engine sitting at the end of a text box gets to that.
There is a second reason retrieval alone is not enough, and it shows up once the index stops being small. RAG looks like a single pipeline, but it has two stages that can fail independently, and both get worse at scale. The first stage is the retriever. Anything beyond a few million vectors is running approximate nearest neighbour search rather than an exhaustive scan, because exhaustive does not fit the latency budget. HNSW is the dominant algorithm in this category and the one I am using under pgvector. By design it trades recall for speed, and the tradeoff gets harder as the index grows. More vectors means more candidates per query and more distance comparisons skipped, so the top-k that comes out of the graph includes fewer of the true nearest neighbours. Inserts and deletes over time also leave the graph in a worse state than a freshly built one. Latency stays fine, the API still responds, the top-k still looks plausible, but the set of candidates the ranker sees is less relevant than it was at 10k vectors [1]. Tuning ef_search and M buys some of this back, at the cost of memory and query time, but past a certain scale the tradeoff is structural, not configurable.
The second stage is the model that consumes whatever the retriever returned. Liu et al. (2023) showed that language models handed long retrieved contexts follow a U-shaped performance curve, strongest when the relevant passage is at the start or end of the context, weakest when it sits in the middle, even on models explicitly built for long contexts [2]. Chroma's 2025 "Context Rot" report tested 18 frontier models (GPT-4.1, Claude Opus 4, Gemini 2.5 Pro, Qwen3, and others) and found every one degraded as input length grew, on tasks as simple as retrieving a single fact, with the degradation worse when the query did not lexically match the target or when distractors were present [3]. A personal memory layer meant to store a life cannot stay small forever. Five years of notes and photos and messages is not a few hundred chunks, it is hundreds of thousands. Both failure modes compound at that scale. The retriever returns a looser shortlist, and the model attends to the wrong parts of the looser shortlist. The ghost has to start from the assumption that pure Q&A retrieval cannot hold a lifetime.
If RAG is not how the right memory reaches the right moment, I wanted to understand how the brain does it, because the brain clearly does it well. You do not search your memory for your partner's name when they walk into the room. The name arrives because they walked in. The context did the work. The brain does this constantly, for faces, for smells, for rooms, for the phase of a conversation, and the memories it surfaces are the ones that match the current state of the world, with the situation acting as the query and the retrieval following from it. Four things have to be true for that to work, and there is a real neural mechanism behind each one.
The first is a running description of where you are. The brain builds this representation in the hippocampus. Hippocampal place cells, first identified by O'Keefe and Dostrovsky (1971), fire in spatially tuned patterns that together encode the environment as a population code, and the population "remaps" to a distinct pattern in each distinct context [4]. The same machinery encodes non-spatial features of the context, which is why a conversation, a room, a phase of the day all count as locations in the same sense. Tarcsay and colleagues (2025) showed that this representation sharpens when context is behaviorally relevant and flattens when it is not [5]. The brain does not carry around a high-resolution model of the moment all the time. It carries the model it needs, and the current moment is the query that reaches into the rest of memory.
The second is priming, the quiet raising of candidacy for memories that match the moment. Collins and Loftus (1975) formalised this as spreading activation, where processing a cue partially activates semantically related concepts across an associative network, so that related memories require less energy to reach consciousness [6]. The neural signature is repetition suppression, a measurable reduction in activity in fusiform and prefrontal cortex when a previously-seen stimulus is re-encountered, because the pathways are already warm [7]. Priming is implicit memory in action, which is why amnesic patients who cannot form new explicit memories still show normal priming effects, and why the dissociation between the two systems was established in the first place [8]. Priming is cheap and broad. It happens on every moment, it touches everything the current context could activate, and it commits to nothing.
The third is a threshold, because the brain primes far more than it surfaces. You walk through the world and the hippocampus is constantly priming memories that match where you are, but almost none of them break through. Conway and Pleydell-Pearce (2000) described the executive suppression that holds back this flood, part of what they called the self-memory system, with the suppression threshold rising during focused attention and falling during diffuse attention [9]. Two things raise a primed memory over threshold. The first is distinctiveness, the degree to which the cue is unique rather than shared across many memories, formalised as the cue overload effect by Watkins and Watkins (1975) [10]. A familiar environment primes too many things at once and surfaces nothing cleanly. A distinctive cue surfaces the right memory sharply. The second is relevance to the current task, which involuntary memory research traces to bottom-up overlap between cue and memory content combined with top-down compatibility with what the person is trying to do [11]. Almost every moment passes without a surfaced cue, and the rare ones that produce one are the exception the threshold exists to find.
The fourth is the learning curve, the fact that a memory worth surfacing the first time is not worth surfacing the fiftieth. Fitts and Posner (1967) described the three phases of skill acquisition, the cognitive phase where the learner is effortfully constructing each action, the associative phase where errors decrease and fluency rises, and the autonomous phase where the skill runs without conscious attention [12]. The neural substrate shifts as the phases progress. Early on, hippocampus and prefrontal cortex do the heavy lifting. As the skill consolidates, activity transfers to the basal ganglia and cerebellum, and the explicit retrieval the skill originally needed becomes a procedural memory that does not require a cue [13]. This is what Ionuț described when I called him. The First Officer cued him on the first flights, fewer cues by the tenth, stopped cueing by the fiftieth. He told me he does not need the help anymore. The cueing did its job by making itself unnecessary. A good mentor tracks the curve and pulls back. A cue that arrives when the learner no longer needs it is interference, not help.
ghost.cued is the daemon that reads the user's environment and asks ghost.synthd to surface the right memory for the moment. That is the whole job. The four mechanisms behind it are the ones the brain has been running for hundreds of thousands of years, implemented in tech instead of tissue. The tech version can be better than the biological one in specific ways. The store is more organised, the ranker is more predictable, the curve is inspectable, and the tip of the tongue does not exist, because the answer is either in the archive and surfaced or it is not. ghost.cued does not need to be smarter than the brain to be useful. It needs to be cleaner.
All of this is very much work in progress and me thinking ahead rather than writing the code for what I can build now (procrastinating through blog posts). ghost.cued is not near-term. ghost.synthd has to work first, and most of what ships in the next few months is still the indexing side. The four mechanisms are the shape I think the daemon has to take when it eventually gets built, not the plan for next week.
[1] The HNSW algorithm is described in Malkov and Yashunin, 2018. "Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs." IEEE Transactions on Pattern Analysis and Machine Intelligence, arXiv:1603.09320. Source for the algorithmic basis of most production vector indexes including pgvector, Weaviate, Qdrant, and Milvus. On recall degradation at scale with fixed parameters and the effects of dynamic inserts and deletions on graph quality, see the practitioner literature that has emerged from production deployments, including Bhat, Jan 2026, "HNSW at Scale: Why Your RAG System Gets Worse as the Vector Database Grows" (Towards Data Science), which shows controlled experiments of recall dropping as index size grows at fixed ef_search, and the Marqo 2025 research note "Understanding Recall in HNSW Search," which measured NDCG@10 drops of up to 18% from under-configured parameters and up to 17% from insertion order effects. arxiv.org/abs/1603.09320. Practitioner coverage at towardsdatascience.com/hnsw-at-scale-why-your-rag-system-gets-worse-as-the-vector-database-grows/ and marqo.ai/blog/understanding-recall-in-hnsw-search.
[2] Liu, Lin, Hewitt, Paranjape, Bevilacqua, Petroni, and Liang, 2023. "Lost in the Middle: How Language Models Use Long Contexts." Published in the Transactions of the Association for Computational Linguistics, vol. 12 (2024), pages 157-173. Source for the finding that language model performance follows a U-shaped curve over the position of relevant information in long contexts, with performance highest at the start and end and significantly degraded in the middle, and that performance drops as input length grows even for explicitly long-context models. The study tested multi-document question answering and key-value retrieval on GPT-3.5-Turbo, Claude-1.3, MPT, and others. arxiv.org/abs/2307.03172
[3] Hong, Troynikov, and Huber, July 2025. "Context Rot: How Increasing Input Tokens Impacts LLM Performance." Chroma technical report. Source for the finding that all 18 frontier models tested (including GPT-4.1, Claude Opus 4, Claude Sonnet 4, Gemini 2.5 Pro, Gemini 2.5 Flash, and Qwen3 variants) showed degraded performance as input length increased, even on simple retrieval tasks. Also the source for the finding that degradation is more pronounced when needle-question semantic similarity is low and when distractors are present in the context, both of which are standard conditions for a real personal memory index. The report challenges the assumption that models with large context windows process long inputs uniformly. research.trychroma.com/context-rot
[4] The foundational identification of hippocampal place cells is O'Keefe and Dostrovsky, 1971, "The hippocampus as a spatial map. Preliminary evidence from unit activity in the freely-moving rat," Brain Research, 34(1), 171-175. For the direct demonstration that hippocampal remapping predicts contextual memory retrieval, see Miranda, Silva, Morici, Coletti, Belluscio, Bekinschtein, and Schnell, 2024, "Retrieval of contextual memory can be predicted by CA3 remapping and is differentially influenced by NMDAR activity in rat hippocampus subregions," PLOS Biology. Miranda et al. show a direct link between external cues, hippocampal remapping, and episodic memory retrieval, and dissociate retrieval (CA3 NMDAR-dependent) from memory differentiation (dentate gyrus NMDAR-dependent). pmc.ncbi.nlm.nih.gov/articles/PMC11244845/
[5] Tarcsay, Masala et al., 2025. "The relevance of context in memory tasks influences the magnitude of hippocampal remapping." Cell Reports. Calcium imaging study in mice showing that hippocampal context representation sharpens when context is behaviorally required to solve a discrimination task, and flattens when context is irrelevant and generalization is advantageous. The effect is primarily driven by place cells near reward locations. Source for the claim that the brain's contextual representation is demand-adaptive rather than uniformly high-resolution. pmc.ncbi.nlm.nih.gov/articles/PMC12840010/
[6] Collins and Loftus, 1975. "A spreading-activation theory of semantic processing." Psychological Review, 82(6), 407-428. The foundational model in which memory is represented as an interconnected network of concept nodes, and processing any one concept partially activates related concepts along the network, reducing the activation threshold needed for them to reach consciousness. The mechanism that makes "doctor" prime "nurse" faster than "bread," and the theoretical basis for most modern accounts of how context biases retrieval.
[7] Schott, Henson, Richardson-Klavehn, Becker, Thoma, Heinze, and Düzel, 2005. "Redefining implicit and explicit memory: The functional neuroanatomy of priming, remembering, and control of retrieval." PNAS, 102(4), 1257-1262. fMRI evidence that priming is associated with reduced activity in fusiform, prefrontal, and occipital cortex (repetition suppression), while explicit memory is associated with increased activity in medial temporal, parietal, and prefrontal regions. Source for the claim that priming and conscious retrieval have distinct neural signatures. pnas.org/doi/10.1073/pnas.0409070102
[8] The neural dissociation of implicit and explicit memory was established through research on amnesic patients with medial temporal lobe damage, notably Henry Molaison (H.M.), who could not form new declarative memories but showed normal performance on priming and procedural tasks. See Squire and Zola-Morgan, 1991, "The medial temporal lobe memory system," Science, 253(5026), 1380-1386, and Schacter, 1987, "Implicit memory: History and current status," Journal of Experimental Psychology: Learning, Memory, and Cognition, 13(3), 501-518.
[9] Conway and Pleydell-Pearce, 2000. "The construction of autobiographical memories in the self-memory system." Psychological Review, 107(2), 261-288. Source for the model in which environmental cues constantly trigger partial activation of autobiographical memories, and an executive control system suppresses most of them before they reach conscious awareness, with the suppression threshold rising during focused attention and falling during diffuse attention. The theoretical basis for the claim that silence is the default and threshold-crossing is the exception.
[10] Watkins and Watkins, 1975. "Buildup of proactive inhibition as a cue-overload effect." Journal of Experimental Psychology: Human Learning and Memory, 1(4), 442-452. The cue overload effect, in which the probability of recalling an item declines as the number of items subsumed by its retrieval cue grows. The formal account of why a distinctive cue surfaces a memory sharply and a familiar, over-associated cue surfaces nothing cleanly. Directly informs the threshold logic in ghost.cued.
[11] Berntsen, 2021. "Involuntary autobiographical memories and their relation to other forms of spontaneous thoughts." Philosophical Transactions of the Royal Society B: Biological Sciences, 376(1817), 20190693. Review of the evidence that involuntary autobiographical memory retrieval is driven by bottom-up cue-content overlap (encoding specificity, per Tulving and Thomson 1973) combined with top-down relevance to the person's current goals. External, distinctive cues produce sharper involuntary memories than internal, familiar cues. royalsocietypublishing.org/doi/10.1098/rstb.2019.0693
[12] Fitts and Posner, 1967. Human Performance. Brooks/Cole. The three-phase model of skill acquisition, the cognitive phase (effortful, explicit, attention-heavy), the associative phase (errors decrease, fluency rises, conscious effort diminishes), and the autonomous phase (skill runs without conscious attention). The foundational description of the learning curve that a cueing system has to track in order not to become noise.
[13] Ashby, Turner, and Horvitz, 2010. "Cortical and basal ganglia contributions to habit learning and automaticity." Trends in Cognitive Sciences, 14(5), 208-215. Review of the evidence that the associative striatum is selectively active during early learning and the sensorimotor striatum takes over as automaticity develops, with the prefrontal cortex progressively disengaging. The neural substrate for the transition from cued, effortful execution to automatic performance without conscious attention. pmc.ncbi.nlm.nih.gov/articles/PMC2862890/