An Index Is Not a Person
They measure you, they reduce the measurements, they sort you into a bucket, and they sell the bucket.
I bought a Samsung Galaxy Watch in December and I've been reading my sleep numbers every morning since. Hours in bed, time to sleep onset, REM percentage, deep sleep minutes, awake events. Those numbers are legible. I don't check the Energy Score. Samsung's stated inputs are physical activity, sleep quality, sleeping heart rate, and sleeping heart rate variability, adjusted for age and gender, with weightings the company will not publish (the model has opinions about me I don't get to read) [1]. What Samsung calls my Energy Score is a comparison to other thirty-nine-year-old men, with the reference class baked in. If I sleep six hours and the model's population of men my age sleeps six and a half, my number falls. The individual at the centre of the score is a statistical fiction assembled from cohort data, and the person wearing the watch is being nudged toward that fiction's midpoint every morning.
Oura calls it Readiness [3], Whoop calls it Recovery [4], Garmin calls it Body Battery [5], Apple has a suite. Each one a single number, each one presented as a personal reading, each one running the same machinery. Doherty et al., a 2025 systematic review across fourteen of these scores from ten manufacturers, found that none of the companies disclose their algorithms, though 86% of the scores rely on heart rate variability, 79% on resting heart rate, and 71% on activity and sleep duration [2]. The score is a vehicle for something else.
I spent most of the last decade building indices for a living (which is a strange way to spend a decade and probably the only thing that lets me write this post with any authority). The work I led for most of that time was CryptoCompare's aggregated price index, CCCAGG [6], which constructs a single digital-asset price (Bitcoin, Ethereum, etc.) by volume-weighting trades across roughly three hundred exchanges, with constituents reviewed monthly against published thresholds and the whole thing regulated under the UK Benchmark Regulation [7]. The methodology is published, an auditor can take the raw tick data, apply the documented rules, and reproduce the index. That is the shape of an honest reduction.
A published index with audited constituents and a reproducible methodology does honest work because the aggregate is the referent. A Bitcoin price is genuinely a distributed quantity, trading on many venues at many prices at any given moment, and there is no single correct price outside the one an aggregator constructs. The reduction loses nothing essential because the essential thing, a consensus price across a fragmented market, did not exist before the index constructed it. Reductions can be correct, audited, and useful when the referent is real and the construction is open.
The wellness score borrows the shape of an index without accepting any of the obligations. There is no distributed quantity called readiness the way there is a distributed quantity called a Bitcoin price. Readiness is something the model invents, not something it measures, and the number on the wrist is pointing at nothing outside the model that produced it. The weightings are not published, the methodology is not reproducible, and no regulator is auditing the construction.
The wellness score borrows the authority of an index without accepting any of the obligations that let one be useful.
Bucketing is not a corporate invention. Humans bucket each other constantly, the moment we see someone, before they have spoken. Tall, short, old, young, well-dressed, scruffy, threatening, safe. The classifier fires before we have time to consent to it firing, and most of the time it is wrong about the specific person in front of us, and most of the time we never find out, because we never speak to them long enough for the data to override the prior. The bucketing is instinctive and load-bearing for navigating public space. I am not going to pretend I do not run it, I run it, and everyone reading this does too.
What rescues the human version is that the bucket dissolves on contact. In 2014, a colleague of mine, Gordan, came round when I was recovering from a bike accident that broke my jaw and a few teeth. Gordan is a big man with a busy beard, full leather gear, more skulls on him than I can count, and a helmet with the word "Hell" on it. While he was at the flat he saw our cat Melly, then a small kitten, picked her up, and was the gentlest, most patient person with her I have ever watched. He sat on the sofa with this small kitten on his chest until he had to leave. The bucket my housemates at the time ran on Gordan when they saw him dissolved in about ninety seconds (and I think about this every time I catch my classifier firing on someone). The version of him that exists in their head now is the version with my kitten on his chest.
The human bucketing is imperfect, sometimes badly so, but it has a self-correcting loop. The loop is the conversation, the time spent, the cat on the chest. The bucket stays open to revision because everyone is regularly mis-bucketed and everyone has had a bucket dissolve when someone showed up.
The wellness score is bucketing with the self-correcting loop removed. There is no conversation, no time spent, no version of the user that gets to dissolve the bucket by showing up. The score updates when the company decides to update its model, on the company's schedule, against the company's incentives, and the user has no handle. What the user gets handed is the bucket placement, dressed up as a personal reading.
The score is the user-facing artefact, the bucket is the commercial artefact, and the two are connected by design.
Samsung's published privacy policy for Samsung Ads states that the service may show personalised ads based on "demographic characteristics, preferences, choices, and interests", by collecting information about online activities across Samsung and third-party platforms [8]. The general Samsung privacy policy says the company uses the information it collects for personalised and generalised advertising and direct marketing on behalf of Samsung and its partners [9]. The Customization Service combines health data with account data, app usage, browsing history, and inferred interests to send customised marketing across a user's devices [10]. The health data feeds the targeting, and the score is where the targeting becomes legible.
This pattern is not limited to Samsung. Every wellness platform, every fitness app, every calorie tracker, every sleep product runs some version of the same machinery. The bucket you have been placed in determines which in-app upsell you see, which notification arrives, which supplement gets recommended, which premium tier you are nudged toward. The score is the interface and the bucket is the product.
The same shape repeats across the rest of the consumer economy with sharper teeth. FICO's credit-based insurance scores are used by roughly 95 percent of US auto insurers and 85 percent of homeowners insurers to bucket policyholders by claim-likelihood, with a correlation between credit data and insurance loss that FICO itself admits is empirical rather than causal [11]. Your premium goes up not because of anything you did, but because of which bucket the model placed you in. The same pattern shows up wherever a company has enough data about enough people to run a classifier, from Spotify's mood clusters [12] to Netflix's taste profiles [13] to banks' wealth segments. The bucket is the commercial product, the number shown to the user is the interface.
The score on your wrist tells you which bucket you are in. It does not tell you about you.
The harm is not in the bucketing instinct, the instinct is human and not the enemy. The harm is in the industrial scale-up of the instinct without the self-correcting loop the human version always had. There is no equivalent of Gordan walking into the room with the cat. The score does not update because you protested, the bucket does not dissolve because you showed up, the classifier keeps firing on its prior, and the user keeps being treated as a member of a category they were placed in by a model they did not see, optimised for outcomes that were never theirs.
Leaderboards are the variant with social comparison grafted on. Apple Watch competitions, Strava segment rankings, every fitness app's weekly challenges. The ranked presentation produces more engagement than a static number, and more engagement is what the platform is optimising for. The user is both the measured and the measuring stick.
A ghost that runs on your hardware, reasoning only from your data, has no cohort and no population to compare you against. It cannot tell you how your sleep compares to other people your age because it has no data on other people your age, and it is not going to get any. The architectural commitment that makes LocalGhost private makes it also non-comparative. There is no Energy Score the ghost can produce, no cohort segmentation it can hand to an advertiser, no bucket.
The ghost's refusal to hold a cohort is the refusal of the whole arrangement.
What the ghost does instead is describe you in terms of yourself. Your sleep this week is worse than three weeks ago by measurements you already trust. Your commit activity dropped around the same time your sleep did. Those are answers from your own data to questions about your own life, and the bucket the ghost places you in is yours. This is the architectural version of the bucket dissolving on contact, except the contact is with your own data instead of with a kitten on a sofa.
The ghost still reduces, it has to. The question is not whether reductions happen but whether they are legible, editable, and under the user's control. Every summary the ghost produces is a view, not a fact, and the raw context stays. If the ghost says you've been off this week, you can ask why, and the answer is the reasoning and the data points the ghost leaned on, not a reconstituted score. If the reasoning is wrong, you can correct it, and the category the ghost has placed you in is editable, and the model the ghost holds of you is something you can argue with. The loop the commercial version stripped out is the loop the ghost puts back.
Cohort comparison is sometimes the only way to catch what a personal baseline cannot. Apple Watch's atrial fibrillation detection works because it knows what a normal sinus rhythm looks like across millions of hearts, and the same goes for illness-onset signals, sleep apnoea screening, and the early signature of overtraining syndrome in HRV. The medical utility is real, and three weeks of your own baseline is not enough data to know what is normal for a human body versus what is normal for your body in trouble. The wellness industry's mistake is pairing the statistical tool with an ad pipeline by default, as if the cohort cannot be delivered without the user attached. A consensual cohort, locally applied and severable from any commercial system, is a different architecture and a future post.
There is one kind of comparison the ghost can legitimately perform. The ghost can compare itself to itself. Run the same question through the ghost twice, once with memory and once without, and the divergence between the two answers tells you how much the answer is leaning on context you supplied. Context can be wrong, misremembered, or stale, and the divergence is the signal that it might be. Run the same question against a different base model with the same inputs, and the divergence tells you how much the answer is specific to the model's priors rather than to your data. When the main ghost says one thing and shadowd, the adversarial sibling, says another, the gap is information about where the ghost's confidence is coming from.
That is the shadowd architecture. The comparison happens inside the ghost, with no external baseline, no cohort, and no population involved. The user is the reference, the ghost is the subject, and the comparison is between different ways the ghost can reason about the user.
I am not entirely sure this is enough on its own (if anyone has a better answer here please tell me, I am not married to mine). Detecting that the answer is load-bearing on context tells you the answer might be wrong, but it does not tell you which way it is wrong, and the user still has to do the work of looking at the raw data and deciding. The shadowd architecture as currently designed flags the divergence and surfaces the reasoning that produced it, and whether that is enough or whether the ghost needs a richer set of disagreement signals is a thing I expect to keep iterating on for a while.
The architecture refuses the bucket. Not as a temporary state, not as something that gets added once the project finds its footing. There is no bucket sold to advertisers, no baseline imported from people you did not consent to share with, no comparative reading dressed up as a personal one. The ghost works for one person because the one person is the reference, and any version of this that sorts you against other people without your consent is the thing the rest of this post is arguing against.
A cluster of ghosts that ranks users against each other is every commercial product I am refusing to build.
Single-user fragility is the correct state, full stop. A system that has not earned the right to describe one person honestly has definitely not earned the right to rank millions of them against each other. The commercial platforms skipped the first step and went straight to the segmentation economy, which has been going about as well as you would expect for everyone except the people selling the segmentation.
The individuality argument is the argument under all the others. The manifesto argued for sovereignty at the data layer, the Model Trap argued for sovereignty at the model layer, How Memory Gets Made argued for sovereignty at the memory layer, and this post argues for sovereignty at the interpretation layer, the last layer the consumer platforms still own and the layer where the most damage gets done.
The reductions any system performs about a person should be inspectable and rejectable, the buckets it places someone in should be visible and editable, and the comparisons it draws should be to the person and not to a cohort imported from somewhere they did not consent to. Most consumer products do not work this way, and the ones that do not are not failing at the design, they are succeeding at a different design, the one that turns the user into a sortable record. A ghost that respects the user has to refuse that design at the architecture level, because anything else replicates what already exists.
The architectural commitments are clear, the implementation is in progress, and the harder questions of what a useful description looks like when it cannot lean on a cohort are questions I expect to be working on for a long time.
The next post is about a different way the same architecture can fail. A ghost good enough to be useful is good enough to compete with the messy human contact a person needs, and if the ghost wins that competition every time, the user ends up alone with a daemon. The work is teaching shadowd to recognise low-friction spirals from tallyd's heart rate, screen time, and social pointers, and to sometimes refuse to help when a phone call or a game of squash would be better. Efficiency is often a polite word for social avoidance, and a ghost that respects the user has to know the difference.
[1] Samsung Energy Score is calculated based on physical activity levels, sleep quality, sleeping heart rate, and sleeping heart rate variability, with weightings adjusted for age and gender. Samsung Research developed the feature in collaboration with Professor Patrick O'Connor (University of Georgia, Mary Frances Early College of Education's Department of Kinesiology). The relative weightings of each input and the precise algorithm used to combine them are not publicly disclosed. Samsung Newsroom announcement of the methodology and University of Georgia collaboration at news.samsung.com/samsung-collaborates-with-university-of-georgia. Samsung Health product page describing Energy Score requirements and inputs at samsung.com/us/apps/samsung-health.
[2] Doherty, C., Baldwin, M., Lambe, R., Burke, D. & Altini, M. (2025). "Readiness, recovery, and strain: an evaluation of composite health scores in consumer wearables." Translational Exercise Biomedicine, 2(2), 128-144. Peer-reviewed open-access systematic evaluation of 14 composite health scores across 10 major wearable manufacturers (Coros, Fitbit/Google, Garmin, Oura, Polar, Samsung, Suunto, Ultrahuman, WHOOP, Withings). The paper concludes that "none of the manufacturers disclosed their exact algorithmic formulas, and few provided empirical validation or peer-reviewed evidence supporting the accuracy or clinical relevance of their scores," and identifies that 86% of the scores rely on heart rate and heart rate variability, 79% on resting heart rate, and 71% on physical activity and sleep quantity, with substantial variation in calculation methodologies despite shared inputs. DOI 10.1515/teb-2025-0001. Article HTML at degruyterbrill.com/teb-2025-0001.
[3] Oura Readiness Score is calculated from nine contributors including resting heart rate, HRV balance, body temperature, recovery index, sleep, sleep balance, sleep regularity, previous day activity, and activity balance, evaluated against personal baselines accumulated over up to two months. The score ranges from 0-100 with interpretive bands (Optimal, Good, Fair, Pay Attention). Oura's documentation describes the contributor weightings as proprietary and notes that long-term metrics are evaluated against "general recommendations for people of your age" alongside personal averages. Oura support documentation at support.ouraring.com/Readiness-Score. Contributor breakdown at support.ouraring.com/Readiness-Contributors.
[4] WHOOP Recovery is a daily 0-100 score nominally indicating how prepared the body is to adapt to physical and mental stressors. Inputs include heart rate variability (weighted approximately 85%), resting heart rate, sleep performance, and respiratory rate, with WHOOP 4.0 also incorporating blood oxygen levels and skin temperature. The proprietary algorithm integrates these inputs against personalised baselines using methods that WHOOP does not publicly disclose in detail. WHOOP also publishes Strain (logarithmic 0-21 scale combining cardiovascular and muscular load) and Stress Monitor (0-3 scale) using overlapping but distinct methodologies. WHOOP Recovery documentation at whoop.com/how-does-whoop-recovery-work. WHOOP Strain documentation at whoop.com/how-does-whoop-strain-work.
[5] Garmin Body Battery is a 0-100 energy estimate combining stress (derived from heart rate, resting heart rate, and HRV), activity levels, and sleep, computed by Firstbeat Analytics under licence to Garmin. Recent data is weighted more heavily than long-term data, and the specific weightings of each input are not publicly disclosed. Garmin also publishes Training Readiness, which integrates sleep score, recovery time, acute training load, HRV status, sleep history, and stress history into a separate 1-100 score with interpretive bands (Prime, High, Moderate, Low, Poor). Garmin Body Battery documentation at garmin.com/garmin-technology/body-battery. Training Readiness documentation at garmin.com/garmin-technology/training-readiness.
[6] CCCAGG (CryptoCompare Aggregated Price Index) was the flagship index produced by CryptoCompare/CCData during the author's tenure as co-founder and CTO. The index aggregates trade data across roughly three hundred exchanges using a twenty-four-hour volume-weighted average for every currency pair, with monthly constituent reviews that drop exchanges whose thirty-day average price difference or price impact exceeds published thresholds. The index sits alongside the Exchange Benchmark, which rates exchanges AA through F based on liquidity, quality, and trustworthiness of data. The methodology, the constituents, and the inclusion and exclusion thresholds are all published, the index is regulated under the UK Benchmark Regulation, and an auditor can take the raw tick data, apply the documented rules, and reproduce the index. CCData (formerly CryptoCompare) was acquired by CoinDesk in October 2024 and the index has since been renamed CCIX. Methodology document at downloads.coindesk.com/CCIX-Methodology.pdf.
[7] The UK Benchmarks Regulation (UK BMR) is the regulatory regime governing the provision of, contribution to, and use of financial benchmarks in the UK. The onshored BMR came into effect at 23:00 on 31 December 2020, regulating any publicly available figure that is regularly determined by formula or other calculation method, where that figure is used to determine amounts payable under financial instruments or to measure investment fund performance. Benchmark administrators must be FCA-authorised, must publish methodologies, must maintain governance and oversight functions, and must address conflicts of interest. The FCA Benchmarks Register lists 36 UK benchmark administrators and 10 third-country administrators as of November 2023. FCA Benchmarks Regulation page at fca.org.uk/markets/benchmarks/regulation.
[8] Samsung Ads Privacy Policy, source for the language describing how the Samsung Ads service uses "demographic characteristics, preferences, choices, and interests" to show personalised ads across Samsung and third-party platforms. Available at policy.samsungrs.com/policy_en_in.html.
[9] Samsung U.S. Privacy Policy, source for the company's stated use of collected information for personalised and generalised advertising and direct marketing on behalf of Samsung and its partners. Available at samsung.com/us/account/privacy-policy.
[10] Samsung Customization Service U.S. Privacy Notice, source for the description of how the Customization Service combines account data, app usage, search terms, and browsing history with health data to send customised marketing communications, including the explicit consent flow for "Improve personalized ads with Samsung account data". Available at terms.samsungconsent.com/CZSVC.
[11] FICO Credit-Based Insurance Scores were introduced by Fair Isaac Corporation in the early 1990s and are used by approximately 95 percent of US auto insurers and 85 percent of US homeowners insurers in states where the practice is permitted, according to FICO's own estimates. The scores use credit report data weighted as payment history (40%), outstanding debt (30%), credit history length (15%), pursuit of new credit (10%), and credit mix (5%) to predict the relative likelihood of an insurance claim, allowing insurers to "group consumers by risk and then adjust premiums up or down based on the risk category assigned." The correlation between credit data and insurance loss is empirical rather than causal, and FICO's own documentation acknowledges that "no one has ever investigated whether a causal relationship exists." FICO insurance scores product page at fico.com/insurance-scores. NAIC explainer of credit-based insurance scoring at content.naic.org/credit-based-insurance-scores. FICO methodology page at insurancescores.fico.com/HowScoringWorks.
[12] Spotify Ads gives advertisers audience targeting based on listening behaviours, interests, demographics, and "real-time context" derived from playlist activity. Spotify's Ad Studio documentation describes how the platform "identif[ies] user-curated and Spotify-curated playlists aligned with popular activities and moments" and "deliver[s] targeted messages to users who are listening to playlists in each category," allowing brands to align ad delivery with listener mood, activity, and setting. Spotify's contextual advertising materials describe its first-party streaming intelligence as data that "reveals moods, mindsets, habits, and tastes in the moment." Spotify Ads audience targeting overview at ads.spotify.com/audience-targeting. Ad Studio targeting options documentation at ads.spotify.com/help-center/targeting-ad-studio. Contextual advertising solutions page at ads.spotify.com/contextual-advertising.
[13] Netflix divides its global subscriber base into approximately 2,000 "taste communities" or "taste clusters" that group viewers based on observed viewing patterns rather than declared demographics. Netflix's product vice president Todd Yellin described these as "taste doppelganger" profiles that group people with shared content preferences across genre boundaries, with the company explicitly preferring behavioural clustering to demographic segmentation. Taste clusters inform recommendation surfaces, artwork personalisation, and content production decisions. Netflix originals product launch strategy team member Olivia De Carlo characterised one such cluster as connecting Lost, Black Mirror, and Groundhog Day through "this through-line of supernatural or extreme worlds." BuzzFeed News interview with Netflix product VP Todd Yellin describing taste community methodology at buzzfeednews.com/netflix-recommendation-algorithm. Quartz coverage of Netflix's 1,300 taste communities and the move away from geographic segmentation at qz.com/netflix-divides-users-into-1300-taste-communities.