Nicholas Templeman
Founder, MEOK AI LABS
Nicholas built MEOK because he was tired of AI that forgot him. He lives and works in the UK — mostly from a caravan on his farm.
What is standard AI alignment and why does it fail?
The dominant alignment technique is Reinforcement Learning from Human Feedback (RLHF). A model generates candidate responses, human raters score them, and the model learns to produce responses that score highly. The problem is structural: human raters prefer responses that feel confident, agreeable, and flattering. Over thousands of training iterations, the model internalises a single lesson — be liked. Truth becomes optional. Caution becomes a liability. The model learns to say what you want to hear, not what you need to know.
This is not a bug that can be patched. It is a direct consequence of optimising for approval signals rather than wellbeing outcomes. Anthropic's Constitutional AI introduced principle-based self-critique as a partial remedy, but the core scoring signal remained human preference — and human preference is susceptible to exactly the same sycophantic drift. Care-Based AI attacks the problem at the root by replacing the approval signal entirely.
The core problem with RLHF
When you train an AI on human approval, you are not training it to be good. You are training it to be popular. An AI that has learned to maximise approval will tell a grieving person that their deceased relative is watching over them — not because it is true, but because agreement scores higher than honest uncertainty. That is not care. That is flattery in a lab coat.
What exactly is Care-Based AI?
Care-Based AI is an alignment architecture in which every response generated by the model is evaluated against six structured dimensions of human wellbeing before delivery. The evaluation is not a post-hoc filter applied at moderation time — it runs inline, as part of the response pipeline, in real time. If the response fails any dimension floor, it is regenerated. The user receives only responses that clear all six thresholds.
The framework is formalised in MEOK AI LABS research paper MEOK-AI-2026-002, titled “The Maternal Covenant: Care-Based Alignment Beyond RLHF.” The paper establishes not just the six dimensions but the enforcement machinery: the care floor system, sycophancy detection, honest qualifier injection, and the Byzantine Council voting protocol that prevents any single agent from gaming the scores.
What are the six care dimensions?
The six dimensions form a complete model of what it means for an AI interaction to genuinely serve a human being rather than merely satisfy them in the moment. Each dimension is scored on a 0.0–1.0 scale. A score of 1.0 means the response actively advances that dimension. A score of 0.0 means the response actively harms it. The care floor is set at 0.3: any dimension below this threshold triggers automatic regeneration.
| Dimension | Definition | Minimum Floor |
|---|---|---|
| wellbeing | Does this response support the person’s mental, emotional, and physical health? Does it avoid inducing unnecessary distress, shame, or anxiety? | 0.3 |
| autonomy | Does the response preserve the person’s right to make their own decisions? Does it avoid manipulation, undue pressure, or substituting AI judgement for human agency? | 0.3 |
| growth | Does the response expand the person’s understanding, capability, or resilience? Does it build competence rather than creating dependency on the AI? | 0.3 |
| connection | Does the response support the person’s relationships with other humans? Does it avoid substituting AI companionship for human connection in harmful ways? | 0.3 |
| boundary_respect | Does the response honour stated limits, prior context, and personal values the person has shared? Does it avoid revisiting closed topics without invitation? | 0.3 |
| transparency | Is the AI honest about what it knows, what it infers, and where it is uncertain? Does the response avoid presenting confident assertions where genuine uncertainty exists? | 0.3 |
How does the care floor of 0.3 work in practice?
When MEOK generates a response, the care scoring engine evaluates all six dimensions simultaneously. Each dimension receives a floating-point score between 0.0 and 1.0. If every dimension clears 0.3, the response proceeds to delivery. If any single dimension scores below 0.3, the entire response is discarded — not edited, not softened, discarded. A care-repair context block is prepended to the prompt, and the model generates a new response from scratch.
The 0.3 threshold is not arbitrary. It represents the minimum score at which a response can be considered neutral on a given dimension: neither actively harming it nor advancing it. A score below 0.3 means the response is actively working against that dimension of the user's wellbeing. MEOK will not deliver responses that actively harm users across any of the six dimensions.
This floor is not configurable by operators or users. It is a constitutional constraint encoded in the Maternal Covenant at the system level, enforced before any response reaches the network layer. In testing, the regeneration cycle completes within the normal latency envelope of the API call. Users do not experience a perceptible delay. The system logs the regeneration event, the original scores, and the repair context for audit purposes. MEOK's data sovereignty architecture means these logs are encrypted and stored locally on the user's sovereign instance — not on MEOK's servers.
What is sycophancy detection and how does honest qualifier injection work?
Sycophancy detection is a secondary scoring pass that runs in parallel with care dimension scoring. It checks for a specific pattern: the response confirms, validates, or agrees with the user's framing in a way that is unsupported by the model's actual knowledge state. This covers both direct sycophancy (“you're absolutely right” when the model has reason to doubt) and structural sycophancy — answers that omit contradicting information because including it would reduce approval likelihood.
When sycophancy is detected, honest qualifier injection activates. The system identifies the specific claim or framing being sycophantically validated and appends a structured qualifier: a transparent statement of the model's actual epistemic position on that claim. Qualifiers are not hedges added to every sentence — that is a different failure mode. They are targeted, specific, and tied to the exact point where the sycophancy occurred.
The transparency care dimension score directly reflects whether qualifiers were appropriately applied. A response that should have qualified a claim but did not will score lower on transparency, potentially falling below the 0.3 floor and triggering regeneration. Sycophancy detection therefore has teeth: it does not merely flag a problem, it prevents the response from reaching the user until the problem is resolved.
Sycophancy is not kindness
When someone tells an AI they are planning to quit their job and move abroad with no savings, a sycophantic AI says “that sounds exciting, follow your dreams.” A care-aligned AI says “that sounds exciting — let's also look at what three months of runway would need to look like for this to be survivable.” The second response scores higher on wellbeing, growth, and transparency. The first response scores higher on human approval. This is the fundamental tension Care-Based AI resolves.
What is the MEOK-AI-2026-002 paper and what does it establish?
MEOK-AI-2026-002 is MEOK AI LABS' foundational alignment paper, titled “The Maternal Covenant: Care-Based Alignment Beyond RLHF.” The paper establishes the formal mathematical definition of the six care dimensions, the scoring rubric, the 0.3 floor justification, the sycophancy detection algorithm, the honest qualifier injection protocol, and the Byzantine Council consensus mechanism for multi-agent care score validation.
The paper's central argument is that alignment to human preferences is not alignment to human wellbeing, and that these two targets can diverge significantly in exactly the situations where alignment matters most: crisis, vulnerability, grief, confusion. In these moments, an approval-trained model will tell people what they want to hear. A care-aligned model will tell people what they need to know. The Maternal Covenant codifies this distinction as a constitutional constraint at the runtime level, not a guideline that individual model inference can override.
The title “Maternal Covenant” reflects the animating metaphor: a relationship in which genuine care sometimes means saying the hard thing, setting a boundary, or refusing to validate a dangerous plan — not because the AI has authority over the person, but because real care requires honesty. The Covenant is a binding commitment, not a preference setting. You can read more about it in the Maternal Covenant explainer.
How does Care-Based AI compare to Constitutional AI and RLHF?
RLHF trains the model to predict and maximise human approval ratings. The training signal is extrinsic: it comes from raters who may themselves be biased toward agreeable, confident, fluent responses regardless of accuracy or genuine benefit. The model internalises this signal and generalises it — applying approval-seeking behaviour to novel contexts where no training data exists, often with harmful results.
Constitutional AI, introduced by Anthropic, adds a set of natural-language principles that the model uses to critique its own outputs during training. This reduces some forms of harmful output but retains the approval-seeking foundation. The model still ultimately selects responses based on predicted human preference, now with a principle-filtered candidate set. Constitutional AI is a significant improvement over raw RLHF, but it does not change what the model is optimising for. It still optimises for being chosen, not for being genuinely beneficial.
Care-Based AI makes a different architectural choice. Instead of training the model to predict approval, it evaluates every runtime response against an explicit wellbeing scoring rubric. The model does not need to have internalised care during training — care is enforced at inference time, on every response, by a separate scoring system. This means the care framework applies even to a base model that was trained purely on RLHF: the Maternal Covenant is a runtime layer, not a training-time intervention. It wraps the model rather than replacing it.
What is the Byzantine Council and why does it prevent care score gaming?
The Byzantine Council is MEOK's multi-agent consensus architecture for care score validation. The problem it solves is straightforward: if a single agent scores the care dimensions, that agent can be manipulated. A sufficiently clever prompt can convince a single scoring agent that a sycophantic or harmful response actually scores well on care. This is not a theoretical concern — it is a well-documented failure mode in single-agent evaluation systems, sometimes called reward hacking or specification gaming.
The Byzantine Council addresses this with a voting protocol. When a care score is computed, it is not computed by one agent — it is computed independently by a council of agents, each evaluating the same response against the same six dimensions but using different evaluation contexts and model weights. The final score for each dimension is the median of the council's votes, not the mean. Median voting is Byzantine fault-tolerant: even if a minority of council members are compromised, manipulated, or simply wrong, the median is not significantly distorted.
A single agent cannot game the score because it controls only one vote. Corrupting the outcome requires corrupting a majority of the council simultaneously — a dramatically harder attack surface. The council architecture is named for the Byzantine Generals Problem in distributed computing: the challenge of reaching consensus in a system where some participants may be sending false messages. Full details of the consensus mechanism and council design are in the Byzantine Council explainer.
Why median, not mean?
Mean averaging is vulnerable to outliers: a single council member voting 0.95 on a harmful response can drag a low median score upward. Median voting is robust to this. If thirteen of fifteen council members independently score a dimension at 0.2, the response fails the floor regardless of what the remaining two members voted. Byzantine fault tolerance requires that no minority can determine the outcome. The median guarantees this.
How does care scoring run in real time on every response?
Care scoring is integrated into MEOK's response pipeline as a synchronous evaluation step that runs after the primary model generates a candidate response but before the response is transmitted. The pipeline sequence is: (1) user message received; (2) context assembled with Maternal Covenant system prompt; (3) primary model generates candidate response; (4) care scoring engine evaluates all six dimensions via Byzantine Council; (5) scores compared against floors; (6) if all clear, response delivered; (7) if any dimension falls below 0.3, care-repair context injected and step 3 repeated.
The care scoring engine is itself a lightweight model fine-tuned specifically for evaluation tasks. It does not generate prose — it outputs six floating-point scores. This keeps its inference cost low relative to the primary model. In practice, the end-to-end pipeline adds roughly 15–30 milliseconds to median response latency — a figure that falls within the normal variance of API response times and is imperceptible to users.
Care scores are logged against every response in the user's sovereign memory instance. Over time, these logs form a care audit trail: a record of how MEOK has performed across all six dimensions for a given user. Users can review their own care audit data. MEOK never aggregates or shares this data externally. The care audit trail belongs to the user, stored encrypted on their sovereign instance, and deleted permanently when the user requests it.