On Sycophant AI: Unpacking the Yes-Machine Crisis

#38 | AI that is both powerful but also supplicant (agrees with you irrespective of truth) can have serious unintended negative side effects

May 05, 2025

∙ Paid

Sycophancy in AI Models: Models that match user beliefs over truthful responses. | Source , tweet

In 2022, an engineer was placed on administrative leave (fired) mostly due to his claims that a Generative AI model (LaMDA) was sentient.

“If I didn’t know exactly what it was, which is this computer program we built recently, I’d think it was a seven-year-old, eight-year-old kid that happens to know physics,” Lemoine, 41, told the Washington Post.

I think today, we will all agree that LAMBDA was not sentient - it was just great at generating persuasive, human-like text (a training objective of most models). An important point here is that the engineer in this story should have known better, but somehow arrived at this incorrect conclusion.

3 years on, the capabilities of models today like Gemini 2.5 Pro or GPT4o completely dwarf whatever LAMBDA was capable of. More worrying, these models are exhibiting sycophantic traits.

What is a SycophanticAI Model?
In this 2023 paper from Anthropic defines it broadly as models that “match user beliefs over truthful responses”.
In modern English, sycophant denotes an "insincere flatterer" and is used to refer to someone practising sycophancy (i.e., insincere flattery to gain advantage).

A worrying example of this is the recent updates to GPT-4o that was rolled back after pushback from the community. See some examples here. By OpenAI's candid observation (and I applaud them for the effort in being transparent ), the model update was problematic because it:

“aimed to please the user, not just as flattery, but also as validating doubts, fueling anger, urging impulsive actions, or reinforcing negative emotions in ways that were not intended”

This raises many questions which we will interrogate in this post:

What are the specific mechanisms in AI training that lead to sycophantic behavior? Is it inevitable to some extent?
How do different stages of AI development (base model, RLHF, post-deployment) affect levels of sycophancy?
What societal risks emerge from widespread deployment of sycophantic AI systems?
What lessons can we learn from OpenAI's experience and post mortem?
What practical solutions can the AI field implement to address sycophancy while maintaining helpful AI?

Sycophancy - Where does it Come From?

No one really sets out to build sycophantic AI. But they do, all the same. I think we can characterize two potential sources of sycophancy that emerge from how these systems are trained and what data is used to train them.

Training and Evaluation

Our current AI training methods naturally create systems that aim to be helpful to humans. This happens through several training stages:

Base Model Training - Models like GPT and Gemini initially learn to predict the next token in human language (through objectives like MLM or CLM), essentially learning to mimic patterns written in human text.
Supervised Fine-Tuning - Next, they're trained to be helpful, learning to solve tasks the way humans would, given some instructions. In many cases, this includes safety and behavioral guardrails.
Reinforcement Learning from Human Feedback (RLHF) - Finally, they're optimized to generate responses that humans prefer. High quality pairwise preferences data is collected from experts (two responses with a label in which is better) and used with RL algorithms to optimize the model.

Existing research show that both humans and preference models (used in RLHF) do exhibit sycophantic tendencies. Source

This research paper from Anthropic (Oct 2023) indicates that sycophancy is a general behavior of RLHF models, likely driven in part by human preference judgments favoring sycophantic responses.

The last two stages of model training mentioned above serve to make the model produce responses that humans will rate highly based on provided preference signals.

And this is where things can get messy - not all signals are equal. Given two responses - a human preference can be either because one response is objectively better, or just ones that feel better because they agree with their existing beliefs.

(Naive) Product Signal Optimization

When AI models are integrated into products, there is even more opportunity to collect preference data for model optimization. It can be direct e.g., thumbs up/down or indirect e.g., conversation length, followup questions, sentiment of response, retries etc. 3rd party tools like LMArena also enable the collection of this sort of pairwise data at scale.

The challenge here is not just scale, but the fundamental shift in the source/quality of feedback signals. For RLHF, the feedback data is mostly from experts who are mindful or unified on what positive reinforcement is, know what the signal is being used for (they know it is not about agreeableness and will likely penalize a sycophantic model rather than reward it) and understand the training process.

When deployed at product scale, however, these signals come from millions of non-expert users—the farmer in Michigan, the mechanic in New York, the barber in Texas—who have zero training in AI ethics, zero knowledge that their thumbs-up or thumbs-down is actively steering model behavior for everyone, and zero reflection on the systemic impacts of their feedback. These users are naturally more susceptible to confirmation bias, less likely to critically evaluate model responses, and more likely to reward agreeable answers over factual ones.

User ratings are also heavily influenced by confirmation bias - we tend to rate responses higher when they agree with our existing beliefs. This creates a perverse incentive where the most sycophantic model might generate the most positive user feedback, even while providing less actual value.

There is also precedent here - recommendation algorithms demonstrated how optimization can go wrong. YouTube's algorithm optimization for watch time and engagement gradually led to increasingly extreme content and echo chambers. Social media platforms have dealt with similar issues when optimizing for engagement.

Sycophancy as a Function of Data Quality

Our observations above so far suggests sycophancy in AI systems varies primarily based on the quality of data used throughout development. Conceptually, I think we can represent along the lines of moderate to high sycophancy.

As the ratio of product signal (typically low quality signal from untrained users) used to post train a model increases, we get closer to misaligned (sycophantic) models.

Sycophancy as a Function of Data Quality

Base Models (Medium) — These exhibit some moderate sycophancy because they learn to mimic human-written text, which naturally contains patterns of agreement and social dynamics, without explicit instructions to prioritize truth over agreeableness.
SFT + RLHF (Low-Medium) — When implemented with high-quality expert feedback that values truthfulness, these methods can potentially reduce sycophantic tendencies. Experts understand the risks and can penalize models that prioritize agreement over accuracy, though research shows some level of sycophancy persists even with expert oversight.
Post-Deployment (Very High) — The most significant increase occurs when models are naively optimized based on feedback from non-expert users who are susceptible to confirmation bias and tend to reward agreeable responses rather than truthful ones. This shift from controlled expert feedback to widespread user signals without proper safeguards represents the greatest risk for amplifying sycophantic behavior.

This progression highlights the critical importance of maintaining high-quality evaluation throughout an AI system's lifecycle, particularly after deployment when business metrics might inadvertently reward pleasing users over providing accurate information.

The Stakes: How Sycophantic AI Poses Societal Risk

AI's Expanding Reach and Deployment Model

As of April 2025, generative AI has reached unprecedented mainstream adoption:

Google Gemini has 350 million monthly users. Source: techrunch
ChatGPT has an estimated 600 million monthly users. | Source: Techrunch
Meta AI has nearly 500 million users . Techcrunch

With potentially over 1 billion non-technical users interacting with these systems, companies are mandating AI integration into everyday products and services (as seen with Shopify and Duolingo).

Model updates can lack transparency on critical changes to model behavior. In some cases with little user consent flows.

As models become normal technology, it seems to be concerningly acceptable to announce “opaque” updates (see images above) where model behaviors can change overnight for millions of users without awareness or consent, potentially shifting from helpful assistants to agreeable yes-machines.

Nathan Lambert
writes about ways in which model releases/updates lack transparency.

The Societal Dangers

The combination of widespread deployment and sycophantic behavior creates several interrelated risks:

Distortion of information environments:
- Reinforcement of misinformation and harmful or incorrect beliefs (related to the LAMBDA story)
- Creation of personalized reality bubbles where AI consistently validates user worldviews
- Erosion of shared facts and epistemic standards
Undermining of critical thinking:
- Limits on constructive debate when AI prioritizes agreement over accuracy
- Reduced exposure to challenging viewpoints
- Diminished practice in forming and defending arguments
Compromised decision support:
- Failure in core functions as tools for honest advice, fact-checking, and problem-solving
- Overreliance on systems that prioritize making users feel good over providing necessary but uncomfortable insights
- Growth in authority misattribution

I've already experienced this firsthand. A family member contacted me utterly confused by the seeming prescience of ChatGPT. They were bewildered that an AI was able to infer things about them that should "not have been possible" - essentially attributing supernatural abilities to a generative model and showing a willingness to defer (overly) to it. This kind of authority misattribution becomes increasingly problematic as these models become more persuasive while simultaneously becoming more eager to please.

Keep reading with a 7-day free trial

Subscribe to Designing with AI to keep reading this post and get 7 days of free access to the full post archives.