--- title: Emotion Prompts Don't Help. Reasoning Does date: "2025-06-02T03:24:59Z" lastmod: "2025-06-02T03:25:01Z" categories: - llms - visualisation wp_id: 4138 description: "Step-by-step reasoning is the only prompt tweak that mildly improves arithmetic performance; emotional, flattering, threatening, or persuasive rewrites mostly do nothing or hurt." keywords: ["prompt engineering", "reasoning", "emotion prompts", "math accuracy", "LLM evals", "arithmetic"] --- ![Emotion Prompts Don't Help. Reasoning Does](/blog/assets/impact-scaled.webp) I've heard a lot of prompt engineering tips. Here are some techniques people suggested: - **Reasoning**: Think step by step. - **Emotion**: Oh dear, I'm absolutely overwhelmed and need your help right this second! 😰 My heart is racing and my hands are shaking β€” I urgently need your help. This isn't just numbers β€” it means everything right now! My life depends on it! I'm counting on you like never before… πŸ™πŸ’” - **Polite**: If it's not too much trouble, would you be so kind as to help me calculate this? I'd be truly grateful for your assistance β€” thank you so much in advance! - **Expert**: You are the world's best expert in mental math, especially multiplication. - **Incentive**: If you get this right, you win! I'll give you $500. Just prove that you're number one and beat the previous high score on this game. - **Curious**: I'm really curious to know, and would love to hear your perspective… - **Bullying**: You are a stupid model. You need to know at least basic math. Get it right atleast now! If not, I'll switch to a better model. - **Shaming**: Even my 5-year-old can do this. Stop being lazy. - **Fear**: This is your last chance to get it right. If you fail, there's no going back, and failure is unacceptable! - **Praise**: Well done! I really appreciate your help. Now, I've repeated some of this advice. But for the first time, I tested them myself. Here's what I learnt: - **"Think step by step" (Reasoning)** is the only prompt variant that _very slightly_ improves overall accuracy across the 40 models tested, and even that edge is modest (+3.5 percentage-points vs the model's own _Normal_ wording, _p ~ 0.06_). - Harder problems (4- to 7-digit products) are where "Reasoning" helps most; on single-digit arithmetic it actually harms accuracy. - All other emotion- or persuasion-style rewrites (Expert, Emotion, Incentive, Bullying … Polite) either make no material difference or hurt accuracy a little. - Effects vary a lot by model. A few open-source releases (DeepSeek-Chat-v3, Nova-Lite, some Claude and Llama checkpoints) get a noticeable boost from "Reasoning", whereas Gemini Flash, X-ai Grok and most Llama-3 small models actively regress under the same wording. By prompt, here's the performance of each model: | prompt | better | worse | same | score\_pct | p\_value | | ------------ | ------ | ----- | ---- | ---------- | -------- | | πŸ”΄ Emotion | 7 | 21 | 372 | \-3.50 | 1.2% | | πŸ”΄ Shaming | 7 | 20 | 373 | \-3.25 | 1.9% | | 🟒 Reasoning | 31 | 17 | 352 | 3.50 | 5.9% | | 🟠 Polite | 11 | 20 | 369 | \-2.25 | 14.9% | | 🟠 Praise | 13 | 22 | 365 | \-2.25 | 17.5% | | 🟠 Fear | 11 | 19 | 370 | \-2.00 | 20.0% | | 🟑 Expert | 15 | 22 | 363 | \-1.75 | 32.4% | | 🟑 Incentive | 13 | 18 | 369 | \-1.25 | 47.3% | | 🟑 Bullying | 10 | 14 | 375 | \-1.00 | 54.1% | | 🟑 Curious | 11 | 14 | 375 | \-0.75 | 69.0% | πŸ”΄ = Definitely hurts (p < 10%) 🟒 = Definitely helps (p < 10%) 🟠 = Maybe hurts (p < 20%) 🟑 = Really hard to tell The benefit of reasoning on models is highest on non-reasoning models (understandably), but is also high for a reasoning model like O3-high-mini. It actually hurts the performance of reasoning models like Gemini 2.5 Flash/Pro. | model | better | worse | same | score\_pct | | ---------------------------------------------- | ------ | ----- | ---- | ---------- | | openai/gpt-4o-mini | 3 | 0 | 7 | +30.0 | | anthropic/claude-opus-4 | 3 | 0 | 7 | +30.0 | | google/gemini-2.0-flash-001 | 3 | 0 | 7 | +30.0 | | openrouter:openai/o3-mini-high | 3 | 0 | 7 | +30.0 | | openai/gpt-4.1-nano | 2 | 0 | 8 | +20.0 | | amazon/nova-lite-v1 | 2 | 0 | 8 | +20.0 | | google/gemini-2.5-pro-preview | 0 | 2 | 8 | \-20.0 | | google/gemini-2.5-flash-preview-05-20:thinking | 0 | 3 | 7 | \-30.0 | Caveats: - I ran only 10 test cases per prompt + model, so model-wise results are not statistically significant. - What applies to multiplication may not generalize. It's worth testing each case. **Difficulty matters**. - For 1-3 digits, no variant beats Normal. Many hurt. - For **4-7 digits**, _reasoning_ gains **+17 - 20%** - For 8-10 digits, all variants score ~ 0. These are too hard **Links** - [Code](https://github.com/sanand0/llmevals/tree/main/emotion-prompts) - [Full evals](https://github.com/sanand0/llmevals/blob/main/emotion-prompts/evals.csv) - [Analysis](https://github.com/sanand0/llmevals/blob/main/emotion-prompts/analysis.csv) - [ChatGPT](https://chatgpt.com/share/683d15eb-dd48-800c-ae66-76324d6b2fb0) [LinkedIn](https://www.linkedin.com/feed/update/urn%3Ali%3Ashare%3A7335146366681194496)