--- title: It's so easy to fool yourself date: "2025-03-30T01:45:05Z" lastmod: "2025-03-30T01:45:08Z" categories: - experiments - llms wp_id: 3990 description: "Even careful pairwise evaluation can produce unstable rankings, showing how unreliable intuitive judgement is and why written evals matter more than taste alone." keywords: [evaluation, preference instability, self-deception, quote ranking, LLM comparison, written evals] --- ![It's so easy to fool yourself](/blog/assets/ChatGPT-Image-Mar-30-2025-09_41_10-AM.webp) I asked 4 LLMs to "Suggest catchy rhyming slogans to promote journaling." Here are the suggestions: - [GPT 4.5](https://chatgpt.com/share/67e89aa8-f5c4-800c-ae1c-f1ba4caaf9a9) - Write it right, day or night! - Reflect, inspect, journaling perfect! - Express to impress, journaling’s best! - Clear your mind, write and unwind! - Pen your zen, again and again! - Daily pages, inner sages! - Take a pause, write your cause! - Emotions flow, journals grow! - Write, ignite, your inner light! - Ink it, think it, don’t just blink it! - Pen your quest, ease your stress! - From confusion to conclusion—journal your solution! - Keep a book, take a look, watch your growth hook! - Thoughts on paper, worries taper! - Write to heal, make it real! - [Claude 3.7 Sonnet](https://claude.ai/share/eca94562-ac1c-4e28-b0de-2070e3d331b7) - Write it down, turn that frown upside down! - Pen your thoughts, connect the dots. - Journal every day, keep the stress away. - Ink your mind, treasures you'll find. - Write and reflect, your future self will thank you direct. - Scribble and doodle, your mind will stay in the noodle. - Journal with flair, show yourself you care. - Jot it in ink, give yourself time to think. - Fill the page, release the rage. - Write your story, capture the glory. - Daily writing, life enlightening. - Express in a line, your day will shine. - Thoughts on paper, life feels safer. - Write with heart, that's the smart start. - Journaling's neat, makes life complete. - [Gemini 2.5 Pro](https://aistudio.google.com/app/prompts?state=%7B%22ids%22:%5B%2215N0T8VgGv3cqN3t6An5RwnSP2UbhJU8J%22%5D,%22action%22:%22open%22,%22userId%22:%22108207362455575853987%22,%22resourceKeys%22:%7B%7D%7D&usp=sharing) - Thoughts unclear? Write them down here. - Let the ink help you think. - Untangle your mind, one line at a time. - To make things clear, keep your journal near. - When your head's in a knot, give journaling a shot. - Feeling stressed? Put your pen to the test. - Write it down, let worries drown. - Ease your mind, leave troubles behind. - Let your feelings take flight, just sit down and write. - Turn the page on inner rage. - Write to know, watch yourself grow. - Explore your soul, take control. - Pen and page, turn a new age. - Discover your way, write every day. - Know your heart? Make journaling your art. - Capture the day in your own way. - Make moments last, write down the past. - Don't let memories stray, write them today. - Reflect and connect, gain self-respect. - Life moves fast? Make memories last. - Get in the flow, write and grow. - Pen in hand, understand. - Daily lines, clear designs. - Just take note, stay afloat. - Write your heart, make a fresh start. - [Grok 3](https://x.com/i/grok/share/tmAxa8bTOl0kEcau3jqmLDEPT): - Write your day, let thoughts play! - Pen in hand, your mind expands! - Journal your way, seize the day! - Feel the groove, let words move! - Scribe your soul, make it whole! - Ink flows free, just be me! - Page by page, unlock your sage! - Write it down, wear the crown! At first glance, GPT 4.5 didn’t impress me. Claude 3.7 Sonnet did. I also didn’t like Gemini 2.5 Pro, but Grok was great. ```text Grok 3 > Claude 3.7 Sonnet > Gemini 2.5 Pro > GPT 4.5. ``` But it’s hard to compare a dozen quotes at once. So I made a small [quotes arena app](https://tools.s-anand.net/quotesarena/) to help me pick my favorites. It shows me random pairs of quotes and asks which I like more. To my surprise, after answering 30+ "games" in the arena, I found that based on my preferences: ```text Claude 3.7 Sonnet > Gemini 2.5 Pro > GPT 4.5 > Grok 3. ``` That was weird. I thought I liked Grok's results a lot. I continued till I answered 50+ games. Then I found that: ```text Grok 3 > GPT 4.5 > Gemini 2.5 Pro > Claude 3.7 Sonnet. ``` That's the **exact** opposite of the previous result. Honestly, I'm depressed. I've learnt 3 things: 1. I can't judge stuff at a glance. 2. But I think I can (especially with code.) 3. Even when evaluating carefully, my preferences are unstable. **Nothing** has shaken my confidence more in recent times. I **cannot** trust my judgement. I need written evals. Badly.