Surge AI Blog https://raw.githubusercontent.com/olshansky/rss-feeds/main/feeds/feed_blogsurgeai.xml New methods, current trends & software infrastructure for NLP. Articles written by our senior engineering leads from Google, Facebook, Twitter, Harvard, MIT, and Y Combinator http://www.rssboard.org/rss-specification python-feedgen en Wed, 19 Nov 2025 04:04:08 +0000 The AI Bottleneck: High-Quality, Human-Powered Data https://www.surgehq.ai/blog/the-ai-bottleneck-high-quality-human-powered-data In theory, AI has blown past our wildest dreams; in practice, Siri can’t even tell us the weather. The problem? Creating high-quality datasets to train and measure our models is still incredibly difficult. We should be able to gather 20,000 labels for training a Reddit classifier in a single https://www.surgehq.ai/blog/the-ai-bottleneck-high-quality-human-powered-data Mon, 02 Aug 2021 00:00:00 +0000 5 Examples of the Importance of Context-Sensitivity in Data-Centric AI https://www.surgehq.ai/blog/why-context-aware-datasets-are-crucial-for-data-centric-ai Data-centric AI requires radically rethinking the data that goes into your models. Surge AI provides data labelers with the skills you need to get context-sensitive labels. https://www.surgehq.ai/blog/why-context-aware-datasets-are-crucial-for-data-centric-ai Fri, 19 Nov 2021 00:00:00 +0000 Is Google Search Deteriorating? Measuring Google's Search Quality in 2022 https://www.surgehq.ai/blog/is-google-search-deteriorating-measuring-search-quality-in-2022 Has Google's Search Quality deteriorated in recent years? This post measures Google Search using human evaluation. https://www.surgehq.ai/blog/is-google-search-deteriorating-measuring-search-quality-in-2022 Mon, 10 Jan 2022 00:00:00 +0000 Holy $#!t: Are popular toxicity models simply profanity detectors? https://www.surgehq.ai/blog/are-popular-toxicity-models-simply-profanity-detectors Are popular toxicity models simply profanity detectors? We show how toxicity models overweight profanity, and make mistakes when profanity is used in a positive way. https://www.surgehq.ai/blog/are-popular-toxicity-models-simply-profanity-detectors Sat, 22 Jan 2022 00:00:00 +0000 Moving Beyond Engagement: Optimizing Facebook's Algorithms for Human Values https://www.surgehq.ai/blog/what-if-social-media-optimized-for-human-values Social media platforms optimize for clicks and engagement — but those same short-term optimizations drive clickbait, toxic content, and misinformation. How can we align their ML systems to human values instead? This post describes a data-driven approach with Facebook. https://www.surgehq.ai/blog/what-if-social-media-optimized-for-human-values Thu, 10 Feb 2022 00:00:00 +0000 Google Search is Falling Behind https://www.surgehq.ai/blog/google-search-is-falling-behind Google Search is falling behind. We analyzed three areas – programming queries, sports queries, and cooking queries – to understand where Google Search lags behind its competitors. https://www.surgehq.ai/blog/google-search-is-falling-behind Tue, 12 Apr 2022 00:00:00 +0000 The average number of ads on a Google Search recipe? 8.7 https://www.surgehq.ai/blog/the-average-number-of-ads-on-a-google-search-recipe-8-7 We ran a large-scale human evaluation to count the average number of ads on a Google Search recipe. https://www.surgehq.ai/blog/the-average-number-of-ads-on-a-google-search-recipe-8-7 Fri, 29 Apr 2022 00:00:00 +0000 We asked 100 humans to draw the DALL·E prompts https://www.surgehq.ai/blog/humans-vs-dall-e Where do human artists fit in a world of rich, creative AI? We asked 100 Surgers to draw the DALL-E prompts. https://www.surgehq.ai/blog/humans-vs-dall-e Thu, 12 May 2022 00:00:00 +0000 How Surge AI Built OpenAI's GSM8K Dataset of 8,500 Math Problems https://www.surgehq.ai/blog/how-we-built-it-openais-gsm8k-dataset-of-8500-math-problems We built a dataset of 8,500 Grade School Math Problems for OpenAI. The goal of the dataset: to train language models like GPT-3 to solve natural language math problems and measure their reasoning ability. Learn about our process in this blog post! https://www.surgehq.ai/blog/how-we-built-it-openais-gsm8k-dataset-of-8500-math-problems Mon, 13 Jun 2022 00:00:00 +0000 Humans vs. Gary Marcus vs. Slate Star Codex: When is an AI failure actually a failure? https://www.surgehq.ai/blog/humans-vs-gary-marcus Gary Marcus has several examples of AI mistakes. But are they really failures, or a sign of creativity? We gave them to 15 Surgers to complete GPT-3's "mistakes" to see how they would perform instead. https://www.surgehq.ai/blog/humans-vs-gary-marcus Wed, 22 Jun 2022 00:00:00 +0000 AI Red Teams and Adversarial Data Labeling with Redwood Research https://www.surgehq.ai/blog/ai-red-teams-and-adversarial-data-labeling-with-redwood-research Our mission at Surge AI is to inject human values and intelligence into AI. We want to build a world where AI https://www.surgehq.ai/blog/ai-red-teams-and-adversarial-data-labeling-with-redwood-research Tue, 28 Jun 2022 00:00:00 +0000 30% of Google's Emotions Dataset is Mislabeled https://www.surgehq.ai/blog/30-percent-of-googles-reddit-emotions-dataset-is-mislabeled Last year, Google released their “GoEmotions” dataset: a human-labeled dataset of 58K Reddit comments categorized according to 27 emotions. The problem? A whopping 30% of the dataset is mislabeled! Check out some of the egregious errors, and learn how to build better datasets.30% of Google's Emotions Dataset is Mislabeled https://www.surgehq.ai/blog/30-percent-of-googles-reddit-emotions-dataset-is-mislabeled Mon, 11 Jul 2022 00:00:00 +0000 Human Evaluation of Large Language Models: How Good is Hugging Face’s BLOOM? https://www.surgehq.ai/blog/how-good-is-hugging-faces-bloom-a-real-world-human-evaluation-of-language-models Hugging Face's BLOOM is a new 176B parameter multilingual large language model. How does it compare to other state-of-the-art LLMs? We ran a human evaluation across 7 real-world categories to evaluate its performance. https://www.surgehq.ai/blog/how-good-is-hugging-faces-bloom-a-real-world-human-evaluation-of-language-models Tue, 19 Jul 2022 00:00:00 +0000 Search Behind-the-Scenes: How Neeva Uses Human Evaluation to Measure Search Quality https://www.surgehq.ai/blog/beyond-clicks-how-neeva-uses-human-evaluation-of-search-quality-to-take-on-google Search quality measurement is one of the trickiest, but most important parts of building Search. Read how Neeva uses human evaluation of search quality to build a state-of-the-art search engine challenging Google. https://www.surgehq.ai/blog/beyond-clicks-how-neeva-uses-human-evaluation-of-search-quality-to-take-on-google Fri, 29 Jul 2022 00:00:00 +0000 The $250K Inverse Scaling Prize and Human-AI Alignment https://www.surgehq.ai/blog/the-250k-inverse-scaling-prize-and-human-ai-alignment Surge AI is partnering with NYU and the Fund for Alignment Research on the Inverse Scaling Prize. If you've found a task with LLM inverse scaling properties, and need help creating a dataset of 300-500+ examples, reach out. We’re a human alignment platform with deep expertise in training large language models on human feedback, and we’re here to help – including $500 of free data labeling credits to kickstart your submission. https://www.surgehq.ai/blog/the-250k-inverse-scaling-prize-and-human-ai-alignment Mon, 15 Aug 2022 00:00:00 +0000 Why Instagram is Losing Gen Z: We Asked 100 Users to Compare TikTok vs. Reels https://www.surgehq.ai/blog/tiktok-vs-instagram-reels-personalized-human-evaluation Why can't Meta A/B test its way back to greatness? To move Instagram beyond short-term engagement metrics, we ran a personalized human evaluation asking 100 users to compare TikTok vs. Instagram Reels. Learn why Gen Z considers Reels the place where TikToks go to die, and what Instagram should do about it. https://www.surgehq.ai/blog/tiktok-vs-instagram-reels-personalized-human-evaluation Wed, 31 Aug 2022 00:00:00 +0000 Evaluating Generative AI: Did Astral Codex Ten Win His Bet on AI Progress? https://www.surgehq.ai/blog/dall-e-vs-imagen-and-evaluating-astral-codex-tens-3000-ai-bet Has Astral Codex Ten's bet on AI progress really been won? We asked Surgers to evaluate DALL·E and Imagen on Scott's 5 compositionality prompts! https://www.surgehq.ai/blog/dall-e-vs-imagen-and-evaluating-astral-codex-tens-3000-ai-bet Thu, 29 Sep 2022 00:00:00 +0000 How TikTok is Evolving the Next Generation of Search https://www.surgehq.ai/blog/how-tiktok-is-evolving-the-next-generation-of-search TikTok has been taking over the world — and now, your Google Search results too. But when are they actually helpful? We ran a large-scale personalized human evaluation, asking Surgers to rate hundreds of <query, TikTok> pairs to find out. https://www.surgehq.ai/blog/how-tiktok-is-evolving-the-next-generation-of-search Tue, 25 Oct 2022 00:00:00 +0000 HellaSwag or HellaBad? 36% of this popular LLM benchmark contains errors https://www.surgehq.ai/blog/hellaswag-or-hellabad-36-of-this-popular-llm-benchmark-contains-errors We analyzed HellaSwag, a popular LLM benchmark, and found errors in 36% of its rows. https://www.surgehq.ai/blog/hellaswag-or-hellabad-36-of-this-popular-llm-benchmark-contains-errors Sun, 04 Dec 2022 00:00:00 +0000 AI Red Teams for Adversarial Training: How to Make ChatGPT and LLMs Adversarially Robust https://www.surgehq.ai/blog/ai-red-teams-for-adversarial-training-making-chatgpt-and-large-language-models-adversarially-robust How do you make large language models safer and adversarially robust to counterattacks? Learn about AI red teams of creative data labelers who try to interactively penetrate AI defenses in order to teach them. https://www.surgehq.ai/blog/ai-red-teams-for-adversarial-training-making-chatgpt-and-large-language-models-adversarially-robust Mon, 12 Dec 2022 00:00:00 +0000 We Evaluated ChatGPT vs. Google on 500 Search Queries https://www.surgehq.ai/blog/googles-existential-threat-chatgpt-matches-googles-performance-on-informational-search-queries-and-smashes-it-on-coding We measured ChatGPT vs. Google on 500 search queries, and found that ChatGPT crushes Google on coding and ties it on general information — despite not being optimized for a search experience at all. Dive into this post to learn more about OpenAI’s existential threat to Google. https://www.surgehq.ai/blog/googles-existential-threat-chatgpt-matches-googles-performance-on-informational-search-queries-and-smashes-it-on-coding Wed, 21 Dec 2022 00:00:00 +0000 How Anthropic uses Surge AI to Train and Evaluate Claude https://www.surgehq.ai/blog/anthropic-surge-ai-rlhf-platform-train-llm-assistant-human-feedback Learn how Anthropic partnered with Surge AI to gather high-quality human feedback at scale using the RLHF platform, resulting in one of the safest and most advanced large language models on the planet. https://www.surgehq.ai/blog/anthropic-surge-ai-rlhf-platform-train-llm-assistant-human-feedback Thu, 09 Mar 2023 00:00:00 +0000 DALL·E 3 and Midjourney Fail Astral Codex Ten's Image Generation Bet https://www.surgehq.ai/blog/dalle-3-and-midjourney-fail-astral-codex-tens-image-generation-bet An update on Astral Codex Ten's Image Generation Bet: close, but no dice. DALL·E 3 and Midjourney fail. https://www.surgehq.ai/blog/dalle-3-and-midjourney-fail-astral-codex-tens-image-generation-bet Thu, 01 Aug 2024 00:00:00 +0000 Bringing light to the GPT-4o vs. GPT-5 personality controversy https://www.surgehq.ai/blog/bringing-light-to-the-gpt-4o-vs-gpt-5-personality-controversy GPT-5 was released on Aug 7, 2025. The swift removal of all legacy models from the ChatGPT UI was met with an even swifter backlash: some people online felt that GPT-4o was more personable, human, and engaging, whereas GPT-5 was stiff and robotic. This viral meme encapsulated the faction’s thesis: https://www.surgehq.ai/blog/bringing-light-to-the-gpt-4o-vs-gpt-5-personality-controversy Fri, 15 Aug 2025 00:00:00 +0000 Unsexy AI Failures: The PDF That Broke ChatGPT https://www.surgehq.ai/blog/the-pdf-that-broke-chatgpt The AI world loves climbing leaderboards. Companies race to hit #1 on LMSYS, chase perfect scores on academic benchmarks, and demo SVGs of pelicans on bicycles. These achievements make for great headlines and impressive presentations – even when these metrics are easily hacked. https://www.surgehq.ai/blog/the-pdf-that-broke-chatgpt Mon, 25 Aug 2025 00:00:00 +0000 Benchmarks are broken https://www.surgehq.ai/blog/benchmarks-are-broken Academic benchmarks make great headlines, and terrible AI. https://www.surgehq.ai/blog/benchmarks-are-broken Sun, 07 Sep 2025 00:00:00 +0000 SWE-Bench Failures: When Coding Agents Spiral Into 693 Lines of Hallucinations https://www.surgehq.ai/blog/when-coding-agents-spiral-into-693-lines-of-hallucinations When coding models spiral into self-reinforcing hallucinations, small mistakes compound into catastrophic failure. In SWE-bench, we saw SOTA models invent whole classes, methods, and terminal outputs—never realizing they had lost touch with the real codebase. In this case study, we’ll look at how three frontier coding agents tried to solve one particular SWE-bench problem: one spiraled into hallucinations and failed entirely, one spiraled but recovered, and one avoided hallucinations altogether. Our goal: to illustrate how dissecting real-world problems can steer models towards human-ready AGI. https://www.surgehq.ai/blog/when-coding-agents-spiral-into-693-lines-of-hallucinations Mon, 15 Sep 2025 00:00:00 +0000 Unsexy AI Failures: Still Confidently Hallucinating Image Text https://www.surgehq.ai/blog/unsexy-ai-failures-still-confidently-hallucinating-image-text A core problem with today’s AI systems isn’t simply that they make mistakes – it’s that they make mistakes confidently. They’ll insist they can do something, describe exactly how they’ll do it, and then deliver something completely wrong. We saw this in our last Unsexy Failures post, where a SOTA model confidently described generating a Word document – even though this was a completely fabricated capability! – and provided a link to nowhere. https://www.surgehq.ai/blog/unsexy-ai-failures-still-confidently-hallucinating-image-text Mon, 22 Sep 2025 00:00:00 +0000 The Human/AI Frontier: A Conversation with Bogdan Grechuk https://www.surgehq.ai/blog/the-human-frontier-bogdan-grechuk At Surge AI, we work with the world’s sharpest minds to push the limits of AI. Professor Bogdan Grechuk—an IMO gold medalist and Associate Professor at the University of Leicester—is one of them. We interviewed him about the work he does to train SOTA models to perform frontier research. https://www.surgehq.ai/blog/the-human-frontier-bogdan-grechuk Mon, 29 Sep 2025 00:00:00 +0000 Is Sonnet 4.5 the best coding model in the world? https://www.surgehq.ai/blog/sonnet-4-5-coding-model-evaluation On Surge AI’s agentic coding benchmark, Claude Sonnet 4.5 outperformed GPT-5-Codex in accuracy, while GPT-5-Codex was more cost-efficient. Despite similar scores, the models were distinct in which tasks they failed in. In a refactoring case study, Claude succeeded after persistent debugging, while GPT-5-Codex failed due to an unexplained decision to end the task early. Both stayed focused and avoided hallucinations even when encountering difficulties. https://www.surgehq.ai/blog/sonnet-4-5-coding-model-evaluation Wed, 08 Oct 2025 00:00:00 +0000 A Product Take on Sonnet 4.5 https://www.surgehq.ai/blog/sonnet-4-5-product-take After 100+ hours with Opus 4.1 and 20+ hours in the first week of Sonnet 4.5's launch, Nick Heiner, our VP of Product gives first impressions. https://www.surgehq.ai/blog/sonnet-4-5-product-take Fri, 10 Oct 2025 00:00:00 +0000 200 finance experts tested frontier models on real tasks. Over 70% failed. https://www.surgehq.ai/blog/finance-eval-real-world We stress-tested GPT-5, Gemini 2.5 Pro, and Claude Sonnet 4.5 on 200+ expert finance tasks. Here's where even the best models break when they move from benchmarks to Wall Street. https://www.surgehq.ai/blog/finance-eval-real-world Mon, 03 Nov 2025 00:00:00 +0000 RL Environments and the Hierarchy of Agentic Capabilities https://www.surgehq.ai/blog/rl-envs-real-world Our RL environment run on 9 models revealed the core capabilities all agents need to master: tool use, planning, adaptability, groundedness, and common sense. https://www.surgehq.ai/blog/rl-envs-real-world Mon, 03 Nov 2025 00:00:00 +0000