Surge AI Blog

The AI Bottleneck: High-Quality, Human-Powered Data

Mon, 02 Aug 2021 00:00:00 +0000

In theory, AI has blown past our wildest dreams; in practice, Siri can’t even tell us the weather. The problem? Creating high-quality datasets to train and measure our models is still incredibly difficult. We should be able to gather 20,000 labels for training a Reddit classifier in a single

5 Examples of the Importance of Context-Sensitivity in Data-Centric AI

Fri, 19 Nov 2021 00:00:00 +0000

Data-centric AI requires radically rethinking the data that goes into your models. Surge AI provides data labelers with the skills you need to get context-sensitive labels.

Is Google Search Deteriorating? Measuring Google's Search Quality in 2022

Mon, 10 Jan 2022 00:00:00 +0000

Has Google's Search Quality deteriorated in recent years? This post measures Google Search using human evaluation.

Holy $#!t: Are popular toxicity models simply profanity detectors?

Sat, 22 Jan 2022 00:00:00 +0000

Are popular toxicity models simply profanity detectors? We show how toxicity models overweight profanity, and make mistakes when profanity is used in a positive way.

Moving Beyond Engagement: Optimizing Facebook's Algorithms for Human Values

Thu, 10 Feb 2022 00:00:00 +0000

Social media platforms optimize for clicks and engagement — but those same short-term optimizations drive clickbait, toxic content, and misinformation. How can we align their ML systems to human values instead? This post describes a data-driven approach with Facebook.

Google Search is Falling Behind

Tue, 12 Apr 2022 00:00:00 +0000

Google Search is falling behind. We analyzed three areas – programming queries, sports queries, and cooking queries – to understand where Google Search lags behind its competitors.

The average number of ads on a Google Search recipe? 8.7

Fri, 29 Apr 2022 00:00:00 +0000

We ran a large-scale human evaluation to count the average number of ads on a Google Search recipe.

We asked 100 humans to draw the DALL·E prompts

Thu, 12 May 2022 00:00:00 +0000

Where do human artists fit in a world of rich, creative AI? We asked 100 Surgers to draw the DALL-E prompts.

How Surge AI Built OpenAI's GSM8K Dataset of 8,500 Math Problems

Mon, 13 Jun 2022 00:00:00 +0000

We built a dataset of 8,500 Grade School Math Problems for OpenAI. The goal of the dataset: to train language models like GPT-3 to solve natural language math problems and measure their reasoning ability. Learn about our process in this blog post!

Humans vs. Gary Marcus vs. Slate Star Codex: When is an AI failure actually a failure?

Wed, 22 Jun 2022 00:00:00 +0000

Gary Marcus has several examples of AI mistakes. But are they really failures, or a sign of creativity? We gave them to 15 Surgers to complete GPT-3's "mistakes" to see how they would perform instead.

AI Red Teams and Adversarial Data Labeling with Redwood Research

Tue, 28 Jun 2022 00:00:00 +0000

Our mission at Surge AI is to inject human values and intelligence into AI. We want to build a world where AI

30% of Google's Emotions Dataset is Mislabeled

Mon, 11 Jul 2022 00:00:00 +0000

Last year, Google released their “GoEmotions” dataset: a human-labeled dataset of 58K Reddit comments categorized according to 27 emotions. The problem? A whopping 30% of the dataset is mislabeled! Check out some of the egregious errors, and learn how to build better datasets.30% of Google's Emotions Dataset is Mislabeled

Human Evaluation of Large Language Models: How Good is Hugging Face’s BLOOM?

Tue, 19 Jul 2022 00:00:00 +0000

Hugging Face's BLOOM is a new 176B parameter multilingual large language model. How does it compare to other state-of-the-art LLMs? We ran a human evaluation across 7 real-world categories to evaluate its performance.

Search Behind-the-Scenes: How Neeva Uses Human Evaluation to Measure Search Quality

Fri, 29 Jul 2022 00:00:00 +0000

Search quality measurement is one of the trickiest, but most important parts of building Search. Read how Neeva uses human evaluation of search quality to build a state-of-the-art search engine challenging Google.

The $250K Inverse Scaling Prize and Human-AI Alignment

Mon, 15 Aug 2022 00:00:00 +0000

Surge AI is partnering with NYU and the Fund for Alignment Research on the Inverse Scaling Prize. If you've found a task with LLM inverse scaling properties, and need help creating a dataset of 300-500+ examples, reach out. We’re a human alignment platform with deep expertise in training large language models on human feedback, and we’re here to help – including $500 of free data labeling credits to kickstart your submission.

Why Instagram is Losing Gen Z: We Asked 100 Users to Compare TikTok vs. Reels

Wed, 31 Aug 2022 00:00:00 +0000

Why can't Meta A/B test its way back to greatness? To move Instagram beyond short-term engagement metrics, we ran a personalized human evaluation asking 100 users to compare TikTok vs. Instagram Reels. Learn why Gen Z considers Reels the place where TikToks go to die, and what Instagram should do about it.

Evaluating Generative AI: Did Astral Codex Ten Win His Bet on AI Progress?

Thu, 29 Sep 2022 00:00:00 +0000

Has Astral Codex Ten's bet on AI progress really been won? We asked Surgers to evaluate DALL·E and Imagen on Scott's 5 compositionality prompts!

How TikTok is Evolving the Next Generation of Search

Tue, 25 Oct 2022 00:00:00 +0000

TikTok has been taking over the world — and now, your Google Search results too. But when are they actually helpful? We ran a large-scale personalized human evaluation, asking Surgers to rate hundreds of pairs to find out.

HellaSwag or HellaBad? 36% of this popular LLM benchmark contains errors

Sun, 04 Dec 2022 00:00:00 +0000

We analyzed HellaSwag, a popular LLM benchmark, and found errors in 36% of its rows.

AI Red Teams for Adversarial Training: How to Make ChatGPT and LLMs Adversarially Robust

Mon, 12 Dec 2022 00:00:00 +0000

How do you make large language models safer and adversarially robust to counterattacks? Learn about AI red teams of creative data labelers who try to interactively penetrate AI defenses in order to teach them.

We Evaluated ChatGPT vs. Google on 500 Search Queries

Wed, 21 Dec 2022 00:00:00 +0000

We measured ChatGPT vs. Google on 500 search queries, and found that ChatGPT crushes Google on coding and ties it on general information — despite not being optimized for a search experience at all. Dive into this post to learn more about OpenAI’s existential threat to Google.

How Anthropic uses Surge AI to Train and Evaluate Claude

Thu, 09 Mar 2023 00:00:00 +0000

Learn how Anthropic partnered with Surge AI to gather high-quality human feedback at scale using the RLHF platform, resulting in one of the safest and most advanced large language models on the planet.

DALL·E 3 and Midjourney Fail Astral Codex Ten's Image Generation Bet

Thu, 01 Aug 2024 00:00:00 +0000

An update on Astral Codex Ten's Image Generation Bet: close, but no dice. DALL·E 3 and Midjourney fail.

Bringing light to the GPT-4o vs. GPT-5 personality controversy

Fri, 15 Aug 2025 00:00:00 +0000

GPT-5 was released on Aug 7, 2025. The swift removal of all legacy models from the ChatGPT UI was met with an even swifter backlash: some people online felt that GPT-4o was more personable, human, and engaging, whereas GPT-5 was stiff and robotic. This viral meme encapsulated the faction’s thesis:

Unsexy AI Failures: The PDF That Broke ChatGPT

Mon, 25 Aug 2025 00:00:00 +0000

The AI world loves climbing leaderboards. Companies race to hit #1 on LMSYS, chase perfect scores on academic benchmarks, and demo SVGs of pelicans on bicycles. These achievements make for great headlines and impressive presentations – even when these metrics are easily hacked.

Benchmarks are broken

Sun, 07 Sep 2025 00:00:00 +0000

Academic benchmarks make great headlines, and terrible AI.

SWE-Bench Failures: When Coding Agents Spiral Into 693 Lines of Hallucinations

Mon, 15 Sep 2025 00:00:00 +0000

When coding models spiral into self-reinforcing hallucinations, small mistakes compound into catastrophic failure. In SWE-bench, we saw SOTA models invent whole classes, methods, and terminal outputs—never realizing they had lost touch with the real codebase. In this case study, we’ll look at how three frontier coding agents tried to solve one particular SWE-bench problem: one spiraled into hallucinations and failed entirely, one spiraled but recovered, and one avoided hallucinations altogether. Our goal: to illustrate how dissecting real-world problems can steer models towards human-ready AGI.

Unsexy AI Failures: Still Confidently Hallucinating Image Text

Mon, 22 Sep 2025 00:00:00 +0000

A core problem with today’s AI systems isn’t simply that they make mistakes – it’s that they make mistakes confidently. They’ll insist they can do something, describe exactly how they’ll do it, and then deliver something completely wrong. We saw this in our last Unsexy Failures post, where a SOTA model confidently described generating a Word document – even though this was a completely fabricated capability! – and provided a link to nowhere.

The Human/AI Frontier: A Conversation with Bogdan Grechuk

Mon, 29 Sep 2025 00:00:00 +0000

At Surge AI, we work with the world’s sharpest minds to push the limits of AI. Professor Bogdan Grechuk—an IMO gold medalist and Associate Professor at the University of Leicester—is one of them. We interviewed him about the work he does to train SOTA models to perform frontier research.

Is Sonnet 4.5 the best coding model in the world?

Wed, 08 Oct 2025 00:00:00 +0000

On Surge AI’s agentic coding benchmark, Claude Sonnet 4.5 outperformed GPT-5-Codex in accuracy, while GPT-5-Codex was more cost-efficient. Despite similar scores, the models were distinct in which tasks they failed in. In a refactoring case study, Claude succeeded after persistent debugging, while GPT-5-Codex failed due to an unexplained decision to end the task early. Both stayed focused and avoided hallucinations even when encountering difficulties.

A Product Take on Sonnet 4.5

Fri, 10 Oct 2025 00:00:00 +0000

After 100+ hours with Opus 4.1 and 20+ hours in the first week of Sonnet 4.5's launch, Nick Heiner, our VP of Product gives first impressions.

200 finance experts tested frontier models on real tasks. Over 70% failed.

Mon, 03 Nov 2025 00:00:00 +0000

We stress-tested GPT-5, Gemini 2.5 Pro, and Claude Sonnet 4.5 on 200+ expert finance tasks. Here's where even the best models break when they move from benchmarks to Wall Street.

RL Environments and the Hierarchy of Agentic Capabilities

Mon, 03 Nov 2025 00:00:00 +0000

Our RL environment run on 9 models revealed the core capabilities all agents need to master: tool use, planning, adaptability, groundedness, and common sense.