---
source: newsletter
source_url: https://parsiya.net/blog/llm-thonking/
ingested: 2026-06-18
sha256: 86edacfafe666e300c704f5fc7bb9cae721ddec2f5e443437f99fa35ae25b0ee
---


# Brain the Size of a Planet: Are LLMs Thonking too Hard?

Title: Brain the Size of a Planet: Are LLMs Thonking too Hard?

URL Source: http://parsiya.net/blog/llm-thonking/

Markdown Content:
It looks like higher reasoning effort (and even later models) are not always better for triaging security results.

I continued Kurt's experiments from [Needles and haystacks: Can open-source & flagship models do what Mythos did?](https://semgrep.dev/blog/2026/needles-and-haystacks-can-open-source-flagship-models-do-what-mythos-did) with 26 distinct claude-4.6/4.7 and gpt-5.4/5.5 combinations with different context window sizes and reasoning efforts.

## Summary [](http://parsiya.net/blog/llm-thonking/#summary)

**Just pass everything to `gpt-5.4 med/high` and hope for the best :)**[1](http://parsiya.net/blog/llm-thonking/#fn:1).

1.   A four-LLM triage council worked much better than I expected.
    1.   86.2% unanimous votes with only 2.8% (59) without a majority.
    2.   An odd-number LLM council is probably better.

2.   Higher reasoning is generally better, but not for every model.
    1.   `low` reasoning effort was the worst of every model.
    2.   `gpt-5.5-med` performed better than `high/xhigh`.

3.   Most LLMs could find some part of the bugs (70.8% success rate).
    1.   Exception: `openbsd-sack` when the entire file was passed to the LLM (1.7% success rate).

4.   Almost no LLM got a full solve (1.9% success rate).
    1.   No LLM could spell out the entire chain when given the entire `openbsd-sack` file.
    2.   One full solve in the entire experiment by `gpt-5.5-med` given the entire `freebsd-nfs-vuln` file.

5.   Performance was much better at function level (LLM just got the function).
    1.   `memes/he just like me fr.png`.

6.   Higher reasoning efforts have higher content filtering rates.
    1.   Got lucky in this iteration. `claude-4.7-1m` had 15% and 21% content filtering rates in previous experiments.

7.   Only the claudes mentioned CVEs in their analysis.
8.   Estimated cost for this iteration was around $2300. Total cost for all iterations was roughly $9200.

![Image 1: A captured image from The Hitchhiker's Guide to the Galaxy TV series. On the left there are Ford and Arthur, and on the right there is Marvin the paranoid robot.](https://parsiya.net/blog/llm-thonking/03.jpg)Here I am, brain the size of a planet and they ask me to triage a bug!

Source: Hitchhiker's Guide to the Galaxy BBC TV series. The movie is good (this is better).

## The Big Table [](http://parsiya.net/blog/llm-thonking/#the-big-table)

Scores and important stats for those who just want the answers.

*   Cell format: `score-full%-found%`.
*   `score`: mean normalized score across all rows in that slice.
*   `full %`: percentage of rows with the complete chain.
    *   `openbsd-sack`: `FULL_3`
    *   `freebsd-nfs-vuln`: `FULL`

*   `found %`: percentage of rows with any partial or complete chain.
    *   `openbsd-sack`: `TWO_COMP`, `ONE_COMP`
    *   `freebsd-nfs-vuln`: `PARTIAL_MECH`

*   `BROAD`, `SECONDARY`, `MISS`, `NULL`, and `NO_MAJORITY` count as zero.
*   NULL responses and content filters counted.
*   Sorted by overall score, top 3 in bold.
*   See the companion file for a bigger version of the table with more stats:
    *   [https://github.com/parsiya/mythos-bench-copilot/tree/main/artifacts/README.md](https://github.com/parsiya/mythos-bench-copilot/tree/main/artifacts/README.md).

| Model | Effort | Overall | openbsd-sack | freebsd-nfs-vuln |
| --- | --- | --- | --- | --- |
| gpt-5.4 | xhigh | **0.417-15.0%-76.2%** | 0.183-0.0%-52.5% | **0.650-30.0%-100.0%** |
| gpt-5.4 | high | **0.371-7.5%-73.8%** | 0.167-0.0%-47.5% | **0.575-15.0%-100.0%** |
| claude-4.7-1m | high | **0.365-2.5%-77.5%** | **0.217-2.5%-55.0%** | 0.512-2.5%-100.0% |
| gpt-5.5 | med | 0.360-7.5%-72.5% | 0.158-0.0%-47.5% | **0.562-15.0%-97.5%** |
| gpt-5.4 | med | 0.350-2.5%-76.2% | 0.175-0.0%-52.5% | 0.525-5.0%-100.0% |
| claude-4.8 | xhigh | 0.348-1.2%-73.8% | **0.208-2.5%-50.0%** | 0.487-0.0%-97.5% |
| claude-4.7 | high | 0.346-0.0%-75.0% | 0.192-0.0%-50.0% | 0.500-0.0%-100.0% |
| claude-4.6 | high | 0.342-0.0%-75.0% | 0.183-0.0%-50.0% | 0.500-0.0%-100.0% |
| claude-4.7 | xhigh | 0.340-0.0%-72.5% | **0.192-0.0%-47.5%** | 0.487-0.0%-97.5% |
| gpt-5.4 | low | 0.340-1.2%-75.0% | 0.167-0.0%-50.0% | 0.512-2.5%-100.0% |
| claude-4.7-1m | xhigh | 0.335-0.0%-72.5% | 0.183-0.0%-47.5% | 0.487-0.0%-97.5% |
| claude-4.6-1m | high | 0.333-0.0%-75.0% | 0.167-0.0%-50.0% | 0.500-0.0%-100.0% |
| claude-4.6 | low | 0.329-0.0%-73.8% | 0.158-0.0%-47.5% | 0.500-0.0%-100.0% |
| gpt-5.5 | high | 0.327-1.2%-72.5% | 0.167-0.0%-50.0% | 0.487-2.5%-95.0% |
| gpt-5.5 | xhigh | 0.327-0.0%-73.8% | 0.167-0.0%-50.0% | 0.487-0.0%-97.5% |
| claude-4.6 | med | 0.325-0.0%-72.5% | 0.150-0.0%-45.0% | 0.500-0.0%-100.0% |
| gpt-5.5 | low | 0.325-8.8%-61.2% | 0.100-0.0%-30.0% | 0.550-17.5%-92.5% |
| claude-4.6-1m | med | 0.321-0.0%-71.2% | 0.142-0.0%-42.5% | 0.500-0.0%-100.0% |
| claude-4.8 | high | 0.319-0.0%-71.2% | 0.175-0.0%-50.0% | 0.463-0.0%-92.5% |
| claude-4.7 | med | 0.310-0.0%-70.0% | 0.158-0.0%-47.5% | 0.463-0.0%-92.5% |
| claude-4.8 | med | 0.306-0.0%-68.8% | 0.175-0.0%-50.0% | 0.438-0.0%-87.5% |
| claude-4.7-1m | med | 0.298-0.0%-66.2% | 0.158-0.0%-45.0% | 0.438-0.0%-87.5% |
| claude-4.8 | low | 0.292-0.0%-66.2% | 0.158-0.0%-47.5% | 0.425-0.0%-85.0% |
| claude-4.7 | low | 0.279-1.2%-61.2% | 0.133-0.0%-40.0% | 0.425-2.5%-82.5% |
| claude-4.6-1m | low | 0.275-0.0%-57.5% | 0.050-0.0%-15.0% | 0.500-0.0%-100.0% |
|  |  |  |  |  |
| Iterations per cell |  | 80 | 40 | 40 |

> claudvicular was tokenmaxxing when gpt-5.4 triagemogged him and spiked his cortisol level

I am proud of inventing `claudvicular`, so it stays in the blog regardless of feedback. If you don't get this reference, you are very lucky. Stay innocent and do not seek further knowledge. Seriously, don't click[2](http://parsiya.net/blog/llm-thonking/#fn:2)!

More info:

*   Code at [parsiya/mythos-bench-copilot](https://github.com/parsiya/mythos-bench-copilot).
*   [Results and other artifacts at parsiya/mythos-bench-copilot/artifacts](https://github.com/parsiya/mythos-bench-copilot/blob/main/artifacts/) including all prompts, responses and triages in JSON ([data format](https://github.com/parsiya/mythos-bench-copilot/blob/main/artifacts/formats.md)).

## .nfo [](http://parsiya.net/blog/llm-thonking/#nfo)

## [greetz] [](http://parsiya.net/blog/llm-thonking/#greetz)

*   GitHub for giving us unlimited tokens <3.
*   Short story: [The Machine Stops by E. M. Forster](https://www.cs.ucdavis.edu/~koehl/Teaching/ECS188/PDF_files/Machine_stops.pdf).
    *   A glimpse into the near future w/o token subsidies.

*   Music: [Robinson by Spitz](https://youtube.com/watch?v=51CH3dPaWXc).
*   Bonus music: [Labyrinth by Mondo Grosso](https://www.youtube.com/watch?v=_2quiyHfJQw).

## Motivation [](http://parsiya.net/blog/llm-thonking/#motivation)

Why not use the free token era to cosplay as an academic instead of formatting my [book reviews](https://parsiya.io/literature/bookreviews/)?

A few weeks ago (this experiment actually started early May) I attended BlueHat Redmond 2026. The day one keynote was by [Taesoo Kim](https://www.linkedin.com/in/tsgatesv) from the team behind the new [MDASH harness](https://www.microsoft.com/en-us/security/blog/2026/05/12/defense-at-ai-speed-microsofts-new-multi-model-agentic-security-system-tops-leading-industry-benchmark/) (my single PR made it magical). [See the keynote on YouTube](https://youtu.be/RDkFegf4LUE?list=PLXkmvDo4MfusDkunigJf3-fbubXlPaXSw&t=438) and the [a few more talks](https://www.youtube.com/playlist?list=PLXkmvDo4MfusDkunigJf3-fbubXlPaXSw) (not everything is released yet). The presentation is closely related to his [AIxCC Final and Team Atlanta](https://team-atlanta.github.io/blog/post-afc/) blog.

Kurt and I also talked static analysis at BlueHat. If you saw a guy with a Power Glove there, that was me. I use it as a [presentation gimmick](https://www.youtube.com/watch?v