---
title: "stochastic parrot deep mystery llms"
source: rss
source_url: https://stochasticparrot.substack.com/p/on-the-deep-mystery-of-language-models
tags: [newsletter, llm]
ingested: 2026-05-09
sha256: 0f4ff1638c4c51199de429b34345fc0fc62045a1c0355e3fd01eee7b4babc174
feed_name: Stochastic Parrot
source_published: 2025-09-25
type: raw
created: 2026-05-10
updated: 2026-05-10
---
# On the deep mystery of language models
[](<https://substackcdn.com/image/fetch/$s_!0jzu!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F89a5554b-665c-450c-997f-c1516c9b71f6_1024x1024.png>)
A while back on [X](<https://x.com/emollick/status/1960919256452796440>), Ethan Mollick posed this challeng: 
> We really have not made a lot of progress on explaining the deep mystery of LLMs: 
> How does a model using matrix multiplication to predict the next word manage to simulate human thought well enough to do all the very human-like things it does? And what does that mean about us?
Hundreds replied, some said "stochastic parrot," others said "what mystery", but most engaged the question and offered diverse, conflicting explanations with little consensus. 
Like Mollick, I'm mystified by a language model's ability to simulate human thought based solely on its ability to predict the next word in a sequence of words. I don't have an explanation; however, the mystery fades if the focus shifts from the technical details of language models to the massive corpus of text used to train these models.
This suggestion should not be surprising. Language models are based on the transformer architecture, an artificial neural network, like CNNs and RNNs, trained to detect patterns in data. What distinguishes transformers is the self-attention mechanism, which lets them capture long-range dependencies in sequences, making them especially effective for text. There is no mystery here: transformers find patterns in text data. If you are mystified by a language model's capabilities, the mystery lies not with the nature of the language model but with the nature of the training corpus that lies behind these models. 
The training corpus is vast, sampling a wide swath of recorded language: literature, foundational works in science, philosophy, and history, many textbooks, journals and newspapers, and a substantial portion of the public web. While it's not exhaustive or perfectly representative, it’s broad enough to act as a proxy for the written record. People are aware of the training corpus's breadth and see this as the explanation not only for a language model's comprehensive fluency in language, but also for its encyclopedic knowledge of the world. This impresses but does not mystify. What amazes is a language model's ability to infer, analyze and explain, and more generally to hold an extended conversation where there are intricate and nuanced lines of thought that intertwine with one another, and be able to respond to those intricacies and nuances. Here, people lose their footing on the training corpus and find a language model's ability magical. 
We need to maintain the focus on the training corpus by noting that in being a proxy for the written record this means more than knowledge. The corpus contains not only facts but derivations and inferences, arguments and reasons, analyses and explanations that justify and illuminate these facts. It also includes ruminations, meditations and reflections. And all of these arguments, explanations and reflections are organized into higher-level structures: dialogues, dialectics and narratives. These forms of discourse contain the content that we see replicated in language models. Why, then, do people not connect these forms of discourse and a language model's abilities. 
First, we note that as we move from knowledge to increasingly more complex discourses we require increasingly longer text to express these forms. You can state a fact in a single sentence and a simple argument in a short paragraph. But an extended argument can go on for pages, and other narrative forms can go on for hundreds of pages. Humans and language models differ in their sensitivity and awareness of these longer forms of discourse.
Length does not pose a serious challenge to modern language models because of the context length used to train them. Context length is usually discussed with regards to prompting technique, it is the maximum length of text a language model can consider at once (prompt + expected response must be less than the context length). But context length also plays a critical role in training, indicating the maximum length of text that is sampled from the corpus and then processed for next-word prediction. Context length is in fact a scaling factor, like model size, training time and size of training corpus. And it is a scaling factor that is just as essential to a language model's performance as these other better known factors. The original GPT only had a training context of 500 token, or about 350 words, while GPT-4 and 5 had training contexts of 8,192 and 256,000, respectively. GPT was limited in its capabilities no matter how long it was training or on how much data, but none of the training segments could span a sufficient amount of text to allow the model to discover long-range patterns.
In contrast, once we move past the sentence, humans' sense of structure is less explicit, merging with more amorphous notions of meaning and content. We have a clear notion of syntax and the structure of factual claims that exemplify patterns which can be detected by a language model. But as we consider larger forms of discourse we simply think of that as content. The discourse is not structureless, but the structure is semantic in character and we don't see how that can be readily detected. We don't see how a language model can just read off the content. This causes a problem for the explanation we would like to give. The training corpus contains forms of discourse that represent the full range of human thought and reasoning. These discourses can be quite lengthy, but language models are trained with context lengths that can span most of these discourses. A language model's ability to process these discourses during training provides the basis for their remarkable abilities. But you can't simply say that during training the language model reads off the content, because that's the very ability we are trying to explain. 
Here we need to be a bit more detailed and formal in characterizing the training corpus. The training corpus contains content, but this content is expressed in written language, and language has explicit structure, what I'll call here _linguistic form_. Linguistic form is the observable shape of language. Linguistic form is not just syntax and word distribution, but all higher forms of observable structure. All linguistic form is reducible to sequences of words. When a language model detects patterns in the training corpus, it is revealing aspects of its linguistic form. The question is how the training corpus exhibits the kind of linguistic form that allows for a rich determination of content. For a given body of text, to what extent can its content be determined or recovered from its linguistic form? This is not a question about language models but a question about the nature of language and thought. 
This is not a question that is well understood, nor one that most people are aware of. But there are a few things we can say. First, the degree to which content can be recovered from form provides an upper limit to the capabilities of a raw language model. A language model's grasp of content is dependent on its grasp of linguistic form. If there is content that cannot be determined from linguistic form, then it is beyond the ken of language models. Consequently, a rich set of content is recoverable from form, as evidenced by the observed capabilities of language models. 
More tentatively, we can say that the [thought experiment](<https://stochasticparrot.substack.com/p/a-thought-experiment>) from a previous post indicates that content is never fully recoverable. We were asked to imagine a person learning an alien language using exactly the same methods used to train a language model. The result was a complete fluency in reading and responding to that language, but no actual understanding of what was being said. We can now say that what the subject learned was the linguistic form of the alien language, and what the thought experiment shows is that linguistic form alone is not sufficient to provide actual understanding of the content or any reference to the world. There is a great deal to be discussed regarding exactly what this means. Nonetheless, we can accept that this indicates some kind of limitation regarding what can be derived from linguistic form alone. 
Finally, we can say that despite its philosophical character the question of the relation between text content and linguistic form remains empirical. And this resonates with a remark made by Stephen Wolfram regarding language models that Mollick quotes. After emphasizing the remarkable abilities of language models Wolfram states:
> . . . it's amazing how human-like the results are. And as I've discussed, this suggests something that's at least scientifically very important: that human language (and the patterns of thinking behind it) are somehow simpler and more "law like" in their structure than we thought. ChatGPT has implicitly discovered it.
The notion of linguistic form exemplifies Wolfram's point: because the patterns of human language and thought are more law like than we realize, they are represented in linguistic form and therefore accessible by a language model. Here we need to take to heart Wolfram's observation that this is a discovery made by ChatGPT. This suggests that we should not be using our understanding of language and thought to shed light on the capabilities of language models; rather, we should be viewing the newly found capabilities of language models as discoveries regarding the relation between language and thought.
Language models aren’t mysterious engines of thought; they are powerful mirrors of the structured record of human thinking embedded in their training corpora. As context scales, transformers learn long-range token regularities that track arguments, explanations, and narratives already present in text. So the real question isn’t how models “think,” but how linguistic form encodes content well enough to be recoverable—i.e., a problem about language (form vs. content), not about neural nets.
This is the question I'll continue to investigate in future posts. 
Stochastic Parrot is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.