--- source_url: "https://www.lesswrong.com/posts/yLHiQGCPdvzL9fBn3/model-size-scaling-in-2023-2031"" ingested: 2026-06-26 sha256: 94a6b4856b632e83 --- sha256: 9d40afdc20fab55e --- title: "Model Size Scaling in 2023-2031" source_url: "https://www.lesswrong.com/posts/yLHiQGCPdvzL9fBn3/model-size-scaling-in-2023-2031" ingested: "2026-06-23" type: article tags: [article] --- # Model Size Scaling in 2023-2031 Published Time: 2026-06-22T23:07:11.323Z Markdown Content: Token generation speed is constrained by the speed at which the relevant HBM can be read, which is mostly the weights and KV-cache. Suppose a model is large, so that more than half of HBM is read when making a single pass over the weights, it's being read in parallel within a scale-up system, and N such systems are used in a pipeline. Then the time it takes to generate a token (without speculative decoding) is at least the time of reading more than half of an HBM stack times N. If we target a particular speed of token generation, this puts a constraint on the number of pipeline stages, which puts a constraint on the total params of the model. But if there isn't enough pretraining compute, models will remain smaller than this constraint (lower sparsity at a given number of active params buys a higher speed of token generation), so both should be taken into account. Working through these considerations gives model sizes feasible for each year between 2023 and 2031. The total params go from 10T in 2026 (at 8x sparsity, still constrained by Oberon racks, trained for 1.3e27 FLOPs) to 240T in 2028 (at 30x sparsity, with Kyber racks more than sufficient for the available pretraining compute) and then to 1.4 quadrillion in 2031 (30x sparsity, served with 8x Kyber Feynman scale-up systems, trained for 2.2e29 FLOPs). Starting in 2027, model sizes are further inflated by the lack of sufficient pretraining data, with models of 2031 having to be 4x bigger than unlimited training data would predict. There are many assumptions that go into the estimates, which I will state as they come up. ## Time to Fully Read an HBM Stack H100 has 5 stacks of 8-Hi HBM3 per compute die (2 GB per DRAM die, 0.8 TB/s per stack), 20 ms to fully read. H200 has 6 stacks of 12-Hi HBM3[[1]](http://www.lesswrong.com/posts/yLHiQGCPdvzL9fBn3/model-size-scaling-in-2023-2031#fn-iN4rjTQjQfPTCYb5o-1) per compute die, 30 ms to read. B200/GB200 has 4 stacks of 8-Hi HBM3E per compute die (3 GB per DRAM die, 1.0 TB/s per stack), 24 ms to read. GB300 has 4 stacks of 12-Hi HBM3E, 36 ms to read. Non-Ultra Rubin chips of 2026 have 4 stacks of 12-Hi HBM4 per compute die (3 GB per DRAM die, 2.75 TB/s per stack[[2]](http://www.lesswrong.com/posts/yLHiQGCPdvzL9fBn3/model-size-scaling-in-2023-2031#fn-iN4rjTQjQfPTCYb5o-2)), 13 ms to read. Rubin Ultra [probably uses 12-Hi stacks](https://www.lesswrong.com/posts/fdCaCDfstHxyPmB9h/vladimir_nesov-s-shortform?commentId=h9LzsJDpmAJvrriBK) of HBM4E[[3]](http://www.lesswrong.com/posts/yLHiQGCPdvzL9fBn3/model-size-scaling-in-2023-2031#fn-iN4rjTQjQfPTCYb5o-3) (4 GB per DRAM die, 2048 pins, 14-16 Gbps per pin, which is 3.6-4.0 TB/s per stack[[4]](http://www.lesswrong.com/posts/yLHiQGCPdvzL9fBn3/model-size-scaling-in-2023-2031#fn-iN4rjTQjQfPTCYb5o-4)), 12-13 ms to read. If HBM4E stays at 16 Gbps per pin (now reliably achieving it), but the stacks become 16-Hi, [first-year Feynman in 2028 might use 2 stacks](https://www.lesswrong.com/posts/fdCaCDfstHxyPmB9h/vladimir_nesov-s-shortform?commentId=h9LzsJDpmAJvrriBK) per compute die, which takes 16 ms to fully read. HBM5 of 2029 (that probably goes into the second year of Feynman) is expected to use 4096 pins and might start at 8 Gbps per pin, though the situation with HBM4 for non-Ultra Rubin suggests anchoring to 11 Gbps instead. This gives 5.6 TB/s per stack. With 5 GB per DRAM die, a 16-Hi stack takes 14 ms to read. Thus we have **20 ms and 30 ms for H100 and H200, 24 ms and 36 ms for GB200 and GB300, 13 ms for both Rubin and Rubin Ultra, 16 ms and 14 ms for first- and second-year Feynman**. ## Maximal Pipelines Below 80 Tokens/s It takes almost 3x more time to fully read an HBM stack of GB300 than that of non-Ultra Rubin. Such differences suggest that the rule of thumb for the number of scale-up systems with the capacity to hold the largest models should be sensitive to their HBM. For concreteness, let's target [token generation speed](https://artificialanalysis.ai/#speed) constrained at 80 tokens/s per request (just from reading HBM, forcing even lower speed in practice, maybe 60 tokens/s), assuming speculative decoding (or [multi-token prediction, MTP](https://arxiv.org/abs/2606.12370)) [that speeds up generation](https://inferencex.semianalysis.com/) by 3x. Then we need to generate/accept some tokens for any given request every 37.5 ms. Let's say we are using 3 racks of GB200 Oberon in a pipeline, and half of all HBM is hosting weights. The KV cache in the other half of a given rack won't actually be fully read on each pass through the weights, because only 1 out of 3 requests will be active during the current phase of the pipeline (the other 2 out of 3 requests are being processed at the other racks). Thus only 67% of HBM is read by a pipeline stage in this setup (50% for weights, and 17% for KV cache). This ideally takes only 67% of the time of reading a full HBM stack of GB200, 16 ms instead of 24 ms. Going through the whole pipeline of 3 racks would only take a request 48 ms, rather than 72 ms (on reading HBM alone). As a request traverses a pipeline of N racks, it waits for all weights at all the racks to be read (taking half as much as reading N full HBM stacks), and it witnesses the reading of 1/N of the KV cache in each of the racks (taking half as much time as reading a single HBM stack in total). So if we want the pipeline to complete in 37.5 ms with GB200, 12 ms will be spent reading the KV cache regardless of the length of the pipeline, and the remaining 25.5 ms can be used to read the weights, which takes 12 ms per pipeline stage, meaning at most 2 stages. That is, 75 ms divided by 24 ms gives an upper bound on N+1. This means a pipeline with the above assumptions can use **at most 2 H100 servers, 1 H200 server, 2 GB200 Oberon racks, 1 GB300 Oberon rack, 4 Rubin or Rubin Ultra racks (either Oberon or Kyber), 3 systems of 8x Kyber with first-year Feynman, 4 systems of 8x Kyber with second-year Feynman**. Assuming [12-Hi HBM stacks for Rubin Ultra, 2 HBM stacks per compute die and 576 compute dies per rack for Feynman](https://www.lesswrong.com/posts/fdCaCDfstHxyPmB9h/vladimir_nesov-s-shortform?commentId=h9LzsJDpmAJvrriBK), total **HBM per scale-up system is 640 GB for a H100 server, 1.1 TB for H200, 14 TB for GB200 Oberon, 20 TB for GB300 and Rubin Oberon racks, 110 TB for Rubin Ultra Kyber, 590 TB for 8x Kyber systems with first-year Feynman chips, 740 TB for 8x Kyber systems with second-year Feynman chips**. If half of the HBM capacity is spent on FP4 weights, and the buildout sufficiently completes a year after the system is released, we get the following upper bounds on total params of the largest models that can be served with at most 80 tokens/s per request (with 3x speedup via MTP). The **constraint from pipelining on total FP4 params is 1.3T for H100 in 2023, 1.1T for H200 in 2024, 27T for GB200 Oberon in 2025, 20T for GB300 Oberon in 2026, 83T for Rubin Oberon racks in 2027, 442T for Rubin Ultra Kyber racks in 2028, 1.7 quadrillion for 8x Kyber systems with first-year Feynman chips in 2029, 2.9 quadrillion for 8x Kyber systems with second-year Feynman chips in 2030**. If serving in FP8 becomes important for very large models, the number of total params enabled by the scale-up systems is 2x lower. And of course this is an overestimate for model sizes that can actually be served at 80 tokens/s, since I'm ignoring the networking and compute overhead that couldn't be masked. The main way in which the network isn't ignored is that expert parallelism must happen only within scale-up systems rather than across different scale-up systems. This is something that for example doesn't apply to [DeepSeek-V3](https://arxiv.org/abs/2412.19437) (see [Section 3.4.2](https://arxiv.org/pdf/2412.19437v2#page=19)), but the resulting design constraints likely can't be sustained for the largest models. ## Pretraining Compute Based on Nvidia's apparent bet on FP8 in Rubin, where the FP8 to BF16 performance ratio is 4.4, which used to be 2 in the previous chips, I'm assuming that even the largest models will be pretrained in FP8. And I'm going to assume 3 months of pretraining at 40% FLOP/s utilization. For 2023, the anchor is [GPT-4 rumors of pretraining on 20K-25K A100s](https://newsletter.semianalysis.com/p/100000-h100-clusters-power-network) (20 MW of IT power), which at 0.3e15 BF16 FLOP/s (no FP8) per compute die gives 2.1e25 FLOPs under the above assumptions. For models of 2024 (trained in late 2023 or early 2024), that's 32K H100s (one building at the [Microsoft Goodyear site](https://newsletter.semianalysis.com/i/178649945/microsoft-and-openai-training-clusters-from-a-fraction-of-a-building-to-the-worlds-largest-facility)), 50 MW of IT power, 2e15 FP8 FLOP/s per compute die, 2e26 FLOPs. For 2025, [100K H100s, 150 MW](https://newsletter.semianalysis.com/p/100000-h100-clusters-power-network), 6e26 FLOPs. For 2026 models, [200K H100s or 300K Trainium 2 Ultra, 300 MW in both cases](https://www.lesswrong.com/posts/WjaGAA4xCAXeFpyWm/my-picture-of-the-present-in-ai?commentId=4f3hAJ4GghabQLqNy), 1.3e27 FLOPs. The 300 MW figure for late 2025 pretraining compute anchors to 1-2 GW of total first-party compute per AI company at the end of 2025, which becomes 3-4 GW at the end of 2026, [and 10 GW by the end of 2027](https://www.youtube.com/watch?v=mDG_Hx3BSUE&t=4330s). So for 2027 models, I'm guessing 1 GW of pretraining compute, which can't be H100s but could be B200/GB200 or Trainium 2 Ultra. IT power of a GB200/GB300 Oberon rack (72 packages per rack, 2 compute dies per package) might be [140-180 kW](https://newsletter.semianalysis.com/i/174558674/nvidia-oberon-rack-architecture-upgrade-vr-nvl144-cpx-vr-nvl144-vr-cpx), possibly with another 10% of networking equipment overhead on top. This gives 730-930 compute dies per 1 GW of IT power, and a concrete example is the OpenAI Abilene site with 800K compute dies. Blackwell dies produce 2.5e15 FP8 FLOP/s, which gives 6e27 FLOPs for a 2027 model under the above assumptions. A 2028 model might use 3 GW of pretraining compute (out of 10 GW of total first-party compute an AI company has), which can still be Blackwell, so 2e28 FLOPs. For 2028-2030 compute, I'm going to [anchor to the SemiAnalysis estimate of global AI compute](https://www.lesswrong.com/posts/fdCaCDfstHxyPmB9h/vladimir_nesov-s-shortform?commentId=xERmNJsHGfMCoFnBN), obtaining an estimate for the total first-party compute available to an AI company that starts with 10 GW in late 2027, then goes to 17 GW in 2028, to 28 GW in 2029, and 40 GW in 2030. This suggests 5 GW of pretraining compute for a 2029 model (about a third of the 17 GW available in late 2028), which probably can no longer be Blackwell and must be Rubin (up to 13 GW out of the 17 GW is Rubin compute, though probably less, and maybe only 5-6 GW is in Oberon racks). At [225 kW per Oberon rack](https://newsletter.semianalysis.com/i/174558674/nvidia-oberon-rack-architecture-upgrade-vr-nvl144-cpx-vr-nvl144-vr-cpx) (possibly plus 10% in networking overhead), this is 580K compute dies per GW. At 8.75e15 FP8 FLOP/s per compute die, 5 GW give 8e28 FLOPs for a 2029 model. At the end of 2029, there's 7 GW of Rubin Ultra Kyber racks from 2028, the 11 GW of the newer first-year Feynman 8x Kyber systems from 2029 can serve the older largest models, and there's 28 GW of first-party compute in total, so maybe 7 GW of Rubin could go into pretraining, which is 1e29 FLOPs. At the end of 2030, there's 40 GW of first-party compute and after the second-year Feynman buildout the large 8x Kyber scale-up systems are no longer scarce, so maybe the 11 GW of first-year Feynman compute might go into pretraining. To estimate the pretraining FLOPs, I need estimates of total IT power and FP8 FLOP/s per compute die for Feynman. Assuming a 30% higher draw than Rubin, and considering the TSMC N3P to A16 jump, [I'm guessing 14e15 FP8 FLOP/s per compute die](https://www.lesswrong.com/posts/fdCaCDfstHxyPmB9h/vladimir_nesov-s-shortform?commentId=xERmNJsHGfMCoFnBN). At 30% more power, there would be only 450K Feynman compute dies per GW. A 2031 model could then be pretrained for 2.2e29 FLOPs. ## Active Params from Pretraining Compute I'm assuming there are up to 200T tokens of unique pretraining data (maybe half of it text data). Based on [this Jan 2025 paper](https://arxiv.org/abs/2501.12370), the compute optimal ratio of tokens per active param is 3x higher for an MoE model with 8x sparsity compared to a dense model, and 6x higher for an MoE model with 30x sparsity, see [Figure 11 and Figure 12, left](https://arxiv.org/pdf/2501.12370v3#page=28). Based on the [Jul 2024 Llama 3 405B report](https://arxiv.org/abs/2407.21783), the compute optimal ratio for a dense model is about 40 tokens/param at 4e25 FLOPs, see [Figure 2 and Figure 3](https://arxiv.org/pdf/2407.21783v3#page=8). Putting these anchors together, we get 120 and 240 tokens/param respectively. A [May 2026 paper](https://arxiv.org/abs/2605.01640)[can be taken to vaguely suggest](https://www.lesswrong.com/posts/fdCaCDfstHxyPmB9h/vladimir_nesov-s-shortform?commentId=gcPjZRwivx4o3fXCH) that the shortfall in unique data should be distributed equally between epochs of repetition and an increase in active params over the compute optimal number. For 2023 models, 2.1e25 FLOPs ask for 170B active params at about 8x sparsity (with 20T training tokens). The constraint from the contemporary scale-up systems of 1.3T total params with 2 H100 servers in FP4 fits right at the 8x sparsity. The GPT-4 rumors say 1.8T total params, but RLVR and long reasoning weren't a concern then. For 2024 models, 2e26 FLOPs ask for 530B active params at about 8x sparsity (with 64T training tokens), which asks for 4.2T total params, which even in FP4 is way above the bound of 1.3T total params from 2 H100 servers or 1.1T total params from 1 H200 server. Thus 2024 models are significantly constrained by the available scale-up systems and should be much smaller than compute optimal, or else they must be slow and expensive. A 2025 model trained for 6e26 FLOPs would want 900B active params at about 8x sparsity (with 110T training tokens), which asks for 7.2T total params, well within the bound of 27T total NVFP4 params from 2 racks of GB200 Oberon. GPT-4.5 was probably at that scale, but it was released before there were GB200 Oberon racks to serve it. With B200 NVL8, which might've been available in sufficient numbers, it would require 5 servers to fit, meaning 68 ms of reading HBM in one lap of the pipeline (47% of the 5 servers of 1.53 TB each is weights, 1/5 of the KV cache in the remaining 53% is read on one pass of a pipeline), or at most 44 tokens/s even with a 3x boost from MTP. If GPT-4.5 was actually faster (before GB200 NVL72 were plausibly available), maybe it was actually smaller. For some reason it wasn't released with RLVR after there were more GB200 Oberon racks to serve it cheaply. A 2026 model trained for 1.3e27 FLOPs wants 1.3T active params at about 8x sparsity (with 160T training tokens), which is 10T total params, fitting well under the constraint from pipelining with the contemporary scale-up systems of either 27T FP4 params from 2 GB200 Oberon racks, or 20T FP4 params from 1 GB300 Oberon rack. This even fits in 1 GB200 Oberon rack in NVFP4, and could use FP8 with 2 GB200 Oberon racks. The largest models of 2026 are not constrained by scale-up systems if they are content with 8x sparsity. For a 2027 model at around 8x sparsity, the 6e27 FLOPs of pretraining would want 2.9T active params with 350T tokens of training data, which is 1.75x more than the assumed 200T tokens of unique data that is actually available. [Splitting the shortfall equally](https://www.lesswrong.com/posts/fdCaCDfstHxyPmB9h/vladimir_nesov-s-shortform?commentId=gcPjZRwivx4o3fXCH) between more active params and more epochs of repetition, we need 1.32x more active params, which is 3.8T (trained for 1.32 epochs on the 200T unique tokens), with 30T total params. This is significantly below the constraint of 83T total params from a pipeline of 4 Rubin Oberon racks. Unclear if the AI companies would elect to go with more sparsity, increasing total params, or much higher speed from shorter pipelines. But in 2027, 30x sparsity remains out of reach[[5]](http://www.lesswrong.com/posts/yLHiQGCPdvzL9fBn3/model-size-scaling-in-2023-2031#fn-iN4rjTQjQfPTCYb5o-5). A 2028 model is constrained at a staggering 442T NVFP4 params if served with a pipeline of 4 Rubin Ultra Kyber racks. So targeting 30x sparsity, 2e28 FLOPs want 3.7T active params and 900T training tokens (from the assumption of 240 tokens/param being compute optimal at this sparsity). This is a shortfall of 4.5x in unique data, which needs a model with 2.1x more active params trained for 2.1 epochs of repetition. Thus a model of 2028 might have 7.9T active params and 240T total (at 30x sparsity), with a lot of room to spare in 4 Rubin Ultra Kyber racks, meaning it's going to use fewer racks and generate tokens faster. The active constraint for 2028 models is not enough pretraining compute, rather than scale-up systems that are too small. A 2029 model pretrains for 8e28 FLOPs, which at 30x sparsity asks for 7.4T active params and 1,800T training tokens. The 200T tokens of unique data are a shortfall of 8.5x, so the model instead needs 22T active params and 650T total, trained for 2.9 epochs with the 200T unique tokens. This model can no longer be served with a pipeline of 4 Rubin Ultra Kyber racks, but the constraint from a pipeline of 3 systems of 8x Kyber with first-year Feynman chips sets a much higher constraint at 1.7 quadrillion NVFP4 params. Thus the 650T param model of 2029 could be served even in FP8 on 2 systems of 8x Kyber (instead of 3 systems), or faster/cheaper with 1-2 systems of 8x Kyber when using NVFP4. A 2030 model pretrains for 1e29 FLOPs, which at 30x sparsity and with infinite data would want 8.3T active params and 2,000T training tokens, a shortfall of 10x. Thus the model would instead need 26T active params, meaning 790T total params at 30x sparsity, pretrained for 3.1 epochs of repeating the 200T unique tokens. This is only slightly more onerous than the 2029 model even for the first-year Feynman systems (perhaps one first-year Feynman system is no longer sufficient in NVFP4, as the model takes up 67% of its HBM), while with second-year Feynman systems (that put a constraint of 2.9 quadrillion NVFP4 params from a pipeline of 4 systems) we need just 1 of them in NVFP4, or 2 of them in FP8. Finally, a 2031 model that pretrains for 2.2e29 FLOPs would at 30x sparsity and with infinite data ask for 12T active params and 3,000T training tokens, a shortfall of 15x. So the model instead needs 48T active params, meaning 1.4 quadrillion total params at 30x sparsity, trained for 3.9 epochs of repeating the 200T unique tokens. I didn't make estimates for post-Feynman systems, but even a pipeline of 3 systems of 8x Kyber with first-year Feynman chips suffices to serve this model in NVFP4 (the constraint is 1.7 quadrillion params), and a pipeline of 4 second-year Feynman systems can serve this model in FP8, so this model is not constrained at 30x sparsity even when served on hardware that is a year old. ## Starting in 2028, the Constraint is Pretraining Compute Overall, 2024 was the year when models were most constrained by the scale-up systems, with a compute optimally trained model infeasible to serve even at 8x sparsity. In 2023 and 2025-2026, compute optimally pretrained models fit in pipelines of scale-up systems if they have 8x sparsity. The situation marginally improves in 2027, when faster HBM4 in Rubin makes longer pipelines practical, while shorter pipelines get faster. And then once the buildout of Rubin Ultra Kyber racks sufficiently completes in 2028, compute optimally pretrained models even with 30x sparsity become feasible to serve. This remains the case through 2031, despite the shortage of pretraining data that might require models to get 4x bigger than they would've needed to be with unlimited data. 1. H200 was released in 2023, before HBM3E of 2024 could grow into its full specs. So [what Nvidia claims to be HBM3E](https://www.nvidia.com/en-us/data-center/h200/) in H200 has the specs of HBM3 also used in H100, the main difference is that the stacks are 12-Hi and there are 6 of them. [↩︎](http://www.lesswrong.com/posts/yLHiQGCPdvzL9fBn3/model-size-scaling-in-2023-2031#fnref-iN4rjTQjQfPTCYb5o-1) 2. The bandwidth of HBM4 in Rubin is much higher than the 2.0 TB/s required by the JEDEC standard. It's about 10.8 Gbps per pin instead of 8 Gbps per pin, with 2048 pins. [↩︎](http://www.lesswrong.com/posts/yLHiQGCPdvzL9fBn3/model-size-scaling-in-2023-2031#fnref-iN4rjTQjQfPTCYb5o-2) 3. Unlike the situation with H200, this is actual HBM4E in the sense of 4 GB per DRAM die rather than the 3 GB per DRAM die of HBM4, even though it's a year early. This is because 1024 GB per package (with 16 stacks of HBM) was announced by Nvidia when it was expected to be 16-Hi, while HBM4 would've been 768 GB per package with 16-Hi stacks. [↩︎](http://www.lesswrong.com/posts/yLHiQGCPdvzL9fBn3/model-size-scaling-in-2023-2031#fnref-iN4rjTQjQfPTCYb5o-3) 4. [Samsung and SK Hynix might be able to achieve 3.6-4.0 TB/s](https://www.ersaelectronics.com/blog/hbm4-hbm4e) in 12-Hi stacks, with 16-Hi stacks still in the works for HBM4E (meaning this bandwidth is probably relevant for Rubin Ultra of 2027, not just for the high-end HBM4E of 2028). Though [news about samples](https://news.skhynix.com/12-layer-hbm4e-sample/) are not a lot of evidence, since what matters is the performance achievable with high yield (ready for ramp), while the availability of samples only weakly depends on yield. So I'm giving some credence to the lower 3.6 TB/s (14 Gbps per pin). [↩︎](http://www.lesswrong.com/posts/yLHiQGCPdvzL9fBn3/model-size-scaling-in-2023-2031#fnref-iN4rjTQjQfPTCYb5o-4) 5. With 240 tokens/param compute optimal at 30x sparsity, we have 2T active params and 490T training tokens, a shortfall in unique data of 2.45. So the model would need 1.56x more active params, which is 3.2T, meaning 96T total params at 30x sparsity, more than the constraint of 83T. [↩︎](http://www.lesswrong.com/posts/yLHiQGCPdvzL9fBn3/model-size-scaling-in-2023-2031#fnref-iN4rjTQjQfPTCYb5o-5)