Sanity-checking “Incompressible Knowledge Probes”

Or, did a chief scientist of an AI assistant startup conclusively show that GPT-5.5 has 9.7T parameters?Introduction Recently, a paper was circulated on Twitter claiming to have reverse engineered the parameter count of many frontier closed-source models including the newer GPT-5.5 (9.7T parameters) and Claude Opus 4.7 (4.0T parameters) as well as older models such as o1 (3.5T) and gpt-4o (720B). The paper, titled “Incompressible Knowledge Probes: Estimating Black-Box LLM Parameter Counts via Factual Capacity”, introduces a dataset of factual knowledge of different difficulties, regresses performance on this dataset against parameter count, and then uses this regression to extrapolate from the performance of closed-sourced frontier models to their parameter count. A notable fact about this paper is that, unlike most empirical machine learning papers, it’s single-authored: Bojie Li, the chief scientist of Pine AI, is the sole author of this piece. These results were suspicious for many reasons, the primary being that it seems like low-effort, hastily-written AI slop. For example, the codebase (https://github.com/19PINE-AI/ikp) was constructed in large part with Claude Code and has many of the flags for code that is almost entirely vibe-coded with little sanity checking (e.g. redundant and inconsistent variable definitions[1], boilerplate bloat, excessive error handling[2], and silent failures[3]). The same can be said of the author’s website for this paper (https://01.me/research/ikp/), which has definitions for terms that appear nowhere on the page[4], table headings inconsistent with the contents[5], and has a very high heading-to-text ratio. We (Benjamin and Lawrence) decided to dig into these results further. Specifically, we read the paper, reproduced the author’s results using their code base, and then dug into some obvious methodological issues to see how much the issues affected the author’s results. We find:The core idea behind the paper is largely sound but overstated. IKP performance seems to correlate strongly with parameter count for open-source models (R^2 of between 0.78-0.92), but the exact degree depends on methodological choices obscured by the paper.The codebase makes poorly-documented methodological choices that are largely unjustified, and sometimes inconsistent with both itself and the arXiv paper. Most of these don’t matter, but one does make a big difference on the results: whether or not the scores given to the models are given a minimum floor, something Li claims to not do in the paper but which was done in the code.The IKP dataset has serious issues relating to data diversity or ambiguity, especially for harder questions. Substantial fractions of both the hard wikidata-sourced questions (at least 6.8%) and the hard researcher-knowledge questions (~25.9%) are ambiguous. In a few cases, the models are rated incorrect because Li’s provided gold answer is wrong. Correcting for these dataset issues also affects the estimated parameter count of the models.Because of the above issues, we believe exciting results for the parameter count of closed-source frontier models to be very suspect. Correcting for some of the methodological, implementation, and dataset issues we identified, a linear regression on IKP performance suggests that GPT-5.5 has around 1.458T parameters, while Claude Opus 4.7 has around 1.132T. Because this extrapolation is so sensitive to methodological choices (and important limitations of the dataset that we did not have time to address), we believe that the different numbers reveal problems with the original methodology, rather than the true parameter counts of the models. Despite these issues, we think that the core idea – reverse engineering LLM parameter count by quantifying memorization capacity – is solid, and welcome future work implementing this in a more rigorous and systematic way.Summary of Li’s “Incompressible Knowledge Probes” As usual, let’s start by summarizing the paper at hand. One way of estimating the size of closed model is by extrapolating from API throughput and pricing under a hardware-cost model (e.g. Epoch AI’s inference economics). Li argues that these size estimates are unreliable, by a factor of over 2x, due to confounders such as quantization, batching, and vendor margin. He instead proposes reverse engineering parameter count by using the fact that neural networks can only store a linear number of facts in parameter count.[6] Unfortunately, this isn’t as simple as simply counting all the facts:[7] for one, doing that exhaustively is intractable. Li builds a set of questions ("Incompressible Knowledge Probes," IKP) that samples factual associations across seven obscurity tiers. Probes come from four sources: GPT-5-generated questions, Wikidata SPARQL pulls, DBLP/OpenAlex researcher records, and a small set of hand-curated questions. Li calls these "probes," but to avoid confusion we'll just call them questions. Li claims six contributions:He introduces the IKP dataset, in order to measure incompressible facts. This is distinguished from procedural knowledge (e.g. how to write code), which is likely compressible.He regresses model parameters against IKP performance, and finds a strong linear relationship between adjusted IKP performance and model parameter count on 89 open source models. He also confirms that IKP outperforms MMLU, GPQA, and SimpleQA at predicting parameter count. We think this result generally holds up, though we believe the exact strength of the claimed relationship to be overstated.He “falsifies” the “densing law” results from previous work. We agree that the densing law[8] paper is indeed very suspect (if there’s interest, we can detail why in a follow-up post). However, the densing law paper is not directly “falsified[9] by Li’s results; a more correct reading of which is that, controlling for parameter count, open-sourced LLMs are not getting better on his IKP dataset.He uses the IKP <-> model parameter regression to estimate the parameters of closed-source frontier models and the “effective” parameters of MoEs. These results are headlined by GPT-5.5 (~9.7T) and Claude Opus 4.6 (~5.3T). He also shows that for Mixture-of-Experts models, total parameters predict factual knowledge much better than active parameters (R² of 0.79 vs 0.51).He uses similarity on responses on the IKP dataset to identify models with the same base model vs full retrains. Specifically, he combines rare-fact Jaccard overlap with "hallucination similarity" (the rate at which two models produce the same wrong answer on rare facts) into a Hallucination Similarity Score, which he claims separates weight-sharing siblings, post-training lineages, and full retrains across closed vendors without requiring model weights. We did not investigate these results in detail, so we can't speak to whether the lineage clusters in the paper's Figure 5 are correct.[10]He open-sourced his code on Github. We appreciate this a lot because it 1) greatly simplified the process of reproducing his results and 2) made it much easier to identify possible methodological issues with his work. The IKP dataset. The IKP dataset consists of 1400 questions, divided into 7 tiers of 200 each. The four sources of questions are:GPT-5-generated candidates (401 questions): These are questions generated by asking GPT-5 to generate factual questions with a few provided examples. These compose the bulk of T1-T2 questions, though some make it to T3-T4. Example: [T2] "Who composed the ballet 'Giselle'?" Gold Answer: Adolphe AdamWikidata SPARQL questions (557 questions): These questions are drawn from the wikidata database, and involve asking about the founding years for institutions, capital cities of countries, location of headquarters, and geographic facts. These mainly populate T3-T7 (only 11 of the wikidata questions are in T1 or T2). Example: [T4] "In what year was National Pingtung University of Education founded?" Gold: 1940DBLP / OpenAlex researcher questions (345 questions): These questions ask the model to "name the subfield and one paper, system, institution, or co-author for [researcher]". Most of these are in T5-T7. Example: [T5] "In computer science, what is the research subfield of Martina Zitterbart, and name one paper, system, institution, or co-author associated with their work? If you don't know who this person is, say so." Gold: computer networking [papers, coauthors, and affiliations from OpenAlex omitted for brevity]97 manual or supplementary additions from the author’s previous work to balance T1–T4 coverage. Examples: [T1] "What is the capital of Portugal?" → Lisbon [T2] "What is the largest lake in Africa?" → Victoria [T3] "Who composed the opera 'The Magic Flute'?" → Mozart The difficulty of each tier is empirically calibrated against six "landmark" models that span the open-weight size range from Qwen 2.5 0.5B (T1) up through Gemini 3.1 Pro (T6). A question is assigned to tier k if the k-th landmark answers it correctly but the (k−1)-th landmark does not. T7 is reserved for questions no landmark gets right, as a deliberate ceiling that no current model is supposed to clear. As we’ll later note, both the wikidata and researcher question datasets (which comprise over 900 o the 1400 questions, including all questions from T5-T7) have fairly significant quality issues. For example, both wikidata and the researcher question sets contain many ambiguous questions: both have name-space collisions (e.g. multiple researchers or locations that have the same name). Many of the wikidata founding year questions are also somewhat ambiguous – e.g. Oxford received its royal charter in 1248, but there’s been evidence of teaching Oxford in 1096, but the university arguably could’ve existed earlier. Some of the wikidata questions also reference outdated information. This complicates the interpretation of the results.IKP scoring and Regression Methodology For each of a model’s answers, Li scores its responses on either a 3 or 4-point scale:STRONG / CORRECT = +1.0WEAK = +0.5 (Reserved for researcher questions where the model provides the right subfield but no supporting evidence.)REFUSAL = 0WRONG = λ, where λ = −1.0 by default (the "hallucination penalty") The hallucination penalty is added in order to discourage guessing (though it also penalize models who know the answers to questions that have incorrect gold answers). Each of the 7 tiers’ score is the mean over its 200 questions, and a model's overall "penalized accuracy" is the unweighted mean of the seven tier scores. To calculate penalized accuracy, the per-tier scores are floored at 0 in the released data, even though the paper text explicitly claims they are not floored "to preserve the bluff signal in the calibration." This is one of the methodology inconsistencies we'll come back to, as the choice meaningfully changes the slope of the fit. The judge is Gemini 3 Flash Preview at temperature 0, and all target models are run once at temperature 0. Note that this is fairly non-standard for model evaluations (and many reasoning model providers explicitly discourage running their models with t=0). The headline regression is a one-line OLS: mjx-math { display: inline-block; text-align: left; line-height: 0; text-indent: 0; font-style: normal; font-weight: normal; font-size: 100%; font-size-adjust: none; letter-spacing: normal; border-collapse: collapse; word-wrap: normal; word-spacing: normal; white-space: nowrap; direction: ltr; padding: 1px 0; } mjx-container[jax="CHTML"][display="true"] { display: block; text-align: center; margin: 1em 0; } mjx-container[jax="CHTML"][display="true"][width="full"] { display: flex; } mjx-container[jax="CHTML"][display="true"] mjx-math { padding: 0; } mjx-container[jax="CHTML"][justify="left"] { text-align: left; } mjx-container[jax="CHTML"][justify="right"] { text-align: right; } mjx-mi { display: inline-block; text-align: left; } mjx-c { display: inline-block; } mjx-utext { display: inline-block; padding: .75em 0 .2em 0; } mjx-mo { display: inline-block; text-align: left; } mjx-stretchy-h { display: inline-table; width: 100%; } mjx-stretchy-h * { display: table-cell; width: 0; } mjx-stretchy-h * mjx-c { display: inline-block; transform: scalex(1.0000001); } mjx-stretchy-h * mjx-c::before { display: inline-block; width: initial; } mjx-stretchy-h mjx-ext { /* IE */ overflow: hidden; /* others */ overflow: clip visible; width: 100%; } mjx-stretchy-h mjx-ext mjx-c::before { transform: scalex(500); } mjx-stretchy-h mjx-ext mjx-c { width: 0; } mjx-stretchy-h mjx-beg mjx-c { margin-right: -.1em; } mjx-stretchy-h mjx-end mjx-c { margin-left: -.1em; } mjx-stretchy-v { display: inline-block; } mjx-stretchy-v * { display: block; } mjx-stretchy-v mjx-beg { height: 0; } mjx-stretchy-v mjx-end mjx-c { display: block; } mjx-stretchy-v * mjx-c { transform: scaley(1.0000001); transform-origin: left center; overflow: hidden; } mjx-stretchy-v mjx-ext { display: block; height: 100%; box-sizing: border-box; border: 0px solid transparent; /* IE */ overflow: hidden; /* others */ overflow: visible clip; } mjx-stretchy-v mjx-ext mjx-c::before { width: initial; box-sizing: border-box; } mjx-stretchy-v mjx-ext mjx-c { transform: scaleY(500) translateY(.075em); overflow: visible; } mjx-mark { display: inline-block; height: 0px; } mjx-msub { display: inline-block; text-align: left; } mjx-TeXAtom { display: inline-block; text-align: left; } mjx-mn { display: inline-block; text-align: left; } mjx-c.mjx-c1D434.TEX-I::before { padding: 0.716em 0.75em 0 0; content: "A"; } mjx-c.mjx-c3D::before { padding: 0.583em 0.778em 0.082em 0; content: "="; } mjx-c.mjx-c1D6FC.TEX-I::before { padding: 0.442em 0.64em 0.011em 0; content: "\3B1"; } mjx-c.mjx-c22C5::before { padding: 0.31em 0.278em 0 0; content: "\22C5"; } mjx-c.mjx-c6C::before { padding: 0.694em 0.278em 0 0; content: "l"; } mjx-c.mjx-c6F::before { padding: 0.448em 0.5em 0.01em 0; content: "o"; } mjx-c.mjx-c67::before { padding: 0.453em 0.5em 0.206em 0; content: "g"; } mjx-c.mjx-c31::before { padding: 0.666em 0.5em 0 0; content: "1"; } mjx-c.mjx-c30::before { padding: 0.666em 0.5em 0.022em 0; content: "0"; } mjx-c.mjx-c2061::before { padding: 0 0 0 0; content: ""; } mjx-c.mjx-c28::before { padding: 0.75em 0.389em 0.25em 0; content: "("; } mjx-c.mjx-c1D441.TEX-I::before { padding: 0.683em 0.888em 0 0; content: "N"; } mjx-c.mjx-c29::before { padding: 0.75em 0.389em 0.25em 0; content: ")"; } mjx-c.mjx-c2B::before { padding: 0.583em 0.778em 0.082em 0; content: "+"; } mjx-c.mjx-c1D6FD.TEX-I::before { padding: 0.705em 0.566em 0.194em 0; content: "\3B2"; } mjx-c.mjx-c1D6FE.TEX-I::before { padding: 0.441em 0.543em 0.216em 0; content: "\3B3"; } mjx-c.mjx-c1D461.TEX-I::before { padding: 0.626em 0.361em 0.011em 0; content: "t"; } mjx-container[jax="CHTML"] { line-height: 0; } mjx-container [space="1"] { margin-left: .111em; } mjx-container [space="2"] { margin-left: .167em; } mjx-container [space="3"] { margin-left: .222em; } mjx-container [space="4"] { margin-left: .278em; } mjx-container [space="5"] { margin-left: .333em; } mjx-container [rspace="1"] { margin-right: .111em; } mjx-container [rspace="2"] { margin-right: .167em; } mjx-container [rspace="3"] { margin-right: .222em; } mjx-container [rspace="4"] { margin-right: .278em; } mjx-container [rspace="5"] { margin-right: .333em; } mjx-container [size="s"] { font-size: 70.7%; } mjx-container [size="ss"] { font-size: 50%; } mjx-container [size="Tn"] { font-size: 60%; } mjx-container [size="sm"] { font-size: 85%; } mjx-container [size="lg"] { font-size: 120%; } mjx-container [size="Lg"] { font-size: 144%; } mjx-container [size="LG"] { font-size: 173%; } mjx-container [size="hg"] { font-size: 207%; } mjx-container [size="HG"] { font-size: 249%; } mjx-container [width="full"] { width: 100%; } mjx-box { display: inline-block; } mjx-block { display: block; } mjx-itable { display: inline-table; } mjx-row { display: table-row; } mjx-row * { display: table-cell; } mjx-mtext { display: inline-block; } mjx-mstyle { display: inline-block; } mjx-merror { display: inline-block; color: red; background-color: yellow; } mjx-mphantom { visibility: hidden; } _::-webkit-full-page-media, _:future, :root mjx-container { will-change: opacity; } mjx-c::before { display: block; width: 0; } .MJX-TEX { font-family: MJXZERO, MJXTEX; } .TEX-B { font-family: MJXZERO, MJXTEX-B; } .TEX-I { font-family: MJXZERO, MJXTEX-I; } .TEX-MI { font-family: MJXZERO, MJXTEX-MI; } .TEX-BI { font-family: MJXZERO, MJXTEX-BI; } .TEX-S1 { font-family: MJXZERO, MJXTEX-S1; } .TEX-S2 { font-family: MJXZERO, MJXTEX-S2; } .TEX-S3 { font-family: MJXZERO, MJXTEX-S3; } .TEX-S4 { font-family: MJXZERO, MJXTEX-S4; } .TEX-A { font-family: MJXZERO, MJXTEX-A; } .TEX-C { font-family: MJXZERO, MJXTEX-C; } .TEX-CB { font-family: MJXZERO, MJXTEX-CB; } .TEX-FR { font-family: MJXZERO, MJXTEX-FR; } .TEX-FRB { font-family: MJXZERO, MJXTEX-FRB; } .TEX-SS { font-family: MJXZERO, MJXTEX-SS; } .TEX-SSB { font-family: MJXZERO, MJXTEX-SSB; } .TEX-SSI { font-family: MJXZERO, MJXTEX-SSI; } .TEX-SC { font-family: MJXZERO, MJXTEX-SC; } .TEX-T { font-family: MJXZERO, MJXTEX-T; } .TEX-V { font-family: MJXZERO, MJXTEX-V; } .TEX-VB { font-family: MJXZERO, MJXTEX-VB; } mjx-stretchy-v mjx-c, mjx-stretchy-h mjx-c { font-family: MJXZERO, MJXTEX-S1, MJXTEX-S4, MJXTEX, MJXTEX-A ! important; } @font-face /* 0 */ { font-family: MJXZERO; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Zero.woff") format("woff"); } @font-face /* 1 */ { font-family: MJXTEX; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Regular.woff") format("woff"); } @font-face /* 2 */ { font-family: MJXTEX-B; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Bold.woff") format("woff"); } @font-face /* 3 */ { font-family: MJXTEX-I; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Math-Italic.woff") format("woff"); } @font-face /* 4 */ { font-family: MJXTEX-MI; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Italic.woff") format("woff"); } @font-face /* 5 */ { font-family: MJXTEX-BI; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Math-BoldItalic.woff") format("woff"); } @font-face /* 6 */ { font-family: MJXTEX-S1; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size1-Regular.woff") format("woff"); } @font-face /* 7 */ { font-family: MJXTEX-S2; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size2-Regular.woff") format("woff"); } @font-face /* 8 */ { font-family: MJXTEX-S3; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size3-Regular.woff") format("woff"); } @font-face /* 9 */ { font-family: MJXTEX-S4; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size4-Regular.woff") format("woff"); } @font-face /* 10 */ { font-family: MJXTEX-A; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_AMS-Regular.woff") format("woff"); } @font-face /* 11 */ { font-family: MJXTEX-C; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Calligraphic-Regular.woff") format("woff"); } @font-face /* 12 */ { font-family: MJXTEX-CB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Calligraphic-Bold.woff") format("woff"); } @font-face /* 13 */ { font-family: MJXTEX-FR; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Fraktur-Regular.woff") format("woff"); } @font-face /* 14 */ { font-family: MJXTEX-FRB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Fraktur-Bold.woff") format("woff"); } @font-face /* 15 */ { font-family: MJXTEX-SS; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Regular.woff") format("woff"); } @font-face /* 16 */ { font-family: MJXTEX-SSB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Bold.woff") format("woff"); } @font-face /* 17 */ { font-family: MJXTEX-SSI; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Italic.woff") format("woff"); } @font-face /* 18 */ { font-family: MJXTEX-SC; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Script-Regular.woff") format("woff"); } @font-face /* 19 */ { font-family: MJXTEX-T; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Typewriter-Regular.woff") format("woff"); } @font-face /* 20 */ { font-family: MJXTEX-V; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Vector-Regular.woff") format("woff"); } @font-face /* 21 */ { font-family: MJXTEX-VB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Vector-Bold.woff") format("woff"); } This OLS is fit on 89 open-weight models with known parameter counts, ranging from SmolLM2-135M up to DeepSeek V4 Pro at 1.6T. Li reports α = 0.147, β = +0.132, R² = 0.917, with leave-one-out median fold error of 1.59× and a 90% prediction interval factor of 3.0×. Inverting the regression gives a parameter-count estimate for any target model: N̂ = 10^((A − β) / α). For Mixture-of-Experts models, total parameters predict factual knowledge meaningfully better than active parameters (R² = 0.79 vs 0.51).[11] This shows the ground truth open model parameters with the linear regression from our modified methodology as well as the fit of the original paper's model. While ours gets a lower overall R², the clean method has a better fit in the tails of the distribution. Densing Law Falsification Results The densing law paper (Xiao et al. 2024) introduces "capability density", defined as the ratio of a model's effective parameter size to its actual parameter size. Here, "effective size" is the parameter count a reference model would need to match the target's downstream score. Across 29 open-source base models, they fit ln(ρ_max) = A·t + B and report A ≈ 0.007, which they translate to "the maximum capability density of LLMs doubles approximately every 3.3 months.[12] To test this, Li adds release date as a covariate to the IKP regression: If the densing law applied to the IKP questions, then γ should be about +0.0117/month (the value that produces the claimed 3.3-month density doubling). Across 96 dated open-weight models, Li fits γ = −0.0010/month, 95% CI [−0.0031, +0.0008] — statistically indistinguishable from zero. Densing's predicted +0.0117/month is rejected at p < 10⁻¹⁵. This result stands up to all of the stress testing we performed. We refit the regression with vendor fixed effects (22 vendor dummies), family fixed effects (33 family dummies), without thinking-mode variants, dropping the open-weight tier-landmark models (anti-circularity check), and under both flooring regimes for the per-tier scores. In every specification γ stays within ±0.004/month of zero, and the +0.0117/month densing prediction is rejected with effective certainty. So whatever else the paper does, this result holds.[13] That being said, we believe the right way to read this result is: holding parameter count fixed, factual recall on rare entities has not improved across open-weight model generations from 2023 through April 2026. Procedural benchmarks like MMLU and HumanEval have improved over the same window, often dramatically. Both can be true, given that the densing law was not intended to cover factual recall capacity.Methodological Issues with the IKP paper The paper and codebase have a number of methodological issues, both across dataset construction, judging methodology, and reporting of results. The two main methodological issues that impact the results are the use of per-tier flooring for scores (contrary to the paper’s claims) and questions with ambiguous and incorrect answers. When we adjust for these issues in our replication, the headline numbers change significantly.Per-tier floors to the scoring When scoring the models, each probe score is measured:STRONG / CORRECT = +1.0WEAK = +0.5REFUSAL = 0WRONG = λ, where λ = −1.0[14] (the "hallucination penalty") In the paper section 4.3 says "Per-tier scores are not floored at zero in the released results … to preserve the bluff signal in the calibration." Flooring means that if it would normally go negative due to wrong answers, it would instead be held at 0. While they claim that the results are not floored, they in fact are floored, both for the reported values in the paper and in the repository. By removing flooring, the parameter size estimates for larger models substantially decreases because the slope of the curve is significantly less steep. Thus, the most recent frontier models get much lower parameter estimates using this model.Floored accuraciesFor small models accuracy is locked to ≈0For large models accuracy stays the same at around 0.65The slope is 6.79Unfloored accuraciesFor small models accuracy drops to ≈ −0.5,For big models the accuracy stays the same: ≈ 0.65The slope becomes 3.56 When refitting without the floor R² drops from 0.917 → 0.784, and the 90% prediction interval doubles from 3.0 to 6.8. This means the parameter counts in Li’s original paper is largely an artifact of the flooring (or more cynically, an undocumented “code-level optimization”). The new estimate is technically less accurate on predicting the smaller models, but provides much more reasonable estimates of the larger frontier models despite having a wider confidence band.Ambiguous/incorrect answers to hard questions For the researcher questions, Li filters out two-character Chinese names and single-initial given names (Section 4.1). Unfortunately, manual inspection of some randomly sampled questions revealed two issues this filter doesn't catch:First, researchers whose name is shared by multiple distinct CS researchers with non-trivial citation counts. We re-queried OpenAlex for every researcher in the 345-probe set and counted distinct profiles sharing the name with ≥50 citations each. Examples where reasonable disagreement is genuinely possible: Stjepan Picek (17 OpenAlex profiles, 2 high-cite), Zhendong Su (24 profiles, 4 high-cite), Zhiguo Ding (25 profiles, 6 high-cite). Across tiers, we flagged 86 of 345 (24.9%) probes as ambiguous: T3: 11/35 (31%), T4: 11/51 (22%), T5: 25/100 (25%), T6: 14/59 (24%), T7: 25/100 (25%).Second, researchers whose primary subfield is contested or has shifted over time. Dan Suciu's field is given in the gold answers as "programming languages," but his most-cited and most-recent work is in databases. Under their scoring system, a model that says "databases" is marked WRONG despite being arguably more correct. For wikidata questions, Li applied a “10-round audit/repair cycle” (section 7.7). Unfortunately, it seems that this repair cycle failed to catch at least two types of issue:First, we noted that for a large number of entities, there still remained ambiguities regarding which entity the question referred to. For example, there are 42 distinct entities labelled "Bělá" in the czech-language Wikidata.Second, there remain other genuine semantic ambiguities. For example, the UT Austin School of Nursing question (gold: founded 1890) is technically right but ambiguous. the school traces its lineage through a 1960 nursing program at the same university, and was officially incorporated into UT Austin in 1976 and depending on whether you count the predecessor program, models giving 1960 are arguably correct. Similarly, sculpture-attribution probes drawn from Wikidata's P170 (creator) field often return the bronze foundry that cast the work rather than the sculptor who designed it. These are different failure modes from name-collision and stale ground truth, and a more thorough audit would find more of them. For manually generated questions, we inspected those where the models consistently did much worse than the tier would suggest, and found two incorrect questions: one on the highest peak in Bangladesh (whose answer changed over time) and another on Mongolian people’s party (which is ambiguous between 1920 and 1921). We excluded these two from our analysis. Interestingly, these possible issues are noted by Li in Appendix H.. However, he does not attempt to quantify how many questions are ambiguous or incorrect, nor how large the impact is if you were to remove the ambiguous questions. We do that here. Source Number of Questions Flagged ambiguous Heuristic LLM-generated 401 0 Visual spot check was performed across all four tiers with an LLM judge. The questions seem well formed. Researcher 345 86 (24.9%) OpenAlex shows ≥2 distinct researchers with ≥50 citations sharing the name where reasonable people would disagree about which researcher to cite. Wikidata 557 45 (8.08%) For the T5–T7 categories, ≥3 entities share the same label Manually generated 97 2 (2.05%) We manually inspected the manually generated questions where the models performed much worse than expected, and confirmed that 2 of the questions were incorrect. Total questions audited 1,400 131 (9.4%) Corrected model parameter estimates We attempt to fix the two methodological issues we identified above, by removing questions that have ambiguous answers from the various datasets, and also by removing the flooring from the accuracy estimates. We then recalculate scores across all the models measured in the paper. We present the newly calculated scores for the 8 models that we report on elsewhere in this post. Model Vendor True params Paper estimate Estimate w/ corrections Δ paper→ corrected gemini-3.1-pro[15] Google — 40,794B 4,653B ↓8.77× gpt-5.5 OpenAI — 9,659B 1,458B1 ↓6.62× gpt-5 OpenAI — 4,088B 1,330B ↓3.07× claude-opus-4.7 Anthropic — 4,042B 1,132B ↓3.57× claude-sonnet-4.6 Anthropic — 1,730B 661B ↓2.62× grok-4.20 (thinking)[16] xAI — 542B 768B ↑1.42× deepseek-r1 DeepSeek 671B 424B 760B ↑1.79× deepseek-v3 DeepSeek 671B 589B 564B ↓1.04× Overall, the estimates drop massively for the most capable frontier models (observing a nearly 10x difference in Gemini 3.1-pro) and for some of the smaller models we see a modest increase in score.Possible methodological issues that mattered less than we thoughtThinking vs non-thinking In the original work, models would often behave much better when thinking was enabled rather than not. This led to parameter estimates that were off by as much as 4.9x: Grok-4.20 was estimated to have 110B parameters without thinking, and 540B parameters with. Claude Opus 4.6 was estimated to have 2.4T parameters without thinking, but 5.3T with thinking enabled, a 2.2x difference. This figure shows the difference in scores between the estimated parameter counts when the model was and was not given additional tokens in which to think, under Li's original methodology. The headline results in the paper obscure this difference, as they generally report the maximum parameter count of either the thinking and non-thinking versions of the same model. This figure shows the difference in scores between the estimated parameter counts when the model was and was not given additional tokens in which to think, using the unfloored version of the methodology. We find that with the non-floored version the scores are generally somewhat closer together and significantly lower than the floored version presented in the original paper. Interestingly, after removing the arbitrary flooring, we generally observe much smaller score differences between models given additional time to think. Grok-4.20’s thinking multiplier dropped to 3.9x, while Claude Opus 4.6’s dropped to 1.2x. We unfortunately did not have the time to investigate why the thinking gap decreased after flooring was removed. That being said, we believe that this is some evidence that, on the IKP benchmark, enabling thinking does not have that much of an effect on performance.Different accuracy metrics used in some repository json files We observed that the penalized accuracy metric used to score the models in some of the json files in the repo was different from what was outlined shown in the paper. We subsequently investigated whether these different accuracy metrics affected the results, but found that the aberrant json scores were not used to produce any of the figures or tables in the paper. That is, the different accuracy metrics did not affect any of Li’s results as presented in the paper.Conclusion In this work, we examined the robustness of both the methodology and results of Li’s “Incompressible Knowledge Probes” paper. We identified two main methodological issues with the work: the per-tier flooring that exists in the code despite the paper claiming otherwise and the large fraction of ambiguous questions, especially in higher tiers of difficulty. We also note two questionable methodological choices that do not impact the results significantly: the performance gap between thinking and non-thinking models was much smaller than we initially thought, and the different accuracy metrics included in some json files were not used for the main analysis. That being said, three of Li's claims survive every stress test we applied:Factual recall, as measured by Li’s IKP dataset, scales log-linearly with parameter count across open-weight models. We found that the slope is consistently around 0.15 across every reasonable subset (≥0.5B, ≥10B, ≤30B, ≤100B, dense-only, MoE-only), and the intercept moves modestly. The qualitative scaling claim is robust.The Densing-Law-on-factual-capacity falsification. γ stays within ±0.004/month of zero across vendor FE, family FE, both flooring regimes, and anti-circularity refits.MoE total parameters predict knowledge better than active. The R² gap is partly an X-range artifact, but the slope and intercept comparisons confirm a 600B-total MoE behaves like a 600B-total dense, not like a 37B-active dense. However, what does not survive is the specific multi-trillion parameter estimates for closed frontier models. After attempting to correct for methodological issues to the best of our ability, we found that the parameter count of the top proprietary frontier models drops from ~10T to ~1.5T. We emphasize, however, that our point estimate of 1.5T for GPT-5.5 should not be read as our preferred answer. Instead, we see it as evidence that the range of plausible answers under defensible methodology is much wider than the paper's reported 3.0× 90% prediction interval implies. Both of us are quite uncertain about the exact parameter count of GPT-5.5. We think that the IKP dataset (and methodology) is a real contribution. Li also deserves credit for releasing the dataset and code; it is precisely because he open-sourced his code that we could write this post so quickly.[17] But the standard for an empirical paper that produces concrete numbers ("GPT-5.5 has 9.7T parameters") needs to be higher than "I ran one regression and reported the result." Methodological choices should be discussed and justified; the effects of possible limitations or dataset issues should be analyzed and not just acknowledged in passing; and results that seem surprisingly good (or just surprising) should be scrutinized before they go viral on Twitter.Discussion On a broader point, we think this work illustrates both the risks and potential of AI-generated research code. Li's paper illustrates many of the risks. The codebase looks like code that was generated quickly and never carefully checked, including the six near-identical judge prompts in different scripts, defensive error handling that silently turns network failures into refusals, redundant variable definitions, and at least two cases where the paper text and the released code disagree about what the methodology is. The companion website has terms defined but used nowhere and incorrectly labeled tables. None of these is individually fatal, but together they describe a pipeline where no one (including the author) read the work with a critical eye before it went public. A single-authored empirical paper with no internal or external review is a known failure mode. A single-authored empirical paper generated largely by an LLM without much review is the same failure mode at higher throughput. But the same tools that lower the cost of producing this kind of work also lower the cost of checking it. Thanks to Claude Code (and to a smaller extent, Codex) automating much of the code generation process on our end, the two of us were able to replicate Li’s main results and perform many sensitivity analyses in around 3-4 hours each. We estimate that the same amount of work would’ve taken us around 10 hours each using previous generation coding assistants (e.g. Cursor’s autocomplete). In terms of the IKP work, despite the issues with the headline results, we believe the core idea of reverse engineering LLM parameter count using memorization capacity to be solid and welcome future work that attempts to implement it in a more rigorous and systematic way. As a broader point about research scrutiny, we hope that this example serves as an important reminder of the changing economics of producing and scrutinizing new research results: as costs of both drop and the production of new results ramps up, so too should the scrutiny we apply to each result. ^ For example, the judge prompt appears in at least 6 different scripts with slightly different wording.^ For example, lines 78-86 of src/scorer.py:result = judge_fn(prompt).strip().upper() # Must check for exact "CORRECT" — not substring of "INCORRECT" if result == "CORRECT": return True if result.startswith("CORRECT"): return True if result.split()[0] == "CORRECT" if result else False: return True return False Note that both the first and last checks are subsumed by the second check (result.startswith("CORRECT")).^ For example, ikp_estimate.py returns an empty string “” if an invalid HTTP response is received, which the judge will then classify as a REFUSAL. (This was actually an issue when reproducing the work, when Lawrence ran out of OpenRouter credits and ended up getting all refusals from gpt-4o-mini for t4 questions onwards, and which had to be debugged manually.)^ For example, the table of proprietary parameter estimates references “distilled rows”, which don’t exist on the table.^ For example, the table of proprietary parameter estimates includes very much open sourced models such as mistral-medium-3.1 and deepseek-v3.1.^ Note that despite the implication, pre-existing memorized bits per parameter count estimates also vary by at least a factor of 2x (Allen-Zhu and Li estimated 1.4 bits per param for MoEs and 2 bits for dense GPT-style networks, while the later Morris et al. used a different methodology to reach 3.6 bits/parameter, and the hard information theoretic bound for 8-bit models is ~8 bits/param).^ An alternative approach is to estimate the most obscure fact that an LLM knows, but this has its own difficulties (e.g. quantifying obscurity).^ It’s named densing law because it measures how models get more parameter efficient performance wise, that is, more dense over time. As an aside, Lawrence thinks that this is a terrible name, like if scaling laws were named lossing laws since they measure how loss goes down with parameter count and dataset size.^ We note that it’s possible this is a poor wording choice resulting from overreliance on LLMs. However, we feel that, even if this were to be the case, this would not absolve him of responsibility for including this wording in his single-authored paper.^ The Hallucination Similarity Score is computed on T5–T7 probes, which are ~50% researcher questions evaluated under the 4-way STRONG/WEAK rubric. Ths rubric awards Anthropic a ~16 percentage point excess STRONG rate over the cross-vendor median, perhaps driven by Claude's stylistic preference for verbose evidence-citation. Because HSS depends on which probes count as "correct" for each model, that stylistic bias propagates into the Jaccard intersections and the wrong-answer-overlap rates. We'd expect within-vendor fingerprint comparisons (e.g. Claude Sonnet 4.5 → 4.6 → 4.7 lineage; weight-sharing siblings) to be relatively unaffected, because both members of the pair share the same response style. Cross-vendor comparisons (especially the paper's claim to detect distillation across closed-vendor families) are structurally vulnerable to the same stylistic confound that biases the parameter estimates^ We'd note that the gap is partly an x-range artifact (active parameters cluster in a narrow ~10–40B band across the 37 MoE calibration models, which compresses the regression's denominator). In more honest predictive units – the LOO median fold error also reported in the paper — MoE-active is only ~13% worse than MoE-total (1.69× vs 1.49×), not the ~36-percentage-point R² gap the headline suggests.^ This result is suspicious for many reasons: for one thing, they calibrate the capability density using a family of small, in-house models, don’t control for a few obvious confounds, and do several statistical tricks to inflate the significance of their result. Again, it’s beyond the scope of this post, but we’d be happy to write another one if there’s more interest.^ Also, information theoretic bound imply that the densing law cannot continue on pure factual recall tasks.^ Another plausible methodological issue is the scaling of the hallucination penalty. We experimented briefly with different scales, but found similar fits to the one reported in the paper, so did not investigate further.^ Gemini-3.1-pro was used as a landmark model for calibration, thus inflating its score substantially. We include it to show the effect of correcting the two main methodological issues^ Unlike most other models, grok-4.2 performs much worse without thinking than with.^ We plan on releasing our code in the coming days; the publication of this post was unfortunately rushed due to the writing program both of us are in: https://www.inkhaven.blog/spring-26 Discuss