In the breakneck race to develop AI that can comprehend vast amounts of information, a new, comprehensive study has delivered a plot twist that few saw coming. The language that consistently outperforms others when processing exceptionally long documents isn't English or Chinese—the usual heavyweights of the AI world—but Polish.
This surprising finding comes from a groundbreaking multilingual benchmark known as OneRuler, detailed in a paper presented at COLM 2025. The research, which evaluated how large language models (LLMs) handle extensive texts across 26 languages, suggests that our fundamental assumptions about linguistic performance in AI may need a significant revision.
The Benchmark and the Unexpected Winner
The core mission of the OneRuler benchmark was to move beyond simple, short-form tasks and stress-test models on their ability to retrieve and aggregate information from documents stretching up to 128,000 tokens—equivalent to hundreds of pages of text. Researchers tested model accuracy across a wide range of context lengths, from a modest 8,000 tokens to the massive 128,000-token mark.
What they discovered was a clear and dramatic shift in performance once the context windows expanded. According to the results chart published on page 6 of the study, a clear leader emerged: Polish.
The data shows that Polish leads all other tested languages with a remarkable average accuracy of 88% at long-context scales. Meanwhile, English, the lingua franca of the internet and a primary training data source for most LLMs, dropped to a disappointing sixth place. Even more striking, Chinese, another AI juggernaut, ranked among the bottom four languages in terms of performance.
You can delve into the full details and methodology of this revealing study in the pre-print paper available here: The OneRuler Benchmark: A Multilingual Study on Long-Context LLMs.
The Script and Tokenization Hypothesis
So, why would Polish, a language with a fraction of the digital footprint of English or Chinese, suddenly become the gold standard for long-context understanding? The study’s authors propose that the answer lies not in the volume of training data, but in the fundamental structure of the language itself—specifically, tokenization efficiency and script type.
Tokenization is the process where an AI breaks down words into smaller pieces (tokens) for processing. The research indicates that languages using Latin-based scripts with rich inflectional morphology, like Polish, French, and Spanish, consistently outperformed those using logographic (like Chinese) or abugida (like Tamil) writing systems.
"The complete 180 of expected rankings is fascinating," the paper notes. "It indicates that once a model's primary task is to search, recall, or summarize information buried deep within a long document, structural aspects of the language itself may take precedence over the sheer prevalence of its dataset."
In simpler terms, the way Polish words are formed and tokenized might create a more efficient "map" for the AI to navigate a massive text, allowing it to find needles in a digital haystack with greater precision.
A Widening Gap and Surprising Sensitivities
The OneRuler benchmark uncovered other critical trends that have profound implications for the future of AI development. One of the most significant is the performance gap. As context windows grew, the chasm between the best and worst-performing languages widened dramatically—from an 11% difference at 8,000 tokens to a staggering 34% gap at 128,000 tokens.
Furthermore, the study revealed just how sensitive long-context models can be to minor changes in instruction. In one test, researchers simply allowed the model to answer "none" if a specific piece of information was not found in the text. This single, logical alteration caused accuracy in English to plummet by 32% at the 128k token level, a stark illustration of the instability that can occur at scale.
For developers and researchers looking to replicate or build upon these findings, the complete framework and data are available here: OneRuler Benchmark on GitHub.
What This Means for the Future of AI
The implications of the OneRuler study are far-reaching. For years, the AI industry has relied heavily on English-centric benchmarks to measure progress and capability. This research powerfully argues that this approach is insufficient and potentially misleading.
The findings suggest that as context windows continue to expand—a key selling point for the next generation of LLMs—linguistic differences will grow more important, not less. English's dominance in standard benchmarks may no longer be representative of true performance when sequence lengths climb into the tens of thousands.
The crown now rests on an unexpected head. As the AI field pushes the boundaries of context, it must also broaden its linguistic horizons, looking beyond the usual suspects to build models that are truly intelligent, no matter the language or the length of the text.


