Are LLMs Hitting a Scaling Wall?

Waking up in 2025 is waking up to a new AI model every day that has “groundbreaking” benchmarks outperforming all others. Before diving into recent research, I knew AI models like ChatGPT and Gemini were rapidly advancing, and that scaling model size and computation was the prevailing strategy to improve performance. This led me to wonder how far are we from the sci-fi future of true “artificial” intelligence and reasoning. AI technology is now a daily reality, one can try hard but not escape it. As someone interested in computer engineering and AI, I was curious how these breakthroughs will affect technology and jobs. Hence, the question felt personal to my own future, which is that for how long will these models keep growing? Or is their growth reaching its practical limitations.

I have experimented with AI tools, including prompt-based models and coding small AI projects, which helped me understand the potential and limitations of current models. I began reading high-level tech journalistic articles, which led me to find out that this is a real research question which led me to a Reuters article.

AI expert Ilya Sutskever of OpenAI says, “results from scaling up pre-training… have plateaued,” and that “the 2010s were the age of scaling, now we’re in the age of wonder and discovery… Scaling the right thing matters more now than ever” (Hu and Nellis). OpenAI’s effort named “Orion” showed disappointing results to get to GPT 5, and hence a GPT 4.5 was prematurely released.

My research first led me to the foundational idea that justified the entire AI boom: The basis of the optimistic view of AI is to agree with scaling laws, i.e., more computation will yield a better model. Kaplan and colleagues showed that with more model parameters, more tokens, and more computation, we will get measurable results (Kaplan et al.). This narrative formed the basis for why companies poured resources into bigger AI training runs in the past few years.

However, I found counterarguments showing that this strategy is hitting three walls. The long-term practicality of that strategy is limited by cost, compute architecture, and data limits. Firstly, several analyses warn about the stock of high-quality, human-generated text unless new sources or methods appear (Villalobos et al.). Secondly, the energy costs of training are already through the roof, and concentrating capability in deep-pocketed labs continues to raise questions. Thirdly, researchers argue that LLM architectures lack methods for reasoning; without better modalities or components, scaling may result in diminishing returns (LeCun; Marcus; Hu and Tong).

Efforts like training on synthetic or AI-generated data have been made, but with little success. A 2023 Nature study shows that repeatedly training new models on outputs from earlier ones leads to “model collapse,” where errors multiply over generations as models lose diversity and fidelity (Shumailov et al.). Sometimes, this phenomenon is called “model autophagy,” which implies that without new, high-quality data, scaling efforts may deteriorate instead of improving performance.

But countervailing evidence neutralizes the “wall” narrative. Every new model shows gains on tougher benchmarks, and engineering innovations are producing major practical improvements. Allocating inference-time reasoning (“Think Longer”) or test-time “deliberation” does produce outsized improvements relative to parameter scaling (Hu and Tong). This leads to a practical, plural future where scaling is not dead, but the axis of scale is altered. Retrieval-augmented generation (RAG), for instance, allows fetching up-to-date and factual information at query time. This reduces reliance on data memorization and substantially reduces hallucinations (“Retrieval-Augmented Generation Explained”). Multimodal training combines vision, audio, and other sensory data that critics say expand the informational substrate which text alone cannot provide (LeCun; OpenAI).

The question revolves around AI models approaching fundamental limitations that can halt their progress. Overall, it is about the future of the world; billions of dollars and research are currently invested in AI, be it startups, big tech, or academia. The economic and infrastructural side matters as much as the technical side. As GPUs, energy, and cooling demand increase, concentration of research in a few labs shapes what gets built and what ethical choices are visible to the public (Villalobos et al.; Hu and Tong). It means that practical progress will come from techniques that allot energy and compute more efficiently.

In short, the next step matters as much as the direction of the step. Are LLMs approaching a fundamental growth wall, or are we simply diverting to what we expect from them?

This exploration leaves my inquiry at a new beginning, where several research questions will be critical for understanding the future:

What are the long-term solutions to the data bottleneck, and what are the ethical considerations on using proprietary datasets to overcome it?
How can multimodal models (beyond vision and audio) be structured and designed to address core issues in reasoning that critics like LeCun and Marcus have highlighted?
Can Retrieval-augmented generation and multimodal training replace the need for larger datasets similar to how the DeepSeek breakthrough occurred?

Waking Up in 2025: Are Large Language Models Hitting a Scaling Wall?

Works Cited