03-25-2025 - Transformers Hit a Scaling Wall
Tuesday, March 25, 2025
[!info] Author: PicoCreator | Published: 2025-03-24 | Source: Twitter
[!abstract]+ TLDR Why it matters: Transformers have reached a scaling wall, making current models like GPT-4.5 unsustainable and costly without clear advancement toward AGI.
What’s happening: Experts like Yann LeCun and Demis Hassabis recognize the need for new architectures and predict a decade before AGI is feasible.
The solution: A shift towards smaller, more reliable, and personalizable models using innovative designs like RWKV, which focus on memory and efficient scaling, is proposed.
The bottom line: This approach could unlock commercially viable AI agents, paving the way to AGI through software improvements rather than hardware.
🧱 Transformers have hit the scaling wall 🧱
💰 GPT-4.5 costs billions, with no clear path to AGI for 10x the cost.
📘 Facebook’s Yann LeCun is stating we need new architectures.
🔎 DeepMind CEO Demis Hassabis asserts we need 10 years.
We have another path to AGI in < 4 years:
- Capable: Of incredible PhD level tasks & beyond.
- (Un)Reliable: Maybe 1-out-of-30 times.
What everyone wants is not a smarter model, but a more reliable model doing basic college level tasks.
Longer write up:
Our Roadmap to Personalized AI
To do the more boring things in life like:
- Organize emails and receipts
- Fill out forms
- Order groceries
- Be a friend
The things that actually matter… all tasks which a 72B model is more than capable of, if only it was more reliable.
And that’s where our work at Qwerky comes in.
Because the one thing that is holding these AI models, and agents back is simply the lack of reliable understanding in memory—memories which are at the heart of recurrent models like RWKV.
❗️ Attention is NOT all you need ❗️
Using only 8 GPUs (not a cluster), we trained a Qwerky-72B (and 32B) without any transformer attention, achieving evaluations far surpassing GPT-3.5 turbo, and closing in on GPT-4 mini. All with over 100x lower inference cost via RWKV linear scaling.
Instead of scaling bigger, more expensive models—which cannot provide an ROI to investors—what if we iterate faster at <100B active parameters to make these already capable models more reliable and personalizable, at a size that offers ROI?
This new approach of treating FFN/MLP as a separate reusable building block allows us to iterate and validate changes in the RWKV architecture at larger scales, faster. 🏃
Expect bigger changes in our average 6-month version cycle (even I struggle to keep up! 🤣).
A model which can be “memory-tuned” without catastrophic forgetting, overcoming the barrier that makes fine-tuning out of reach for the vast majority of teams.
Quick and efficient personalization of AI models will unlock reliable commercial AI agents—without compounding errors.
Memories are the secret to AGI.
Once memories for personalized AI are mastered—where they can be reliably tuned with controlled datasets by AI engineers easily—the next step is to get the AI model to prepare its continuous training dataset without compounding loss.
It’s a binary question: Is recurrent memory the path to AGI? If so, this path to AGI is inevitable, as all the critical ingredients are already here, bound only by software, not hardware.
You can read more in detail in our long-form writing:
Our Roadmap to Personalized AI