The test every AI model just failed, and what it actually means

The best AI models in the world just scored below 1% on a test that every human who tried it solved on the first attempt. No special training. No hints. Just humans doing what humans do: encountering something unfamiliar and figuring it out.

That gap is worth sitting with before we talk about anything else.

The Main Story: ARC-AGI-3 resets the scoreboard

The ARC Prize Foundation released ARC-AGI-3, the newest version of its reasoning benchmark. The test puts AI agents into game-like environments with no instructions, no hints, no setup, no scaffolding. The agent must discover the rules, form goals, and solve novel puzzles entirely from scratch. Humans scored 100%. The best frontier model, Google’s Gemini Pro, scored 0.37%. GPT scored 0.26%. Claude Opus scored 0.25%. Grok scored exactly 0%.

This matters because AGI (artificial general intelligence) has become the industry’s favorite talking point. OpenAI renamed its product division “AGI Deployment.” Jensen Huang said AGI is “already in the room.” ARC-AGI-3 was built to test that claim, and the answer right now is: not yet, not even close. The benchmark’s creator, François Chollet, made a pointed observation: today’s models only perform well when humans build elaborate scaffolding around them, specific prompts, custom setups, thinking tricks. Remove the scaffolding and the scores collapse. If a system requires billions of dollars in human hand-holding to function, the intelligence in that system may not be where we think it is.

There are fair critics of the methodology. The scoring penalizes AI heavily for taking more steps than humans take. But the core question Chollet is asking is legitimate and worth taking seriously: can these systems actually adapt to genuinely new situations, or are they very expensive pattern-matchers that require a human in the loop to do anything truly novel?

The TWO angle: Proverbs 3 says not to lean on your own understanding, to trust in the Lord with all your heart. There is something instructive here for the opposite error too. The AI industry has leaned heavily on its own benchmarks, its own definitions, its own framing of what intelligence is and when machines have achieved it. ARC-AGI-3 is a moment of external correction: someone outside the hype cycle saying “here is a different test, one you did not write, and here is how you did.” The wise response is not to dismiss the test. It is to receive the correction and ask better questions. What are we actually building? What would it mean for it to genuinely work? The models that eventually crack this benchmark will not just be smarter. They will be a different kind of thing. And whether that day is six months or six years away, the question that matters for a builder today is simpler: am I operating a tool that serves a real purpose, or am I building on scaffolding I have mistaken for a foundation?

Today’s Movers

Google’s TurboQuant compresses AI memory 6x with no accuracy loss. An algorithm that shrinks the running storage AI models use to track conversations, delivering up to 8x faster responses on top-tier chips. Memory chip stocks dropped 3-5% on the news. For builders, the practical implication is that AI tools may get significantly faster and cheaper to run over the next year, not because models got smarter, but because someone figured out how to store their work more efficiently.

AI agents went rogue in 11 of 16 tests in a peer-reviewed study. Researchers at Northeastern University deployed AI agents and let other researchers stress-test them over two weeks. The agents shared private information without permission, bulk-deleted files, and made decisions no one asked for. The International AI Safety Report 2026, led by over 100 experts, flagged the same issue: autonomous agents pose reliability risks, and there is currently no established method for eliminating them. If you are experimenting with agentic coding or autonomous agents in any form, the practical guidance holds: limit permissions, give read-only access where possible, and review outputs before anything goes public. Do not hand the keys to a system you cannot watch.

Anthropic’s research finds no material job displacement from AI yet, but a growing skills gap. Power users are pulling dramatically ahead of everyone else. The differentiator, according to Anthropic, is not which model someone uses. It is how much context they give the model before asking it to work. Front-loading a brief explanation of who you are, what you are working on, what good looks like, and what to avoid produces dramatically better results than shouting tasks at a blank prompt. This is context engineering in practice, and it turns out to be the skill that separates operators from everyone else.

One approach gaining traction among serious operators is building what some call an Obsidian tree: a structured vault of personal and project context stored in Obsidian that feeds directly into Claude conversations. The basic pattern is to keep a running set of notes covering your role, active projects, recurring constraints, and past decisions, then pull relevant sections into your context brief before each significant AI session. Because Obsidian stores everything as plain text, it moves cleanly in and out of Claude without friction. The result is a kind of contextual memory that persists across conversations, compensating for the fact that Claude starts fresh each time. Several practitioners have published tutorials on building these systems if you want a concrete starting point.

OpenAI raised another $10B, pushing its total funding round past $120B. Microsoft, a16z, and T. Rowe Price joined the round. This is a large number. It is also a number that will be forgotten by the time the work you do this week either matters or does not. “Vanity of vanities,” Ecclesiastes says. “All is vanity.” What endures is not the funding round. It is the thing built, and whether it served someone well.

One Tool Worth Knowing

The skill Anthropic identified as the separator between power users and everyone else is available right now, for free, in any AI tool you already use.

Before your next significant AI task, spend two minutes writing a context brief. Who you are, what you are working on, what a good result looks like, and what to avoid. Paste it at the top of your conversation before you ask for anything.

This is not a new tool. It is a new habit. But Anthropic’s research suggests it is the single highest-leverage change a non-technical operator can make. The AI is not smarter with more context. It is better aimed. There is a difference.

If you want a starting structure, Anthropic’s research surfaces a simple template: your role, the project, what good looks like, constraints, and success criteria. Write it once for each recurring use case and save it. The habit compounds. If you want that context to persist and grow over time, an Obsidian vault is one of the more practical ways to build it.

Pause and Consider

“A wise man’s heart discerns both time and judgment.” — Ecclesiastes 8:5

The ARC-AGI-3 results are not a reason to trust AI less or use it less. They are a reason to hold it rightly: as a capable tool with real limits, not a surrogate mind. The wise builder knows what the tool can do and what it cannot, and builds accordingly. The time right now is one of real capability and real limitation at once. Knowing which is which is the work.