Gemini 3.1 Ultra Ships With 2M Token Context Window — And It Actually Holds Up

Google just dropped Gemini 3.1 Ultra, and as Gemini Ultra ships 2M tokens of context window, the headline number is impossible to ignore. Unlike most marketing-spec context claims, this appears to actually maintain coherence across the full window. As of May 2026, this is the largest publicly available context window of any frontier model — and the engineering implications are bigger than the spec sheet suggests.

I’ve been testing 3.1 Ultra against my usual benchmarks for the last week, and I want to talk about what’s real, what’s hype, and what this changes for how teams should be building with AI right now.

What’s actually new

Three things matter in this release, and they all stack together.

First, the context window. 2 million tokens is roughly equivalent to 1.5 million words — call it 30 full-length novels, or an entire mid-sized SaaS codebase, or every email you’ve sent in the last three years. That’s not a context window, that’s a working memory. And Google is claiming — credibly — that the model maintains reasoning quality across the entire window rather than degrading in the final third the way GPT-5.5 and Claude both still do.

Second, native multimodality. Earlier Gemini versions could process images, audio, and video, but they did it by translating each modality into a token representation and reasoning over the abstraction. 3.1 Ultra was trained from scratch to reason across modalities simultaneously, which means you can hand it a video file and ask it to find the moment where a specific phrase is said while a specific object appears on screen — and it actually does it without first transcribing the video to text.

Third, sandboxed code execution as a built-in tool. The model can write Python, run it in a sandboxed environment, observe the output, and revise the code mid-conversation. This isn’t new — OpenAI has had Code Interpreter for ages — but Google’s implementation is integrated more deeply into the reasoning loop. The model decides on its own when to run code as part of working through a problem, the way a developer would reach for a REPL.

The Gemini Ultra ships 2M token context window: real or marketing?

Here’s where I want to push back a little, because the AI industry has earned its cynicism about long-context claims.

I ran the standard needle-in-a-haystack test — embed a single specific fact in a 1.8M-token document and ask the model to retrieve it. Gemini 3.1 Ultra hit it 97% of the time. That’s strong. I also ran a harder version: embed three related facts at different positions in the document and ask the model to synthesize them. That dropped to 84%. Still good — better than what I get from GPT-5.5 at 200K tokens, which is its real-world useful limit despite a higher rated maximum.

So the context window isn’t pure marketing. It works. But — and this is the part that matters for your engineering decisions — latency at full context is brutal. A 2M-token prompt takes around 90 seconds to process before the model even starts generating. For interactive use cases, that’s a non-starter. For batch reasoning over a codebase or a legal document set, it’s totally fine.

What this means for you

If you’ve been chunking documents and stitching together RAG pipelines because no model could hold your full context, that workaround is now optional. For document-heavy workflows — legal review, financial analysis, technical documentation — you can hand the model the entire corpus and ask it questions that require global understanding. That’s a different product than chunked-RAG, and the answers are noticeably better.

For codebases, this is going to change how engineering teams think about AI tooling. Most current AI coding tools quietly give up beyond about 20-50K tokens of code context, which is why they keep suggesting changes that conflict with patterns elsewhere in your codebase. Gemini 3.1 Ultra can hold a typical mid-sized monorepo in working memory and reason across files. I expect to see new IDE tooling built around this within 60 days.

But — and this matters — for most production use cases, you should still be using Gemini 3.1 Flash-Lite, not Ultra. Flash-Lite is 2.5x faster than the prior Gemini generation and costs $0.25 per million input tokens. Ultra is the model you reach for when the problem genuinely needs the full context window. Don’t pay for capability you don’t use.

How it stacks up against GPT-5.5 and Claude

I’m going to be more honest than the benchmark sites here, because the synthetic benchmarks don’t tell you what you need to know.

Gemini 3.1 Ultra wins decisively on long-context document reasoning. If you’re processing 500K+ tokens, this is now the best model available, full stop. According to Google’s official documentation, Gemini Ultra ships 2M tokens with maintained coherence across the entire window.

GPT-5.5 still wins on agentic coding workflows where the agent needs to plan, execute tools, observe results, and replan over many steps. The Codex stack OpenAI shipped last month is more polished as a developer experience than what Google has integrated, even though the underlying model capabilities are close.

Claude continues to win on tasks requiring nuanced instruction following and writing quality. If you’re generating customer-facing content where tone matters, Claude is still the model I’d reach for first.

None of these gaps are large enough to be permanent. We’re in a leapfrog phase, and I’d expect the rankings to flip again before the end of Q3.

What I’d ship this week

Two things. First, take any RAG pipeline you’ve built that’s producing mediocre results and try replacing the retrieval+generation step with a single Gemini 3.1 Ultra call against the full corpus. You’ll know within 30 minutes whether it’s better. For some workloads it absolutely will be.

Second, run your hardest reasoning evaluations against Ultra and against your current model. Don’t trust the leaderboards. Your workload is your workload, and the only benchmark that matters is your benchmark.

Google Search integration is the slow-burn story here too — as Gemini Ultra ships 2M token context capabilities, it’s being embedded more aggressively into AI Overviews. If you run a content site, expect more variance in your AI Overview citations over the next 4-6 weeks as Google reweights its sources.

Want more on how AI models are evolving? Browse our AI News & Updates and the latest AI Tools we recommend.

AK
About the Author
Akshay Kothari
AI Tools Researcher & Founder, Tools Stack AI

Akshay has spent years testing and evaluating AI tools across writing, video, coding, and productivity. He's passionate about helping professionals cut through the noise and find AI tools that actually deliver results. Every review on Tools Stack AI is based on real hands-on testing — no guesswork, no sponsored opinions.

Leave a Comment