QwenLong-L1.5: How Alibaba Cracked the 4 Million Token Reasoning Problem

Most LLMs hit a wall somewhere around 128K tokens. The attention mechanism starts to struggle, context gets muddy, and performance degrades. OpenAI solved this with clever retrieval. Google threw context windows wider. But both approaches have trade-offs that become painfully obvious when you’re actually trying to reason over a million-token legal document or codebase.

Alibaba’s Tongyi Lab just released QwenLong-L1.5 on December 15, 2025—and it takes a fundamentally different approach. Instead of brute-forcing context or lazy retrieval, they built a memory management system that actually thinks about what to remember.

The results? An average 9.9-point improvement across long-context benchmarks, performance matching GPT-5.2 and Gemini 3 Pro, and the ability to reason over 4 million tokens. And here’s what caught my attention: they open-sourced everything—weights, training code, and the complete data synthesis pipeline.

The Architecture That Makes 4M Tokens Possible

QwenLong-L1.5 is built on Qwen3-30B-A3B-Thinking, a Mixture-of-Experts (MoE) transformer with:

Specification	Value
Total Parameters	30.5 billion
Active Parameters	3.3 billion per token
Expert Count	128 (8 activated per token)
Layers	48
Attention Heads	32 Q / 4 KV (GQA)
Native Context	256K tokens
Extended Context	Up to 4M tokens (memory mode)

That MoE efficiency matters—it’s what makes processing ultra-long contexts practical without requiring a data center.

But the base model only handles 256K tokens natively. So how do you get to 4 million? The answer is a three-part system that represents the most sophisticated long-context approach we’ve seen from any lab.

1. Atomic Facts Data Synthesis Pipeline

Here’s the problem with training long-context models: there’s almost no high-quality data. Most documents that even approach a million tokens are poorly structured, repetitive, or just not suitable for reasoning tasks.

Alibaba’s solution is genuinely clever. The pipeline works in stages:

1. Document Decomposition: Source documents are broken into “atomic facts”—the smallest verifiable units of information

2. Relationship Mapping: The system builds a graph of how these facts connect across the document

3. Question Generation: Multi-hop reasoning questions are programmatically composed, requiring the model to ground answers across globally distributed evidence

4. Difficulty Calibration: Questions are tuned to require genuine long-range reasoning, not simple retrieval

It’s not just “find the needle”—it’s “connect seventeen needles scattered across a haystack and reason about their relationships.” This is what makes the training data fundamentally different from previous approaches.

2. Stabilized Reinforcement Learning with AEPO

Long-context RL is notoriously unstable. When you’re training on tasks ranging from 256K to 4M tokens, two critical problems emerge:

The Data Distribution Problem: Mixing different task types (numerical reasoning, dialogue memory, document QA) creates massive variance in training batches.

The Credit Assignment Problem: In long-context reasoning, an incorrect answer might still follow 90% of correct steps. The model struggles to identify exactly where the reasoning went wrong.

QwenLong-L1.5 introduces Adaptive Entropy-Controlled Policy Optimization (AEPO), which:

Dynamically regulates exploration-exploitation trade-offs
Uses task-balanced sampling with task-specific advantage estimation
Prevents entropy collapse that typically kills model exploration
Stabilizes KL divergence during long training runs

Combined, these techniques stabilize what was previously considered an unsolvable training problem. The model uses Generalized Policy Optimization (GRPO) with trajectory-level rewards across a multi-stage fusion RL training paradigm.

3. Memory-Augmented Agent Architecture

This is where it gets really interesting. For contexts exceeding the 256K native window, QwenLong-L1.5 transforms into a memory agent with a distinct processing pipeline:

Processing Flow:

“

Query → Core Question + Instructions


↓
Document → Chunk 1 → Memory Update
→ Chunk 2 → Memory Update (cumulative)
→ ...
→ Chunk N → Memory Update
↓
Accumulated Memory → Final Reasoning → Answer

“

Unlike RAG, which retrieves chunks before reasoning, the memory agent reads sequentially and maintains a persistent, evolving memory state. Each chunk can reference and build upon insights from earlier chunks—enabling genuine multi-hop reasoning that RAG fundamentally cannot achieve.

Think of it like this: instead of trying to fit everything into working memory simultaneously, the model reads chunks, extracts relevant information into a persistent memory state, and reasons over that accumulated memory. It’s closer to how humans actually process long documents.

The Benchmarks: Verified Numbers

Let’s talk exact numbers, because precision matters here.

Benchmark	QwenLong-L1.5 Score	Improvement vs Baseline	What It Tests
LongBench-V2	55.27	+6.16	Multi-task long-context
DocMath	66.26	+4.00	Numerical reasoning in docs
Frames	74.76	+4.49	Multi-hop QA
MRCR	82.99	+31.72	Reading comprehension
CorpusQA	Strong	—	Document QA
LongBench-V1-QA	Strong	—	General QA
Average (6 benchmarks)	71.82	+9.90	Overall

The MRCR score is particularly striking—a 31.72-point improvement suggests the memory architecture specifically excels at reading comprehension tasks requiring information integration across long distances. For ultra-long tasks (1M-4M tokens), the memory-agent framework delivered a 9.48-point gain over the baseline.

Head-to-Head: QwenLong-L1.5 vs the Frontier Models

Here’s the comparison everyone’s been asking for—QwenLong-L1.5 against the current flagship models from OpenAI, Anthropic, and Google.

Feature	QwenLong-L1.5	GPT-5.2	Claude Sonnet 4.5	Gemini 3 Pro
Context Window	256K native, 4M memory	400K	200K (1M beta)	1M
Architecture	MoE 30.5B/3.3B active	Dense (undisclosed)	Dense (undisclosed)	MoE
GPQA Diamond	—	~92-93%	83.4%	—
SWE-bench	—	55.6% Pro	77.2%	—
MRCR	82.99	~100% (at 256K)	—	—
LongBench-V2	55.27	—	—	—
LiveCodeBench	—	—	—	91.7%
Speed	100+ tok/s (quantized)	Fast (API)	Fast (API)	3x faster than 2.5 Pro
Multimodal	✗ Text only	✓	✓	✓ Native
Open Source	✓ Full weights + code	✗	✗	✗
Pricing	Free (self-host)	API only	API only	$0.50/1M input

Key takeaways:

Context capacity: QwenLong-L1.5’s memory-augmented approach scales to 4 million tokens—10x what GPT-5.2 handles and 4x Claude/Gemini’s native windows
MRCR dominance: GPT-5.2 Thinking hits near-100% on MRCR at 256K, but QwenLong-L1.5 achieves 82.99 across the full context range—a different but impressive achievement
Open-source advantage: China’s open-weight models continue to gain ground. QwenLong-L1.5 is the only model here you can actually run locally and fine-tune for your domain
Specialization matters: Claude Sonnet 4.5 dominates SWE-bench (77.2%), Gemini 3 Pro excels at multimodal tasks, GPT-5.2 leads general reasoning. QwenLong-L1.5 is purpose-built for long-document reasoning

The honest assessment? For general tasks, the closed models still lead. But for extracting insights from massive document collections, reasoning over entire codebases, or any scenario genuinely requiring multi-million-token context—QwenLong-L1.5 is now the leading option, period. And it’s open-source.

The Limitations Nobody’s Talking About

Let me be direct about what QwenLong-L1.5 can’t do yet.

Text-only: The current pipeline doesn’t support multimodal data. If you need to reason over documents with embedded images, charts, or video, you’re looking elsewhere—Gemini 3 Pro handles native multimodal.

Long-input, short-output: The model struggles with “long input, long output” scenarios. Generating a 50-page report or performing major document revisions isn’t in its wheelhouse yet.

Credit assignment is coarse: The RL training uses trajectory-level rewards, meaning the model gets a single signal for an entire reasoning chain. Future versions will likely implement token-level credit assignment for more precise learning.

Memory adds latency: Unlike the implicit memory in larger context windows, the memory-agent approach requires explicit sequential processing. This adds latency and complexity for simpler, shorter-context use cases where a standard model would be faster.

What This Means for Developers

If you’re working on document analysis, legal research, or codebase understanding, QwenLong-L1.5 deserves serious attention.

The practical use cases:

Processing entire codebases (4M tokens covers most monorepos)
Legal document review with reasoning across thousands of pages
Research synthesis across massive paper collections
Long-form dialogue systems that maintain context for hours

What you’ll need to deploy:

GPU with 24GB+ VRAM for full precision
32GB+ RAM for quantized inference
~60GB storage for model weights

The Mixture-of-Experts architecture means you can actually run this on reasonable hardware. The quantized version hits 100+ tokens/second on an M4 Max—that’s local, on-device inference for a model that reasons over millions of tokens.

For enterprises already exploring AWS AI Factories or similar infrastructure, QwenLong-L1.5 offers a compelling alternative to closed APIs. You control the deployment, the data stays on-premises, and you can fine-tune for specific domains.

The Bottom Line

QwenLong-L1.5 isn’t just an incremental improvement. It’s a proof-of-concept that memory-augmented reasoning can scale to lengths that seemed impractical a year ago.

Is it better than GPT-5.2 or Claude Sonnet 4.5 for general tasks? No. But for extracting insights from massive document collections, reasoning over entire codebases, or any scenario where you genuinely need multi-million-token context—it’s now the leading open-source option.

The bigger story here is the methodology. Alibaba open-sourced not just weights but the complete training recipe. That means the techniques—AEPO, the atomic facts pipeline, the memory framework—become building blocks for the entire field. We’ll see these innovations incorporated into other models within months.

For anyone tracking the broader AI infrastructure landscape, QwenLong-L1.5 reinforces a pattern: the most interesting long-context work is happening outside the closed-source labs.

FAQ

How many tokens can QwenLong-L1.5 actually process?

The base model handles 256K tokens natively. The memory-augmented agent mode extends this to 4 million tokens by processing documents in chunks and accumulating relevant information into persistent memory. Practical performance remains strong across this entire range.

Is QwenLong-L1.5 truly open-source?

Yes. Alibaba released the complete weights on Hugging Face, the full training code on GitHub, and a detailed technical report on arXiv. This includes the data synthesis pipeline and RL training methodology—everything needed to replicate or extend the research.

How does the memory management work compared to RAG?

Unlike RAG (Retrieval-Augmented Generation), which retrieves relevant chunks before reasoning, QwenLong-L1.5’s memory agent reads sequentially and updates a persistent memory state. This allows for genuine multi-hop reasoning where later chunks can reference earlier accumulated insights—something RAG struggles with.

Can I run QwenLong-L1.5 locally?

Yes, with appropriate hardware. The quantized version runs at 100+ tokens/second on an M4 Max with 32GB RAM. For the full-precision model, you’ll need a GPU with 24GB+ VRAM. The Mixture-of-Experts architecture (3.3B active parameters) makes inference surprisingly efficient.

How does it compare to GPT-5.2 for long-context tasks?

GPT-5.2 has a 400K context window and achieves near-100% on MRCR at 256K. QwenLong-L1.5 extends to 4M tokens and scores 82.99 on MRCR across wider context ranges. GPT-5.2 excels at general reasoning (GPQA ~92-93%), while QwenLong-L1.5 is specifically optimized for multi-hop reasoning over ultra-long documents. Choose based on your context length requirements.

Categorized in:

AI, News,

Last Update: December 28, 2025

QwenLong-L1.5: How Alibaba Cracked the 4 Million Token Reasoning Problem