Qwen 3.5: A Complete Model Family from 0.8B to 397B
March 08, 2026 · 9 min read
Alibaba’s Qwen team just finished rolling out Qwen 3.5 — not one model but a family of eight, spanning from a 0.8-billion-parameter model that fits on a phone to a 397-billion-parameter flagship that trades blows with the best closed-source systems. Every model ships under Apache 2.0, supports 201 languages, handles text, images, and video natively, and uses a new hybrid architecture that replaces most of the standard attention mechanism with something fundamentally more efficient. GGUF quantizations are already on Hugging Face. Here is what the full lineup offers, how the architecture works, and what it means for running serious AI locally.
The Full Qwen 3.5 Lineup
The release came in three waves across three weeks:
February 16 — the flagship:
- Qwen3.5-397B-A17B — 397 billion total parameters, 17 billion active per token. Mixture-of-Experts with a 262K native context window, extendable to over 1 million tokens.
February 24 — mid-range models:
- Qwen3.5-122B-A10B — 122B total, 10B active. The sweet spot for self-hosted servers.
- Qwen3.5-35B-A3B — 35B total, only 3B active. Light enough for a MacBook.
- Qwen3.5-27B — Dense 27B model. No MoE routing overhead, straightforward to deploy.
March 2 — small models:
- Qwen3.5-9B — The standout. Matches or beats models 3-10x its size on multiple benchmarks.
- Qwen3.5-4B — Multimodal base for lightweight agents, 262K context.
- Qwen3.5-2B — Fast inference on phones and tablets.
- Qwen3.5-0.8B — Under 2 GB of VRAM at full precision. Built for the most constrained edge hardware.
Every model in the family is natively multimodal — text, image, and video inputs are handled by the same architecture, not bolted on as adapters. Every model supports 201 languages and dialects. And every model uses the same new hybrid attention system.
What Changed: Gated DeltaNet Replaces Most Attention Layers
The architectural headline in Qwen 3.5 is a shift away from standard Transformer attention. Roughly 75% of the model’s layers now use Gated DeltaNet — a linear attention mechanism — while the remaining 25% use conventional full attention. This is a 3:1 ratio of linear to quadratic layers.
Standard Transformer attention scales quadratically with sequence length. Double the context, quadruple the compute. Gated DeltaNet sidesteps this by maintaining a compressed state matrix instead of storing all key-value pairs. The result is near-linear scaling — which is how Qwen 3.5 pushes its context window to 262K tokens natively and over a million with extension, without the memory costs that would make that impractical on consumer hardware.
The “gated” part is what makes it competitive with full attention on quality. Gated DeltaNet combines two memory-control mechanisms: a decay gate that can selectively forget the accumulated state (similar to Mamba2), and a delta rule that can surgically update individual key-value associations without disturbing the rest. Together, these give the model flexible control over what it remembers and what it discards across long sequences — avoiding the retrieval degradation that plagued earlier linear attention designs.
The remaining 25% of full attention layers handle tasks where precise token-level relationships matter — the kind of fine-grained matching that linear approximations still struggle with. The hybrid approach gets most of the efficiency gains while keeping quality where it counts.
For local inference, the practical impact is lower memory pressure during long-context generation. A model using Gated DeltaNet does not need to hold an ever-growing KV cache in RAM the way a pure Transformer does. On a device with 8 GB of unified memory, that difference determines whether a conversation can continue or the model runs out of headroom.
Benchmark Results: The 9B Model Is the Story
The flagship 397B model delivers expected frontier-class numbers — it beats Claude 4.5 Opus on the HMMT math benchmark and competes across the board with GPT-5.3. But the numbers that matter for local AI users come from the small models, and the 9B punches far above its weight:
| Benchmark | Qwen3.5-9B | GPT-OSS-120B | Qwen3-30B | Qwen3-80B |
|---|---|---|---|---|
| GPQA Diamond | 81.7 | 71.5 | — | 77.2 |
| HMMT Feb 2025 | 83.2 | 76.7 | — | — |
| MMMU-Pro | 70.1 | 59.7 | — | — |
| IFEval | 91.5 | — | — | 88.9 |
| LongBench v2 | 55.2 | — | — | 48.0 |
A 9B model beating a 120B model on graduate-level science questions (GPQA Diamond) and outscoring an 80B predecessor on instruction following and long-context tasks is not incremental improvement. It reflects the combined gains from the new architecture, better training data, and scaled reinforcement learning that Alibaba applied across the family.
The vision benchmarks are equally strong. On MMMU (multimodal understanding), the 9B scores 78.4 and the 4B scores 77.6 — both competitive with models requiring far more compute. Document recognition on OmniDocBench hits 87.7 for the 9B.
The smaller models hold up too. The 0.8B model reaches 62.2 on MathVista and 63.8 on VideoMME — respectable numbers for a model that fits in under 2 GB of memory.
Running Qwen 3.5 Locally on iPhone and Mac
GGUF quantizations are available through Unsloth on Hugging Face, with Dynamic 2.0 quantization at multiple bit depths. Here are the practical memory requirements:
| Model | Q3 (3-bit) | Q4 (4-bit) | Q6 (6-bit) | Q8 (8-bit) |
|---|---|---|---|---|
| 0.8B | ~3 GB | ~3.5 GB | ~5 GB | ~7.5 GB |
| 2B | ~3 GB | ~3.5 GB | ~5 GB | ~7.5 GB |
| 4B | ~4.5 GB | ~5.5 GB | ~7 GB | ~10 GB |
| 9B | ~5.5 GB | ~6.5 GB | ~9 GB | ~13 GB |
| 35B-A3B | ~3 GB* | ~4 GB* | ~6 GB* | ~8 GB* |
*MoE active parameter memory — total model file is larger, but only the active experts need fast access during inference.
For iPhone 15 Pro and later (8 GB RAM), the 0.8B, 2B, and 4B models at Q4 quantization are comfortable fits. The 9B at Q3 is possible but tight. The 35B-A3B MoE variant is the interesting edge case — with only 3B parameters active per token, actual inference memory is comparable to a 4B dense model, though the full model file is larger and needs to be stored on device.
On Apple Silicon Macs, the story opens up significantly. An M1 with 16 GB of unified memory can handle the 9B at Q4 or even Q6 without issues. M2/M3/M4 machines with 24+ GB can comfortably run the 27B dense model or the 122B MoE variant.
The models work with llama.cpp, Ollama, and any GGUF-compatible inference engine. For a walkthrough of quantization formats and how to pick the right one for your hardware, see our practical quantization guide for iPhone and Mac.
How Qwen 3.5 Compares to the Competition
The local AI model landscape shifted significantly since Qwen3 launched last April. Here is where Qwen 3.5 sits relative to the current field:
Against Gemma 3n and Gemma 3 (Google): Gemma 3 remains strong at the 4B and 27B sizes, but Qwen 3.5’s 9B model now occupies a tier that Gemma does not directly compete in — and beats Gemma’s larger offerings on several reasoning benchmarks. Qwen 3.5 also adds native video understanding, which Gemma 3n does not support. The 201-language coverage versus Gemma’s 140+ is a meaningful gap for developers targeting less-common languages.
Against Llama (Meta): Llama remains popular for its ecosystem and community tooling, but has not released a model family matching Qwen 3.5’s combination of multimodal capability, language coverage, and efficient MoE variants at this range of sizes. The Qwen 3.5 family offers more deployment options from edge to server.
Against Phi-4 (Microsoft): Microsoft’s Phi-4-Reasoning-Vision, also released this month, targets the 15B sweet spot with strong math and science reasoning. It is a compelling single model, but Qwen 3.5 offers a complete family — if you need a model at 0.8B, 4B, 9B, 35B, or 122B, there is a Qwen 3.5 variant purpose-built for that scale. Phi-4 does not cover that range.
Against closed-source (GPT-5.4, Claude 4.6): The flagship 397B model is competitive on benchmarks, but the real comparison that matters is the 9B model versus cloud APIs. A locally-running 9B model with 81.7 on GPQA Diamond is not far from what cloud models scored a generation ago — except it runs on your hardware, costs nothing per query, and keeps every token private. For many practical use cases, the gap has closed enough that the privacy and cost advantages of local inference outweigh the remaining quality difference.
What Makes This Release Different
Model releases are constant. What makes Qwen 3.5 worth paying attention to is the combination of three things that rarely ship together:
A real architecture change. Gated DeltaNet is not a training trick or a prompting technique — it is a fundamental rethinking of how attention works in a Transformer. The 3:1 hybrid ratio means these models process long contexts with meaningfully less memory than architecturally identical models from six months ago. That matters most at the edge, where every megabyte counts.
Full-range coverage under one architecture. Having the same architectural approach from 0.8B to 397B means findings and optimizations transfer across the family. Fine-tuning a 4B model for a specific task gives you a reasonable prediction of how the 9B or 27B will behave on the same task. Tooling built for one size works with all of them.
Apache 2.0 across the board. No gated access, no usage restrictions, no commercial license fees. Every model in the family is available for any use — commercial products, research, modification, redistribution. For developers and companies building on local AI, this removes the licensing uncertainty that still surrounds some competing model families.
The Bottom Line
Qwen 3.5 is the most complete open-source model family available today. Eight models spanning four orders of magnitude in size, all sharing a new hybrid architecture, native multimodal capabilities, and 201-language support — released over three weeks under a fully permissive license.
For local AI users, the 9B model is the headline. It delivers graduate-level reasoning performance in a package that fits on an iPhone or runs comfortably on any Mac. The MoE variants — especially the 35B-A3B — offer an unusual efficiency profile where a model with 35 billion parameters of knowledge activates only 3 billion per token, keeping inference fast and memory requirements low.
GGUF versions are on Hugging Face now. If you have been running Qwen3 locally, the upgrade path is straightforward — same ecosystem, better architecture, stronger benchmarks at every size. If you are new to local AI, the 4B model at Q4 quantization is a reasonable starting point: multimodal, multilingual, and small enough to run on any recent iPhone without compromise.