Local LLMs in September 2025: What’s New and What Matters
September 06, 2025
Local large language models took another leap this summer. Beyond faster chips, we saw real progress in platform support, in‑browser inference, friendlier tools, and model efficiency. Here’s what changed—and how to take advantage of it on iPhone and Mac.
Platforms and runtimes: local by default
- AMD Gaia (Windows): AMD introduced an open‑source stack to run LLMs locally on mainstream PCs, built on ONNX TurnkeyML’s Lemonade SDK. Gaia ships a retrieval‑augmented generation (RAG) agent and two installers: a standard build for any Windows PC and a hybrid build that taps Ryzen AI. The focus: security, offline use, and low‑latency responses.
- Matured local stacks: The usual suspects—llama.cpp variants and ONNX Runtime—continue to stabilize quantized inference across consumer hardware, with more consistent memory footprints and better long‑context behavior.
In‑browser inference is real
- WebLLM + WebGPU: Running LLMs entirely in the browser is now practical. WebLLM exposes an OpenAI‑style API and achieves up to ~80% of native performance on the same device by using WebGPU on GPU and WebAssembly on CPU. The result is privacy‑preserving apps that work anywhere without installs.
Tooling and UX: easier than ever
- LM Studio 3.0: An all‑in‑one local AI workbench with simultaneous chat, API, and fine‑tuning modes, integrated model discovery, and one‑click deployment.
- Ollama AI Suite: Enterprise‑grade management for local models—deployment, versioning, monitoring, and IDE integrations—plus improved Windows support.
- H2OGPT Pro: A privacy‑first assistant with stronger document understanding, collaborative features, and customizable templates for industry‑specific workflows.
These tools reduce the barrier to entry and make serious local workflows possible without MLOps expertise.
Models and architectures: capacity when you need it
- MoE goes mainstream: Mixture‑of‑Experts designs emphasize high total capacity but activate only a small subset of parameters per token, delivering strong quality without the full runtime cost.
- Reasoning‑tuned small models: Distilled reasoning variants bring step‑by‑step thinking to compact footprints—great for laptops and even phones.
- Strong open‑weight families: Gemma 3, Qwen 3, Llama 3.x/4‑class releases, and Phi‑4 mini/14B continue to raise the quality bar at 1–32B scales.
Compression and fine‑tuning: smaller, smarter, cheaper
- Quantization: 4‑bit (Q4) remains the sweet spot for phones and most Macs, with Q5 improving reasoning/coding if memory allows. Ternary (~1.58‑bit) formats are emerging for select models.
- PEFT: LoRA/QLoRA let you adapt models efficiently without retraining the entire network, ideal for constrained devices or quick vertical customizations.
Edge and hybrid: right‑sizing inference
- Edge computing continues to drive offline and low‑latency scenarios across mobile and IoT.
- Local‑cloud offloading research (e.g., LCIO) explores dynamically splitting work between a lightweight local model and a large cloud model to maximize quality while staying within device limits.
Practical picks for iPhone and Mac (September 2025)
These are pragmatic defaults; always match the model to your task and device memory.
iPhone (and most iPads)
- Best sizes: 0.5B–4B effective parameters
- Great defaults: Gemma 3n (effective ~4B), Phi‑4 mini/flash (~3.8B), Qwen 2.5/3 in 1–3B, Liquid AI LFM‑2 700M for snappy mobile use
- Quantization: Q4 for comfort; Q3 for maximum footprint savings; Q5 only if thermals allow
Mac (Apple Silicon)
- 8–16GB RAM: 7B (Q4/Q5) for everyday chat/writing and lightweight coding
- 16–32GB RAM: 7B–14B (Q4/Q5) for stronger writing/coding and long‑context tasks
- 32–64GB RAM: 14B–32B (Q4/Q5) for reasoning and multi‑file coding
- 128GB+ RAM: 70B class (Q4) for near‑frontier local performance
Tip: For coding/reasoning, try Q5 if your device allows; for drafting and chat, Q4 usually feels great while staying cool.
Using these models in Enclave AI
Enclave AI is built for private, on‑device work on iPhone and Mac. Pick efficient small models on mobile, then scale to 14B/32B (and 70B on high‑RAM Macs) when you need deeper reasoning or bigger contexts—all without sending your data to the cloud. Prefer a mixed approach? Enclave can also connect to cloud endpoints when you explicitly choose them.
Key takeaways
- Local is now table stakes: Windows (Gaia), browser (WebLLM), and Mac/iOS all have strong options.
- Efficiency beats raw size: MoE, distillation, and quantization make small footprints punch above their weight.
- UX caught up: LM Studio, Ollama, and H2OGPT Pro make local deployments turnkey.
- Start small, scale as needed: Pick for the task, quantize for comfort, and only move up when you feel limited.