Running large language models (LLMs) locally has often meant accepting slower speeds and tighter memory limits. Ollama’s latest update, built on Apple’s MLX framework, goes some way toward easing those constraints – especially for developers running AI agents directly on their machines.
In tandem, the release also introduces support for NVIDIA’s NVFP4 format, which targets memory efficiency for larger models.
For context, Ollama is runtime for LLMs with an open core that can be run locally, with a growing catalogue of open-weight models from major AI labs such as Meta, Google, Mistral, and Alibaba, which can be downloaded and run on a developer’s own machine or private infrastructure. It also integrates with coding agents, assistants, and developer tools, allowing those tools to run on locally hosted models instead of relying solely on external APIs.
Local speed gains News emerged in early 2025 that Ollama was developing support for MLX, an open source machine learning framework Apple introduced in 2023 to run models efficiently on Apple Silicon. Its core feature — and that of Apple’s modern hardware — is a shared memory model that allows CPU and GPU workloads to operate on the same data without the usual transfer overhead, reducing latency and improving throughput during inference.
Ollama is now officially plugging directly into that architecture with its latest release. In its announcement on Monday, the company points to improvements in both responsiveness and generation speed, particularly for coding-focused models.