Home / Server / Ollama Running Slow? How to Speed Up Local LLM Inference

Ollama Running Slow? How to Speed Up Local LLM Inference

Ollama AI Featured Image

If you have found that Ollama running slow is grinding your local AI workflow to a halt, you are not alone. Many UK businesses deploying large language models on-premises hit the same wall: promising initial tests followed by sluggish real-world performance that makes the setup feel barely usable. This guide walks through every practical fix, from hardware choices to configuration tweaks, so you can get meaningful inference speeds without spending a fortune.

Local LLM inference with Ollama offers real advantages for UK SMEs: data stays on-site, there are no per-token API costs, and you retain full control over sensitive business information. But those benefits disappear quickly if your team is waiting thirty seconds for each response. Understanding why Ollama runs slowly and what you can do about it is the difference between a useful tool and an abandoned experiment.


Why Is Ollama Running Slow? The Core Causes

Ollama inference speed comes down to one fundamental bottleneck in the vast majority of cases: memory bandwidth. Large language models need to load billions of parameters from storage into working memory and then shuttle those parameters through a processor at high speed for every single token generated. When that pipeline is constrained, generation slows to a crawl regardless of how fast your CPU clock speed appears on paper.

The three most common culprits are insufficient RAM causing model layers to spill onto disk, running entirely on CPU when a GPU is available or could be added, and using a model that is simply too large for the hardware you have. Before tuning any settings, it is worth diagnosing which of these is your primary issue, as the fixes are quite different.

  • Model layers offloaded to disk: Dramatically reduces tokens per second, often to single digits
  • CPU-only inference: Typically 5 to 15 tokens per second on a modern workstation; usable but slow
  • Mismatched model size: Running a 70B parameter model on hardware suited to 7B
  • Quantisation not applied: Using full-precision weights when a quantised version would fit in VRAM
  • Thermal throttling: Sustained inference loads can cause CPUs and GPUs to reduce clock speeds
  • Background processes competing for RAM: A Windows or Linux server doing other work while serving Ollama requests

Step 1: Match Your Model Size to Your Available RAM

The single most impactful change most users can make is choosing a model size that fits comfortably within their available memory. Running a model that exceeds your VRAM forces Ollama to offload layers to system RAM, and if that also overflows, layers go to your storage drive. NVME SSD to RAM bandwidth is roughly ten to twenty times slower than RAM to GPU bandwidth. Disk to RAM is worse still. The performance cliff is severe.

As a rule of thumb, the model weights in gigabytes need to fit inside your GPU VRAM for fast inference. A Q4 quantised 7B model requires roughly 4 to 5 GB of VRAM. A Q4 13B model needs approximately 8 to 9 GB. A Q4 34B model wants around 20 GB. Our detailed guide on how much RAM you need to run Ollama models covers this in full, including which model families are most memory-efficient for given tasks.

Model SizeQuantisationExpected Speed (GPU)
7BQ4_K_M~5 GB40 to 80 tokens/sec
13BQ4_K_M~8 GB25 to 50 tokens/sec
34BQ4_K_M~20 GB10 to 25 tokens/sec
70BQ4_K_M~40 GB5 to 15 tokens/sec
7BQ4_K_M (CPU only)~5 GB RAM5 to 15 tokens/sec

If you are currently on CPU-only inference, dropping from a 13B to a 7B model and applying Q4 quantisation will often double or triple your tokens per second immediately, with minimal visible quality loss for most business tasks such as summarisation, drafting, or classification.


Step 2: Enable GPU Acceleration (CUDA or ROCm)

If Ollama is running slow and you have a discrete GPU in your system, the first question to ask is whether Ollama is actually using it. Many users assume GPU acceleration is automatic, but it requires the correct drivers and runtime libraries to be installed and recognised. Ollama supports NVIDIA GPUs via CUDA and AMD GPUs via ROCm on Linux.

You can confirm whether Ollama is using your GPU by running ollama run [model] and watching GPU utilisation in a tool such as nvidia-smi or GPU-Z. If the GPU sits at idle while a query is processing, something is misconfigured. Our guide on fixing Ollama GPU not detected errors covers the CUDA and ROCm troubleshooting steps in detail, including driver version mismatches and WSL2 configuration on Windows.

For UK businesses looking to add GPU acceleration without a full server rebuild, consumer cards such as the NVIDIA RTX 4060 (8 GB VRAM, typically from around £280 to £320) or the RTX 4070 (12 GB VRAM, typically around £500 to £580) offer a substantial step up for local inference. The RTX 4090 with 24 GB VRAM is the current sweet spot for running 34B models fully in VRAM, though pricing sits at around £1,700 to £2,000 new. Workstation cards such as the NVIDIA RTX 4000 Ada (20 GB VRAM) are also worth considering if you need ECC memory or a lower profile form factor.


Step 3: Tune Ollama Configuration for Your Hardware

Ollama exposes several environment variables that can meaningfully improve performance once your hardware is correctly configured. These are often overlooked because the default settings work acceptably in testing but do not reflect real production workloads.

  • OLLAMA_NUM_PARALLEL: Controls how many requests Ollama processes simultaneously. Defaults to 1. Increasing this to 2 or 4 can improve throughput when multiple users hit the server concurrently, though it splits available VRAM between contexts.
  • OLLAMA_MAX_LOADED_MODELS: Sets how many models stay loaded in memory at once. Keeping this at 1 when you only use one model avoids unnecessary VRAM consumption.
  • OLLAMA_NUM_GPU: Specifies how many GPU layers to offload. Setting this explicitly rather than relying on auto-detection can prevent partial CPU fallback on some configurations.
  • OLLAMA_FLASH_ATTENTION: Setting this to 1 enables Flash Attention, which reduces memory usage and can speed up inference on supported hardware. Available in recent Ollama versions.
  • OLLAMA_KEEP_ALIVE: Determines how long a model stays loaded between requests. Setting to a high value (e.g., 30m or -1 for always loaded) eliminates the cold-load penalty on the first request after a period of inactivity.

On Linux, set these in your systemd service file or shell profile. On Windows, set them as system environment variables through the Advanced System Settings panel. After any change, restart the Ollama service completely and run a test prompt to verify the effect.

The context window size is another lever. Every model request carries a context buffer in VRAM. If you are using a 128K context window but your actual prompts are only a few hundred tokens, reduce num_ctx in your Modelfile to 2048 or 4096. This frees VRAM for the model weights themselves and reduces per-token computation.


Step 4: Choose the Right Quantisation Level

Quantisation reduces the numerical precision of model weights, shrinking the file size and memory footprint in exchange for a small reduction in output quality. For most business applications, the tradeoff is extremely favourable. Q4_K_M is generally considered the best balance: it cuts a 7B model from around 14 GB (FP16) to roughly 4.5 GB with minimal quality degradation on practical tasks.

Ollama’s model library on ollama.com provides multiple quantisation variants for most popular models. When pulling a model, you can specify the tag directly. For example, ollama pull llama3.1:8b-instruct-q4_K_M pulls the Q4_K_M quantised version. If you previously pulled the default tag without specifying quantisation, you may be running a larger variant than necessary.

QuantisationRelative SizeQuality vs FP16Best Use Case
FP16100%ReferenceResearch, maximum quality
Q8_0~53%Near-identicalHigh-end GPU with spare VRAM
Q4_K_M~30%Very goodMost business workloads
Q4_0~28%GoodMemory-constrained systems
Q2_K~18%AcceptableVery low VRAM, simple tasks

For teams doing summarisation, email drafting, FAQ generation, or basic data extraction, Q4_K_M on a well-chosen 7B or 8B model will be nearly indistinguishable from FP16 in day-to-day use. Dropping to Q2_K is only advisable when you genuinely cannot fit a Q4 model and cannot upgrade your hardware.


Step 5: Upgrade Your Hardware Strategically

If you have applied all the above optimisations and inference speed is still insufficient for your use case, hardware is the next conversation. The good news for UK SMEs is that meaningful upgrades do not require enterprise-grade investment. A mid-range dedicated inference server built around a consumer GPU can comfortably serve a team of ten to twenty users running 7B to 13B models.

For CPU-only inference, the limiting factor is memory bandwidth rather than core count. AMD Threadripper and EPYC processors have notably higher memory bandwidth than standard desktop Ryzen or Intel Core chips, and this translates directly into faster tokens per second when running entirely in system RAM. Apple Silicon (M2, M3, M4 chips) is particularly efficient for CPU inference due to its unified memory architecture, where the GPU and CPU share the same high-bandwidth memory pool. If your team is already using Macs, running Ollama on a Mac Mini M4 Pro (typically around £1,400) can serve small teams very effectively without any GPU configuration headaches. Our guide on how to run Ollama on a home server covers server build recommendations in more detail.

For Windows or Linux servers, adding a single NVIDIA RTX 4070 or 4070 Ti (12 GB VRAM) to an existing machine with a compatible PCIe slot is often the most cost-effective upgrade available. It shifts most inference off the CPU entirely and delivers a five to ten times improvement in tokens per second for 7B and 8B models. Ensure your server’s power supply can handle the additional load; most RTX 4070 cards require a 650W or higher PSU.


Step 6: Optimise Your Server Environment

Even with good hardware and correct configuration, environmental factors can constrain Ollama performance. Thermal throttling is a common and often overlooked cause of inconsistent speeds: a GPU or CPU running sustained inference at near-maximum load will reduce its clock speed when temperatures climb, leading to speeds that start fast but degrade over minutes. Ensure your server or workstation has adequate airflow, clean fans, and fresh thermal paste if it is more than two or three years old.

On Windows, ensure Ollama is running with adequate process priority and that no background tasks are competing for memory. Windows Update, antivirus scans, and backup jobs are common culprits for intermittent slowdowns during inference. On Linux, use nice and ionice to ensure the Ollama process has scheduling priority, particularly on shared servers running other services alongside LLM inference.

Storage speed matters less than memory bandwidth for inference, but it does affect model load time. If your models are stored on a mechanical hard drive, moving them to an NVME SSD will dramatically reduce the cold-start delay when Ollama first loads a model into memory. This is especially relevant if you have set OLLAMA_KEEP_ALIVE to a short value and models are frequently unloaded and reloaded between sessions.


Key Takeaways

  • Ollama running slow is most commonly caused by model layers spilling out of VRAM into system RAM or disk, not CPU speed
  • Always verify GPU acceleration is actually active using monitoring tools before assuming it works
  • Matching model size to available VRAM using Q4_K_M quantisation is the most impactful single change for most users
  • Environment variables such as OLLAMA_KEEP_ALIVE, OLLAMA_FLASH_ATTENTION, and reduced num_ctx offer free performance gains
  • For CPU-only inference, memory bandwidth matters more than core count; Apple Silicon and AMD EPYC offer advantages here
  • A single mid-range NVIDIA RTX GPU upgrade can deliver five to ten times speed improvement for teams currently on CPU inference
  • Thermal management and background process interference are common real-world causes of inconsistent speeds
  • Reducing context window size when large contexts are unnecessary frees VRAM for model layers and reduces computation


Frequently Asked Questions

Why is Ollama so slow on my CPU compared to cloud AI tools?

Cloud AI APIs run on high-end hardware with multiple GPUs and optimised inference stacks, often processing thousands of tokens per second. Consumer CPUs typically manage 5 to 15 tokens per second for a 7B model, which feels slow by comparison. Adding a mid-range GPU and using Q4 quantisation can bring local speeds close enough to cloud performance for most business tasks, while keeping data entirely on-premises.

Does more RAM always mean faster Ollama inference?

Not necessarily. More RAM helps if your model was previously overflowing into swap or disk, which would be causing a severe bottleneck. But once your model fits comfortably in memory, additional RAM does not increase tokens per second. The limiting factor then becomes memory bandwidth and compute throughput, which is why GPU VRAM with its much higher bandwidth is so much more effective than adding system RAM.

What is the fastest free way to speed up Ollama without new hardware?

The fastest free improvements are: switching to a Q4_K_M quantised model variant if you are not already using one, reducing the context window size using num_ctx in your Modelfile to match your actual prompt lengths, setting OLLAMA_KEEP_ALIVE to a high value to eliminate cold-load delays, and enabling Flash Attention via the OLLAMA_FLASH_ATTENTION environment variable if your Ollama version supports it. Together these changes can noticeably improve responsiveness without spending anything.

Can I run Ollama across multiple GPUs to speed things up?

Ollama does support multi-GPU configurations and will automatically distribute model layers across available GPUs when the model is too large for a single card. However, the performance scaling is not always linear due to inter-GPU communication overhead. For most SME scenarios, a single larger GPU (such as an RTX 4090 with 24 GB VRAM) will outperform two smaller GPUs of equivalent total VRAM due to lower latency and simpler memory management.

Is it worth running Ollama on a dedicated server rather than a workstation?

For teams of more than three or four people, a dedicated inference server makes practical sense. It avoids competing with someone’s daily workload, allows you to keep models always loaded in memory, and provides a consistent API endpoint for multiple applications. A dedicated machine does not need to be expensive: a refurbished workstation with a modern GPU added can serve a small team effectively for well under £2,000 total investment, with no ongoing API costs.



Sign Up For Daily Newsletter

Stay updated with our weekly newsletter. Subscribe now to never miss an update!

[mc4wp_form]