Home / Server / How Much RAM Do You Need to Run Ollama Models?

How Much RAM Do You Need to Run Ollama Models?

If you are trying to work out how much RAM for Ollama you actually need, you are not alone. It is one of the most common questions from UK business owners and IT managers who are exploring running large language models locally, whether on a dedicated server, a workstation, or even a capable desktop machine. Get the RAM wrong and your models will either refuse to load, run painfully slowly, or crash mid-conversation.

This guide breaks down Ollama RAM requirements by model size, explains how memory usage works in practice, and gives you clear recommendations whether you are running a small 7B model for testing or a full 70B model for production use. We will keep it practical and focused on what actually matters when you are spending real money on hardware.


How Ollama Uses RAM (and Why It Matters So Much)

Ollama loads AI model weights directly into memory before it can process any prompts. Unlike a traditional application that might use a few hundred megabytes, a large language model needs to hold billions of parameters in memory simultaneously. The more parameters, the more memory required. This is not negotiable. If your system does not have enough RAM or VRAM to hold the model, Ollama will either refuse to run it or fall back to extremely slow disk-based inference that renders it practically unusable for business purposes.

There are two memory types at play here. VRAM is the memory on your GPU, and it is by far the fastest option for inference. System RAM (your standard DDR4 or DDR5) is used when running Ollama in CPU-only mode. Both matter, but they behave differently. A model that fits entirely in VRAM will generate tokens many times faster than one that has to rely on system RAM. For business use where staff are waiting for responses, this difference is very significant.

Quantisation also plays a major role. Most models available through Ollama are distributed in quantised formats such as Q4_K_M or Q8_0, which reduce the memory footprint significantly compared to the full-precision weights. A 7B model at Q4 quantisation uses roughly 4 to 5 GB of memory, whereas the same model at full 16-bit precision would need around 14 GB. Understanding quantisation levels helps you pick the right model for your available hardware.


Ollama RAM Requirements by Model Size

The table below gives you a practical reference for the most common Ollama model sizes and their approximate memory requirements. These figures are based on typical Q4 quantisation, which is what Ollama pulls by default for most models. They represent the memory needed to load and run inference, not peak usage including the operating system and other applications.

Model SizeExamplesApprox. RAM (Q4)Recommended System RAM
3BPhi-3 Mini, Llama 3.2 3B2 to 3 GB16 GB
7BMistral 7B, Llama 3.1 7B, Gemma 7B4 to 5 GB16 GB
13BLlama 2 13B, CodeLlama 13B8 to 9 GB32 GB
34BCodeLlama 34B, Phi-3 Medium19 to 22 GB64 GB
70BLlama 3.1 70B, Qwen2 72B38 to 45 GB64 to 128 GB
405BLlama 3.1 405B220 GB+256 GB+ (enterprise only)

These figures assume CPU-only inference using system RAM. If you have a GPU, Ollama will attempt to load as much of the model as possible into VRAM and use system RAM only for the remainder. Partially GPU-accelerated inference is faster than pure CPU but slower than running entirely in VRAM. For most UK SME deployments, the sweet spot is a 7B or 13B model on a machine with 32 GB of RAM and a mid-range GPU with at least 8 GB of VRAM.


CPU-Only vs GPU-Accelerated Ollama Deployments

Running Ollama on CPU alone is perfectly possible and can produce good results for smaller models, but you will notice a significant difference in response speed. On a modern multi-core server CPU, a 7B model at Q4 quantisation might generate 10 to 20 tokens per second. That is readable and usable for asynchronous tasks, but it can feel slow when a member of staff is sitting waiting for a response in real time.

Add a GPU with sufficient VRAM and that figure can jump to 60 to 120 tokens per second for the same 7B model, which feels almost instant for most practical tasks. Consumer GPUs such as the NVIDIA RTX 4070 (12 GB VRAM) or RTX 4090 (24 GB VRAM) are popular choices for on-premise Ollama setups. For a 13B model, you will ideally want 16 GB of VRAM or more. If you are encountering GPU detection issues, our guide on Ollama GPU not detected and how to fix CUDA and ROCm errors covers the most common problems in detail.

For businesses considering a dedicated local AI server, AMD GPUs using ROCm are a viable alternative to NVIDIA CUDA, though driver support can require more configuration. Apple Silicon Macs (M2 Pro, M2 Ultra, M3, M4 series) are an increasingly popular option because they use unified memory shared between CPU and GPU, meaning a Mac Studio with 64 GB of unified memory can run a 34B model with excellent performance. This architecture removes the VRAM bottleneck entirely.


Different business use cases demand different hardware configurations. A solicitor’s office using Ollama for document summarisation has different requirements to a software development firm running code generation models. Below are practical configurations based on common UK SME scenarios.

  • Basic testing and evaluation (16 GB RAM, no GPU): Suitable for running 3B and 7B models. Response times will be moderate on CPU. Good for exploring what Ollama can do before committing to hardware spend.
  • Single-user productivity (32 GB RAM, 8 GB VRAM GPU): Run 7B and 13B models with GPU acceleration. Fast enough for real-time chat, summarisation, and code assistance. This is the minimum recommended for regular daily use.
  • Small team deployment (64 GB RAM, 16 to 24 GB VRAM GPU): Supports 34B models in VRAM or 70B models with RAM overflow. Can handle multiple concurrent users depending on the model and context window size.
  • Departmental or production server (128 GB RAM, dual GPU or 48 GB VRAM GPU): Enables full 70B model inference in VRAM with room for multiple concurrent sessions. Suitable for integrating Ollama into business applications and APIs.
  • Enterprise-level (256 GB+ RAM, multi-GPU): Required for 405B parameter models or high-concurrency production workloads. Typically requires server-grade hardware and is beyond most SME budgets.

If you are planning to run Ollama as a persistent service on your own infrastructure rather than a cloud instance, our guide on how to run Ollama on a home server provides a useful starting point for the server setup process, including OS configuration and networking considerations.


How Quantisation Affects Your RAM Requirements

Quantisation is the process of reducing the numerical precision of a model’s weights in order to shrink its memory footprint. Ollama supports multiple quantisation levels, and understanding these can help you squeeze more performance out of limited hardware or fit a larger model into the RAM you already have.

Quantisation LevelBits per WeightQuality vs Full Precision7B Model Size
Q2_K~2.5 bitNoticeable quality loss~2.7 GB
Q4_K_M~4.5 bitMinimal quality loss (recommended)~4.8 GB
Q5_K_M~5.5 bitVery close to original~5.7 GB
Q8_08 bitNear identical to original~8.0 GB
F16 (half precision)16 bitFull quality~14 GB

For most business applications, Q4_K_M is the sensible default. It offers a strong balance between output quality and memory efficiency, and it is what Ollama typically pulls when you run a model without specifying a tag. If you have plenty of VRAM to spare, upgrading to Q8_0 will produce noticeably sharper outputs for tasks like legal document analysis, technical writing, or anything where precision of language matters.

If you are severely RAM-constrained, you can drop to Q2 or Q3, but be aware that the output quality degrades noticeably at these levels. For professional business use, this is generally not advisable. It is better to run a smaller, well-quantised model than a larger, poorly quantised one.


Context Window Size and Its Impact on RAM Usage

One factor that is often overlooked in Ollama RAM planning is the context window. The context window determines how much text the model can hold in its working memory during a conversation. Longer conversations, longer documents, and longer prompts all consume more memory during inference. This is separate from the model weights themselves and is often referred to as the KV cache.

Ollama defaults to a context window of 2048 tokens for most models. If you increase this, for example to 8192 or 32768 tokens to support longer documents, your RAM usage during inference will increase accordingly. For a 7B model running at Q4 with a 2048 token context, you might use around 5 GB total. Extend the context to 32768 tokens and you could easily need 10 to 12 GB. This is a critical consideration if you are planning to use Ollama for summarising long contracts, processing technical manuals, or handling multi-turn business conversations.

You can control the context window in Ollama using a Modelfile or by setting the num_ctx parameter in your API calls. If you are running into memory errors on otherwise compatible hardware, an unexpectedly large context window is often the culprit. Always account for both the model weights and the expected context window size when planning your RAM budget.


Best RAM Choices for an Ollama Server Build

When specifying RAM for an Ollama server, speed and capacity both matter, but capacity should be your priority. DDR5 offers higher bandwidth than DDR4 and can marginally improve CPU inference speeds, but the difference is less dramatic than simply having more RAM available. For a dedicated Ollama workstation or server, aim to populate all memory channels to maximise bandwidth.

  • Consumer desktop builds: DDR5 at 5600 MHz or faster, 32 GB as a minimum for 7B to 13B models, 64 GB for 34B models. Brands such as Corsair, Kingston, and Crucial are reliable choices available from UK suppliers including Scan, Overclockers, and Ebuyer.
  • Workstation builds (AMD Threadripper or Intel Xeon W): ECC DDR5 at 4800 MHz, configured in quad-channel for maximum bandwidth. 128 GB or 256 GB configurations are practical for 70B models without GPU acceleration.
  • Server builds (1U or 2U rack servers): Registered ECC DDR4 or DDR5, typically purchased as part of a configured system from suppliers such as Dell, HPE, or Lenovo. These systems support larger DIMM configurations and are better suited to high-availability deployments.
  • Apple Silicon (Mac Studio or Mac Pro): Unified memory from 32 GB to 192 GB depending on specification. No separate VRAM to worry about. Particularly well-suited for running Ollama cleanly on mid-to-large models without complex driver configuration.

RAM speed matters less than you might expect for GPU-accelerated inference because the bottleneck in that scenario is the VRAM bandwidth on the GPU, not your system RAM. However, for pure CPU inference, faster memory with more channels does make a measurable difference, particularly for larger models where memory bandwidth becomes the limiting factor.


Key Takeaways

  • Ollama RAM requirements scale directly with model size and quantisation level. A 7B model at Q4 needs roughly 4 to 5 GB, while a 70B model needs 38 to 45 GB.
  • Always add headroom beyond the model size alone. Your operating system, other applications, and the KV cache for context windows all consume additional memory.
  • GPU VRAM provides dramatically faster inference than system RAM. Aim to fit your chosen model entirely within VRAM where possible.
  • Quantisation lets you run larger models on less memory, but quality degrades at very low quantisation levels. Q4_K_M is the recommended default for business use.
  • Context window size significantly affects RAM usage during inference. Longer documents require more memory beyond the base model footprint.
  • For most UK SME deployments, 32 GB of system RAM with an 8 to 12 GB VRAM GPU covers 7B to 13B models comfortably. Scale to 64 GB RAM and 24 GB VRAM for 34B models.
  • Apple Silicon Macs with unified memory are an increasingly practical and straightforward option for on-premise Ollama deployments without the need for separate GPU configuration.


Frequently Asked Questions

Can I run Ollama with 8 GB of RAM?

Yes, but your options are limited. With 8 GB of system RAM you can run small models such as Phi-3 Mini (3B) or TinyLlama with reasonable performance on CPU. A 7B model at Q4 will technically load on 8 GB but leaves almost no headroom for the operating system and will likely cause slowdowns or swapping to disk. For any serious business use, 16 GB is the practical minimum and 32 GB is strongly recommended.

Does Ollama use RAM or VRAM?

Ollama uses whichever memory is available and prioritises VRAM for GPU-accelerated inference. If your model fits entirely within your GPU’s VRAM, Ollama will load it there and use the GPU for all inference, which is significantly faster. If the model is too large for your VRAM, Ollama will split it across VRAM and system RAM, using partial GPU acceleration. If there is no GPU or insufficient VRAM, Ollama falls back to CPU-only inference using system RAM.

How much RAM do I need to run Llama 3.1 70B with Ollama?

At Q4_K_M quantisation, Llama 3.1 70B requires approximately 40 to 45 GB of memory to load. You should have at least 64 GB of system RAM to run it comfortably on CPU, with 128 GB recommended for reliable performance and longer context windows. To run it entirely in VRAM, you would need a GPU with 48 GB of VRAM such as the NVIDIA RTX 6000 Ada or A6000, or a multi-GPU setup. This places it firmly in the professional workstation or server category rather than a typical office desktop.

Will adding more RAM make Ollama faster?

It depends on your current setup. If your model is currently spilling onto disk due to insufficient RAM, adding more RAM will make a very large difference. If your model is already fully loaded into system RAM but running on CPU only, adding more RAM will not dramatically improve speed. The most impactful upgrade for inference speed is adding a GPU with sufficient VRAM to hold your model entirely, or upgrading to a GPU with higher memory bandwidth. RAM speed (MHz) has a modest effect on CPU inference performance.

What is the best setup for running Ollama in a small business?

For most UK small businesses, a dedicated workstation or small server with 32 to 64 GB of DDR5 RAM and an NVIDIA RTX 4070 or RTX 4080 GPU provides an excellent balance of cost and capability. This configuration can run 7B to 13B models with full GPU acceleration and handle multiple staff members using the service throughout the day. If budget allows, stepping up to 64 GB RAM and an RTX 4090 (24 GB VRAM) opens up smooth 34B model inference. Always ensure the machine is running continuously and accessible on your local network, and consider security and access control if the Ollama API is exposed internally.



Sign Up For Daily Newsletter

Stay updated with our weekly newsletter. Subscribe now to never miss an update!

[mc4wp_form]