Ollama is an open-source tool that lets you download, manage, and run large language models entirely on your own computer. It handles all the complexity of model weights, runtime configuration, and GPU inference behind a simple command-line interface and a local HTTP API. Think of it as a package manager for AI models — you pull a model by name, and Ollama takes care of the rest. Once a model is running, you can chat with it in your terminal, call it from your own applications, or connect it to compatible front-end tools like Open WebUI.

Is Ollama free to use?

Yes. Ollama is completely free and open source, released under the MIT licence. There are no subscriptions, usage limits, or API costs. The models you run through Ollama are also generally free to download, though each has its own licence — some permit commercial use and some do not, so it is worth checking before using a model in a product or business context.

Does Ollama work without an internet connection?

Yes. Once you have pulled a model to your machine, Ollama runs it entirely locally with no internet connection required. The initial download needs internet access, but after that the model files are stored on your device and inference happens completely offline. This makes Ollama a strong choice for privacy-sensitive use cases, air-gapped environments, or situations where you cannot rely on a stable connection.

What operating systems does Ollama support?

Ollama supports macOS (Apple Silicon and Intel), Linux (x86_64 and ARM64), and Windows. On macOS, Apple Silicon Macs get the best performance because Ollama uses Metal to accelerate inference across the unified memory architecture. On Linux, NVIDIA GPUs are supported via CUDA and AMD GPUs via ROCm. On Windows, Ollama runs natively and supports NVIDIA and AMD GPUs as well as CPU-only mode.

How much RAM do I need to run Ollama?

The RAM you need depends on the model. As a rough guide:

7B parameter models (Llama 3.2, Mistral 7B) — around 6–8 GB RAM
13B parameter models — typically 10–12 GB RAM
30B–34B parameter models — around 20–24 GB RAM
70B parameter models — 40 GB or more

Quantised models (4-bit or 8-bit) use significantly less memory than full-precision variants. On Apple Silicon Macs, RAM is shared between CPU and GPU, so the full figure applies directly to the unified memory pool.

Does Ollama need a GPU?

No. Ollama will fall back to CPU inference automatically if no supported GPU is available. However, GPU inference is dramatically faster — CPU inference on larger models can be slow for interactive use. If you have a supported NVIDIA GPU (CUDA), AMD GPU (ROCm), or an Apple Silicon Mac (Metal), Ollama will use it automatically. For a responsive chat experience, a GPU or Apple Silicon chip is strongly recommended.

Which models can I run with Ollama?

Ollama’s library includes a wide and growing range of popular open-weight models: Meta’s Llama 3 and Llama 3.1 series, Mistral and Mixtral, Google’s Gemma, Microsoft’s Phi-3, Alibaba’s Qwen, Code Llama, DeepSeek Coder, and many others. Browse the full catalogue at ollama.com/library. You can also import GGUF model files from other sources (such as Hugging Face) using a Modelfile.

How do I install Ollama?

On macOS and Windows, download the installer from ollama.com and run it. On Linux, use the official install script:

curl -fsSL https://ollama.com/install.sh | sh

This installs the ollama binary and sets up a systemd service so the server starts automatically. Confirm with ollama --version.

How do I pull and run a model?

Use ollama pull to download a model, then ollama run to start an interactive session:

ollama pull llama3.2
ollama run llama3.2

If you run a model that hasn’t been pulled yet, ollama run will download it automatically first. Specify a variant with a colon: ollama pull llama3.2:3b or ollama pull llama3.2:1b.

How do I update a model to the latest version?

Simply pull the model again:

ollama pull llama3.2

Ollama checks the manifest and downloads only what has changed. There is no separate update command — pulling always fetches the current version of the tag you specify.

Where does Ollama store model files?

Model files are stored locally in:

macOS and Linux: ~/.ollama/models
Windows: %USERPROFILE%\.ollama\models

Large models take up several gigabytes each. You can point Ollama to a different directory by setting the OLLAMA_MODELS environment variable.

How do I delete a model I no longer need?

ollama rm llama3.2

This removes the model’s files from disk and frees up storage. See all downloaded models with ollama list. Each variant (e.g. 3b and 7b) must be deleted separately.

Can I use Ollama with a chat UI instead of the terminal?

Yes. Ollama’s built-in terminal interface is functional but minimal. For a more user-friendly experience, Open WebUI provides a ChatGPT-style interface with conversation history, model switching, and file uploads. Other options include Msty, Chatbox, and Hollama. Most connect to Ollama automatically if it is already running on your machine.

What is the Ollama API and how do I use it?

When Ollama is running, it exposes a local HTTP API on http://localhost:11434. Core endpoints:

POST /api/generate — single-turn text generation
POST /api/chat — multi-turn conversations with message history
POST /api/embeddings — generate text embeddings
GET /api/tags — list locally available models

Official client libraries are available for Python (pip install ollama) and JavaScript.

Is Ollama compatible with the OpenAI API?

Yes. Ollama includes an OpenAI-compatible endpoint at http://localhost:11434/v1. Any application or library that targets the OpenAI API — including the official openai Python package — can be pointed at your local Ollama instance with just a base URL change and a dummy API key. This makes it straightforward to swap cloud-based OpenAI calls for local model inference with minimal code changes.

Can I use Ollama with VS Code or other code editors?

Yes. The Continue VS Code extension is the most widely used option — it provides inline code completion and a chat sidebar, connecting to Ollama’s API with straightforward configuration. Cody from Sourcegraph also supports local Ollama models. In editors like Cursor, you can configure a custom OpenAI-compatible base URL pointing at http://localhost:11434/v1.

How do I run multiple models at the same time?

Ollama can keep more than one model loaded in memory simultaneously. Set the OLLAMA_MAX_LOADED_MODELS environment variable to control how many models can be resident at once:

OLLAMA_MAX_LOADED_MODELS=3 ollama serve

When you call a model that is not currently loaded, Ollama loads it automatically, potentially unloading the least-recently-used model if the limit has been reached. Models stay in memory for 5 minutes after last use by default — controlled by OLLAMA_KEEP_ALIVE.

What is a Modelfile and what can I do with it?

A Modelfile is a plain-text configuration file — similar in concept to a Dockerfile — that defines how Ollama should set up and run a model. You can customise the system prompt, adjust parameters (temperature, context length), import a GGUF file from disk, or apply a LoRA adapter. Build a custom model with:

ollama create my-custom-model -f ./Modelfile
ollama run my-custom-model

This is useful for creating task-specific personas, coding assistants with baked-in instructions, or wrappers around community fine-tunes downloaded from Hugging Face.

Why is Ollama running slowly?

Slow inference usually comes from one of a few causes: Ollama may be running on CPU rather than GPU (check that your GPU is being detected); the model may be too large for your available VRAM, causing it to spill into slower system RAM; or on Linux, you may have missing or mismatched GPU drivers. Running a smaller or more aggressively quantised model is the quickest fix. Check ollama logs to see which device Ollama is using.

Why isn’t Ollama detecting my GPU?

On Linux with NVIDIA GPUs, the most common cause is missing or mismatched CUDA drivers. Ensure nvidia-smi reports your GPU correctly. For AMD GPUs on Linux, ROCm must be properly installed and your GPU must be in ROCm’s supported hardware list. On Windows, make sure GPU drivers are up to date. Inspect Ollama’s logs (journalctl -u ollama on Linux) to see what happened at startup.

By default, Ollama only listens on 127.0.0.1 and is not accessible from other devices. To expose it on your local network, set the OLLAMA_HOST environment variable to 0.0.0.0:11434 before starting the server. Be cautious on untrusted networks — the API has no built-in authentication. For broader access, place a reverse proxy (nginx or Caddy) with authentication in front of it.

Is my data private when using Ollama?

Yes. Because everything runs locally, your prompts, responses, and documents never leave your device. There is no telemetry sent to Ollama’s servers, no cloud logging of your conversations, and no third-party model provider receiving your data. This is one of Ollama’s most important advantages over cloud-based AI services.

How is Ollama different from LM Studio?

Both tools let you run LLMs locally, but they take different approaches. LM Studio is a graphical desktop application with a polished GUI, accessible for users who prefer not to use the command line. Ollama is primarily a CLI and API tool — lighter weight, no GUI of its own, and better suited to developers who want to integrate local inference into applications or scripts. Ollama installs as a background service that other tools can call; LM Studio is more self-contained. Many users run both: Ollama as the always-on backend and a web UI such as Open WebUI for conversational use.

Can I run Ollama on a server headlessly?

Yes. Running ollama serve starts the API server without opening any interactive session. The official Linux install script sets up a systemd service that runs Ollama automatically, so you do not need to keep a terminal open. On a headless server, interact with Ollama via its HTTP API or by SSH-ing in to run ollama commands. This makes it straightforward to host a private inference endpoint on a home server, NAS, or cloud instance.

How do I keep Ollama running in the background automatically?

On Linux, the install script creates a systemd service:

sudo systemctl enable ollama   # start at boot
sudo systemctl start ollama    # start now
sudo systemctl status ollama   # check status

To set environment variables (such as OLLAMA_HOST or OLLAMA_KEEP_ALIVE), add them to a systemd override with sudo systemctl edit ollama, then reload with sudo systemctl daemon-reload && sudo systemctl restart ollama.

On macOS, the desktop app registers a login item that keeps Ollama running in the background, visible in the menu bar. On Windows, Ollama similarly runs as a background process after installation.

Ollama FAQ: Answers to the Most Common Questions

Table of Contents

What is Ollama and what does it do?