Ollama is an open-source tool that lets you download, run, and manage large language models locally on Linux, macOS, and Windows. This cheat sheet covers every CLI command, Modelfile instruction, REST API endpoint, and environment variable you are likely to need — structured as a dense reference you can bookmark and return to rather than read top to bottom.
Installation Quick Reference
Install
| Platform | Method | Command / Action |
|---|---|---|
| Linux | One-liner script | curl -fsSL https://ollama.com/install.sh | sh |
| macOS | Homebrew | brew install ollama |
| macOS | Manual download | Download .dmg from ollama.com and drag to Applications |
| Windows | Installer | Download OllamaSetup.exe from ollama.com and run it |
Verify and start
| Command | What it does |
|---|---|
ollama --version |
Print installed Ollama version |
ollama serve |
Start the Ollama API server manually (not needed if running as a systemd service or macOS menu bar app) |
Default API port: 11434
API base URL: http://localhost:11434
Model Management
| Command | What it does | Example |
|---|---|---|
ollama pull <model> |
Download the latest tag of a model | ollama pull llama3.2 |
ollama pull <model>:<tag> |
Download a specific version or quantisation | ollama pull llama3.2:8b-instruct-q4_K_M |
ollama list |
List all locally installed models with size and modified date | ollama list |
ollama ls |
Alias for ollama list |
ollama ls |
ollama show <model> |
Display model details: parameters, template, and system prompt | ollama show llama3.2 |
ollama show <model> --modelfile |
Print the full Modelfile for a model | ollama show llama3.2 --modelfile |
ollama rm <model> |
Delete a model and free disk space | ollama rm llama3.2:8b |
ollama cp <source> <dest> |
Copy a model under a new name (useful before customising) | ollama cp llama3.2 my-llama |
ollama push <model> |
Push a model to the Ollama registry (requires a free account at ollama.com) | ollama push myuser/my-llama |
Understanding model tag conventions
Model tags follow the pattern <family>:<size>-<type>-<quantisation>. For example, llama3.2:8b-instruct-q4_K_M breaks down as:
| Part | Meaning | Notes |
|---|---|---|
8b |
8 billion parameters | Larger = more capable but slower and more RAM |
instruct |
Fine-tuned for instruction following / chat | base models are for continuation tasks only |
q4_K_M |
4-bit quantisation, K-quant medium | Lower bit = smaller file, slight quality reduction. q4_K_M is a popular balance point. fp16 = full precision, largest file. |
If you omit the tag (e.g. ollama pull llama3.2), Ollama downloads the default recommended tag for that model.
Running Models
ollama run flags
| Command | What it does | Example |
|---|---|---|
ollama run <model> |
Start an interactive chat session in the terminal | ollama run llama3.2 |
ollama run <model> "prompt" |
Send a single prompt and exit (non-interactive) | ollama run llama3.2 "Summarise Linux in one paragraph" |
ollama run <model> --verbose |
Show token generation speed and stats after the response | ollama run llama3.2 --verbose "Hello" |
ollama run <model> --nowordwrap |
Disable automatic word wrapping in the terminal | ollama run llama3.2 --nowordwrap |
ollama run <model> --format json |
Force the model to output valid JSON | ollama run llama3.2 --format json "List 3 distros" |
ollama run <model> --keepalive 0 |
Unload the model from memory immediately after the response | ollama run llama3.2 --keepalive 0 "Hello" |
ollama run <model> --keepalive -1 |
Keep the model loaded in memory indefinitely | ollama run llama3.2 --keepalive -1 |
echo "prompt" | ollama run <model> |
Pipe a prompt from stdin | echo "What is Nginx?" | ollama run llama3.2 |
ollama run <model> < file.txt |
Redirect a file as the prompt | ollama run llama3.2 < prompt.txt |
Interactive session slash commands
These commands are available once you are inside an ollama run session.
| Command | What it does |
|---|---|
/? or /help |
Show all available slash commands |
/set <option> <value> |
Change a runtime parameter (e.g. /set parameter temperature 0.5) |
/show <info> |
Display model info, parameters, template, system prompt, or licence |
/load <model> |
Switch to a different model without leaving the session |
/save <session> |
Save the current conversation history as a named session |
/clear |
Clear the current conversation history (resets context) |
/bye or Ctrl+D |
Exit the interactive session |
Modelfile Reference
A Modelfile is a plain-text configuration file (similar in concept to a Dockerfile) that defines a custom model — its base, system prompt, parameters, and template. You create a custom model by pointing ollama create at a Modelfile.
Modelfile instructions
| Instruction | Required | What it does |
|---|---|---|
FROM |
Yes | Base model to build from (e.g. FROM llama3.2 or a path to a GGUF file) |
PARAMETER <name> <value> |
No | Set a model parameter (see Key Parameters Reference below for all valid names) |
TEMPLATE |
No | Override the prompt template used to format messages sent to the model |
SYSTEM |
No | Set a system prompt that is injected at the start of every conversation |
ADAPTER |
No | Path to a LoRA adapter file (.gguf) to apply to the base model |
LICENSE |
No | Specify the licence under which the model is distributed |
MESSAGE <role> <content> |
No | Pre-populate conversation history (role is user, assistant, or system) |
Minimal working Modelfile example
FROM llama3.2
PARAMETER temperature 0.7
PARAMETER num_ctx 4096
SYSTEM "You are a helpful assistant."
Building a model from a Modelfile
| Command | What it does |
|---|---|
ollama create <name> -f Modelfile |
Build a custom model named <name> from a file called Modelfile in the current directory |
ollama create <name> -f ./path/to/Modelfile |
Build from a Modelfile at an explicit path |
Key Parameters Reference
| Parameter | Range / Type | Default | What it controls | Typical use |
|---|---|---|---|---|
temperature |
0.0 – 2.0 | 0.8 | Randomness and creativity of output | Lower (0.1–0.3) for factual/deterministic; higher (0.8–1.2) for creative writing |
top_p |
0.0 – 1.0 | 0.9 | Nucleus sampling — limits token pool to top cumulative probability | Reduce to 0.5–0.7 to make output more focused |
top_k |
Integer | 40 | Top-k sampling — limits token pool to the top k candidates | Lower values reduce diversity; 0 disables top-k |
num_ctx |
Integer | 2048 | Context window size in tokens (prompt + response) | Increase for long documents; uses more VRAM/RAM |
num_predict |
Integer | 128 | Maximum number of tokens to generate in a response | Set to -1 for unlimited; set low for short answers |
repeat_penalty |
Float | 1.1 | Penalises repeated tokens to reduce looping/repetition | Increase to 1.3–1.5 if the model repeats itself |
seed |
Integer | 0 (random) | Random seed for reproducible output | Set a fixed value for deterministic, repeatable responses |
stop |
String | — | Stop sequence — generation halts when this string is produced | Useful to terminate responses at a known boundary token |
num_gpu |
Integer | Auto | Number of model layers to offload to GPU | Set to 0 to force CPU-only inference |
num_thread |
Integer | Auto | Number of CPU threads used for inference | Set explicitly if Ollama is competing with other processes |
mirostat |
0, 1, or 2 | 0 | Mirostat sampling algorithm (0 = disabled, 1 = Mirostat, 2 = Mirostat 2.0) | Alternative to top-p/top-k; can produce more consistent perplexity |
mirostat_eta |
Float | 0.1 | Mirostat learning rate | Lower = slower adaptation to target perplexity |
mirostat_tau |
Float | 5.0 | Mirostat target entropy (perplexity) | Higher = more diverse output |
REST API Quick Reference
Endpoints
| Method | Endpoint | What it does |
|---|---|---|
POST |
/api/generate |
Single-turn completion; set "stream": false to get the full response in one JSON object |
POST |
/api/chat |
Multi-turn chat using a messages array with role and content fields |
POST |
/api/embed |
Generate vector embeddings for a string or array of strings |
GET |
/api/tags |
List all locally installed models (equivalent to ollama list) |
POST |
/api/show |
Return model info, parameters, and template for a named model |
POST |
/api/pull |
Pull a model from the registry via API |
POST |
/api/push |
Push a model to the registry via API |
POST |
/api/create |
Create a custom model from a Modelfile string passed in the request body |
DELETE |
/api/delete |
Delete a model by name |
GET |
/api/ps |
List models currently loaded in memory with their VRAM/RAM usage |
POST |
/api/copy |
Copy a model to a new name |
curl examples
Generate (non-streaming):
curl http://localhost:11434/api/generate \
-d '{"model":"llama3.2","prompt":"What is Linux?","stream":false}'
Chat (non-streaming):
curl http://localhost:11434/api/chat \
-d '{
"model": "llama3.2",
"messages": [{"role": "user", "content": "What is Linux?"}],
"stream": false
}'
Environment Variables
| Variable | Default | What it controls |
|---|---|---|
OLLAMA_HOST |
127.0.0.1:11434 |
Address and port the API server binds to. Set to 0.0.0.0:11434 to accept connections from other hosts on the network. |
OLLAMA_ORIGINS |
localhost |
CORS allowed origins. Set to * to permit all origins (useful for local web UIs). |
OLLAMA_MODELS |
~/.ollama/models |
Directory where model files are stored. Change to point to a larger disk. |
OLLAMA_KEEP_ALIVE |
5m |
How long a loaded model stays in memory after the last request. Accepts Go duration strings: 10m, 1h, 0 (unload immediately), -1 (keep indefinitely). |
OLLAMA_NUM_PARALLEL |
1 |
Maximum number of requests processed simultaneously per model. |
OLLAMA_MAX_LOADED_MODELS |
1 |
Maximum number of models that can be loaded into memory at the same time. |
OLLAMA_FLASH_ATTENTION |
Disabled | Set to 1 to enable Flash Attention, which reduces VRAM usage for long contexts on supported hardware. |
OLLAMA_GPU_OVERHEAD |
0 |
VRAM (in bytes) to reserve for the OS and other applications before Ollama allocates layers. Useful on shared systems. |
CUDA_VISIBLE_DEVICES |
All GPUs | Restrict Ollama to specific NVIDIA GPUs by index (e.g. 0, 1, or 0,1 for multi-GPU). |
OLLAMA_DEBUG |
Disabled | Set to 1 to enable verbose debug logging — useful for diagnosing GPU detection issues. |
OLLAMA_NOPRUNE |
Disabled | Set to 1 to prevent Ollama from pruning unused model blobs at startup. |
Service Management
Linux (systemd)
| Command | What it does |
|---|---|
systemctl status ollama |
Show current service status, PID, and recent log lines |
systemctl start ollama |
Start the Ollama service |
systemctl stop ollama |
Stop the Ollama service |
systemctl restart ollama |
Restart the Ollama service (required after changing environment variables) |
systemctl enable ollama |
Enable Ollama to start automatically at boot |
journalctl -u ollama -f |
Follow the Ollama service log in real time |
journalctl -u ollama --since "1 hour ago" |
Show Ollama log entries from the past hour |
Setting environment variables on Linux (systemd)
Add Environment lines under the [Service] section of /etc/systemd/system/ollama.service:
[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"
Environment="OLLAMA_KEEP_ALIVE=10m"
Then reload the daemon and restart the service:
systemctl daemon-reload && systemctl restart ollama
macOS
| Action | Notes |
|---|---|
ollama serve |
Start the server manually in a terminal session |
| Menu bar app | When installed via .dmg, Ollama runs as a menu bar application and starts automatically at login |
launchctl list | grep ollama |
Check whether Ollama is registered as a launch agent |
Model Storage Locations
| Platform | Default path |
|---|---|
| Linux | ~/.ollama/models/ |
| macOS | ~/.ollama/models/ |
| Windows | C:\Users\%username%\.ollama\models\ |
Directory structure
| Subdirectory | Contents |
|---|---|
blobs/ |
Actual model weight files, stored as content-addressed blobs. These are the large files that consume disk space. |
manifests/ |
Lightweight JSON metadata files that map model names and tags to their blob hashes. |
Moving models to a different disk: Set OLLAMA_MODELS to the new path, move the models/ directory to that location, then restart the Ollama service.
Useful One-Liners
Check total disk space used by models:
du -sh ~/.ollama/models/
List all local models with sizes:
ollama list
Unload a specific model from memory immediately:
curl -X POST http://localhost:11434/api/generate \
-d '{"model":"llama3.2","keep_alive":0}'
Test that the API is responding:
curl http://localhost:11434/api/tags
Run a model with an inline system prompt:
ollama run llama3.2 --system "You are a Linux expert." "Explain systemd"
Generate embeddings for a string:
curl http://localhost:11434/api/embed \
-d '{"model":"nomic-embed-text","input":"Hello world"}'
Pull multiple models in parallel (bash):
ollama pull llama3.2 & ollama pull nomic-embed-text & wait
Show which models are currently loaded in VRAM/RAM:
curl http://localhost:11434/api/ps
Further guides available on serverman.co.uk cover related topics in depth: securing Ollama for network exposure (the Ollama security guide), integrating Ollama as an AI backend in automation workflows (how to use Ollama with n8n), connecting local models to agentic toolchains (building local AI agents with Ollama and MCP), and using Ollama as a drop-in replacement for the OpenAI API in existing applications (the Ollama OpenAI API compatibility guide).






