Home / AI / Ollama / Ollama CLI Cheat Sheet: Every Command You Need

Ollama CLI Cheat Sheet: Every Command You Need

Ollama CLI Cheat Sheet: Every Command You Need

Ollama is an open-source tool that lets you download, run, and manage large language models locally on Linux, macOS, and Windows. This cheat sheet covers every CLI command, Modelfile instruction, REST API endpoint, and environment variable you are likely to need — structured as a dense reference you can bookmark and return to rather than read top to bottom.

Installation Quick Reference

Install

Platform Method Command / Action
Linux One-liner script curl -fsSL https://ollama.com/install.sh | sh
macOS Homebrew brew install ollama
macOS Manual download Download .dmg from ollama.com and drag to Applications
Windows Installer Download OllamaSetup.exe from ollama.com and run it

Verify and start

Command What it does
ollama --version Print installed Ollama version
ollama serve Start the Ollama API server manually (not needed if running as a systemd service or macOS menu bar app)

Default API port: 11434
API base URL: http://localhost:11434

Model Management

Command What it does Example
ollama pull <model> Download the latest tag of a model ollama pull llama3.2
ollama pull <model>:<tag> Download a specific version or quantisation ollama pull llama3.2:8b-instruct-q4_K_M
ollama list List all locally installed models with size and modified date ollama list
ollama ls Alias for ollama list ollama ls
ollama show <model> Display model details: parameters, template, and system prompt ollama show llama3.2
ollama show <model> --modelfile Print the full Modelfile for a model ollama show llama3.2 --modelfile
ollama rm <model> Delete a model and free disk space ollama rm llama3.2:8b
ollama cp <source> <dest> Copy a model under a new name (useful before customising) ollama cp llama3.2 my-llama
ollama push <model> Push a model to the Ollama registry (requires a free account at ollama.com) ollama push myuser/my-llama

Understanding model tag conventions

Model tags follow the pattern <family>:<size>-<type>-<quantisation>. For example, llama3.2:8b-instruct-q4_K_M breaks down as:

Part Meaning Notes
8b 8 billion parameters Larger = more capable but slower and more RAM
instruct Fine-tuned for instruction following / chat base models are for continuation tasks only
q4_K_M 4-bit quantisation, K-quant medium Lower bit = smaller file, slight quality reduction. q4_K_M is a popular balance point. fp16 = full precision, largest file.

If you omit the tag (e.g. ollama pull llama3.2), Ollama downloads the default recommended tag for that model.

Running Models

ollama run flags

Command What it does Example
ollama run <model> Start an interactive chat session in the terminal ollama run llama3.2
ollama run <model> "prompt" Send a single prompt and exit (non-interactive) ollama run llama3.2 "Summarise Linux in one paragraph"
ollama run <model> --verbose Show token generation speed and stats after the response ollama run llama3.2 --verbose "Hello"
ollama run <model> --nowordwrap Disable automatic word wrapping in the terminal ollama run llama3.2 --nowordwrap
ollama run <model> --format json Force the model to output valid JSON ollama run llama3.2 --format json "List 3 distros"
ollama run <model> --keepalive 0 Unload the model from memory immediately after the response ollama run llama3.2 --keepalive 0 "Hello"
ollama run <model> --keepalive -1 Keep the model loaded in memory indefinitely ollama run llama3.2 --keepalive -1
echo "prompt" | ollama run <model> Pipe a prompt from stdin echo "What is Nginx?" | ollama run llama3.2
ollama run <model> < file.txt Redirect a file as the prompt ollama run llama3.2 < prompt.txt

Interactive session slash commands

These commands are available once you are inside an ollama run session.

Command What it does
/? or /help Show all available slash commands
/set <option> <value> Change a runtime parameter (e.g. /set parameter temperature 0.5)
/show <info> Display model info, parameters, template, system prompt, or licence
/load <model> Switch to a different model without leaving the session
/save <session> Save the current conversation history as a named session
/clear Clear the current conversation history (resets context)
/bye or Ctrl+D Exit the interactive session

Modelfile Reference

A Modelfile is a plain-text configuration file (similar in concept to a Dockerfile) that defines a custom model — its base, system prompt, parameters, and template. You create a custom model by pointing ollama create at a Modelfile.

Modelfile instructions

Instruction Required What it does
FROM Yes Base model to build from (e.g. FROM llama3.2 or a path to a GGUF file)
PARAMETER <name> <value> No Set a model parameter (see Key Parameters Reference below for all valid names)
TEMPLATE No Override the prompt template used to format messages sent to the model
SYSTEM No Set a system prompt that is injected at the start of every conversation
ADAPTER No Path to a LoRA adapter file (.gguf) to apply to the base model
LICENSE No Specify the licence under which the model is distributed
MESSAGE <role> <content> No Pre-populate conversation history (role is user, assistant, or system)

Minimal working Modelfile example

FROM llama3.2
PARAMETER temperature 0.7
PARAMETER num_ctx 4096
SYSTEM "You are a helpful assistant."

Building a model from a Modelfile

Command What it does
ollama create <name> -f Modelfile Build a custom model named <name> from a file called Modelfile in the current directory
ollama create <name> -f ./path/to/Modelfile Build from a Modelfile at an explicit path

Key Parameters Reference

Parameter Range / Type Default What it controls Typical use
temperature 0.0 – 2.0 0.8 Randomness and creativity of output Lower (0.1–0.3) for factual/deterministic; higher (0.8–1.2) for creative writing
top_p 0.0 – 1.0 0.9 Nucleus sampling — limits token pool to top cumulative probability Reduce to 0.5–0.7 to make output more focused
top_k Integer 40 Top-k sampling — limits token pool to the top k candidates Lower values reduce diversity; 0 disables top-k
num_ctx Integer 2048 Context window size in tokens (prompt + response) Increase for long documents; uses more VRAM/RAM
num_predict Integer 128 Maximum number of tokens to generate in a response Set to -1 for unlimited; set low for short answers
repeat_penalty Float 1.1 Penalises repeated tokens to reduce looping/repetition Increase to 1.3–1.5 if the model repeats itself
seed Integer 0 (random) Random seed for reproducible output Set a fixed value for deterministic, repeatable responses
stop String Stop sequence — generation halts when this string is produced Useful to terminate responses at a known boundary token
num_gpu Integer Auto Number of model layers to offload to GPU Set to 0 to force CPU-only inference
num_thread Integer Auto Number of CPU threads used for inference Set explicitly if Ollama is competing with other processes
mirostat 0, 1, or 2 0 Mirostat sampling algorithm (0 = disabled, 1 = Mirostat, 2 = Mirostat 2.0) Alternative to top-p/top-k; can produce more consistent perplexity
mirostat_eta Float 0.1 Mirostat learning rate Lower = slower adaptation to target perplexity
mirostat_tau Float 5.0 Mirostat target entropy (perplexity) Higher = more diverse output

REST API Quick Reference

Endpoints

Method Endpoint What it does
POST /api/generate Single-turn completion; set "stream": false to get the full response in one JSON object
POST /api/chat Multi-turn chat using a messages array with role and content fields
POST /api/embed Generate vector embeddings for a string or array of strings
GET /api/tags List all locally installed models (equivalent to ollama list)
POST /api/show Return model info, parameters, and template for a named model
POST /api/pull Pull a model from the registry via API
POST /api/push Push a model to the registry via API
POST /api/create Create a custom model from a Modelfile string passed in the request body
DELETE /api/delete Delete a model by name
GET /api/ps List models currently loaded in memory with their VRAM/RAM usage
POST /api/copy Copy a model to a new name

curl examples

Generate (non-streaming):

curl http://localhost:11434/api/generate \
  -d '{"model":"llama3.2","prompt":"What is Linux?","stream":false}'

Chat (non-streaming):

curl http://localhost:11434/api/chat \
  -d '{
    "model": "llama3.2",
    "messages": [{"role": "user", "content": "What is Linux?"}],
    "stream": false
  }'

Environment Variables

Variable Default What it controls
OLLAMA_HOST 127.0.0.1:11434 Address and port the API server binds to. Set to 0.0.0.0:11434 to accept connections from other hosts on the network.
OLLAMA_ORIGINS localhost CORS allowed origins. Set to * to permit all origins (useful for local web UIs).
OLLAMA_MODELS ~/.ollama/models Directory where model files are stored. Change to point to a larger disk.
OLLAMA_KEEP_ALIVE 5m How long a loaded model stays in memory after the last request. Accepts Go duration strings: 10m, 1h, 0 (unload immediately), -1 (keep indefinitely).
OLLAMA_NUM_PARALLEL 1 Maximum number of requests processed simultaneously per model.
OLLAMA_MAX_LOADED_MODELS 1 Maximum number of models that can be loaded into memory at the same time.
OLLAMA_FLASH_ATTENTION Disabled Set to 1 to enable Flash Attention, which reduces VRAM usage for long contexts on supported hardware.
OLLAMA_GPU_OVERHEAD 0 VRAM (in bytes) to reserve for the OS and other applications before Ollama allocates layers. Useful on shared systems.
CUDA_VISIBLE_DEVICES All GPUs Restrict Ollama to specific NVIDIA GPUs by index (e.g. 0, 1, or 0,1 for multi-GPU).
OLLAMA_DEBUG Disabled Set to 1 to enable verbose debug logging — useful for diagnosing GPU detection issues.
OLLAMA_NOPRUNE Disabled Set to 1 to prevent Ollama from pruning unused model blobs at startup.

Service Management

Linux (systemd)

Command What it does
systemctl status ollama Show current service status, PID, and recent log lines
systemctl start ollama Start the Ollama service
systemctl stop ollama Stop the Ollama service
systemctl restart ollama Restart the Ollama service (required after changing environment variables)
systemctl enable ollama Enable Ollama to start automatically at boot
journalctl -u ollama -f Follow the Ollama service log in real time
journalctl -u ollama --since "1 hour ago" Show Ollama log entries from the past hour

Setting environment variables on Linux (systemd)

Add Environment lines under the [Service] section of /etc/systemd/system/ollama.service:

[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"
Environment="OLLAMA_KEEP_ALIVE=10m"

Then reload the daemon and restart the service:

systemctl daemon-reload && systemctl restart ollama

macOS

Action Notes
ollama serve Start the server manually in a terminal session
Menu bar app When installed via .dmg, Ollama runs as a menu bar application and starts automatically at login
launchctl list | grep ollama Check whether Ollama is registered as a launch agent

Model Storage Locations

Platform Default path
Linux ~/.ollama/models/
macOS ~/.ollama/models/
Windows C:\Users\%username%\.ollama\models\

Directory structure

Subdirectory Contents
blobs/ Actual model weight files, stored as content-addressed blobs. These are the large files that consume disk space.
manifests/ Lightweight JSON metadata files that map model names and tags to their blob hashes.

Moving models to a different disk: Set OLLAMA_MODELS to the new path, move the models/ directory to that location, then restart the Ollama service.

Useful One-Liners

Check total disk space used by models:

du -sh ~/.ollama/models/

List all local models with sizes:

ollama list

Unload a specific model from memory immediately:

curl -X POST http://localhost:11434/api/generate \
  -d '{"model":"llama3.2","keep_alive":0}'

Test that the API is responding:

curl http://localhost:11434/api/tags

Run a model with an inline system prompt:

ollama run llama3.2 --system "You are a Linux expert." "Explain systemd"

Generate embeddings for a string:

curl http://localhost:11434/api/embed \
  -d '{"model":"nomic-embed-text","input":"Hello world"}'

Pull multiple models in parallel (bash):

ollama pull llama3.2 & ollama pull nomic-embed-text & wait

Show which models are currently loaded in VRAM/RAM:

curl http://localhost:11434/api/ps

Further guides available on serverman.co.uk cover related topics in depth: securing Ollama for network exposure (the Ollama security guide), integrating Ollama as an AI backend in automation workflows (how to use Ollama with n8n), connecting local models to agentic toolchains (building local AI agents with Ollama and MCP), and using Ollama as a drop-in replacement for the OpenAI API in existing applications (the Ollama OpenAI API compatibility guide).