Home / AI / Ollama / Ollama CLI Cheat Sheet: Every Command You Need

Ollama

Ollama CLI Cheat Sheet: Every Command You Need

1. Installation Quick Reference

5. Understanding model tag conventions

8. Interactive session slash commands

11. Minimal working Modelfile example

12. Building a model from a Modelfile

20. Setting environment variables on Linux (systemd)

Ollama is an open-source tool that lets you download, run, and manage large language models locally on Linux, macOS, and Windows. This cheat sheet covers every CLI command, Modelfile instruction, REST API endpoint, and environment variable you are likely to need — structured as a dense reference you can bookmark and return to rather than read top to bottom.

Installation Quick Reference

Install

Platform	Method	Command / Action
Linux	One-liner script	`curl -fsSL https://ollama.com/install.sh \| sh`
macOS	Homebrew	`brew install ollama`
macOS	Manual download	Download `.dmg` from `ollama.com` and drag to Applications
Windows	Installer	Download `OllamaSetup.exe` from `ollama.com` and run it

Verify and start

Command	What it does
`ollama --version`	Print installed Ollama version
`ollama serve`	Start the Ollama API server manually (not needed if running as a systemd service or macOS menu bar app)

Default API port: 11434
API base URL: http://localhost:11434

Model Management

Command	What it does	Example
`ollama pull <model>`	Download the latest tag of a model	`ollama pull llama3.2`
`ollama pull <model>:<tag>`	Download a specific version or quantisation	`ollama pull llama3.2:8b-instruct-q4_K_M`
`ollama list`	List all locally installed models with size and modified date	`ollama list`
`ollama ls`	Alias for `ollama list`	`ollama ls`
`ollama show <model>`	Display model details: parameters, template, and system prompt	`ollama show llama3.2`
`ollama show <model> --modelfile`	Print the full Modelfile for a model	`ollama show llama3.2 --modelfile`
`ollama rm <model>`	Delete a model and free disk space	`ollama rm llama3.2:8b`
`ollama cp <source> <dest>`	Copy a model under a new name (useful before customising)	`ollama cp llama3.2 my-llama`
`ollama push <model>`	Push a model to the Ollama registry (requires a free account at ollama.com)	`ollama push myuser/my-llama`

Understanding model tag conventions

Model tags follow the pattern <family>:<size>-<type>-<quantisation>. For example, llama3.2:8b-instruct-q4_K_M breaks down as:

Part	Meaning	Notes
`8b`	8 billion parameters	Larger = more capable but slower and more RAM
`instruct`	Fine-tuned for instruction following / chat	`base` models are for continuation tasks only
`q4_K_M`	4-bit quantisation, K-quant medium	Lower bit = smaller file, slight quality reduction. `q4_K_M` is a popular balance point. `fp16` = full precision, largest file.

If you omit the tag (e.g. ollama pull llama3.2), Ollama downloads the default recommended tag for that model.

Running Models

ollama run flags

Command	What it does	Example
`ollama run <model>`	Start an interactive chat session in the terminal	`ollama run llama3.2`
`ollama run <model> "prompt"`	Send a single prompt and exit (non-interactive)	`ollama run llama3.2 "Summarise Linux in one paragraph"`
`ollama run <model> --verbose`	Show token generation speed and stats after the response	`ollama run llama3.2 --verbose "Hello"`
`ollama run <model> --nowordwrap`	Disable automatic word wrapping in the terminal	`ollama run llama3.2 --nowordwrap`
`ollama run <model> --format json`	Force the model to output valid JSON	`ollama run llama3.2 --format json "List 3 distros"`
`ollama run <model> --keepalive 0`	Unload the model from memory immediately after the response	`ollama run llama3.2 --keepalive 0 "Hello"`
`ollama run <model> --keepalive -1`	Keep the model loaded in memory indefinitely	`ollama run llama3.2 --keepalive -1`
`echo "prompt" \| ollama run <model>`	Pipe a prompt from stdin	`echo "What is Nginx?" \| ollama run llama3.2`
`ollama run <model> < file.txt`	Redirect a file as the prompt	`ollama run llama3.2 < prompt.txt`

Interactive session slash commands

These commands are available once you are inside an ollama run session.

Command	What it does
`/?` or `/help`	Show all available slash commands
`/set <option> <value>`	Change a runtime parameter (e.g. `/set parameter temperature 0.5`)
`/show <info>`	Display model info, parameters, template, system prompt, or licence
`/load <model>`	Switch to a different model without leaving the session
`/save <session>`	Save the current conversation history as a named session
`/clear`	Clear the current conversation history (resets context)
`/bye` or `Ctrl+D`	Exit the interactive session

Modelfile Reference

A Modelfile is a plain-text configuration file (similar in concept to a Dockerfile) that defines a custom model — its base, system prompt, parameters, and template. You create a custom model by pointing ollama create at a Modelfile.

Modelfile instructions

Instruction	Required	What it does
`FROM`	Yes	Base model to build from (e.g. `FROM llama3.2` or a path to a GGUF file)
`PARAMETER <name> <value>`	No	Set a model parameter (see Key Parameters Reference below for all valid names)
`TEMPLATE`	No	Override the prompt template used to format messages sent to the model
`SYSTEM`	No	Set a system prompt that is injected at the start of every conversation
`ADAPTER`	No	Path to a LoRA adapter file (`.gguf`) to apply to the base model
`LICENSE`	No	Specify the licence under which the model is distributed
`MESSAGE <role> <content>`	No	Pre-populate conversation history (role is `user`, `assistant`, or `system`)

Minimal working Modelfile example

FROM llama3.2
PARAMETER temperature 0.7
PARAMETER num_ctx 4096
SYSTEM "You are a helpful assistant."

Building a model from a Modelfile

Command	What it does
`ollama create <name> -f Modelfile`	Build a custom model named `<name>` from a file called `Modelfile` in the current directory
`ollama create <name> -f ./path/to/Modelfile`	Build from a Modelfile at an explicit path

Key Parameters Reference

Parameter	Range / Type	Default	What it controls	Typical use
`temperature`	0.0 – 2.0	0.8	Randomness and creativity of output	Lower (0.1–0.3) for factual/deterministic; higher (0.8–1.2) for creative writing
`top_p`	0.0 – 1.0	0.9	Nucleus sampling — limits token pool to top cumulative probability	Reduce to 0.5–0.7 to make output more focused
`top_k`	Integer	40	Top-k sampling — limits token pool to the top k candidates	Lower values reduce diversity; 0 disables top-k
`num_ctx`	Integer	2048	Context window size in tokens (prompt + response)	Increase for long documents; uses more VRAM/RAM
`num_predict`	Integer	128	Maximum number of tokens to generate in a response	Set to `-1` for unlimited; set low for short answers
`repeat_penalty`	Float	1.1	Penalises repeated tokens to reduce looping/repetition	Increase to 1.3–1.5 if the model repeats itself
`seed`	Integer	0 (random)	Random seed for reproducible output	Set a fixed value for deterministic, repeatable responses
`stop`	String	—	Stop sequence — generation halts when this string is produced	Useful to terminate responses at a known boundary token
`num_gpu`	Integer	Auto	Number of model layers to offload to GPU	Set to `0` to force CPU-only inference
`num_thread`	Integer	Auto	Number of CPU threads used for inference	Set explicitly if Ollama is competing with other processes
`mirostat`	0, 1, or 2	0	Mirostat sampling algorithm (0 = disabled, 1 = Mirostat, 2 = Mirostat 2.0)	Alternative to top-p/top-k; can produce more consistent perplexity
`mirostat_eta`	Float	0.1	Mirostat learning rate	Lower = slower adaptation to target perplexity
`mirostat_tau`	Float	5.0	Mirostat target entropy (perplexity)	Higher = more diverse output

REST API Quick Reference

Endpoints

Method	Endpoint	What it does
`POST`	`/api/generate`	Single-turn completion; set `"stream": false` to get the full response in one JSON object
`POST`	`/api/chat`	Multi-turn chat using a `messages` array with `role` and `content` fields
`POST`	`/api/embed`	Generate vector embeddings for a string or array of strings
`GET`	`/api/tags`	List all locally installed models (equivalent to `ollama list`)
`POST`	`/api/show`	Return model info, parameters, and template for a named model
`POST`	`/api/pull`	Pull a model from the registry via API
`POST`	`/api/push`	Push a model to the registry via API
`POST`	`/api/create`	Create a custom model from a Modelfile string passed in the request body
`DELETE`	`/api/delete`	Delete a model by name
`GET`	`/api/ps`	List models currently loaded in memory with their VRAM/RAM usage
`POST`	`/api/copy`	Copy a model to a new name

curl examples

Generate (non-streaming):

curl http://localhost:11434/api/generate \
  -d '{"model":"llama3.2","prompt":"What is Linux?","stream":false}'

Chat (non-streaming):

curl http://localhost:11434/api/chat \
  -d '{
    "model": "llama3.2",
    "messages": [{"role": "user", "content": "What is Linux?"}],
    "stream": false
  }'

Environment Variables

Variable	Default	What it controls
`OLLAMA_HOST`	`127.0.0.1:11434`	Address and port the API server binds to. Set to `0.0.0.0:11434` to accept connections from other hosts on the network.
`OLLAMA_ORIGINS`	`localhost`	CORS allowed origins. Set to `*` to permit all origins (useful for local web UIs).
`OLLAMA_MODELS`	`~/.ollama/models`	Directory where model files are stored. Change to point to a larger disk.
`OLLAMA_KEEP_ALIVE`	`5m`	How long a loaded model stays in memory after the last request. Accepts Go duration strings: `10m`, `1h`, `0` (unload immediately), `-1` (keep indefinitely).
`OLLAMA_NUM_PARALLEL`	`1`	Maximum number of requests processed simultaneously per model.
`OLLAMA_MAX_LOADED_MODELS`	`1`	Maximum number of models that can be loaded into memory at the same time.
`OLLAMA_FLASH_ATTENTION`	Disabled	Set to `1` to enable Flash Attention, which reduces VRAM usage for long contexts on supported hardware.
`OLLAMA_GPU_OVERHEAD`	`0`	VRAM (in bytes) to reserve for the OS and other applications before Ollama allocates layers. Useful on shared systems.
`CUDA_VISIBLE_DEVICES`	All GPUs	Restrict Ollama to specific NVIDIA GPUs by index (e.g. `0`, `1`, or `0,1` for multi-GPU).
`OLLAMA_DEBUG`	Disabled	Set to `1` to enable verbose debug logging — useful for diagnosing GPU detection issues.
`OLLAMA_NOPRUNE`	Disabled	Set to `1` to prevent Ollama from pruning unused model blobs at startup.

Service Management

Linux (systemd)

Command	What it does
`systemctl status ollama`	Show current service status, PID, and recent log lines
`systemctl start ollama`	Start the Ollama service
`systemctl stop ollama`	Stop the Ollama service
`systemctl restart ollama`	Restart the Ollama service (required after changing environment variables)
`systemctl enable ollama`	Enable Ollama to start automatically at boot
`journalctl -u ollama -f`	Follow the Ollama service log in real time
`journalctl -u ollama --since "1 hour ago"`	Show Ollama log entries from the past hour

Setting environment variables on Linux (systemd)

Add Environment lines under the [Service] section of /etc/systemd/system/ollama.service:

[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"
Environment="OLLAMA_KEEP_ALIVE=10m"

Then reload the daemon and restart the service:

systemctl daemon-reload && systemctl restart ollama

macOS

Action	Notes
`ollama serve`	Start the server manually in a terminal session
Menu bar app	When installed via `.dmg`, Ollama runs as a menu bar application and starts automatically at login
`launchctl list \| grep ollama`	Check whether Ollama is registered as a launch agent

Model Storage Locations

Platform	Default path
Linux	`~/.ollama/models/`
macOS	`~/.ollama/models/`
Windows	`C:\Users\%username%\.ollama\models\`

Directory structure

Subdirectory	Contents
`blobs/`	Actual model weight files, stored as content-addressed blobs. These are the large files that consume disk space.
`manifests/`	Lightweight JSON metadata files that map model names and tags to their blob hashes.

Moving models to a different disk: Set OLLAMA_MODELS to the new path, move the models/ directory to that location, then restart the Ollama service.

Useful One-Liners

Check total disk space used by models:

du -sh ~/.ollama/models/

List all local models with sizes:

ollama list

Unload a specific model from memory immediately:

curl -X POST http://localhost:11434/api/generate \
  -d '{"model":"llama3.2","keep_alive":0}'

Test that the API is responding:

curl http://localhost:11434/api/tags

Run a model with an inline system prompt:

ollama run llama3.2 --system "You are a Linux expert." "Explain systemd"

Generate embeddings for a string:

curl http://localhost:11434/api/embed \
  -d '{"model":"nomic-embed-text","input":"Hello world"}'

Pull multiple models in parallel (bash):

ollama pull llama3.2 & ollama pull nomic-embed-text & wait

Show which models are currently loaded in VRAM/RAM:

curl http://localhost:11434/api/ps

Further guides available on serverman.co.uk cover related topics in depth: securing Ollama for network exposure (the Ollama security guide), integrating Ollama as an AI backend in automation workflows (how to use Ollama with n8n), connecting local models to agentic toolchains (building local AI agents with Ollama and MCP), and using Ollama as a drop-in replacement for the OpenAI API in existing applications (the Ollama OpenAI API compatibility guide).