Home / AI / Ollama / Ollama Cloud Models: Remote Inference Explained

Ollama Cloud Models: Remote Inference Explained

Ollama Cloud Models: Remote Inference Explained

Ollama Cloud launched in September 2025 and quietly changed what local AI means. The feature lets you run models like DeepSeek-V3.1 (671B parameters) or Qwen3-Coder (480B) from any machine — laptop, Raspberry Pi, CI pipeline — without a local GPU, without downloading model weights, and using the exact same Ollama commands you already know. The only change is a :cloud suffix on the model name. This guide explains how the routing works under the hood, what the 37 available cloud models are, how GPU-time pricing differs from token-based competitors, and when running cloud inference actually makes sense over local.

What Is Ollama Cloud?

Ollama Cloud is a managed inference service built directly into the Ollama runtime. It extends the local Ollama daemon to route certain model requests to Ollama’s own datacentres across the US, Europe, and Asia-Pacific, rather than executing on local hardware. Cloud models are identified by a :cloud suffix and never need to be downloaded — no ollama pull required before running, no disk space consumed, no VRAM limit to worry about.

From an application’s perspective, nothing changes. The client still sends requests to localhost:11434. The local daemon detects the :cloud suffix, attaches your authentication credentials, and proxies the request to Ollama’s infrastructure. The response streams back in real time. Every tool that works with local Ollama — Open WebUI, Python SDK, JavaScript SDK, any OpenAI-compatible client — works identically with cloud models.

How Ollama Cloud Routing Works

Understanding the architecture avoids the most common gotchas. When you make a request with a :cloud model name, the local Ollama daemon’s generate and chat handlers detect a modelSourceCloud flag, normalise the model name for the remote endpoint, attach auth headers from your stored credentials, and forward the request to Ollama’s cloud infrastructure. The response is streamed back through the same local proxy in exactly the same format as a local inference response.

This proxy architecture means:

  • No model download — the model lives on Ollama’s servers, not yours
  • Identical API surface — same endpoints, same request format, same streaming behaviour
  • Auth is transparent — once you run ollama signin, credentials are stored and attached automatically
  • Local models are unaffected — cloud and local models coexist; model names route to the correct backend

One environment variable controls cloud access entirely: setting OLLAMA_NO_CLOUD=1 disables all cloud routing, useful for air-gapped environments or enforcing fully local operation.

Available Ollama Cloud Models

As of April 2026, there are 37 cloud models available. The full list is at ollama.com/search?c=cloud. Here are the most notable, grouped by use case:

Model Parameters Context Best for
qwen3-coder:480b-cloud 480B 262K Coding, agents
devstral-2:123b-cloud 123B 262K Coding agents (Mistral)
deepseek-v3.1:671b-cloud 671B 164K General reasoning
deepseek-v3.2:cloud 671B 164K General reasoning (latest)
gpt-oss:120b-cloud 120B 131K General purpose
gpt-oss:20b-cloud 20B 131K Faster, lighter tasks
glm-5:cloud 744B (40B active) 203K MoE efficiency
kimi-k2.5:cloud 262K in / 262K out Long-context tasks
kimi-k2.6:cloud 262K Multimodal + long context
gemma4:31b-cloud 31B 262K Google’s compact model
gemini-3-flash-preview:cloud 1M Massive context tasks
nemotron-3-nano:30b-cloud 30B 1M NVIDIA’s 1M context model
mistral-large-3:675b-cloud 675B 262K Mistral’s flagship
qwen3.5:cloud 0.8B–122B 262K Multiple sizes via same name

Notable: Gemini-3-Flash and Nemotron-3-Nano offer 1 million token context windows — entirely impractical to run locally, trivial to use via :cloud.

Getting Started with Ollama Cloud

Three steps. Make sure you are on Ollama 0.6.x or later first — see how to update Ollama if needed.

Step 1: Sign in

ollama signin

This opens a browser tab to ollama.com/connect where you log in with your Ollama account. Credentials are stored locally and attached automatically to all subsequent cloud model requests.

Step 2: Run a cloud model

ollama run deepseek-v3.1:671b-cloud

No pull required. The model runs instantly on Ollama’s infrastructure. You will see streaming output in your terminal exactly as you would with a local model.

Step 3: Use it in your code

import ollama

response = ollama.chat(
    model='deepseek-v3.1:671b-cloud',
    messages=[{'role': 'user', 'content': 'Explain attention mechanisms.'}]
)
print(response.message.content)

The Python SDK, JavaScript SDK, and any OpenAI-compatible client work without modification — just swap the model name.

The API Endpoint Gotcha (Direct Access Without the Local Daemon)

If you want to call Ollama Cloud directly from a server or container without a local Ollama daemon installed, use an API key and call the cloud endpoint directly. This is where most people get tripped up by a URL mismatch.

Get your API key from ollama.com/settings/keys, then:

export OLLAMA_API_KEY=your_key_here

# CORRECT:
curl https://ollama.com/v1/chat/completions \
  -H "Authorization: Bearer $OLLAMA_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"model": "deepseek-v3.1:671b", "messages": [{"role": "user", "content": "Hello"}]}'

The common mistake is using /api/v1/ instead of /v1/ — the former returns a 404 with no helpful error message. The correct base URL is https://ollama.com/v1, which is OpenAI-compatible.

Using the Python client directly against the cloud endpoint (no local daemon):

import os
from ollama import Client

client = Client(
    host='https://ollama.com',
    headers={'Authorization': 'Bearer ' + os.environ['OLLAMA_API_KEY']}
)

response = client.chat(
    'deepseek-v3.1:671b',
    messages=[{'role': 'user', 'content': 'Hello'}]
)
print(response.message.content)

Note: when calling directly (no local daemon), omit the :cloud suffix — the cloud endpoint already knows it is serving cloud models. The :cloud suffix is only needed when routing through the local daemon.

Switching Between Local and Cloud Models

One of Ollama Cloud’s most practical design decisions: switching between local and cloud execution requires zero code changes beyond the model name. The :cloud suffix is the entire switch.

# Local execution — runs on your GPU
response = ollama.chat('llama3.2:3b', messages=messages)

# Cloud execution — identical call, runs on Ollama's servers
response = ollama.chat('gpt-oss:120b-cloud', messages=messages)

This makes it straightforward to build hybrid workflows: use a small local model for cheap, fast tasks and a large cloud model only for the steps that need it. The Ollama REST API guide covers the full parameter surface that applies to both local and cloud requests.

Ollama Cloud Pricing Explained

Ollama Cloud uses GPU-time billing, not per-token pricing. This is a fundamental difference from OpenAI, Anthropic, and Groq — you pay for how long the GPU runs, not how many tokens you generate. There are no per-token caps published.

Plan Price Concurrent models Usage ceiling
Free $0/month 1 Light usage
Pro $20/month or $200/year 3 ~50× more than Free
Max $100/month 10 ~5× more than Pro

A few things worth knowing about the billing model:

  • Session limits reset every 5 hours; weekly limits reset every 7 days
  • Local inference is always unlimited and free — cloud billing only applies to :cloud model requests
  • Concurrency matters — the Free tier allows only one cloud model at a time, which affects agent workflows that need parallel model calls
  • Performance tiers: shared infrastructure runs at approximately 95 tokens/second; dedicated capacity runs at ~210 tokens/second
  • Ollama has indicated that future plans will allow purchasing additional usage at per-token rates with cache-aware pricing

When to Use Cloud vs Local

Ollama Cloud is not a replacement for local inference — it is a complement. Here is a practical framework for deciding which to use:

Use Ollama Cloud when:

  • You need a model that won’t fit locally — 671B models need 400GB+ VRAM; cloud has no such constraint
  • Context window is the bottleneck — 1M token context on Gemini-3-Flash or Nemotron-3-Nano is impossible locally on consumer hardware
  • You are running on a low-power machine — Raspberry Pi, old laptop, or a CI/CD pipeline where a local GPU is not available
  • You want to evaluate a model before committing to hardware — test quality on cloud before buying a GPU to run it locally
  • Team access — Pro and Max plans support multiple concurrent sessions, making shared team access practical

Stick with local inference when:

  • Privacy is non-negotiable — even with Ollama’s no-data-retention policy, local inference keeps data entirely off any external server
  • You have the hardware and the model fits — local inference has zero recurring cost and often lower latency for models up to ~30B on a good GPU
  • High volume, sustained inference — at scale, GPU-time billing adds up; amortised hardware cost becomes cheaper
  • You need Thinking Mode — Ollama’s thinking mode with Qwen3 currently works on local models; cloud availability varies by model

Ollama Cloud vs Groq, OpenAI, and Anthropic

For developers choosing a cloud inference provider, here is how Ollama Cloud compares on the dimensions that matter most:

Ollama Cloud Groq OpenAI Anthropic
Speed (tok/s) ~95–210 300–1,240 Variable Variable
Pricing model GPU-time subscription Per-token Per-token Per-token
Custom models Yes — push any Ollama model No Fine-tuning only No
Local/cloud switching Single model name change Different SDK/endpoint Different SDK/endpoint Different SDK/endpoint
Data retention None (per Ollama) Limited Yes (opt-out) Limited
Free tier Yes Yes No No

Groq wins on raw speed. Ollama Cloud wins on custom model support and the seamless local/cloud switching experience. If you are already using Ollama locally, Cloud is the lowest-friction way to add large-model access — no new SDKs, no new endpoints to learn, no parallel toolchain to maintain.

Troubleshooting Ollama Cloud

Cloud model returns 401 Unauthorized

Your session has expired or credentials were not stored correctly. Run ollama signin again and try the request. If you are using a direct API key, verify the key at ollama.com/settings/keys and confirm OLLAMA_API_KEY is exported in your shell session.

Model runs locally instead of on cloud (responds instantly with wrong output)

Check the model name — the :cloud suffix is required. ollama run deepseek-v3.1:671b and ollama run deepseek-v3.1:671b-cloud are different models. If you see a fast response that looks like a much smaller model, the local version ran instead.

Streaming stops mid-response

Usually a network timeout between your machine and Ollama’s servers. Retry the request. If it happens consistently, check whether a proxy or firewall is terminating long-lived connections — Ollama Cloud uses the same streaming protocol as local inference and requires persistent connections for long responses.

OLLAMA_NO_CLOUD=1 set but cloud models still route

Confirm the variable is exported (not just set) in the shell where Ollama is running: export OLLAMA_NO_CLOUD=1. If Ollama is running as a systemd service, add the variable to the service override file rather than the shell environment — the service does not inherit interactive shell variables.

Direct API calls return 404

You are using the wrong endpoint path. The correct URL is https://ollama.com/v1/chat/completions — not /api/v1/, not /api/chat. The /v1 path is the OpenAI-compatible endpoint. Double-check the URL and retry.

The Minions Protocol: Local + Cloud Collaboration

Stanford’s Hazy Research lab built a framework called Minions directly on top of Ollama Cloud that shows what hybrid local/cloud inference can look like in practice. Rather than routing every request to the cloud, Minions uses the cloud model as an orchestrator and small local models as workers.

In the Minion mode, a cloud model and a single local model work in dialogue — the cloud model breaks down the problem and critiques the local model’s output. Results: 30× cost reduction versus full cloud inference, retaining 87% of the full cloud performance.

In MinionS mode, the cloud model distributes subtasks to multiple local model instances running in parallel. Results: 5.7× cost reduction with 97.9% of full cloud performance retained.

For most users, Minions is an advanced use case, but it illustrates the direction Ollama Cloud is pointing: not a replacement for local inference, but a top-of-hierarchy coordinator that makes the combination of local hardware and cloud capacity more capable than either alone.