If you have searched for “ollama gpu not detected” then you are almost certainly staring at painfully slow inference times and a growing suspicion that your expensive graphics card is doing absolutely nothing. Whether you are running a local AI assistant for your business, testing large language models in-house, or evaluating Ollama before committing to a wider deployment, GPU acceleration is not optional. Without it, even a modest 7B model can take minutes to respond to a simple prompt.
This guide walks through every common cause of Ollama failing to detect a GPU, covering both NVIDIA CUDA errors and AMD ROCm failures, and explains how to verify and fix each one. It is written for IT managers and business owners running Windows, Linux, or WSL2 environments on their own hardware. If you are still deciding whether to self-host at all, our guide on how to run Ollama on a home server covers the full setup from scratch.
Why Ollama Falls Back to CPU
Ollama is designed to detect your GPU automatically at startup. When it does, model inference is offloaded to the GPU and responses are dramatically faster. When detection fails, Ollama silently falls back to CPU mode. There is no loud error message by default, which is exactly why so many users do not realise the problem exists until they benchmark their inference speeds or check the logs.
The root causes split cleanly into two camps: NVIDIA CUDA issues and AMD ROCm issues. Both frameworks act as the bridge between Ollama and your GPU hardware. If the framework is missing, misconfigured, or the wrong version, the bridge does not exist and Ollama falls back to CPU. Understanding which camp your hardware sits in is the first step before running any diagnostics.
There is also a third, less obvious category: environment issues. These include missing environment variables, containerisation problems (such as Docker not being configured to pass through the GPU), and permission errors in Linux that prevent Ollama from accessing GPU devices at all. Each of these is covered in detail below.
Step One: Confirm Whether Ollama Is Using Your GPU
Before fixing anything, confirm the problem. Run a model and immediately check GPU utilisation. On Windows, open Task Manager, click Performance, then select your GPU. If the GPU engine labelled “Compute” stays at zero while Ollama is processing a prompt, the GPU is not being used. On Linux, use the nvidia-smi command for NVIDIA cards or rocm-smi for AMD cards and watch the GPU utilisation column during inference.
You can also check the Ollama logs directly. On Linux, run journalctl -u ollama --no-pager | grep -i gpu. On Windows, Ollama logs are typically found at %LOCALAPPDATA%\Ollama\logs. Look for lines referencing CUDA, ROCm, or “no GPU found”. The logs will usually tell you exactly what Ollama tried to detect and why it gave up.
A quick and useful alternative is to run ollama ps whilst a model is loaded. The output will show which device the model is running on. If it says “CPU” where you expect to see your GPU name, you have confirmed the issue and can move on to the fixes below.
Fixing NVIDIA CUDA Errors in Ollama
The most common Ollama CUDA error scenario is a missing or incompatible CUDA toolkit. Ollama requires CUDA 11.3 or later for NVIDIA GPU support. The toolkit is separate from your display driver. Many users install the display driver and assume CUDA is included, but a full CUDA toolkit installation is required for compute workloads.
Download the CUDA toolkit directly from NVIDIA’s developer site and install the version appropriate for your operating system. After installation, verify it worked by running nvcc --version in a terminal. If the command is not found, your PATH environment variable may need updating to include the CUDA binary directory, typically C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.x\bin on Windows.
Driver version mismatches are the second most common NVIDIA problem. Each CUDA version requires a minimum driver version. If your driver is too old, CUDA will not function correctly. Run nvidia-smi and check the driver version shown in the top right corner of the output, then cross-reference it against NVIDIA’s CUDA compatibility table. If you need to update, download the latest Game Ready or Studio driver from NVIDIA’s website, or use a tool like GeForce Experience if available. On a business server running a data centre card such as an RTX 4000 Ada or an older Quadro, update through NVIDIA’s enterprise driver portal instead.
- Install the CUDA Toolkit from developer.nvidia.com (version 11.3 minimum, 12.x recommended)
- Confirm your NVIDIA driver meets the minimum version for your chosen CUDA release
- Verify with
nvcc --versionandnvidia-smibefore restarting Ollama - On Linux, ensure the
nvidia-container-toolkitis installed if running Ollama in Docker - Restart the Ollama service after any driver or toolkit change
Fixing AMD ROCm Not Working With Ollama
AMD ROCm support in Ollama is excellent on Linux but limited on Windows, where it remains in early stages. If you are attempting to use an AMD GPU on Windows and ROCm is not working, the honest answer is that Windows ROCm support via Ollama is still maturing. Linux is the recommended platform for AMD GPU inference with Ollama, and Ubuntu 22.04 or 22.04-based distributions offer the best compatibility at the time of writing.
On Linux, begin by installing the ROCm stack from AMD’s official repository. AMD provides a straightforward installation script via amdgpu-install. After installation, verify with rocm-smi and check that your GPU is listed. If it is not, the most common cause is that your GPU is not on AMD’s officially supported list. Ollama does support some unofficially supported GPUs through the HSA_OVERRIDE_GFX_VERSION environment variable, which tells ROCm to treat your GPU as a different, supported architecture.
For example, if you have an RX 6600 (gfx1032) which may not be listed as supported, setting HSA_OVERRIDE_GFX_VERSION=10.3.0 before launching Ollama can enable GPU acceleration by mapping it to a supported GFX architecture. This is not guaranteed to work perfectly for every card, but it resolves ROCm not working for a wide range of RDNA 2 and RDNA 3 consumer GPUs that AMD has not formally certified for compute workloads.
- Use Linux (Ubuntu 22.04 strongly recommended) for AMD ROCm with Ollama
- Install ROCm using AMD’s official
amdgpu-installscript - Verify GPU detection with
rocm-smiafter installation - For unsupported consumer GPUs, set
HSA_OVERRIDE_GFX_VERSIONto the nearest supported architecture - Add your user to the
renderandvideogroups:sudo usermod -a -G render,video $USER - Log out and back in after adding groups, then test again
WSL2 and Docker GPU Passthrough Issues
A significant number of UK businesses running Ollama on Windows do so inside WSL2 (Windows Subsystem for Linux) or Docker. Both introduce additional GPU passthrough requirements that are easy to miss. In WSL2, GPU support requires Windows 11 or Windows 10 21H2 or later, plus a WSL2-compatible NVIDIA driver installed on the Windows host. The driver must be 470.76 or later for CUDA in WSL2 to function. Crucially, you should not install a separate CUDA toolkit inside WSL2 unless you are sure of the versioning. The CUDA libraries within WSL2 are provided by the Windows host driver, not by a separate Linux installation.
For Docker on Linux, GPU passthrough requires the nvidia-container-toolkit package. Once installed, configure Docker to use the NVIDIA runtime by editing /etc/docker/daemon.json to add the NVIDIA runtime entry, then restart Docker. When launching Ollama in Docker, include the --gpus all flag. Without it, the container has no visibility of the host GPU regardless of what drivers are installed.
On Windows Docker Desktop, GPU support for NVIDIA is available through the WSL2 backend. Ensure GPU resources are enabled in Docker Desktop settings under Resources. AMD GPU passthrough in Docker on Windows is not currently supported in a reliable, production-ready way and should be treated as experimental for business deployments.
VRAM Limitations and Model Layer Offloading
Sometimes Ollama does detect the GPU but only partially uses it. This happens when the model you are loading is larger than your available VRAM. In this case, Ollama offloads as many layers as it can to the GPU and runs the remainder on the CPU. The result is that you see some GPU utilisation but performance is still poor. This is not a bug. It is a design feature, but it may feel indistinguishable from a GPU detection failure if you are not watching the layer offload count.
Check the Ollama logs during model loading to see how many layers were offloaded. A line such as “offloaded 20/32 layers to GPU” tells you that the model is too large for your VRAM and some layers are running on CPU. The solution is either to use a smaller quantisation variant of the model (such as a Q4_K_M instead of Q8 or full precision), reduce the context length via the OLLAMA_NUM_CTX environment variable, or upgrade your GPU to one with more VRAM.
For UK businesses buying GPU hardware specifically for local AI inference, NVIDIA RTX 4060 Ti 16GB cards offer good value at typically around GBP 500 to GBP 600, providing enough VRAM to run 13B models fully in GPU memory. For heavier workloads involving 30B or 70B models, data centre grade cards with 24GB or more of VRAM are necessary, though these carry a significantly higher price tag, often starting from around GBP 2,000 for enterprise-grade options.
| GPU | VRAM | Max Model Size (Full GPU) | Approx UK Price |
|---|---|---|---|
| RTX 4060 | 8GB | 7B (Q4) | From around GBP 280 |
| RTX 4060 Ti 16GB | 16GB | 13B (Q4-Q6) | From around GBP 520 |
| RTX 4090 | 24GB | 30B (Q4) | From around GBP 1,800 |
| RX 7900 XTX | 24GB | 30B (Q4) | From around GBP 900 |
| RTX 6000 Ada (Pro) | 48GB | 70B (Q4) | From around GBP 6,500 |
Environment Variables That Control GPU Behaviour in Ollama
Several environment variables directly control how Ollama interacts with your GPU. Knowing these gives you precise control over GPU behaviour beyond what the default configuration provides. The most important is CUDA_VISIBLE_DEVICES for NVIDIA systems. Setting this to 0 tells CUDA to use the first GPU. Setting it to -1 effectively disables GPU use entirely, which is a common accidental misconfiguration that causes Ollama to run on CPU without any obvious error.
On AMD systems, the equivalent is HIP_VISIBLE_DEVICES or ROCR_VISIBLE_DEVICES. Both control which GPU ROCm can see. If these are set incorrectly in a system-level profile, a bash script, or a Docker environment file, Ollama will see no GPU regardless of what hardware is installed. Always audit these variables before spending time on deeper troubleshooting.
CUDA_VISIBLE_DEVICES=0enables the first NVIDIA GPUCUDA_VISIBLE_DEVICES=-1disables CUDA GPU access entirelyHIP_VISIBLE_DEVICES=0enables the first AMD GPU via ROCmHSA_OVERRIDE_GFX_VERSIONmaps an unsupported AMD GPU to a supported architectureOLLAMA_NUM_GPUsets how many GPUs Ollama will use (useful in multi-GPU setups)OLLAMA_NUM_CTXreduces context window to lower VRAM demand
If you are deploying Ollama as a service for your business, consider whether local GPU hardware is the right long-term approach versus cloud-hosted inference. Our comparison of Azure vs AWS vs Google Cloud for UK SMEs is worth reading if you are evaluating whether cloud-based AI inference makes more financial sense than maintaining on-premises GPU hardware.
Key Takeaways
- Ollama falls back to CPU silently when GPU detection fails. Always confirm GPU usage via
ollama ps, Task Manager, or GPU monitoring tools before assuming everything is working correctly. - NVIDIA CUDA errors are usually caused by a missing CUDA toolkit, an outdated driver, or a version mismatch. Install the toolkit separately and verify with
nvcc --version. - AMD ROCm not working on Windows is expected at this stage. Use Linux, ideally Ubuntu 22.04, for reliable AMD GPU inference with Ollama.
- For unsupported AMD consumer GPUs, the
HSA_OVERRIDE_GFX_VERSIONvariable can unlock GPU acceleration by mapping to a supported architecture. - WSL2 GPU support requires a Windows 11 or Windows 10 21H2 host with driver 470.76 or later. Docker requires the NVIDIA container toolkit and the
--gpus allflag. - Partial GPU usage often indicates a VRAM limitation, not a detection failure. Use smaller quantisation variants or reduce context length to fit models fully into GPU memory.
- Environment variables such as
CUDA_VISIBLE_DEVICES=-1can silently disable GPU use. Always audit these in your shell profile, systemd service file, and Docker compose files.
Related Guides
- How to Run Ollama on a Home Server
- Azure vs AWS vs Google Cloud for UK SMEs Comparison
- Best Servers for Small Business in 2025
- AI in Business: Why Now is the Time to Embrace the Future
- How to Configure Your First Server in 2025: Step-by-Step
Frequently Asked Questions
Why does Ollama say it is running on CPU when I have a GPU installed?
Ollama automatically falls back to CPU when it cannot detect a compatible GPU or the required compute framework (CUDA for NVIDIA, ROCm for AMD) is missing or misconfigured. This happens silently with no prominent warning. Check the Ollama logs for GPU-related lines and run ollama ps while a model is loaded to confirm which device it is actually using. In most cases the fix is installing or updating the CUDA toolkit, correcting your driver version, or resolving a misconfigured environment variable such as CUDA_VISIBLE_DEVICES being set to -1.
Does Ollama support AMD GPUs on Windows?
AMD ROCm support on Windows within Ollama is limited and should be considered experimental at this time. AMD ROCm is primarily developed for Linux, and Ollama’s Windows AMD GPU support is still maturing. If you need reliable AMD GPU inference, Linux (particularly Ubuntu 22.04) is the recommended platform. On Windows, many AMD GPU users find that Ollama will run on CPU regardless of the hardware present due to the immaturity of the Windows ROCm stack.
How do I fix Ollama GPU errors in WSL2?
GPU support in WSL2 requires Windows 11 or Windows 10 version 21H2 or later, with an NVIDIA driver version of 470.76 or higher installed on the Windows host. You should not install a separate CUDA toolkit inside WSL2 as CUDA libraries are provided by the Windows host driver through WSL2’s DirectML integration. If your GPU is still not detected inside WSL2, check that the WSL2 backend is fully updated via wsl --update in a Windows terminal, then restart WSL2 and relaunch Ollama.
My AMD GPU is not on the ROCm supported list. Can I still use it with Ollama?
Yes, in many cases. AMD consumer GPUs that are not on the official ROCm supported list can often be made to work by setting the HSA_OVERRIDE_GFX_VERSION environment variable to the architecture version of a supported GPU that is closest to your own. For example, many RDNA 2 cards work with HSA_OVERRIDE_GFX_VERSION=10.3.0. This is not officially supported by AMD but is widely used in the Ollama community and works reliably for a broad range of unsupported consumer cards. Set this variable before starting the Ollama service.
How much VRAM do I need to run Ollama models fully on the GPU?
The VRAM requirement depends on the model size and quantisation level. A 7B parameter model at Q4 quantisation typically requires around 4 to 5GB of VRAM. A 13B model at Q4 requires around 8 to 9GB, and a 70B model at Q4 requires around 40GB or more. For UK businesses buying dedicated inference hardware, an RTX 4060 Ti 16GB (typically from around GBP 520) handles 13B models fully in GPU memory and is a practical starting point. If you are working with larger models regularly, look at cards with 24GB or more of VRAM.


