Running LLMs Locally: Ollama, llama.cpp & More
Using large language models on your own PC – from installation to optimal configuration.
Table of Contents
01Local Language Models Overview
Large Language Models (LLMs) like GPT, Llama, Mistral, or Qwen can now be run on consumer hardware thanks to quantization and efficient runtime environments. The range extends from small 1B models that run on a laptop to 70B+ models that require a high-end workstation.
02Popular Open-Source LLMs
The most well-known locally usable language models:
- Llama 3.x (Meta): 8B, 70B, 405B parameters. Excellent all-round models with good multilingual support.
- Mistral / Mixtral (Mistral AI): 7B and MoE variants. Very efficient and fast.
- Qwen 2.5 (Alibaba): 7B, 14B, 32B, 72B parameters. Strong coding and reasoning capabilities.
- Gemma 2 (Google): 2B, 9B, 27B parameters. Compact and efficient models.
- DeepSeek-V3 / R1: Advanced reasoning models with chain-of-thought.
- Phi-3/4 (Microsoft): Small but powerful models (3.8B), ideal for weaker hardware.
03Ollama – The Easiest Way to Start
Ollama is the most user-friendly solution for local LLMs. Installation: Simply download from ollama.com and install. Then in the terminal: 'ollama run llama3.1' – done! Ollama manages models automatically, supports GGUF quantization, and provides an OpenAI-compatible API. Ideal for beginners and as a backend for chat UIs like Open WebUI.
04llama.cpp – Maximum Control
llama.cpp is the low-level solution for maximum performance and control. It allows fine-grained settings for quantization, context length, and GPU offloading. Advantages: Supports CPU + GPU mixed, extremely efficient, compatible with all GGUF models. Disadvantages: Requires more technical knowledge and command-line usage.
05Understanding Quantization
Quantization reduces the precision of model weights to save memory:
- FP16: Full quality, double memory requirement (7B model ≈ 14 GB)
- Q8: Minimal quality loss, good compromise (7B model ≈ 8 GB)
- Q5_K_M: Good quality with significant memory savings (7B model ≈ 5 GB)
- Q4_K_M: Slight quality loss, greatly reduced memory (7B model ≈ 4 GB)
- Q2_K: Noticeable quality loss, only for experiments (7B model ≈ 3 GB)
06Chat Interfaces
For a pleasant chat experience, we recommend Open WebUI – a modern web interface that works with Ollama and provides a ChatGPT-like experience. Alternatives are: LM Studio (desktop app with GUI), Jan (open-source desktop app), and text-generation-webui (feature-rich but more complex).
Hardware Recommendations
The best hardware for local AI generation. Our recommendations based on price-performance and compatibility.
Graphics Cards (GPU)
NVIDIA RTX 3060 12GB
EntryBest entry-level model for local AI. 12 GB VRAM is sufficient for SDXL and small LLMs.
from ~$300NVIDIA RTX 4070 Ti Super 16GB
RecommendedIdeal mid-range GPU. 16 GB VRAM for Flux, SDXL, and medium-sized LLMs.
from ~$800NVIDIA RTX 4090 24GB
High-EndHigh-end GPU for demanding models. 24 GB VRAM for Wan 2.2 14B and large LLMs.
from ~$1,800NVIDIA RTX 5090 32GB
EnthusiastMaximum performance and VRAM. 32 GB for all current and future AI models.
from ~$2,200Memory (RAM)
* Affiliate links: If you purchase through these links, we receive a small commission at no additional cost to you. This helps us keep ComfyVault free.