AI Models

Running LLMs Locally: Ollama, llama.cpp & More

Using large language models on your own PC – from installation to optimal configuration.

11 min readUpdated: February 6, 2026

LLMOllamallama.cppLanguage ModelsChat

01Local Language Models Overview

Large Language Models (LLMs) like GPT, Llama, Mistral, or Qwen can now be run on consumer hardware thanks to quantization and efficient runtime environments. The range extends from small 1B models that run on a laptop to 70B+ models that require a high-end workstation.

02Popular Open-Source LLMs

The most well-known locally usable language models:

Llama 3.x (Meta): 8B, 70B, 405B parameters. Excellent all-round models with good multilingual support.
Mistral / Mixtral (Mistral AI): 7B and MoE variants. Very efficient and fast.
Qwen 2.5 (Alibaba): 7B, 14B, 32B, 72B parameters. Strong coding and reasoning capabilities.
Gemma 2 (Google): 2B, 9B, 27B parameters. Compact and efficient models.
DeepSeek-V3 / R1: Advanced reasoning models with chain-of-thought.
Phi-3/4 (Microsoft): Small but powerful models (3.8B), ideal for weaker hardware.

03Ollama – The Easiest Way to Start

Ollama is the most user-friendly solution for local LLMs. Installation: Simply download from ollama.com and install. Then in the terminal: 'ollama run llama3.1' – done! Ollama manages models automatically, supports GGUF quantization, and provides an OpenAI-compatible API. Ideal for beginners and as a backend for chat UIs like Open WebUI.

04llama.cpp – Maximum Control

llama.cpp is the low-level solution for maximum performance and control. It allows fine-grained settings for quantization, context length, and GPU offloading. Advantages: Supports CPU + GPU mixed, extremely efficient, compatible with all GGUF models. Disadvantages: Requires more technical knowledge and command-line usage.

05Understanding Quantization

Quantization reduces the precision of model weights to save memory:

FP16: Full quality, double memory requirement (7B model ≈ 14 GB)
Q8: Minimal quality loss, good compromise (7B model ≈ 8 GB)
Q5_K_M: Good quality with significant memory savings (7B model ≈ 5 GB)
Q4_K_M: Slight quality loss, greatly reduced memory (7B model ≈ 4 GB)
Q2_K: Noticeable quality loss, only for experiments (7B model ≈ 3 GB)

06Chat Interfaces

Tip

For a pleasant chat experience, we recommend Open WebUI – a modern web interface that works with Ollama and provides a ChatGPT-like experience. Alternatives are: LM Studio (desktop app with GUI), Jan (open-source desktop app), and text-generation-webui (feature-rich but more complex).

Recommended Hardware

Hardware Recommendations

The best hardware for local AI generation. Our recommendations based on price-performance and compatibility.

Graphics Cards (GPU)

NVIDIA RTX 3060 12GB

Entry

Best entry-level model for local AI. 12 GB VRAM is sufficient for SDXL and small LLMs.

from ~$300

NVIDIA RTX 4070 Ti Super 16GB

Recommended

Ideal mid-range GPU. 16 GB VRAM for Flux, SDXL, and medium-sized LLMs.

from ~$800

NVIDIA RTX 4090 24GB

High-End

High-end GPU for demanding models. 24 GB VRAM for Wan 2.2 14B and large LLMs.

from ~$1,800

NVIDIA RTX 5090 32GB

Enthusiast

Maximum performance and VRAM. 32 GB for all current and future AI models.

from ~$2,200

Memory (RAM)

32GB DDR5 RAM Kit

Recommended

Recommended minimum for local AI. Fast DDR5 memory for optimal model loading.

from ~$80

64GB DDR5 RAM Kit

Power User

Ideal for large models and CPU offloading. Enables loading multiple models simultaneously.

from ~$150

* Affiliate links: If you purchase through these links, we receive a small commission at no additional cost to you. This helps us keep ComfyVault free.

Flux.1: The New Reference in Image Generation

Wan 2.2: Text-to-Video & Image-to-Video

Discover More

Explore more articles in our Knowledge Base and become an expert in local AI.