AI Models

Running LLMs Locally: Ollama, llama.cpp & More

Using large language models on your own PC – from installation to optimal configuration.

11 min readUpdated: February 6, 2026
LLMOllamallama.cppLanguage ModelsChat

Table of Contents

01Local Language Models Overview

Large Language Models (LLMs) like GPT, Llama, Mistral, or Qwen can now be run on consumer hardware thanks to quantization and efficient runtime environments. The range extends from small 1B models that run on a laptop to 70B+ models that require a high-end workstation.

02Popular Open-Source LLMs

The most well-known locally usable language models:

  • Llama 3.x (Meta): 8B, 70B, 405B parameters. Excellent all-round models with good multilingual support.
  • Mistral / Mixtral (Mistral AI): 7B and MoE variants. Very efficient and fast.
  • Qwen 2.5 (Alibaba): 7B, 14B, 32B, 72B parameters. Strong coding and reasoning capabilities.
  • Gemma 2 (Google): 2B, 9B, 27B parameters. Compact and efficient models.
  • DeepSeek-V3 / R1: Advanced reasoning models with chain-of-thought.
  • Phi-3/4 (Microsoft): Small but powerful models (3.8B), ideal for weaker hardware.

03Ollama – The Easiest Way to Start

Ollama is the most user-friendly solution for local LLMs. Installation: Simply download from ollama.com and install. Then in the terminal: 'ollama run llama3.1' – done! Ollama manages models automatically, supports GGUF quantization, and provides an OpenAI-compatible API. Ideal for beginners and as a backend for chat UIs like Open WebUI.

04llama.cpp – Maximum Control

llama.cpp is the low-level solution for maximum performance and control. It allows fine-grained settings for quantization, context length, and GPU offloading. Advantages: Supports CPU + GPU mixed, extremely efficient, compatible with all GGUF models. Disadvantages: Requires more technical knowledge and command-line usage.

05Understanding Quantization

Quantization reduces the precision of model weights to save memory:

  • FP16: Full quality, double memory requirement (7B model ≈ 14 GB)
  • Q8: Minimal quality loss, good compromise (7B model ≈ 8 GB)
  • Q5_K_M: Good quality with significant memory savings (7B model ≈ 5 GB)
  • Q4_K_M: Slight quality loss, greatly reduced memory (7B model ≈ 4 GB)
  • Q2_K: Noticeable quality loss, only for experiments (7B model ≈ 3 GB)

06Chat Interfaces

Tip

For a pleasant chat experience, we recommend Open WebUI – a modern web interface that works with Ollama and provides a ChatGPT-like experience. Alternatives are: LM Studio (desktop app with GUI), Jan (open-source desktop app), and text-generation-webui (feature-rich but more complex).

Discover More

Explore more articles in our Knowledge Base and become an expert in local AI.