

llama.cpp
#2 v Lokální LLM runtimellama-cpp · od 10. März 2023 (Erstveröffentlichung durch Georgi Gerganov) · 28× · naposledy 30. 6. 2026
64
Momentum
llama.cpp is an open-source C/C++ library for local and cloud inference of large language models, created by Georgi Gerganov. It runs without external dependencies on a wide range of CPUs and GPUs and uses its own GGUF file format for quantized models. The project provides CLI tools and a server with an OpenAI-compatible API, forming the technical backbone of many popular local LLM applications such as Ollama and LM Studio. It is released under the MIT license and is free to download.
Vývoj momenta
04.04.03.07.
Vlastnosti
| Deployment (Self-Hosted/Cloud) | Self-hosted (local, server, Docker) as well as cloud deployment possible, e.g. via Hugging Face Inference Endpoints |
| Throughput/Latency | Highly hardware-dependent; example: RTX 3060 12GB approx. 42 tok/s (8B, Q4), M1 MacBook approx. 30-50 tok/s (7B quantized) |
| License | MIT License |
| Platform | Windows, Linux, macOS; runs on CPU, Apple Silicon (Metal), NVIDIA (CUDA), AMD (HIP), Intel/SYCL, Vulkan, RISC-V, among others |
| Price | Free, open source (no license fees) |
| Protocol Compatibility | OpenAI-compatible API endpoints (e.g. v1/chat/completions), grammar-based JSON output |
| Release Date | March 10, 2023 (initial release by Georgi Gerganov) |
| Supported Models/Providers | Llama, Mistral, Gemma, DeepSeek, gpt-oss, Phi, Qwen, and many more in GGUF format |