LLM Quantization Explained — How 4-Bit Models Run on Consumer GPUs

Modern large language models store their knowledge as billions of numbers called weights. At full 16-bit precision, a 70 billion parameter model takes 140 gigabytes of VRAM — six times what the largest consumer GPU offers. But thanks to a technique called quantization, you can run the same model in about 40 gigabytes, fitting on a single RTX 4090 or a MacBook Pro. The quality loss is surprisingly small — often within 2 percent of the original.

This is the technology that made local AI possible.

Why Quantization?

Modern large language models store their knowledge as billions of numbers called weights. These numbers determine how the model processes input and generates output. At full 16-bit floating point precision, a 70 billion parameter model takes 140 gigabytes just to load.

An RTX 4090 has 24 GB. An A100 has 80 GB. You simply cannot fit the full model on a single consumer GPU.

The obvious solution is to make the numbers smaller. Instead of 16 bits per weight, use 8. Or 4. Or even 2. This is quantization — reducing the precision of the weights to compress the model into available memory. And it works much better than you’d expect.

How Numbers Work in Neural Networks

The standard precision for AI models is FP16 — 16-bit floating point, which gives about 65,000 distinct values per weight. That’s more than enough precision for the model to distinguish between similar weights and make accurate predictions.

FP32, the original full precision, gives 4 billion values — overkill for inference. INT8 gives 256 values — still enough to capture the important differences between weights. INT4 gives 16 values. And INT2 gives just 4.

Each time you halve the bit width, you double the memory efficiency but you exponentially reduce the number of possible values. The art of quantization is finding the sweet spot where you’ve compressed the model as much as possible without losing meaningful capability.

The Quantization Process

Here’s how it works. Take a group of FP16 weights — say they range from -2.5 to +1.8. You want to represent these as INT8 values — -128 to +127.

You compute a scale factor: the total range divided by the number of integer levels. In this case, about 0.017. Then you divide each weight by the scale factor and round to the nearest integer. The original weight 0.73 becomes 43. To dequantize, you multiply back: 43 times 0.017 equals 0.73.

The error is the rounding loss — and it’s usually tiny, often less than half a percent.

The challenge comes from outlier weights. If most weights are between -1 and 1 but one weight is 50, the quantization window stretches to accommodate it, crushing the resolution for the important values. Modern quantization methods handle this through block-wise grouping.

Block-Wise Quantization

Instead of computing one scale factor for the entire layer, weights are divided into small blocks — typically 32 weights per block. Each block gets its own scale factor stored alongside the quantized values.

Why does this help? Because weight distributions vary across the network. Attention layers look different from feed-forward layers. Early layers look different from late layers. With block-wise quantization, each local group gets its own optimal quantization window, preserving precision where it matters most.

This is the fundamental insight that makes modern quantization practical. It’s used by every major format — GGUF, GPTQ, AWQ — and it’s why you can run a 70 billion parameter model at 4 bits with only a 2 percent quality loss.

K-Quants Decoded

If you’ve ever downloaded a model from Hugging Face, you’ve seen the alphabet soup: Q2_K, Q3_K_M, Q4_K_M, Q5_K_M, Q6_K, Q8_0. Here’s what they mean:

Q2_K: 2.56 bits per weight. The highest compression — about 16% of the original size. Quality is noticeably degraded.
Q4_K_M: 4.6 bits per weight. About 29% of the original size. The community sweet spot — within 2% quality of the original.
Q5_K_M: 5.3 bits per weight. About 33% of the original size. Near-lossless — degradation is essentially undetectable.
Q8_0: 8.5 bits per weight. About 53% of the original size. Virtually indistinguishable from full FP16.

The “K” stands for importance-weighted quantization — it allocates more bits to layers that matter more for output quality.

Quality vs Size

On perplexity benchmarks for a 7B model (lower is better, FP16 = 5.47):

Q8_0: 5.48 — virtually identical
Q6_K: 5.49
Q5_K_M: 5.51
Q4_K_M: 5.59 — ~2% degradation
Q3_K_M: 5.92 — 8% degradation, noticeable on reasoning
Q2_K: 7.19 — 31% degradation

The knee in the quality curve is at Q4_K_M. Below that, quality drops faster than size improves. Above that, diminishing returns.

Choosing the Right Quant

Practical formula: (parameters × bits per weight) / 8 + 2 GB overhead.

Q4_K_M is your default — best quality-to-size ratio
Go lower (Q3/Q2) only if your hardware forces you
Go higher (Q5/Q6) if you have VRAM headroom
Never use Q2 for math or reasoning

Memory bandwidth matters more than raw compute. A 4090 loads models at ~1 TB/s. CPU RAM runs at ~50 GB/s. The same quantized model runs 20x faster on GPU.

Bottom Line

Quantization is the unsung hero of local AI. It took models that required datacenter hardware and made them run on laptops. Built by the open source community — Georgi Gerganov’s llama.cpp, the GGUF format, the K-quant system. Without it, running AI at home wouldn’t exist.

And the best part: for most use cases, you can’t tell the difference.