Docs

Key terms and concepts used on canirunaimodel, explained simply.

Hugging Face repo URL

The checker accepts public Hugging Face model pages like https://huggingface.co/google/gemma-3-4b-it. We normalize the URL, extract the repo id, then fetch metadata from Hugging Face before estimating local runtime needs.

What we read from the repo

We inspect whatever the repo exposes: parameter counts, architecture names, tensor dtypes, storage totals, safetensors metadata, checkpoint filenames, tags, and adapter hints. If exact values are missing, we fall back to file size and repo structure.

Parameters

When you see "7B" or "70B", that's the number of parameters (weights) in the model, in billions. More parameters generally means the model is smarter and more capable, but also needs more memory and is slower to run. A 7B model is great for basic tasks, 13B to 34B is a solid sweet spot, and 70B+ delivers near-frontier quality but needs serious hardware.

Size vs capability tradeoff

1-3B Fast 7-8B Good 13-14B Better 27-34B Great 70B Excellent 405B+ Frontier Up = smarter Down = faster

Quantization

Quantization reduces the precision of a model's weights to make it smaller and faster, at the cost of some quality. The names tell you the bit-width:

Quality vs size (for a 7B model)

Format Quality Size F16 ~13 GB 100% Q8_0 ~6.7 GB ~99% Q6_K ~5.3 GB ~95% Q4_K_M ~3.9 GB ~88% best Q2_K ~2.5 GB ~60% Bar = quality retention vs original
Format Bits Quality Notes
Q2_K 2 Low Smallest size, noticeable quality loss
Q4_K_M 4 Good Best balance of size and quality, most popular
Q6_K 6 Very good Near-lossless, moderate size increase
Q8_0 8 Excellent Minimal quality loss, larger file
F16 16 Original Full precision, largest size

VRAM and RAM

VRAM is the memory on your GPU. RAM is system memory. To run a model well, the quantized weights usually need to fit in VRAM or in unified memory on Apple Silicon. If a model needs 8 GB of VRAM and your GPU has 6 GB, it usually will not run well and may fall back to much slower CPU inference.

Tensor type

Tensor type means how the weights are stored: for example F32, F16, BF16, or quantized formats. Heavier tensor types need more memory. If Hugging Face metadata does not say this clearly, we infer it from filenames, repo tags, or checkpoint format.

Adapters and LoRAs

Some Hugging Face repos are not full models. They are adapters or LoRAs that need a separate base model at runtime. In those cases, the checker tries to estimate the real runtime footprint using the base model instead of only the adapter file size.

MoE (Mixture of Experts)

A Mixture of Experts model splits its parameters into groups called experts. On each token, only a few experts are active. This means you get the quality of a larger model with the speed of a smaller one. The tradeoff: the full model still needs to fit in memory, even though only part of it runs at inference time.

MoE expert routing (Mixtral example)

Token Router top-2 Expert 1 Expert 2 Expert 3 Expert 4 ... x8 experts total Output Active: ~12.9B Total: 46.7B VRAM: all 46.7B

Dense vs MoE Architecture

A dense model activates all its parameters for every token. A MoE model has more total parameters but only uses a subset per token. Dense models are simpler and more predictable in terms of memory and speed. MoE models can punch above their weight in quality but need more VRAM than their active parameter count suggests.

Context Length

Context length is how many tokens the model can process at once, input and output combined. Longer context is great for analyzing documents or long conversations, but uses more memory. Most local usage works fine with 4K to 8K context.

Tokens per Second (tok/s)

This is the inference speed, how fast the model generates text. A rough guide:

  • 60+ tok/s - Instant feel, great for interactive use
  • 30-60 tok/s - Fast and comfortable
  • 15-30 tok/s - Usable, slight wait
  • 5-15 tok/s - Workable for batch tasks
  • <5 tok/s - Painful for interactive use

Verdicts

We turn the estimate into a simple verdict:

  • Runs great means there is healthy memory headroom.
  • Tight fit means it may run, but with little room to spare.
  • Too heavy means the footprint likely exceeds what your setup can handle comfortably.

GGUF Format

GGUF is the file format used by llama.cpp and tools like Ollama, LM Studio, and GPT4All. It stores quantized model weights in a single file that is ready to run on CPU or GPU. Even if you start from a Hugging Face repo URL, you often end up looking for the GGUF version when you actually want to run it locally.

Memory Bandwidth

Memory bandwidth, measured in GB/s, determines how fast data can be read from VRAM. During inference, the bottleneck is often reading model weights from memory, so higher bandwidth means more tokens per second. This is why Apple Silicon Macs can run larger models surprisingly well, and why an RTX 4090 generates text faster than an RTX 4060 even at the same VRAM usage.

Memory bandwidth comparison (GB/s)

RTX 4060 272 M4 Pro 273 RTX 4070 504 M4 Max 546 7900 XTX 960 RTX 4090 1008 RTX 5090 1792 Higher bandwidth = faster tok/s at same model size

What comes next

The site is moving toward a broader Hugging Face toolbox. Models come first, and later this can expand into datasets, Spaces, and other HF-native pages using the same hardware-plus-metadata approach.