How to Meet Tabby ML Requirements for Local Model Deployment

Deploying machine‑learning models locally with Tabby requires meeting specific format, size, and performance criteria. This guide explains those requirements, shows how to adapt a model, and demonstrates a real deployment.

What Tabby ML Actually Demands: Breaking Down the Requirements

Core Constraints for On‑Device Inference

Running a 7 B LLaMA model on a laptop with Tabby highlights two often‑overlooked aspects: the environmental constraints Tabby enforces before it loads a model file. Tabby is deliberately strict to guarantee low‑latency inference on a wide range of hardware, from MacBooks to low‑end Windows PCs and embedded ARM boards.

Below is a checklist for preparing a model for Tabby. The runtime will refuse to load anything that violates these rules.

CPU‑only inference: Tabby does not support GPU offload. All tensor operations must run on the host CPU, so the model must fit into RAM while leaving headroom for the OS and the Tabby process.
Deterministic execution path: The runtime disables nondeterministic kernels (e.g., torch.backends.cudnn or numpy.random seeds) to ensure the same prompt always yields the same token stream on the same hardware. Inference libraries that inject random jitter for performance are rejected.
Thread limits: Tabby caps the number of inference threads to the number of physical cores (usually os.cpu_count()). Over‑committing threads can cause thread‑pool thrashing, which Tabby treats as a performance violation. Setting OMP_NUM_THREADS to the core count before launching the server avoids this issue.
Contiguous memory allocation: Tabby expects the model to be loaded into a single contiguous memory block. Fragmented allocations (common when loading additional libraries after the model) trigger a “memory layout error.” Starting Tabby early in the session helps maintain a clean memory layout.
Latency ceiling: Tabby measures the time from prompt receipt to first token emission. If that latency exceeds ~200 ms on the target hardware, the model is flagged as “too slow” and will not start. Quantization and architectural pruning are therefore essential.

# sanity_check.py
import psutil, os, sys, time

def cpu_only():
    # Tabby will reject any GPU context; enforce it early.
    if any('cuda' in p.name().lower() for p in psutil.process_iter()):
        sys.exit('GPU processes detected – stop them before loading Tabby.')

def memory_check(model_path, max_ram_gb=8):
    # Rough estimate: model size on disk + 1.5× for RAM overhead.
    size_gb = os.path.getsize(model_path) / (1024**3)
    if size_gb * 1.5 > max_ram_gb:
        sys.exit(f'Model too large for target RAM ({size_gb:.2f} GB on disk).')
    
def latency_probe(model_path, prompt='Hello, world!'):
    import timeit, tabby
    t = timeit.timeit(lambda: tabby.infer(model_path, prompt), number=1)
    if t > 0.2:
        sys.exit(f'First-token latency {t:.3f}s exceeds Tabby limit.')
    
if __name__ == '__main__':
    model = sys.argv[1]
    cpu_only()
    memory_check(model)
    latency_probe(model)

Running python sanity_check.py my_model.gguf catches most Tabby rejections before the server starts, avoiding cryptic “failed to load model” logs.

Supported Model Formats and Size Limits

.gguf (GGUF – “GGML Unified Format”): The default for llama.cpp 0.2+ and the only format Tabby officially documents. GGUF bundles model weights, metadata, and quantization parameters into a single, self‑describing file.
.bin (Legacy GGML): Older models may be provided as raw .bin files. Tabby can load them when a corresponding .json config describing architecture and quantization is supplied.

Formats such as ONNX, TensorFlow SavedModel, or PyTorch .pt are rejected outright. The conversion pipeline is straightforward and can be integrated into a CI process.

# 1. Export to HuggingFace format (optional but convenient)
python -c "
import torch, transformers
model = torch.load('model.pt')
model.save_pretrained('hf_export')
"

# 2. Convert with llama.cpp's convert.py
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
python convert.py \
  --input_dir ../hf_export \
  --output_dir ./gguf_output \
  --model_type llama \
  --outtype q4_0   # 4‑bit quantization

# 3. Verify size – GGUF should be ~4.2 GB for a 13 B model quantized to q4_0
ls -lh gguf_output/ggml-model-q4_0.gguf
# -rw-r--r-- 1 user user 4.2G Jan 12 10:30 ggml-model-q4_0.gguf

# 4. Run sanity check (from earlier)
python ../sanity_check.py gguf_output/ggml-model-q4_0.gguf

The --outtype q4_0 flag tells llama.cpp to compress weights to 4‑bit integers, reducing the on‑disk footprint by roughly 75 % compared with a 16‑bit float checkpoint. Tabby caps the maximum model size at 5 GB for 4‑bit models on an 8 GB RAM device. Larger files trigger an out‑of‑memory error or are refused during the initial load.