How to Run TabbyML Inside Docker for Fast Local LLM Inference
Running TabbyML inside Docker provides fast, reproducible local LLM inference while eliminating manual dependency and environment management.
The Problem: Why Local LLM Inference Is Hard Without Docker
Hardware and Dependency Hell
Spinning up a local LLM with TabbyML often hits a hardware‑software mismatch. For example, the gemma-2b-it model requires at least 12 GB of VRAM for reasonable performance, while many laptops ship with 6 GB GPUs. The obvious workaround is to downgrade the model or hope the driver can handle the memory pressure.
- CUDA toolkit: the version that matches the driver (e.g., 525.85) can conflict with the
torchwheel that TabbyML pulls in. - PyTorch: nightly builds that promise
torch.compilespeedups may requiregcc-12, which is not available in defaultaptrepositories without adding a PPA. - ONNX Runtime: the
onnxruntime‑gpupackage often needs a specificcudnnversion that resides in a Conda channel not enabled by default.
After multiple attempts with apt-get update && apt-get install …, conda env create …, and recurring ImportError: libcusolver.so.11, a working environment may finally be achieved—only to discover that the script crashes on the first batch because the GPU runs out of memory and falls back to CPU, making inference dramatically slower.
In a team setting, this “dependency hell” multiplies. Each developer’s OS version, GPU driver, and Python environment differ, so a combination that works on one machine fails on another. Locking versions in requirements.txt freezes Python packages but does not prevent drift in underlying system libraries.
FROM nvidia/cuda:12.1.0-runtime-ubuntu22.04
# System deps
RUN apt-get update && \
apt-get install -y --no-install-recommends \
python3-pip git curl && \
rm -rf /var/lib/apt/lists/*
# Python deps
RUN pip3 install --no-cache-dir \
torch==2.2.0+cu121 \
tabbyml==0.6.1 \
onnxruntime-gpu==1.17.0
# Verify CUDA visibility
ENV NVIDIA_VISIBLE_DEVICES all
ENV NVIDIA_DRIVER_CAPABILITIES compute,utility
Using the NVIDIA CUDA runtime image aligns low‑level drivers automatically, removing the need to chase down specific cudnn or gcc versions. The container becomes a portable “hardware‑compatible” bundle that runs on any machine with a recent NVIDIA driver.
Isolation and Reproducibility Benefits
Even after aligning the hardware stack, environment drift can appear. A clean venv may start with the expected dependencies, but after a week of experimentation pip freeze can list dozens of transitive packages not present in the original requirements.txt. A teammate pulling the same repository and installing from requirements.txt might encounter RuntimeError: Expected all tensors to be on the same device because a different version of transformers introduced a changed tokenizer implementation.
Containers provide a hard boundary. When TabbyML runs inside Docker, the entire OS filesystem, Python packages, and GPU drivers are sealed in a layer that only changes when the image is explicitly rebuilt. This eliminates the “it works on my machine” excuse.
- Check the model and hardware requirements on the TabbyML GitHub page.
- Pick the matching
nvidia/cudabase image (e.g.,12.1.0-runtime-ubuntu22.04for CUDA 12.1). - Write a minimal
Dockerfilethat installs only what TabbyML needs. - Build the image once and push it to a private registry so every CI runner and developer can pull the exact same image.
- Run inference with a one‑liner:
docker run --gpus all -v $(pwd)/models:/app/models tabbyml:latest python -m tabbyml.serve --model gemma-2b-it.
Because the container encapsulates the exact CUDA version and PyTorch wheel, performance measurements become comparable across machines without separate driver‑tuning sessions.
Another subtle win is security. Running an LLM often involves downloading model weights from arbitrary URLs. When those downloads happen inside a container, the host file system stays clean—only the mounted /app/models directory receives the files. If a model file is corrupted or malicious, deleting the container and spinning up a fresh one removes any stray binaries from the host.
jobs:
test-tabbyml:
runs-on: ubuntu-latest
container:
image: ghcr.io/yourorg/tabbyml:latest
options: --gpus all
steps:
- uses: actions/checkout@v3
- name: Run unit tests
run: pytest tests/unit
- name: Run inference sanity check
run: |
python -m tabbyml.serve --model gemma-2b-it --max-tokens 32 &
sleep 10
curl -X POST http://localhost:8000/generate -d '{"prompt":"Hello"}'
Since the container already includes the correct GPU drivers, the CI job never stalls on “cannot find CUDA”. The entire pipeline becomes deterministic, ensuring that a failing test reflects an actual issue in the code rather than an environment mismatch.