How to Set Up OpenHands AI Locally on a Single GPU

Running OpenHands AI locally on a single GPU can unlock fast, private code‑assistant capabilities, but it requires careful setup.

The Pain Points: What Makes Local OpenHands Deployment Hard?

When trying to run OpenHands on a workstation, the biggest hurdles are not the tutorial coverage but the model’s memory appetite and the tangled web of Python‑CUDA dependencies. Below are the two most common blockers and how they tend to surface in a real‑world setup.

GPU memory limits and model size

OpenHands ships with several model variants—7 billion, 13 billion and even a 30 billion parameter checkpoint. Those numbers sound impressive until you translate them into VRAM consumption.

7B (fp16): roughly 13 GB of GPU memory when loaded uncompressed.
13B (fp16): pushes the limit to about 27 GB, which exceeds most consumer cards.
7B (int8‑quantized): drops to ~6 GB, but introduces a small latency penalty and occasional token‑level quality dip.

Traceback (most recent call last):
  File "run_openhands.py", line 42, in <module>
    model = AutoModelForCausalLM.from_pretrained(...)
  File ".../torch/cuda/__init__.py", line 307, in _lazy_init
    torch._C._cuda_init()
RuntimeError: CUDA out of memory. Tried to allocate 13.55 GiB (GPU 0; 24.00 GiB total capacity; 5.31 GiB already allocated; 2.68 GiB free; 13.55 GiB reserved in total)

Quantize before loading. Tools like bitsandbytes let you convert a fp16 checkpoint to int8 on‑the‑fly:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
import bitsandbytes as bnb

model_name = "openhands/7b-fp16"
tokenizer = AutoTokenizer.from_pretrained(model_name)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    torch_dtype=torch.float16,
    load_in_8bit=True
)

Offload to CPU. The device_map="auto" strategy shards part of the model onto system RAM. On a 16 GB laptop with a 6 GB GPU, offloading ~8 GB to CPU enables the 7B model to run, with a slowdown on the first pass but acceptable subsequent token generation.
Use a smaller context window. OpenHands defaults to a 4096‑token context. Reducing to 2048 cuts the KV‑cache size in half, saving ~2–3 GB. Set model.config.max_position_embeddings = 2048 before the first forward pass.

torch.cuda.empty_cache()
# or, in a longer‑running service:
del outputs
torch.cuda.empty_cache()

A helper script that prints the peak VRAM after each request can catch subtle bugs, such as stray tensors left on the GPU that cause a gradual OOM drift.

Dependency hell: Python, CUDA, and library versions

The second beast is the version matrix. OpenHands sits on top of torch, transformers, bitsandbytes, and optionally flash‑attn. Each library has strict CUDA compatibility rules, and upgrading one can break the whole stack.

# naive pip install
pip install torch transformers bitsandbytes flash-attn openhands

Typical issues:

torch pulls the latest 2.2.0 wheel compiled for CUDA 12.1, but the GPU driver may only support CUDA 11.8.
bitsandbytes fails at runtime with ImportError: libcuda.so.1: cannot open shared object file because it was built against a newer cuBLAS.
flash‑attn reports a missing or mismatched nvcc version.

# environment.yml
name: openhands-env
channels:
  - conda-forge
  - pytorch
dependencies:
  - python=3.10
  - pytorch=2.1.0=py3.10_cuda11.8_cudnn8.6.0_0   # explicit CUDA 11.8 build
  - torchvision
  - torchaudio
  - cudatoolkit=11.8
  - pip
  - pip:
      - transformers==4.38.2
      - bitsandbytes==0.42.0
      - flash-attn==2.5.6
      - openhands==0.1.0
      - accelerate==0.28.0

export BNB_CUDA_VERSION=118   # matches the cudatoolkit version

On Windows, install the CUDA toolkit separately and ensure nvcc is on the PATH. A common pitfall is a mismatched MSVC runtime between the conda‑installed cudatoolkit and the system‑wide CUDA driver, which leads to DLL load failed errors.

Another gotcha: torchvision and torchaudio must be pulled from the pytorch channel, not conda-forge. Mixing channels can silently downgrade the CUDA ABI, resulting in cryptic errors like RuntimeError: Expected all tensors to be on the same device when calling model.to('cuda').

Finally, keep an eye on the accelerate configuration. OpenHands expects the environment variable TRANSFORMERS_CACHE to point to a writable directory; otherwise the first run attempts to write model weights into a read‑only site‑packages folder and crashes with a permission error.

# ~/.bashrc or equivalent
export TRANSFORMERS_CACHE=$HOME/.cache/huggingface/transformers
export HF_HOME=$HOME/.cache/huggingface