vllm-installer
This skill should be used when users need to install, configure, debug, or run vLLM inference server on NVIDIA GPUs (especially B200/H100/A100). It covers installation from PyPI or source, dependency management, environment setup, common error diagnosis and fixes, tensor parallelism configuration, and server startup/testing. The skill automatically checks for LSSD mount status and DeepEP installation for MoE models.
SKILL.md
| Name | vllm-installer |
| Description | This skill should be used when users need to install, configure, debug, or run vLLM inference server on NVIDIA GPUs (especially B200/H100/A100). It covers installation from PyPI or source, dependency management, environment setup, common error diagnosis and fixes, tensor parallelism configuration, and server startup/testing. The skill automatically checks for LSSD mount status and DeepEP installation for MoE models. |
name: vllm-installer description: This skill should be used when users need to install, configure, debug, or run vLLM inference server on NVIDIA GPUs (especially B200/H100/A100). It covers installation from PyPI or source, dependency management, environment setup, common error diagnosis and fixes, tensor parallelism configuration, and server startup/testing. The skill automatically checks for LSSD mount status and DeepEP installation for MoE models. license: MIT
vLLM Installer
This skill provides comprehensive guidance for installing, configuring, and debugging vLLM on NVIDIA GPUs with CUDA 12.x.
When to Use This Skill
- Installing vLLM on NVIDIA GPUs (B200/H100/A100)
- Debugging vLLM installation errors (missing libraries, version conflicts)
- Configuring tensor parallelism for different model architectures
- Setting up environment variables for CUDA and NVIDIA libraries
- Starting and testing vLLM OpenAI-compatible API server
- Fixing common runtime errors (cuDNN, cusparseLt, FlashInfer issues)
Version Information (as of v0.14.1)
| Component | Version | Notes |
|---|---|---|
| vLLM | 0.14.1 | Latest stable (v0.15.0rc2 not yet on PyPI) |
| flashinfer-python | 0.5.3 | Attention backend |
| flashinfer-cubin | 0.5.3 | Must match flashinfer-python version |
| nixl | 0.9.0 | KV cache transfer (DMA-BUF, recommended for PD disaggregation) |
| nvidia-nccl-cu12 | 2.28.3 | Force reinstall |
| nvidia-cudnn-cu12 | 9.16.0.29 | Required for PyTorch 2.9+ |
| bitsandbytes | 0.46.1 | Quantization support |
| numpy | <2.3 | Required for numba compatibility |
DeepSeek-V3 FP8 on Blackwell (B200)
vLLM correctly handles DeepSeek-V3 FP8 on Blackwell GPUs, unlike SGLang which produces garbage output due to FP8 scale format incompatibility.
| Framework | DeepSeek-V3 FP8 on Blackwell |
|---|---|
| vLLM 0.14.1 | ✅ Works correctly |
| SGLang 0.5.8 | ❌ Garbage output (scale format mismatch) |
If you need to run DeepSeek-V3 on Blackwell (B200) GPUs, use vLLM:
vllm serve deepseek-ai/DeepSeek-V3 \
--tensor-parallel-size 8 \
--port 8000 \
--download-dir /lssd/huggingface/hub \
--trust-remote-code \
--max-model-len 4096
Pre-Installation Checks
Step 0: Check Prerequisites
Before installing vLLM, the skill automatically checks:
- LSSD Mount Status - High-speed local SSD for model caching
- DeepEP Installation - Required for MoE models (DeepSeek-V3, DeepSeek-R1)
LSSD Check
# Check if /lssd is mounted
if mountpoint -q /lssd 2>/dev/null; then
echo "✓ LSSD is mounted: $(df -h /lssd | tail -1 | awk '{print $2}')"
else
echo "✗ LSSD is not mounted"
echo " Run: /lssd-mounter"
fi
If LSSD is not mounted, use the lssd-mounter skill:
/lssd-mounter
DeepEP Check (for MoE models)
# Check if DeepEP is installed
python3 -c "import deep_ep; print('✓ DeepEP installed')" 2>/dev/null || \
python3 -c "import deepep; print('✓ DeepEP installed')" 2>/dev/null || \
echo "✗ DeepEP not installed (required for MoE models)"
If DeepEP is not installed and you need to run MoE models, use the deepep-installer skill:
/deepep-installer
Installation Workflow
Pre-requisites (Ubuntu 24.04)
Ubuntu 24.04 doesn't include pip by default. Install it first:
sudo apt-get update
sudo apt-get install -y python3-pip
Step 1: Environment Setup
To set up the environment, ensure CUDA is properly configured:
export CUDA_HOME=/usr/local/cuda
export PATH=$CUDA_HOME/bin:$PATH
# Set HuggingFace cache to LSSD (if available)
if [ -d /lssd/huggingface ]; then
export HF_HOME=/lssd/huggingface
fi
Step 2: Install PyTorch
pip install torch torchvision torchaudio \
--index-url https://download.pytorch.org/whl/cu129
Step 3: Install vLLM
# Install from PyPI (recommended)
pip install vllm==0.14.1 \
--extra-index-url https://download.pytorch.org/whl/cu129
Step 4: Install NVIDIA Libraries
These libraries must be installed with --force-reinstall --no-deps to avoid version conflicts:
pip install nvidia-nccl-cu12==2.28.3 --force-reinstall --no-deps
pip install nvidia-cudnn-cu12==9.16.0.29 --force-reinstall --no-deps
pip install nvidia-cusparselt-cu12 --force-reinstall --no-deps
Step 5: Install FlashInfer
FlashInfer is the recommended attention backend for vLLM:
pip install flashinfer-python==0.5.3 flashinfer-cubin==0.5.3
Important: flashinfer-python and flashinfer-cubin versions MUST match exactly.
⚠️ WARNING: FlashInfer may change PyTorch version!
FlashInfer installation can upgrade PyTorch from 2.9.1 to 2.10.0, which breaks:
- vLLM (requires PyTorch 2.9.1)
- sgl-kernel (ABI mismatch)
- DeepEP (ABI mismatch)
Always reinstall after FlashInfer:
pip install flashinfer-python==0.5.3 flashinfer-cubin==0.5.3
pip install torch==2.9.1+cu129 --index-url https://download.pytorch.org/whl/cu129 --force-reinstall
pip install nvidia-nccl-cu12==2.28.3 nvidia-cudnn-cu12==9.16.0.29 --force-reinstall --no-deps
Step 6: Configure LD_LIBRARY_PATH
To fix library loading issues, run scripts/setup_env.sh or manually set:
# Collect all nvidia pip package lib paths
NVIDIA_LIB_PATHS=""
for d in /usr/local/lib/python3.*/dist-packages/nvidia/*/lib; do
[ -d "$d" ] && NVIDIA_LIB_PATHS="${d}:${NVIDIA_LIB_PATHS}"
done
for d in $HOME/.local/lib/python3.*/site-packages/nvidia/*/lib; do
[ -d "$d" ] && NVIDIA_LIB_PATHS="${d}:${NVIDIA_LIB_PATHS}"
done
export LD_LIBRARY_PATH=${CUDA_HOME}/lib64:${NVIDIA_LIB_PATHS}${LD_LIBRARY_PATH}
Common Errors and Fixes
Error: libcudnn.so.9 not found
Symptom:
ImportError: libcudnn.so.9: cannot open shared object file
Fix:
pip install nvidia-cudnn-cu12==9.16.0.29 --force-reinstall --no-deps
# Then set LD_LIBRARY_PATH as described above
Error: libcusparseLt.so.0 not found
Symptom:
ImportError: libcusparseLt.so.0: cannot open shared object file
Fix:
pip install nvidia-cusparselt-cu12 --force-reinstall --no-deps
# Then set LD_LIBRARY_PATH as described above
Error: FlashInfer version mismatch
Symptom:
ModuleNotFoundError: No module named 'flashinfer.jit.cubin_loader'
or
FLASHINFER_CUBIN_DIR not found
Diagnosis: flashinfer-python and flashinfer-cubin versions don't match.
Fix:
pip install flashinfer-python==0.5.3 flashinfer-cubin==0.5.3 --force-reinstall
Error: WorkerProc failed to start
Symptom:
ERROR: WorkerProc failed to start.
File "vllm/v1/attention/selector.py" ...
Diagnosis: Usually caused by FlashInfer import failure.
Fix: Check FlashInfer versions match and LD_LIBRARY_PATH is set correctly.
Error: assert self.total_num_heads % tp_size == 0
Symptom:
AssertionError: assert self.total_num_heads % tp_size == 0
Diagnosis: The model's attention head count is not divisible by the tensor parallelism size.
Fix: Choose a --tensor-parallel-size value that divides the model's attention head count:
| Model | Attention Heads | Valid TP Values |
|---|---|---|
| Qwen2.5-7B | 28 | 1, 2, 4, 7, 14 |
| Qwen2.5-72B | 64 | 1, 2, 4, 8, 16, 32 |
| Llama-3-8B | 32 | 1, 2, 4, 8, 16, 32 |
| Llama-3-70B | 64 | 1, 2, 4, 8, 16, 32 |
| DeepSeek-R1 | 128 | 1, 2, 4, 8, 16, 32, 64 |
To find the attention head count for any model:
python3 -c "from transformers import AutoConfig; c = AutoConfig.from_pretrained('MODEL_NAME'); print(f'Attention heads: {c.num_attention_heads}')"
Starting the Server
To start the vLLM OpenAI-compatible API server:
# Load environment
source /vllm-workspace/vllm-env.sh
# Start server (adjust tp based on model architecture)
vllm serve Qwen/Qwen2.5-7B-Instruct \
--tensor-parallel-size 4 \
--port 8000 \
--host 0.0.0.0
Or using the Python module:
python3 -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen2.5-7B-Instruct \
--tensor-parallel-size 4 \
--port 8000 \
--host 0.0.0.0
Disaggregated Prefill (PD Separation)
vLLM supports prefill-decode disaggregation where prefill and decode phases run on separate instances. This allows independent tuning of TTFT (time-to-first-token) and ITL (inter-token-latency).
KV Transfer Connectors
vLLM supports multiple KV transfer backends:
| Connector | Dependency | Use Case |
|---|---|---|
| NixlConnector | nixl | Recommended, uses DMA-BUF (no nvidia_peermem needed) |
| MooncakeConnector | mooncake | Requires nvidia_peermem kernel module |
| P2pNcclConnector | NCCL | Same-node P2P transfer |
| LMCacheConnector | lmcache | External KV cache |
Installing NIXL for Disaggregation
pip install --break-system-packages nixl==0.9.0
# IMPORTANT: NIXL may downgrade NVIDIA libraries, reinstall correct versions:
pip install nvidia-nccl-cu12==2.28.3 --force-reinstall --no-deps
pip install nvidia-cudnn-cu12==9.16.0.29 --force-reinstall --no-deps
Verifying NIXL
python3 -c "
from vllm.distributed.kv_transfer.kv_connector.v1.nixl_connector import NixlConnector
print('NixlConnector OK')
"
Disaggregated Prefill Configuration
Prefill Node:
vllm serve deepseek-ai/DeepSeek-V3 \
--tensor-parallel-size 8 \
--port 8100 \
--download-dir /lssd/huggingface/hub \
--kv-transfer-config '{
"kv_connector": "NixlConnector",
"kv_role": "kv_both",
"kv_buffer_device": "cuda"
}'
Decode Node:
vllm serve deepseek-ai/DeepSeek-V3 \
--tensor-parallel-size 8 \
--port 8200 \
--download-dir /lssd/huggingface/hub \
--kv-transfer-config '{
"kv_connector": "NixlConnector",
"kv_role": "kv_both",
"kv_buffer_device": "cuda"
}'
KV Transfer Config Parameters
| Parameter | Values | Description |
|---|---|---|
| kv_connector | NixlConnector, MooncakeConnector, P2pNcclConnector | KV transfer backend |
| kv_role | kv_producer, kv_consumer, kv_both | Role in KV transfer (kv_both for most cases) |
| kv_buffer_device | cuda, cpu | Buffer device (cuda recommended, cpu for TPU) |
| kv_ip | IP address | Connector IP for distributed connection |
| kv_port | Port number | Connector port (default: 14579) |
NIXL vs Mooncake
| Feature | NIXL | Mooncake |
|---|---|---|
| Memory registration | DMA-BUF (kernel native) | nvidia_peermem (kernel module) |
| Transport | UCX (TCP/RDMA/SHM) | RDMA or TCP |
| Kernel module required | No | nvidia_peermem (may fail on Open driver) |
| NVIDIA Open driver compatible | Yes | No |
| Recommended for | B200, new deployments | Legacy systems |
Recommendation: Use NixlConnector for new deployments, especially on systems with NVIDIA Open Kernel Module.
Testing the Server
To verify the server is working:
# List models
curl http://localhost:8000/v1/models
# Chat completion test
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2.5-7B-Instruct",
"messages": [{"role": "user", "content": "Hello!"}],
"max_tokens": 50
}'
Diagnostic Script
To diagnose installation issues, run scripts/diagnose.py:
python3 scripts/diagnose.py
This script checks:
- CUDA installation and version
- PyTorch CUDA compatibility
- vLLM version and import status
- FlashInfer versions (python and cubin must match)
- Required library availability
- GPU detection and memory
- LD_LIBRARY_PATH configuration
- LSSD mount status (prompts to run /lssd-mounter if not mounted)
- DeepEP installation (prompts to run /deepep-installer if not installed, for MoE models)
Dependency Skills
LSSD Mounter
If LSSD is not mounted, the diagnostic script will prompt:
⚠ LSSD: Not mounted
High-speed local SSD recommended for model caching
To mount LSSD, use the lssd-mounter skill:
/lssd-mounter
DeepEP Installer
If DeepEP is not installed and you plan to run MoE models:
⚠ DeepEP: Not installed
DeepEP is required for MoE models (DeepSeek-V3, DeepSeek-R1)
To install DeepEP, use the deepep-installer skill:
/deepep-installer
Workflow Integration
When diagnosing vLLM installation:
- Run
diagnose.pyto check all prerequisites - If LSSD not mounted → prompt to run
/lssd-mounter - If DeepEP not installed and MoE models needed → prompt to run
/deepep-installer - After dependencies installed, re-run diagnostic
- Proceed with vLLM server startup
Pre-downloading DeepSeek Weights
For faster DeepSeek-V3/R1 model loading, pre-download weights from GCS instead of HuggingFace:
# Check if already downloaded
DEEPSEEK_PATH="/lssd/huggingface/hub/models--deepseek-ai--DeepSeek-V3"
if [ -d "$DEEPSEEK_PATH" ]; then
echo "✓ DeepSeek-V3 weights already exist: $DEEPSEEK_PATH"
du -sh "$DEEPSEEK_PATH"
else
echo "Downloading DeepSeek-V3 weights from GCS..."
gcloud storage cp -r gs://chrisya-gpu-pg-ase1/huggingface /lssd/
echo "✓ DeepSeek-V3 weights downloaded"
fi
Notes:
- GCS bucket
gs://chrisya-gpu-pg-ase1/huggingfacecontains pre-cached DeepSeek-V3 FP8 weights - Downloading from GCS is much faster than HuggingFace (same-region high bandwidth)
- Weights are ~600GB, including complete safetensors files
- Use
--download-dir /lssd/huggingface/hubwhen starting vLLM server
vLLM vs SGLang
| Feature | vLLM | SGLang |
|---|---|---|
| Attention | PagedAttention | RadixAttention |
| Strength | High throughput batch | Multi-turn, structured output |
| API | OpenAI compatible | OpenAI compatible |
| Default Port | 8000 | 30000 |
Coexistence and Dependency Conflicts
Both can coexist on the same system but have dependency version conflicts:
| Package | SGLang wants | vLLM installs |
|---|---|---|
| grpcio | 1.75.1 | 1.76.0 |
| timm | 1.0.16 | 1.0.24 |
| xgrammar | 0.1.27 | 0.1.29 |
| llguidance | <0.8.0,>=0.7.11 | 1.3.0 |
Recommendations:
- Production: Use separate Python virtual environments for each
- Development: Accept mismatches (usually works for basic inference)
- MoE models: Install DeepEP first, then either framework
sgl-kernel ABI Issues After vLLM Install
When vLLM is installed after SGLang, you may see sgl-kernel ABI errors:
Symptom:
ImportError: .../sgl_kernel/sm100/common_ops.abi3.so: undefined symbol: _ZN3c104cuda29c10_cuda_check_implementationEiPKcS2_ib
Root Cause: vLLM pins PyTorch 2.9.1, but sgl-kernel may have been compiled against a different version.
Fix:
# Reinstall sgl-kernel after vLLM
pip install sgl-kernel==0.3.21 --force-reinstall --no-deps
pip install nvidia-nccl-cu12==2.28.3 nvidia-cudnn-cu12==9.16.0.29 --force-reinstall --no-deps
DeepEP Recompilation After vLLM Install
IMPORTANT: If DeepEP was installed before vLLM, you may need to recompile DeepEP after installing vLLM due to PyTorch ABI changes.
Symptom:
ImportError: .../deep_ep_cpp.cpython-312-x86_64-linux-gnu.so: undefined symbol: _ZNK3c1010TensorImpl15incref_pyobjectEv
Fix:
cd /tmp/deepep_build # or wherever DeepEP was cloned
rm -rf build/ dist/ *.egg-info
rm -rf ~/.local/lib/python3.12/site-packages/deep_ep-*.egg # Remove old egg
export CUDA_HOME=/usr/local/cuda-12.9
export LD_LIBRARY_PATH=/opt/deepep/nvshmem/lib:/opt/deepep/gdrcopy/lib:$LD_LIBRARY_PATH
export LIBRARY_PATH=/usr/lib/x86_64-linux-gnu:$CUDA_HOME/lib64/stubs:$LIBRARY_PATH
TORCH_CUDA_ARCH_LIST="10.0" NVSHMEM_DIR=/opt/deepep/nvshmem python3 setup.py install --user
Resources
scripts/diagnose.py- Diagnostic script for installation issuesscripts/setup_env.sh- Environment variable setup scriptreferences/version_matrix.md- Version compatibility matrixreferences/troubleshooting.md- Extended troubleshooting guide
Unified Environment Script
After installing DeepEP + SGLang + vLLM, use the unified environment script:
source /opt/deepep/unified-env.sh
This sets up all necessary environment variables for the complete stack.
Post-Installation Verification
After completing all installations, verify everything works:
source /opt/deepep/unified-env.sh
python3 -c "
import torch; print(f'PyTorch: {torch.__version__}')
import deep_ep; print('DeepEP: OK')
import sglang; print(f'SGLang: {sglang.__version__}')
import sgl_kernel; print(f'sgl-kernel: {sgl_kernel.__version__}')
import vllm; print(f'vLLM: {vllm.__version__}')
import nixl; print('NIXL: OK')
"
Version History
-
2026-01-29: DeepSeek-V3 FP8 Blackwell compatibility
- NEW: Documented vLLM correctly handles DeepSeek-V3 FP8 on Blackwell (B200) GPUs
- NEW: Added comparison table showing vLLM works while SGLang produces garbage output
- NEW: Added numpy <2.3 version requirement (numba compatibility)
- RECOMMENDATION: Use vLLM for DeepSeek-V3 on Blackwell
-
2026-01-29: Added FlashInfer PyTorch version warning
- CRITICAL: Documented FlashInfer changing PyTorch to 2.10 issue
- NEW: Added fix commands after FlashInfer installation
- NEW: Added unified environment script reference
- NEW: Added post-installation verification commands
-
2026-01-29: Added GCS DeepSeek weights pre-download
- NEW: Added "Pre-downloading DeepSeek Weights" section in Dependency Skills
- GCS source:
gs://chrisya-gpu-pg-ase1/huggingface - Faster than HuggingFace download (same-region bandwidth)
-
2026-01-29: Added Disaggregated Prefill (PD Separation) with NIXL support
- NEW: Added NixlConnector as recommended KV transfer backend
- NEW: Added disaggregated prefill configuration examples
- NEW: Added KV transfer config parameters documentation
- NEW: Added NIXL vs Mooncake comparison table
- Added nixl 0.9.0 to version table
-
2026-01-29: Updated based on full installation workflow with DeepEP + SGLang + vLLM
- CRITICAL: Added DeepEP recompilation requirement after vLLM install
- Documented PyTorch ABI incompatibility issue and fix
- Added detailed dependency conflict table (numpy, llguidance versions)
-
2026-01-28: Updated based on installation experience
- Added pip installation for Ubuntu 24.04
- Added FlashInfer 0.5.3 unified version with SGLang
- Documented SGLang/vLLM dependency conflicts with specific versions