Agent Skill
2/7/2026

vllm-installer

This skill should be used when users need to install, configure, debug, or run vLLM inference server on NVIDIA GPUs (especially B200/H100/A100). It covers installation from PyPI or source, dependency management, environment setup, common error diagnosis and fixes, tensor parallelism configuration, and server startup/testing. The skill automatically checks for LSSD mount status and DeepEP installation for MoE models.

Y
yangwhale
4GitHub Stars
1Views
npx skills add yangwhale/gpu-tpu-pedia

SKILL.md

Namevllm-installer
DescriptionThis skill should be used when users need to install, configure, debug, or run vLLM inference server on NVIDIA GPUs (especially B200/H100/A100). It covers installation from PyPI or source, dependency management, environment setup, common error diagnosis and fixes, tensor parallelism configuration, and server startup/testing. The skill automatically checks for LSSD mount status and DeepEP installation for MoE models.

name: vllm-installer description: This skill should be used when users need to install, configure, debug, or run vLLM inference server on NVIDIA GPUs (especially B200/H100/A100). It covers installation from PyPI or source, dependency management, environment setup, common error diagnosis and fixes, tensor parallelism configuration, and server startup/testing. The skill automatically checks for LSSD mount status and DeepEP installation for MoE models. license: MIT

vLLM Installer

This skill provides comprehensive guidance for installing, configuring, and debugging vLLM on NVIDIA GPUs with CUDA 12.x.

When to Use This Skill

  • Installing vLLM on NVIDIA GPUs (B200/H100/A100)
  • Debugging vLLM installation errors (missing libraries, version conflicts)
  • Configuring tensor parallelism for different model architectures
  • Setting up environment variables for CUDA and NVIDIA libraries
  • Starting and testing vLLM OpenAI-compatible API server
  • Fixing common runtime errors (cuDNN, cusparseLt, FlashInfer issues)

Version Information (as of v0.14.1)

ComponentVersionNotes
vLLM0.14.1Latest stable (v0.15.0rc2 not yet on PyPI)
flashinfer-python0.5.3Attention backend
flashinfer-cubin0.5.3Must match flashinfer-python version
nixl0.9.0KV cache transfer (DMA-BUF, recommended for PD disaggregation)
nvidia-nccl-cu122.28.3Force reinstall
nvidia-cudnn-cu129.16.0.29Required for PyTorch 2.9+
bitsandbytes0.46.1Quantization support
numpy<2.3Required for numba compatibility

DeepSeek-V3 FP8 on Blackwell (B200)

vLLM correctly handles DeepSeek-V3 FP8 on Blackwell GPUs, unlike SGLang which produces garbage output due to FP8 scale format incompatibility.

FrameworkDeepSeek-V3 FP8 on Blackwell
vLLM 0.14.1✅ Works correctly
SGLang 0.5.8❌ Garbage output (scale format mismatch)

If you need to run DeepSeek-V3 on Blackwell (B200) GPUs, use vLLM:

vllm serve deepseek-ai/DeepSeek-V3 \
    --tensor-parallel-size 8 \
    --port 8000 \
    --download-dir /lssd/huggingface/hub \
    --trust-remote-code \
    --max-model-len 4096

Pre-Installation Checks

Step 0: Check Prerequisites

Before installing vLLM, the skill automatically checks:

  1. LSSD Mount Status - High-speed local SSD for model caching
  2. DeepEP Installation - Required for MoE models (DeepSeek-V3, DeepSeek-R1)

LSSD Check

# Check if /lssd is mounted
if mountpoint -q /lssd 2>/dev/null; then
    echo "✓ LSSD is mounted: $(df -h /lssd | tail -1 | awk '{print $2}')"
else
    echo "✗ LSSD is not mounted"
    echo "  Run: /lssd-mounter"
fi

If LSSD is not mounted, use the lssd-mounter skill:

/lssd-mounter

DeepEP Check (for MoE models)

# Check if DeepEP is installed
python3 -c "import deep_ep; print('✓ DeepEP installed')" 2>/dev/null || \
python3 -c "import deepep; print('✓ DeepEP installed')" 2>/dev/null || \
echo "✗ DeepEP not installed (required for MoE models)"

If DeepEP is not installed and you need to run MoE models, use the deepep-installer skill:

/deepep-installer

Installation Workflow

Pre-requisites (Ubuntu 24.04)

Ubuntu 24.04 doesn't include pip by default. Install it first:

sudo apt-get update
sudo apt-get install -y python3-pip

Step 1: Environment Setup

To set up the environment, ensure CUDA is properly configured:

export CUDA_HOME=/usr/local/cuda
export PATH=$CUDA_HOME/bin:$PATH

# Set HuggingFace cache to LSSD (if available)
if [ -d /lssd/huggingface ]; then
    export HF_HOME=/lssd/huggingface
fi

Step 2: Install PyTorch

pip install torch torchvision torchaudio \
    --index-url https://download.pytorch.org/whl/cu129

Step 3: Install vLLM

# Install from PyPI (recommended)
pip install vllm==0.14.1 \
    --extra-index-url https://download.pytorch.org/whl/cu129

Step 4: Install NVIDIA Libraries

These libraries must be installed with --force-reinstall --no-deps to avoid version conflicts:

pip install nvidia-nccl-cu12==2.28.3 --force-reinstall --no-deps
pip install nvidia-cudnn-cu12==9.16.0.29 --force-reinstall --no-deps
pip install nvidia-cusparselt-cu12 --force-reinstall --no-deps

Step 5: Install FlashInfer

FlashInfer is the recommended attention backend for vLLM:

pip install flashinfer-python==0.5.3 flashinfer-cubin==0.5.3

Important: flashinfer-python and flashinfer-cubin versions MUST match exactly.

⚠️ WARNING: FlashInfer may change PyTorch version!

FlashInfer installation can upgrade PyTorch from 2.9.1 to 2.10.0, which breaks:

  • vLLM (requires PyTorch 2.9.1)
  • sgl-kernel (ABI mismatch)
  • DeepEP (ABI mismatch)

Always reinstall after FlashInfer:

pip install flashinfer-python==0.5.3 flashinfer-cubin==0.5.3
pip install torch==2.9.1+cu129 --index-url https://download.pytorch.org/whl/cu129 --force-reinstall
pip install nvidia-nccl-cu12==2.28.3 nvidia-cudnn-cu12==9.16.0.29 --force-reinstall --no-deps

Step 6: Configure LD_LIBRARY_PATH

To fix library loading issues, run scripts/setup_env.sh or manually set:

# Collect all nvidia pip package lib paths
NVIDIA_LIB_PATHS=""
for d in /usr/local/lib/python3.*/dist-packages/nvidia/*/lib; do
    [ -d "$d" ] && NVIDIA_LIB_PATHS="${d}:${NVIDIA_LIB_PATHS}"
done
for d in $HOME/.local/lib/python3.*/site-packages/nvidia/*/lib; do
    [ -d "$d" ] && NVIDIA_LIB_PATHS="${d}:${NVIDIA_LIB_PATHS}"
done
export LD_LIBRARY_PATH=${CUDA_HOME}/lib64:${NVIDIA_LIB_PATHS}${LD_LIBRARY_PATH}

Common Errors and Fixes

Error: libcudnn.so.9 not found

Symptom:

ImportError: libcudnn.so.9: cannot open shared object file

Fix:

pip install nvidia-cudnn-cu12==9.16.0.29 --force-reinstall --no-deps
# Then set LD_LIBRARY_PATH as described above

Error: libcusparseLt.so.0 not found

Symptom:

ImportError: libcusparseLt.so.0: cannot open shared object file

Fix:

pip install nvidia-cusparselt-cu12 --force-reinstall --no-deps
# Then set LD_LIBRARY_PATH as described above

Error: FlashInfer version mismatch

Symptom:

ModuleNotFoundError: No module named 'flashinfer.jit.cubin_loader'

or

FLASHINFER_CUBIN_DIR not found

Diagnosis: flashinfer-python and flashinfer-cubin versions don't match.

Fix:

pip install flashinfer-python==0.5.3 flashinfer-cubin==0.5.3 --force-reinstall

Error: WorkerProc failed to start

Symptom:

ERROR: WorkerProc failed to start.
File "vllm/v1/attention/selector.py" ...

Diagnosis: Usually caused by FlashInfer import failure.

Fix: Check FlashInfer versions match and LD_LIBRARY_PATH is set correctly.

Error: assert self.total_num_heads % tp_size == 0

Symptom:

AssertionError: assert self.total_num_heads % tp_size == 0

Diagnosis: The model's attention head count is not divisible by the tensor parallelism size.

Fix: Choose a --tensor-parallel-size value that divides the model's attention head count:

ModelAttention HeadsValid TP Values
Qwen2.5-7B281, 2, 4, 7, 14
Qwen2.5-72B641, 2, 4, 8, 16, 32
Llama-3-8B321, 2, 4, 8, 16, 32
Llama-3-70B641, 2, 4, 8, 16, 32
DeepSeek-R11281, 2, 4, 8, 16, 32, 64

To find the attention head count for any model:

python3 -c "from transformers import AutoConfig; c = AutoConfig.from_pretrained('MODEL_NAME'); print(f'Attention heads: {c.num_attention_heads}')"

Starting the Server

To start the vLLM OpenAI-compatible API server:

# Load environment
source /vllm-workspace/vllm-env.sh

# Start server (adjust tp based on model architecture)
vllm serve Qwen/Qwen2.5-7B-Instruct \
    --tensor-parallel-size 4 \
    --port 8000 \
    --host 0.0.0.0

Or using the Python module:

python3 -m vllm.entrypoints.openai.api_server \
    --model Qwen/Qwen2.5-7B-Instruct \
    --tensor-parallel-size 4 \
    --port 8000 \
    --host 0.0.0.0

Disaggregated Prefill (PD Separation)

vLLM supports prefill-decode disaggregation where prefill and decode phases run on separate instances. This allows independent tuning of TTFT (time-to-first-token) and ITL (inter-token-latency).

KV Transfer Connectors

vLLM supports multiple KV transfer backends:

ConnectorDependencyUse Case
NixlConnectornixlRecommended, uses DMA-BUF (no nvidia_peermem needed)
MooncakeConnectormooncakeRequires nvidia_peermem kernel module
P2pNcclConnectorNCCLSame-node P2P transfer
LMCacheConnectorlmcacheExternal KV cache

Installing NIXL for Disaggregation

pip install --break-system-packages nixl==0.9.0

# IMPORTANT: NIXL may downgrade NVIDIA libraries, reinstall correct versions:
pip install nvidia-nccl-cu12==2.28.3 --force-reinstall --no-deps
pip install nvidia-cudnn-cu12==9.16.0.29 --force-reinstall --no-deps

Verifying NIXL

python3 -c "
from vllm.distributed.kv_transfer.kv_connector.v1.nixl_connector import NixlConnector
print('NixlConnector OK')
"

Disaggregated Prefill Configuration

Prefill Node:

vllm serve deepseek-ai/DeepSeek-V3 \
    --tensor-parallel-size 8 \
    --port 8100 \
    --download-dir /lssd/huggingface/hub \
    --kv-transfer-config '{
        "kv_connector": "NixlConnector",
        "kv_role": "kv_both",
        "kv_buffer_device": "cuda"
    }'

Decode Node:

vllm serve deepseek-ai/DeepSeek-V3 \
    --tensor-parallel-size 8 \
    --port 8200 \
    --download-dir /lssd/huggingface/hub \
    --kv-transfer-config '{
        "kv_connector": "NixlConnector",
        "kv_role": "kv_both",
        "kv_buffer_device": "cuda"
    }'

KV Transfer Config Parameters

ParameterValuesDescription
kv_connectorNixlConnector, MooncakeConnector, P2pNcclConnectorKV transfer backend
kv_rolekv_producer, kv_consumer, kv_bothRole in KV transfer (kv_both for most cases)
kv_buffer_devicecuda, cpuBuffer device (cuda recommended, cpu for TPU)
kv_ipIP addressConnector IP for distributed connection
kv_portPort numberConnector port (default: 14579)

NIXL vs Mooncake

FeatureNIXLMooncake
Memory registrationDMA-BUF (kernel native)nvidia_peermem (kernel module)
TransportUCX (TCP/RDMA/SHM)RDMA or TCP
Kernel module requiredNonvidia_peermem (may fail on Open driver)
NVIDIA Open driver compatibleYesNo
Recommended forB200, new deploymentsLegacy systems

Recommendation: Use NixlConnector for new deployments, especially on systems with NVIDIA Open Kernel Module.

Testing the Server

To verify the server is working:

# List models
curl http://localhost:8000/v1/models

# Chat completion test
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-7B-Instruct",
    "messages": [{"role": "user", "content": "Hello!"}],
    "max_tokens": 50
  }'

Diagnostic Script

To diagnose installation issues, run scripts/diagnose.py:

python3 scripts/diagnose.py

This script checks:

  • CUDA installation and version
  • PyTorch CUDA compatibility
  • vLLM version and import status
  • FlashInfer versions (python and cubin must match)
  • Required library availability
  • GPU detection and memory
  • LD_LIBRARY_PATH configuration
  • LSSD mount status (prompts to run /lssd-mounter if not mounted)
  • DeepEP installation (prompts to run /deepep-installer if not installed, for MoE models)

Dependency Skills

LSSD Mounter

If LSSD is not mounted, the diagnostic script will prompt:

⚠ LSSD: Not mounted
  High-speed local SSD recommended for model caching
  To mount LSSD, use the lssd-mounter skill:
    /lssd-mounter

DeepEP Installer

If DeepEP is not installed and you plan to run MoE models:

⚠ DeepEP: Not installed
  DeepEP is required for MoE models (DeepSeek-V3, DeepSeek-R1)
  To install DeepEP, use the deepep-installer skill:
    /deepep-installer

Workflow Integration

When diagnosing vLLM installation:

  1. Run diagnose.py to check all prerequisites
  2. If LSSD not mounted → prompt to run /lssd-mounter
  3. If DeepEP not installed and MoE models needed → prompt to run /deepep-installer
  4. After dependencies installed, re-run diagnostic
  5. Proceed with vLLM server startup

Pre-downloading DeepSeek Weights

For faster DeepSeek-V3/R1 model loading, pre-download weights from GCS instead of HuggingFace:

# Check if already downloaded
DEEPSEEK_PATH="/lssd/huggingface/hub/models--deepseek-ai--DeepSeek-V3"

if [ -d "$DEEPSEEK_PATH" ]; then
    echo "✓ DeepSeek-V3 weights already exist: $DEEPSEEK_PATH"
    du -sh "$DEEPSEEK_PATH"
else
    echo "Downloading DeepSeek-V3 weights from GCS..."
    gcloud storage cp -r gs://chrisya-gpu-pg-ase1/huggingface /lssd/
    echo "✓ DeepSeek-V3 weights downloaded"
fi

Notes:

  • GCS bucket gs://chrisya-gpu-pg-ase1/huggingface contains pre-cached DeepSeek-V3 FP8 weights
  • Downloading from GCS is much faster than HuggingFace (same-region high bandwidth)
  • Weights are ~600GB, including complete safetensors files
  • Use --download-dir /lssd/huggingface/hub when starting vLLM server

vLLM vs SGLang

FeaturevLLMSGLang
AttentionPagedAttentionRadixAttention
StrengthHigh throughput batchMulti-turn, structured output
APIOpenAI compatibleOpenAI compatible
Default Port800030000

Coexistence and Dependency Conflicts

Both can coexist on the same system but have dependency version conflicts:

PackageSGLang wantsvLLM installs
grpcio1.75.11.76.0
timm1.0.161.0.24
xgrammar0.1.270.1.29
llguidance<0.8.0,>=0.7.111.3.0

Recommendations:

  1. Production: Use separate Python virtual environments for each
  2. Development: Accept mismatches (usually works for basic inference)
  3. MoE models: Install DeepEP first, then either framework

sgl-kernel ABI Issues After vLLM Install

When vLLM is installed after SGLang, you may see sgl-kernel ABI errors:

Symptom:

ImportError: .../sgl_kernel/sm100/common_ops.abi3.so: undefined symbol: _ZN3c104cuda29c10_cuda_check_implementationEiPKcS2_ib

Root Cause: vLLM pins PyTorch 2.9.1, but sgl-kernel may have been compiled against a different version.

Fix:

# Reinstall sgl-kernel after vLLM
pip install sgl-kernel==0.3.21 --force-reinstall --no-deps
pip install nvidia-nccl-cu12==2.28.3 nvidia-cudnn-cu12==9.16.0.29 --force-reinstall --no-deps

DeepEP Recompilation After vLLM Install

IMPORTANT: If DeepEP was installed before vLLM, you may need to recompile DeepEP after installing vLLM due to PyTorch ABI changes.

Symptom:

ImportError: .../deep_ep_cpp.cpython-312-x86_64-linux-gnu.so: undefined symbol: _ZNK3c1010TensorImpl15incref_pyobjectEv

Fix:

cd /tmp/deepep_build  # or wherever DeepEP was cloned
rm -rf build/ dist/ *.egg-info
rm -rf ~/.local/lib/python3.12/site-packages/deep_ep-*.egg  # Remove old egg

export CUDA_HOME=/usr/local/cuda-12.9
export LD_LIBRARY_PATH=/opt/deepep/nvshmem/lib:/opt/deepep/gdrcopy/lib:$LD_LIBRARY_PATH
export LIBRARY_PATH=/usr/lib/x86_64-linux-gnu:$CUDA_HOME/lib64/stubs:$LIBRARY_PATH

TORCH_CUDA_ARCH_LIST="10.0" NVSHMEM_DIR=/opt/deepep/nvshmem python3 setup.py install --user

Resources

  • scripts/diagnose.py - Diagnostic script for installation issues
  • scripts/setup_env.sh - Environment variable setup script
  • references/version_matrix.md - Version compatibility matrix
  • references/troubleshooting.md - Extended troubleshooting guide

Unified Environment Script

After installing DeepEP + SGLang + vLLM, use the unified environment script:

source /opt/deepep/unified-env.sh

This sets up all necessary environment variables for the complete stack.

Post-Installation Verification

After completing all installations, verify everything works:

source /opt/deepep/unified-env.sh
python3 -c "
import torch; print(f'PyTorch: {torch.__version__}')
import deep_ep; print('DeepEP: OK')
import sglang; print(f'SGLang: {sglang.__version__}')
import sgl_kernel; print(f'sgl-kernel: {sgl_kernel.__version__}')
import vllm; print(f'vLLM: {vllm.__version__}')
import nixl; print('NIXL: OK')
"

Version History

  • 2026-01-29: DeepSeek-V3 FP8 Blackwell compatibility

    • NEW: Documented vLLM correctly handles DeepSeek-V3 FP8 on Blackwell (B200) GPUs
    • NEW: Added comparison table showing vLLM works while SGLang produces garbage output
    • NEW: Added numpy <2.3 version requirement (numba compatibility)
    • RECOMMENDATION: Use vLLM for DeepSeek-V3 on Blackwell
  • 2026-01-29: Added FlashInfer PyTorch version warning

    • CRITICAL: Documented FlashInfer changing PyTorch to 2.10 issue
    • NEW: Added fix commands after FlashInfer installation
    • NEW: Added unified environment script reference
    • NEW: Added post-installation verification commands
  • 2026-01-29: Added GCS DeepSeek weights pre-download

    • NEW: Added "Pre-downloading DeepSeek Weights" section in Dependency Skills
    • GCS source: gs://chrisya-gpu-pg-ase1/huggingface
    • Faster than HuggingFace download (same-region bandwidth)
  • 2026-01-29: Added Disaggregated Prefill (PD Separation) with NIXL support

    • NEW: Added NixlConnector as recommended KV transfer backend
    • NEW: Added disaggregated prefill configuration examples
    • NEW: Added KV transfer config parameters documentation
    • NEW: Added NIXL vs Mooncake comparison table
    • Added nixl 0.9.0 to version table
  • 2026-01-29: Updated based on full installation workflow with DeepEP + SGLang + vLLM

    • CRITICAL: Added DeepEP recompilation requirement after vLLM install
    • Documented PyTorch ABI incompatibility issue and fix
    • Added detailed dependency conflict table (numpy, llguidance versions)
  • 2026-01-28: Updated based on installation experience

    • Added pip installation for Ubuntu 24.04
    • Added FlashInfer 0.5.3 unified version with SGLang
    • Documented SGLang/vLLM dependency conflicts with specific versions
Skills Info
Original Name:vllm-installerAuthor:yangwhale