mastering-pytorch-rl-nlp
Expert guidance for PyTorch development covering Deep Reinforcement Learning and NLP Transformers. This skill provides comprehensive knowledge for building RL agents with TorchRL (DQN, PPO) and NLP systems with HuggingFace Transformers. Use this skill when working with PyTorch 2.7+, implementing reinforcement learning algorithms, fine-tuning transformer models, or deploying ML systems to production. Includes current best practices, verified library versions (Dec 2025), and warnings about deprecated APIs.
SKILL.md
| Name | mastering-pytorch-rl-nlp |
| Description | Expert guidance for PyTorch development covering Deep Reinforcement Learning and NLP Transformers. This skill provides comprehensive knowledge for building RL agents with TorchRL (DQN, PPO) and NLP systems with HuggingFace Transformers. Use this skill when working with PyTorch 2.7+, implementing reinforcement learning algorithms, fine-tuning transformer models, or deploying ML systems to production. Includes current best practices, verified library versions (Dec 2025), and warnings about deprecated APIs. |
name: mastering-pytorch-rl-nlp description: | Expert guidance for PyTorch development covering Deep Reinforcement Learning and NLP Transformers. This skill provides comprehensive knowledge for building RL agents with TorchRL (DQN, PPO) and NLP systems with HuggingFace Transformers. Use this skill when working with PyTorch 2.7+, implementing reinforcement learning algorithms, fine-tuning transformer models, or deploying ML systems to production. Includes current best practices, verified library versions (Dec 2025), and warnings about deprecated APIs. version: 1.0.0 category: machine-learning triggers:
- pytorch
- torch
- torchrl
- reinforcement learning
- RL with pytorch
- DQN
- Deep Q-Network
- PPO
- Proximal Policy Optimization
- policy gradient
- huggingface
- transformers
- BERT
- GPT
- fine-tuning
- torch.compile
- quantization
- TorchServe
- gymnasium
- gym environment
- NLP with PyTorch
- transformer models
- PEFT
- LoRA author: Rick Hightower license: MIT tags:
- pytorch
- deep-learning
- reinforcement-learning
- nlp
- transformers
- torchrl
- huggingface
Mastering PyTorch: Deep RL and NLP
Expert guidance for PyTorch 2.7+ development covering Deep Reinforcement Learning with TorchRL and NLP with HuggingFace Transformers.
Verified Library Versions (December 2025)
| Library | Version | Notes |
|---|---|---|
| PyTorch | 2.9.1+ | Use 2.7+ minimum, CUDA 12.4 recommended |
| TorchRL | 0.10.x | GymEnv, SyncDataCollector, ClipPPOLoss |
| HuggingFace Transformers | 4.56.2+ | AutoTokenizer, Trainer, pipeline |
| Gymnasium | 1.0.0+ | OpenAI Gym is DEPRECATED |
| PEFT | Current | LoRA fine-tuning |
| PettingZoo | Current | Multi-agent RL |
Deprecation Warnings
ALWAYS avoid these deprecated patterns:
| Deprecated | Use Instead |
|---|---|
import gym | import gymnasium as gym |
evaluation_strategy in Trainer | eval_strategy |
| CUDA < 12.1 | CUDA 12.4 |
env.step() returning 4 values | Use 5 values: obs, reward, terminated, truncated, info |
Quick Reference
Device Setup (All Platforms)
import torch
device = (
torch.device("cuda") if torch.cuda.is_available() else
torch.device("mps") if torch.backends.mps.is_available() else
torch.device("xpu") if hasattr(torch.backends, 'xpu') and torch.backends.xpu.is_available() else
torch.device("cpu")
)
model = model.to(device)
model = torch.compile(model) # Optimize for speed
TorchRL Quick Start (DQN)
from torchrl.envs import GymEnv
from torchrl.collectors import SyncDataCollector
from torchrl.data import ReplayBuffer, LazyTensorStorage
from torchrl.objectives import DQNLoss, HardUpdate
env = GymEnv("CartPole-v1", device=device)
collector = SyncDataCollector(env, policy, frames_per_batch=128, total_frames=100_000)
replay_buffer = ReplayBuffer(storage=LazyTensorStorage(10_000))
loss_module = DQNLoss(value_network=qnet, loss_function="smooth_l1", delay_value=True)
TorchRL Quick Start (PPO)
from torchrl.objectives import ClipPPOLoss
from torchrl.objectives.value import GAE
loss_fn = ClipPPOLoss(actor_network=actor, critic_network=critic, clip_epsilon=0.2, entropy_coef=0.01)
advantage_fn = GAE(value_network=critic, gamma=0.99, lmbda=0.95)
HuggingFace Quick Start
from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification
# Simple inference
classifier = pipeline("sentiment-analysis", model="distilbert-base-uncased-finetuned-sst-2-english")
result = classifier("I love PyTorch!")
# Fine-tuning
from transformers import Trainer, TrainingArguments
training_args = TrainingArguments(
output_dir="./results", eval_strategy="epoch", # NOT evaluation_strategy
learning_rate=2e-5, per_device_train_batch_size=8, num_train_epochs=2, fp16=True
)
trainer = Trainer(model=model, args=training_args, train_dataset=train_ds, eval_dataset=val_ds)
trainer.train()
Detailed Guides
For comprehensive coverage, load the appropriate guide:
| Topic | Guide | When to Load |
|---|---|---|
| Tensors, Autograd, nn.Module | references/pytorch-fundamentals.md | PyTorch basics, device management |
| TorchRL, DQN, PPO | references/reinforcement-learning.md | RL algorithms, environments |
| HuggingFace, BERT, Fine-tuning | references/nlp-transformers.md | NLP tasks, transformer models |
| torch.compile, Quantization, DDP | references/optimization-deployment.md | Production, performance |
| CLIP, RLHF, Ethics | references/advanced-topics.md | Multi-modal, responsible AI |
Common Patterns
Gymnasium Environment (Modern API)
import gymnasium as gym
env = gym.make("CartPole-v1")
obs, info = env.reset()
while True:
action = env.action_space.sample()
obs, reward, terminated, truncated, info = env.step(action) # 5 values!
done = terminated or truncated
if done:
break
Training Loop with AMP
from torch.cuda.amp import autocast, GradScaler
scaler = GradScaler()
for batch in dataloader:
optimizer.zero_grad()
with autocast():
loss = compute_loss(batch)
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()
PEFT/LoRA Fine-tuning
from peft import LoraConfig, get_peft_model
lora_config = LoraConfig(r=8, lora_alpha=16, target_modules=["q_proj", "v_proj"])
model = get_peft_model(model, lora_config)
# Now fine-tune with much fewer parameters
Installation
# Create virtual environment
python -m venv .venv && source .venv/bin/activate
# PyTorch with CUDA 12.4
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124
# RL libraries
pip install torchrl gymnasium pettingzoo
# NLP libraries
pip install transformers datasets peft accelerate
# Experiment tracking
pip install tensorboard wandb
Apple Silicon (M2/M3/M4) Support
PyTorch MPS backend enables GPU acceleration on Apple Silicon Macs.
MPS Setup
import torch
import os
# Enable MPS fallback for unsupported operations
os.environ["PYTORCH_ENABLE_MPS_FALLBACK"] = "1"
device = torch.device("mps") if torch.backends.mps.is_available() else torch.device("cpu")
model = model.to(device)
MPS Limitations (Dec 2025)
- Not all operations supported - Some fall back to CPU
- No distributed training - Single GPU only
- No float64 support - Use float32
- SDPA can be unstable - Use eager attention if issues occur
- torch.compile limited - Test thoroughly, may need to disable
- ~3x slower than RTX 4090 - But 80% lower energy consumption
Best for Apple Silicon
- Prototyping and development
- Light to medium training workloads
- Local inference
- Learning and experimentation
Key Concepts
Reinforcement Learning
- Agent: Learns from environment interactions
- Environment: Provides states, rewards (use Gymnasium, not gym)
- Policy: Maps states to actions (actor in PPO)
- Value Function: Estimates future rewards (critic in PPO)
- Experience Replay: Stores transitions for stable learning (DQN)
- Target Network: Slowly-updated copy for stable Q-targets (DQN)
NLP/Transformers
- Tokenizer: Converts text to token IDs
- Encoder: Processes input (BERT-style)
- Decoder: Generates output (GPT-style)
- Fine-tuning: Adapts pre-trained model to specific task
- PEFT/LoRA: Parameter-efficient fine-tuning (fewer trainable params)
Optimization
- torch.compile: JIT compilation for faster execution
- Mixed Precision (AMP): fp16/bf16 for speed and memory
- Quantization: Reduce model size (int8)
- DDP: Distributed training across GPUs
- torchrun: Launch distributed training