Agent Skill
2/7/2026

mastering-pytorch-rl-nlp

Expert guidance for PyTorch development covering Deep Reinforcement Learning and NLP Transformers. This skill provides comprehensive knowledge for building RL agents with TorchRL (DQN, PPO) and NLP systems with HuggingFace Transformers. Use this skill when working with PyTorch 2.7+, implementing reinforcement learning algorithms, fine-tuning transformer models, or deploying ML systems to production. Includes current best practices, verified library versions (Dec 2025), and warnings about deprecated APIs.

S
spillwavesolutions
1GitHub Stars
1Views
npx skills add SpillwaveSolutions/mastering-pytorch-rl-nlp-agentic-skill

SKILL.md

Namemastering-pytorch-rl-nlp
DescriptionExpert guidance for PyTorch development covering Deep Reinforcement Learning and NLP Transformers. This skill provides comprehensive knowledge for building RL agents with TorchRL (DQN, PPO) and NLP systems with HuggingFace Transformers. Use this skill when working with PyTorch 2.7+, implementing reinforcement learning algorithms, fine-tuning transformer models, or deploying ML systems to production. Includes current best practices, verified library versions (Dec 2025), and warnings about deprecated APIs.

name: mastering-pytorch-rl-nlp description: | Expert guidance for PyTorch development covering Deep Reinforcement Learning and NLP Transformers. This skill provides comprehensive knowledge for building RL agents with TorchRL (DQN, PPO) and NLP systems with HuggingFace Transformers. Use this skill when working with PyTorch 2.7+, implementing reinforcement learning algorithms, fine-tuning transformer models, or deploying ML systems to production. Includes current best practices, verified library versions (Dec 2025), and warnings about deprecated APIs. version: 1.0.0 category: machine-learning triggers:

  • pytorch
  • torch
  • torchrl
  • reinforcement learning
  • RL with pytorch
  • DQN
  • Deep Q-Network
  • PPO
  • Proximal Policy Optimization
  • policy gradient
  • huggingface
  • transformers
  • BERT
  • GPT
  • fine-tuning
  • torch.compile
  • quantization
  • TorchServe
  • gymnasium
  • gym environment
  • NLP with PyTorch
  • transformer models
  • PEFT
  • LoRA author: Rick Hightower license: MIT tags:
  • pytorch
  • deep-learning
  • reinforcement-learning
  • nlp
  • transformers
  • torchrl
  • huggingface

Mastering PyTorch: Deep RL and NLP

Expert guidance for PyTorch 2.7+ development covering Deep Reinforcement Learning with TorchRL and NLP with HuggingFace Transformers.

Verified Library Versions (December 2025)

LibraryVersionNotes
PyTorch2.9.1+Use 2.7+ minimum, CUDA 12.4 recommended
TorchRL0.10.xGymEnv, SyncDataCollector, ClipPPOLoss
HuggingFace Transformers4.56.2+AutoTokenizer, Trainer, pipeline
Gymnasium1.0.0+OpenAI Gym is DEPRECATED
PEFTCurrentLoRA fine-tuning
PettingZooCurrentMulti-agent RL

Deprecation Warnings

ALWAYS avoid these deprecated patterns:

DeprecatedUse Instead
import gymimport gymnasium as gym
evaluation_strategy in Trainereval_strategy
CUDA < 12.1CUDA 12.4
env.step() returning 4 valuesUse 5 values: obs, reward, terminated, truncated, info

Quick Reference

Device Setup (All Platforms)

import torch

device = (
    torch.device("cuda") if torch.cuda.is_available() else
    torch.device("mps") if torch.backends.mps.is_available() else
    torch.device("xpu") if hasattr(torch.backends, 'xpu') and torch.backends.xpu.is_available() else
    torch.device("cpu")
)
model = model.to(device)
model = torch.compile(model)  # Optimize for speed

TorchRL Quick Start (DQN)

from torchrl.envs import GymEnv
from torchrl.collectors import SyncDataCollector
from torchrl.data import ReplayBuffer, LazyTensorStorage
from torchrl.objectives import DQNLoss, HardUpdate

env = GymEnv("CartPole-v1", device=device)
collector = SyncDataCollector(env, policy, frames_per_batch=128, total_frames=100_000)
replay_buffer = ReplayBuffer(storage=LazyTensorStorage(10_000))
loss_module = DQNLoss(value_network=qnet, loss_function="smooth_l1", delay_value=True)

TorchRL Quick Start (PPO)

from torchrl.objectives import ClipPPOLoss
from torchrl.objectives.value import GAE

loss_fn = ClipPPOLoss(actor_network=actor, critic_network=critic, clip_epsilon=0.2, entropy_coef=0.01)
advantage_fn = GAE(value_network=critic, gamma=0.99, lmbda=0.95)

HuggingFace Quick Start

from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification

# Simple inference
classifier = pipeline("sentiment-analysis", model="distilbert-base-uncased-finetuned-sst-2-english")
result = classifier("I love PyTorch!")

# Fine-tuning
from transformers import Trainer, TrainingArguments
training_args = TrainingArguments(
    output_dir="./results", eval_strategy="epoch",  # NOT evaluation_strategy
    learning_rate=2e-5, per_device_train_batch_size=8, num_train_epochs=2, fp16=True
)
trainer = Trainer(model=model, args=training_args, train_dataset=train_ds, eval_dataset=val_ds)
trainer.train()

Detailed Guides

For comprehensive coverage, load the appropriate guide:

TopicGuideWhen to Load
Tensors, Autograd, nn.Modulereferences/pytorch-fundamentals.mdPyTorch basics, device management
TorchRL, DQN, PPOreferences/reinforcement-learning.mdRL algorithms, environments
HuggingFace, BERT, Fine-tuningreferences/nlp-transformers.mdNLP tasks, transformer models
torch.compile, Quantization, DDPreferences/optimization-deployment.mdProduction, performance
CLIP, RLHF, Ethicsreferences/advanced-topics.mdMulti-modal, responsible AI

Common Patterns

Gymnasium Environment (Modern API)

import gymnasium as gym

env = gym.make("CartPole-v1")
obs, info = env.reset()

while True:
    action = env.action_space.sample()
    obs, reward, terminated, truncated, info = env.step(action)  # 5 values!
    done = terminated or truncated
    if done:
        break

Training Loop with AMP

from torch.cuda.amp import autocast, GradScaler

scaler = GradScaler()
for batch in dataloader:
    optimizer.zero_grad()
    with autocast():
        loss = compute_loss(batch)
    scaler.scale(loss).backward()
    scaler.step(optimizer)
    scaler.update()

PEFT/LoRA Fine-tuning

from peft import LoraConfig, get_peft_model

lora_config = LoraConfig(r=8, lora_alpha=16, target_modules=["q_proj", "v_proj"])
model = get_peft_model(model, lora_config)
# Now fine-tune with much fewer parameters

Installation

# Create virtual environment
python -m venv .venv && source .venv/bin/activate

# PyTorch with CUDA 12.4
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124

# RL libraries
pip install torchrl gymnasium pettingzoo

# NLP libraries
pip install transformers datasets peft accelerate

# Experiment tracking
pip install tensorboard wandb

Apple Silicon (M2/M3/M4) Support

PyTorch MPS backend enables GPU acceleration on Apple Silicon Macs.

MPS Setup

import torch
import os

# Enable MPS fallback for unsupported operations
os.environ["PYTORCH_ENABLE_MPS_FALLBACK"] = "1"

device = torch.device("mps") if torch.backends.mps.is_available() else torch.device("cpu")
model = model.to(device)

MPS Limitations (Dec 2025)

  • Not all operations supported - Some fall back to CPU
  • No distributed training - Single GPU only
  • No float64 support - Use float32
  • SDPA can be unstable - Use eager attention if issues occur
  • torch.compile limited - Test thoroughly, may need to disable
  • ~3x slower than RTX 4090 - But 80% lower energy consumption

Best for Apple Silicon

  • Prototyping and development
  • Light to medium training workloads
  • Local inference
  • Learning and experimentation

Key Concepts

Reinforcement Learning

  • Agent: Learns from environment interactions
  • Environment: Provides states, rewards (use Gymnasium, not gym)
  • Policy: Maps states to actions (actor in PPO)
  • Value Function: Estimates future rewards (critic in PPO)
  • Experience Replay: Stores transitions for stable learning (DQN)
  • Target Network: Slowly-updated copy for stable Q-targets (DQN)

NLP/Transformers

  • Tokenizer: Converts text to token IDs
  • Encoder: Processes input (BERT-style)
  • Decoder: Generates output (GPT-style)
  • Fine-tuning: Adapts pre-trained model to specific task
  • PEFT/LoRA: Parameter-efficient fine-tuning (fewer trainable params)

Optimization

  • torch.compile: JIT compilation for faster execution
  • Mixed Precision (AMP): fp16/bf16 for speed and memory
  • Quantization: Reduce model size (int8)
  • DDP: Distributed training across GPUs
  • torchrun: Launch distributed training
Skills Info
Original Name:mastering-pytorch-rl-nlpAuthor:spillwavesolutions