Agent Skill
2/7/2026

rag-architecture-skill

Build retrieval-augmented generation systems that ground LLMs in your data.

F
fabioc
0GitHub Stars
1Views
npx skills add fabioc-aloha/WindowsWidget

SKILL.md

Namerag-architecture-skill
DescriptionBuild retrieval-augmented generation systems that ground LLMs in your data.

name: "RAG Architecture Skill" description: "Build retrieval-augmented generation systems that ground LLMs in your data." applyTo: "/rag,/retrieval,/embedding,/vector,/knowledge,/search"

RAG Architecture Skill

Build retrieval-augmented generation systems that ground LLMs in your data.

Core Principle

RAG = Retrieval + Generation. Instead of relying solely on the model's training data, retrieve relevant context at query time and include it in the prompt. This reduces hallucination and enables access to private/current data.

RAG Pipeline

┌─────────────┐     ┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│   Query     │────▶│   Embed     │────▶│  Retrieve   │────▶│   Augment   │
│  "How do I  │     │  Query to   │     │  Top-K      │     │  Add to     │
│   deploy?"  │     │  Vector     │     │  Documents  │     │  Prompt     │
└─────────────┘     └─────────────┘     └─────────────┘     └─────────────┘
                                                                   │
                                                                   ▼
┌─────────────┐     ┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│   Answer    │◀────│  Generate   │◀────│   Format    │◀────│  Context    │
│  Grounded   │     │  With LLM   │     │   Prompt    │     │  + Query    │
│  Response   │     │             │     │             │     │             │
└─────────────┘     └─────────────┘     └─────────────┘     └─────────────┘

Indexing Pipeline

Document Processing

┌──────────────┐     ┌──────────────┐     ┌──────────────┐     ┌──────────────┐
│    Load      │────▶│    Clean     │────▶│    Chunk     │────▶│    Embed     │
│  Documents   │     │  & Parse     │     │   Content    │     │   Chunks     │
└──────────────┘     └──────────────┘     └──────────────┘     └──────────────┘
                                                                      │
                                                                      ▼
                                                               ┌──────────────┐
                                                               │    Store     │
                                                               │  in Vector   │
                                                               │     DB       │
                                                               └──────────────┘

Chunking Strategies

StrategyDescriptionBest For
Fixed SizeSplit every N tokens/charsSimple, predictable
SentenceSplit on sentence boundariesNatural breaks
ParagraphSplit on paragraph breaksCoherent units
SemanticSplit on topic changesMeaningful segments
RecursiveTry large, fall back to smallerMixed content
DocumentKeep whole documentsShort docs
# Recursive chunking example
from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,  # Overlap prevents losing context at boundaries
    separators=["\n\n", "\n", ". ", " ", ""]
)

chunks = splitter.split_documents(documents)

Chunk Size Tradeoffs

SizeProsCons
Small (100-500)Precise retrievalMay lose context
Medium (500-1500)BalancedGood default
Large (1500-3000)Full contextLess precise, costly

Rule of thumb: Chunk should contain enough context to be useful standalone.

Embedding Models

Model Comparison

ModelDimensionsSpeedQualityCost
text-embedding-3-small1536FastGoodLow
text-embedding-3-large3072MediumBestMedium
ada-0021536FastGoodLow
Cohere embed-v31024FastGoodLow
BGE-large1024LocalGoodFree
E5-large1024LocalGoodFree

Embedding Best Practices

# Normalize embeddings for cosine similarity
import numpy as np

def normalize(embedding):
    return embedding / np.linalg.norm(embedding)

# Batch embeddings for efficiency
embeddings = embed_model.embed_documents(chunks)  # Not one at a time

# Cache embeddings - don't re-embed unchanged content

Vector Databases

Options

DatabaseTypeStrengthsUse Case
PineconeManagedEasy, scalableProduction
WeaviateManaged/SelfHybrid searchEnterprise
QdrantSelf-hostedPerformancePrivacy-sensitive
ChromaEmbeddedSimple, localPrototyping
pgvectorPostgreSQL extSQL + vectorsExisting Postgres
Azure AI SearchManagedM365 integrationAzure ecosystem
FAISSLibraryFast, offlineLocal/research

Index Types

IndexSpeedAccuracyMemory
Flat (exact)Slow100%High
IVFFast~95%Medium
HNSWVery fast~98%High
PQVery fast~90%Low

Retrieval Strategies

Basic Retrieval

# Simple top-k retrieval
results = vector_store.similarity_search(query, k=5)

Hybrid Search

Combine semantic (vector) with keyword (BM25) search:

# Reciprocal Rank Fusion
def hybrid_search(query, k=5, alpha=0.5):
    semantic_results = vector_search(query, k=k*2)
    keyword_results = bm25_search(query, k=k*2)

    # Fuse rankings
    scores = {}
    for rank, doc in enumerate(semantic_results):
        scores[doc.id] = scores.get(doc.id, 0) + alpha * (1 / (rank + 60))
    for rank, doc in enumerate(keyword_results):
        scores[doc.id] = scores.get(doc.id, 0) + (1-alpha) * (1 / (rank + 60))

    return sorted(scores.items(), key=lambda x: -x[1])[:k]

Reranking

Two-stage retrieval for better precision:

# Stage 1: Fast retrieval (get candidates)
candidates = vector_store.similarity_search(query, k=20)

# Stage 2: Rerank with cross-encoder (more accurate)
reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
scores = reranker.predict([(query, doc.content) for doc in candidates])
reranked = sorted(zip(candidates, scores), key=lambda x: -x[1])[:5]

Query Transformation

# Hypothetical Document Embedding (HyDE)
def hyde_search(query):
    # Generate hypothetical answer
    hypothetical = llm.generate(f"Write a passage that answers: {query}")
    # Search using the hypothetical (often better match)
    return vector_store.similarity_search(hypothetical, k=5)

# Multi-query retrieval
def multi_query_search(query):
    # Generate query variations
    variations = llm.generate(f"Generate 3 different ways to ask: {query}")
    # Search with each, combine results
    all_results = []
    for q in variations:
        all_results.extend(vector_store.similarity_search(q, k=3))
    return deduplicate(all_results)

Prompt Augmentation

Basic RAG Prompt

Use the following context to answer the question. If the context doesn't
contain the answer, say "I don't have information about that."

Context:
{retrieved_documents}

Question: {user_query}

Answer:

Structured RAG Prompt

You are answering questions based on the provided documentation.

RULES:
1. Only use information from the provided context
2. Quote relevant passages when possible
3. If the context doesn't contain the answer, say so
4. If information is partial, acknowledge limitations

CONTEXT:
---
Source: {doc1.source}
{doc1.content}
---
Source: {doc2.source}
{doc2.content}
---

QUESTION: {query}

Provide your answer with citations to the source documents.

Citation Handling

# Include source metadata
for i, doc in enumerate(retrieved_docs):
    context += f"[{i+1}] Source: {doc.metadata['source']}\n{doc.content}\n\n"

# Prompt for citations
prompt += "\nCite sources using [1], [2], etc."

Advanced Patterns

Parent Document Retrieval

Store small chunks for retrieval, but return larger parent context:

# Index small chunks (e.g., 200 tokens)
# But store mapping to parent (e.g., full section)

def retrieve_with_parent(query):
    small_chunks = vector_store.search(query, k=3)
    parent_ids = set(chunk.metadata['parent_id'] for chunk in small_chunks)
    return [doc_store.get(pid) for pid in parent_ids]

Self-Query Retrieval

Let LLM write the filter query:

# User: "What did we decide about authentication in 2024?"
# LLM generates: {"filter": {"year": 2024, "topic": "authentication"}}

retriever = SelfQueryRetriever(
    llm=llm,
    vectorstore=vectorstore,
    document_content_description="Meeting notes and decisions",
    metadata_field_info=[
        {"name": "year", "type": "integer"},
        {"name": "topic", "type": "string"},
    ]
)

Agentic RAG

Let an agent decide when/what to retrieve:

tools = [
    Tool("search_docs", "Search internal documentation", search_function),
    Tool("search_web", "Search the web for current info", web_search),
    Tool("search_code", "Search codebase", code_search),
]

agent = Agent(
    llm=llm,
    tools=tools,
    system_prompt="Decide which sources to search based on the question."
)

Evaluation Metrics

Retrieval Quality

MetricMeasuresFormula
Recall@KFound relevant docsRelevant in top-K / Total relevant
Precision@KTop-K accuracyRelevant in top-K / K
MRRRank of first relevant1 / rank of first relevant
NDCGRanking qualityNormalized discounted cumulative gain

Generation Quality

MetricMeasuresHow
FaithfulnessGrounded in contextCheck claims against sources
RelevanceAnswers the questionHuman evaluation
CompletenessCovers all aspectsHuman evaluation
Hallucination rateMade-up factsCompare to source docs

RAG Evaluation Tools

# Using ragas library
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision

results = evaluate(
    dataset,
    metrics=[faithfulness, answer_relevancy, context_precision]
)

Common Pitfalls

PitfallSymptomSolution
Chunks too smallAnswers lack contextIncrease chunk size or use parent retrieval
Chunks too largeIrrelevant content includedDecrease size, improve chunking
Wrong K valueToo much/little contextTune K based on evaluation
No metadataCan't filter resultsAdd source, date, topic metadata
Stale indexOutdated answersImplement refresh pipeline
Ignoring retrieved contextHallucinationsImprove prompt, lower temperature

Production Considerations

Caching

# Cache embeddings
embedding_cache = {}
def get_embedding(text):
    if text not in embedding_cache:
        embedding_cache[text] = embed_model.embed(text)
    return embedding_cache[text]

# Cache frequent queries
from functools import lru_cache

@lru_cache(maxsize=1000)
def cached_search(query_hash):
    return vector_store.search(query, k=5)

Monitoring

# Log retrieval quality
logger.info({
    "query": query,
    "retrieved_docs": [d.id for d in results],
    "retrieval_time_ms": elapsed,
    "rerank_time_ms": rerank_elapsed,
    "total_time_ms": total_elapsed
})

Cost Optimization

OptimizationSavings
Batch embeddingsAPI calls
Cache frequent queriesCompute + API
Use smaller embedding modelAPI cost
Compress vectors (PQ)Storage
Filter before semantic searchCompute

Synapses

See synapses.json for connections.

Skills Info
Original Name:rag-architecture-skillAuthor:fabioc