Agent Skill
2/7/2026

mteb-retrieve

This skill provides guidance for semantic similarity retrieval tasks using embedding models (e.g., MTEB benchmarks, document ranking). It should be used when computing embeddings for documents/queries, ranking documents by similarity, or identifying top-k similar items. Covers data preprocessing, model selection, similarity computation, and result verification.

L
letta
48GitHub Stars
1Views
npx skills add letta-ai/skills

SKILL.md

Namemteb-retrieve
DescriptionThis skill provides guidance for semantic similarity retrieval tasks using embedding models (e.g., MTEB benchmarks, document ranking). It should be used when computing embeddings for documents/queries, ranking documents by similarity, or identifying top-k similar items. Covers data preprocessing, model selection, similarity computation, and result verification.

name: mteb-retrieve description: This skill provides guidance for semantic similarity retrieval tasks using embedding models (e.g., MTEB benchmarks, document ranking). It should be used when computing embeddings for documents/queries, ranking documents by similarity, or identifying top-k similar items. Covers data preprocessing, model selection, similarity computation, and result verification.

MTEB Retrieve

Overview

This skill guides semantic similarity retrieval tasks where documents must be ranked by their similarity to a query using embedding models. These tasks typically involve loading documents, computing embeddings, calculating similarity scores, and identifying documents at specific ranks.

Workflow

Step 1: Data Inspection and Preprocessing

Before computing embeddings, thoroughly inspect the input data format:

  1. Examine raw file contents - Read a sample of lines to understand the actual format
  2. Identify formatting artifacts - Look for:
    • Line number prefixes (e.g., 1→, 2→, 11→)
    • Index markers or delimiters
    • Whitespace padding or alignment characters
    • Header rows or metadata lines
  3. Clean the data - Remove any non-semantic content:
    • Strip line numbers and prefixes using regex (e.g., re.sub(r'^\s*\d+→', '', line))
    • Remove leading/trailing whitespace
    • Filter empty lines
  4. Validate preprocessing - Print sample cleaned documents to verify they contain only semantic content

Example preprocessing pattern:

import re

def clean_line(line):
    # Remove line number prefix like "  1→" or "11→"
    cleaned = re.sub(r'^\s*\d+[→\t]', '', line)
    return cleaned.strip()

documents = [clean_line(line) for line in raw_lines if clean_line(line)]

Step 2: Model Selection

Select an appropriate embedding model for the content language and domain:

  1. Check model language - Models often have language indicators in their names:
    • zh = Chinese (e.g., bge-small-zh-v1.5)
    • en = English (e.g., bge-small-en-v1.5)
    • No suffix often means multilingual or English
  2. Match model to content - Using a Chinese-optimized model for English text (or vice versa) produces suboptimal embeddings
  3. Consider model size - Larger models generally produce better embeddings but are slower

Step 3: Embedding Computation

When computing embeddings:

  1. Normalize embeddings - Use normalize_embeddings=True to enable cosine similarity via dot product
  2. Batch processing - For large document sets, process in batches to manage memory
  3. Verify dimensions - Confirm embedding dimensions match expectations for the model
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('model-name')
doc_embeddings = model.encode(documents, normalize_embeddings=True)
query_embedding = model.encode(query, normalize_embeddings=True)

Step 4: Similarity Computation and Ranking

  1. Compute similarities - Use dot product for normalized embeddings (equivalent to cosine similarity)
  2. Handle ties - Be aware that identical similarity scores produce arbitrary ordering
  3. Use correct indexing - For k-th highest, use index k-1 after sorting in descending order
import numpy as np

similarities = np.dot(doc_embeddings, query_embedding)
sorted_indices = np.argsort(similarities)[::-1]  # Descending order

# For 5th highest: index 4 (0-indexed)
fifth_highest_idx = sorted_indices[4]
fifth_highest_doc = documents[fifth_highest_idx]

Step 5: Result Verification

Before writing final results, verify correctness:

  1. Print document count - Confirm expected number of documents were loaded
  2. Show sample documents - Display first few cleaned documents to verify preprocessing
  3. Display top-k results - Print at least the top 5-10 documents with their similarity scores
  4. Cross-check output format - Ensure the output contains only the semantic content, not formatting artifacts
# Verification checklist
print(f"Total documents: {len(documents)}")
print(f"Sample document: {documents[0][:100]}...")
print("\nTop 10 by similarity:")
for i in range(min(10, len(sorted_indices))):
    idx = sorted_indices[i]
    print(f"  {i+1}. [{similarities[idx]:.4f}] {documents[idx][:50]}...")

Common Pitfalls

Data Format Issues

  • Line number prefixes - Input files often include line numbers (e.g., 1→Text) that corrupt embeddings if not removed
  • Invisible characters - Watch for tabs, non-breaking spaces, or Unicode formatting characters
  • Mixed encodings - Explicitly specify file encoding (encoding='utf-8')

Model Mismatches

  • Language mismatch - Using language-specific models on wrong-language content
  • Version confusion - Ensure model revision matches expected behavior

Indexing Errors

  • Off-by-one errors - k-th highest uses index k-1 in 0-indexed arrays
  • Original vs sorted indices - Track the mapping between sorted positions and original document indices

Verification Gaps

  • No sanity checks - Always verify document count, sample content, and score distribution
  • Missing tie handling - Document when ties exist and how they affect results
Skills Info
Original Name:mteb-retrieveAuthor:letta