ondevice-rag-engine
Build the on-device RAG (Retrieval-Augmented Generation) engine using NaturalLanguage framework embeddings, vector storage in SwiftData, and Accelerate-powered similarity search. This skill should be used when implementing semantic search, embedding generation, chunking, indexing, or vector similarity for the planner app.
SKILL.md
| Name | ondevice-rag-engine |
| Description | Build the on-device RAG (Retrieval-Augmented Generation) engine using NaturalLanguage framework embeddings, vector storage in SwiftData, and Accelerate-powered similarity search. This skill should be used when implementing semantic search, embedding generation, chunking, indexing, or vector similarity for the planner app. |
name: ondevice-rag-engine description: Build the on-device RAG (Retrieval-Augmented Generation) engine using NaturalLanguage framework embeddings, vector storage in SwiftData, and Accelerate-powered similarity search. This skill should be used when implementing semantic search, embedding generation, chunking, indexing, or vector similarity for the planner app.
On-Device RAG Engine
Overview
Implement a fully on-device retrieval-augmented generation pipeline. The RAG engine generates sentence embeddings using Apple's NaturalLanguage framework, stores vectors in SwiftData, and performs cosine similarity search using the Accelerate framework. No data leaves the device.
Embedding Generation
NLEmbedding Setup
import NaturalLanguage
final class EmbeddingService {
private let embedding: NLEmbedding?
init() {
self.embedding = NLEmbedding.sentenceEmbedding(for: .english)
}
var isAvailable: Bool { embedding != nil }
func generateVector(for text: String) -> [Double]? {
embedding?.vector(for: text)
}
func generateVectors(for texts: [String]) -> [[Double]?] {
texts.map { embedding?.vector(for: $0) }
}
}
Critical notes:
NLEmbedding.sentenceEmbedding(for:)returnsnilif the model for that language is not downloaded to the device. Always checkisAvailablebefore use.- English sentence embeddings produce 512-dimensional vectors.
- All computation is on-device with no network calls.
- The embedding model is loaded once and reused. Do not re-create
NLEmbeddinginstances per query.
Supported Languages
NLEmbedding.supportedRevisions(for: .sentenceEmbedding) lists available languages. English is the most reliable. For multi-language support, detect language first with NLLanguageRecognizer then load the appropriate model.
Chunking Strategy
Meeting data must be split into chunks optimized for sentence embeddings. Chunks that are too long lose specificity; too short lose context.
Meeting Record Chunking
For each MeetingRecord, generate these chunks:
- Title + Date:
"Meeting: {title} on {formatted date}" - Attendees:
"Attendees: {comma-separated names}" - Purpose:
"Purpose: {purpose}"(if not nil) - Outcomes:
"Outcomes: {outcomes}"(if not nil) - Action Items:
"Action items: {actionItems}"(if not nil) - Location:
"Location: {location}"(if not nil)
Note Chunking
For each Note, split plainText into chunks:
- Split on paragraph boundaries (double newline)
- If a paragraph exceeds 200 words, split on sentence boundaries
- Prepend the note title to the first chunk:
"Note: {title}. {first paragraph}" - Target chunk size: 50-200 words (sweet spot for sentence embeddings)
func chunkText(_ text: String, title: String) -> [String] {
let paragraphs = text.components(separatedBy: "\n\n")
.map { $0.trimmingCharacters(in: .whitespacesAndNewlines) }
.filter { !$0.isEmpty }
var chunks: [String] = []
for (index, paragraph) in paragraphs.enumerated() {
let prefix = index == 0 ? "Note: \(title). " : ""
let words = paragraph.split(separator: " ")
if words.count > 200 {
// Split on sentence boundaries
let sentences = paragraph.components(separatedBy: ". ")
var current = prefix
for sentence in sentences {
if current.split(separator: " ").count + sentence.split(separator: " ").count > 200 {
chunks.append(current)
current = sentence
} else {
current += (current.isEmpty ? "" : ". ") + sentence
}
}
if !current.isEmpty { chunks.append(current) }
} else {
chunks.append(prefix + paragraph)
}
}
return chunks
}
Vector Storage
Data Conversion
Store [Double] as Data in SwiftData's EmbeddingRecord:
extension Array where Element == Double {
var asData: Data {
withUnsafeBytes { Data($0) }
}
}
extension Data {
var asDoubleArray: [Double] {
withUnsafeBytes { Array($0.bindMemory(to: Double.self)) }
}
}
Indexing Pipeline
func indexMeetingRecord(_ record: MeetingRecord, context: ModelContext) async {
// Delete existing embeddings for this record
let existingEmbeddings = record.embeddings
existingEmbeddings.forEach { context.delete($0) }
// Generate chunks
let chunks = generateMeetingChunks(record)
// Generate and store embeddings
for (index, chunk) in chunks.enumerated() {
guard let vector = embeddingService.generateVector(for: chunk) else { continue }
let embedding = EmbeddingRecord()
embedding.vectorData = vector.asData
embedding.sourceText = chunk
embedding.sourceType = "meeting"
embedding.chunkIndex = index
embedding.createdAt = Date()
embedding.meetingRecord = record
context.insert(embedding)
}
}
Run indexing on a background queue to avoid blocking the UI:
Task.detached(priority: .utility) {
let context = ModelContext(modelContainer)
await indexMeetingRecord(record, context: context)
try? context.save()
}
Similarity Search
Cosine Similarity with Accelerate
Use vDSP for fast vector operations instead of naive loops:
import Accelerate
func cosineSimilarity(_ a: [Double], _ b: [Double]) -> Double {
precondition(a.count == b.count)
var dotProduct: Double = 0
var normA: Double = 0
var normB: Double = 0
vDSP_dotprD(a, 1, b, 1, &dotProduct, vDSP_Length(a.count))
vDSP_dotprD(a, 1, a, 1, &normA, vDSP_Length(a.count))
vDSP_dotprD(b, 1, b, 1, &normB, vDSP_Length(b.count))
let denominator = sqrt(normA) * sqrt(normB)
guard denominator > 0 else { return 0 }
return dotProduct / denominator
}
Search Implementation
struct SearchResult {
let embeddingRecord: EmbeddingRecord
let score: Double
}
func search(query: String, topK: Int = 5, context: ModelContext) -> [SearchResult] {
guard let queryVector = embeddingService.generateVector(for: query) else { return [] }
// Fetch all embeddings
let descriptor = FetchDescriptor<EmbeddingRecord>()
guard let allEmbeddings = try? context.fetch(descriptor) else { return [] }
// Score each embedding
let scored = allEmbeddings.compactMap { record -> SearchResult? in
let vector = record.vectorData.asDoubleArray
guard vector.count == queryVector.count else { return nil }
let score = cosineSimilarity(queryVector, vector)
return SearchResult(embeddingRecord: record, score: score)
}
// Return top-K sorted by descending score
return scored.sorted { $0.score > $1.score }.prefix(topK).map { $0 }
}
Performance Optimization
For large datasets (1000+ embeddings), loading all vectors into memory is expensive. Optimize:
- Pre-filter by date range: Add a date predicate to the fetch to limit candidates
- Batch similarity: Load vectors in batches of 100, compute similarities, keep running top-K
- Minimum threshold: Skip results below 0.3 similarity (unlikely to be relevant)
- Cache query vectors: If the same query is run repeatedly, cache its embedding
Re-indexing
Trigger re-indexing when:
- A
MeetingRecordis created or its text fields are updated - A
Noteis saved with new or changedplainText - A
MeetingRecordis deleted (delete itsEmbeddingRecordentries via cascade)
Avoid re-indexing unchanged records. Compare a hash of the concatenated chunk source text against a stored hash before regenerating.
Query Pipeline Summary
- User enters a natural language question
- Embed the question using
NLEmbedding - Fetch candidate
EmbeddingRecordentries (optionally filtered by date/type) - Compute cosine similarity using Accelerate
- Return top-K results with source text previews and links to source
MeetingRecordorNote - Optionally pass results to the summarization service for a synthesized answer