Core Concepts
This page explains the fundamental concepts behind ragit and RAG systems in general.
What is RAG?
RAG (Retrieval-Augmented Generation) is a technique that enhances LLM responses by retrieving relevant context from a knowledge base before generating an answer.
The RAG Pipeline
User Question
|
v
+------------------+
| 1. Embed Query | Convert question to vector
+------------------+
|
v
+------------------+
| 2. Retrieve | Find similar chunks from index
+------------------+
|
v
+------------------+
| 3. Augment | Add context to prompt
+------------------+
|
v
+------------------+
| 4. Generate | LLM produces answer
+------------------+
|
v
Final Answer
Why RAG Matters
Current Information: LLMs have knowledge cutoffs; RAG provides up-to-date context
Domain Knowledge: Add specialized knowledge without fine-tuning
Reduced Hallucination: Grounding responses in retrieved facts
Transparency: Know exactly what sources informed the answer
Document Chunking
Documents are split into smaller pieces called “chunks” for efficient retrieval.
Chunk Size
The number of characters in each chunk significantly affects retrieval quality:
Small chunks (128-256): More precise retrieval but may lose context
Medium chunks (512-1024): Good balance of precision and context
Large chunks (2048+): More context but less precise matching
from ragit import chunk_text
# Small chunks for precise retrieval
small_chunks = chunk_text(text, chunk_size=256, chunk_overlap=25)
# Large chunks for more context
large_chunks = chunk_text(text, chunk_size=1024, chunk_overlap=100)
Chunk Overlap
Overlap ensures information at chunk boundaries isn’t lost:
Without overlap:
[Chunk 1: "The quick brown"][Chunk 2: "fox jumps over"]
With overlap:
[Chunk 1: "The quick brown fox"][Chunk 2: "brown fox jumps over"]
Typical overlap values: 10-20% of chunk size.
Embeddings
Embeddings are numerical vector representations of text that capture semantic meaning.
How Embeddings Work
from ragit.providers import OllamaProvider
provider = OllamaProvider()
# Similar sentences have similar embeddings
emb1 = provider.embed("The cat sat on the mat", model="mxbai-embed-large")
emb2 = provider.embed("A feline rested on the rug", model="mxbai-embed-large")
emb3 = provider.embed("Python is a programming language", model="mxbai-embed-large")
# emb1 and emb2 will be similar (both about cats)
# emb3 will be different (about programming)
Embedding Models
ragit supports any embedding model available through Ollama:
mxbai-embed-large: General-purpose, 1024 dimensions
nomic-embed-text: Good for long documents, 768 dimensions
all-minilm: Lightweight, 384 dimensions
Vector Similarity Search
When you ask a question, ragit finds chunks with similar embeddings.
Cosine Similarity
ragit uses cosine similarity to compare embeddings:
similarity = dot(query, document) / (|query| * |document|)
Range: -1 to 1
- 1.0: Identical meaning
- 0.0: Unrelated
- -1.0: Opposite meaning
Pre-normalized Embeddings
ragit pre-normalizes embeddings at index time for fast retrieval:
# Instead of computing full cosine similarity each time:
# similarity = dot(a, b) / (norm(a) * norm(b))
# We normalize once at index time:
# normalized_a = a / norm(a)
# normalized_b = b / norm(b)
# Then similarity is just a dot product:
# similarity = dot(normalized_a, normalized_b)
This makes retrieval O(1) per vector instead of O(n).
Top-K Retrieval
The top_k parameter controls how many chunks are retrieved:
from ragit import RAGAssistant
assistant = RAGAssistant("docs/")
# Retrieve more chunks for complex questions
results = assistant.retrieve("Explain the architecture", top_k=10)
# Retrieve fewer for simple lookups
results = assistant.retrieve("What is the version?", top_k=2)
Trade-offs
Higher top_k: More context, but may include irrelevant information
Lower top_k: More focused, but may miss relevant information
Typical values: 3-5 for focused answers, 5-10 for comprehensive responses.
Prompt Augmentation
Retrieved chunks are added to the LLM prompt:
System: You are a helpful assistant. Answer based on the context.
Context:
[Chunk 1 content]
[Chunk 2 content]
[Chunk 3 content]
Question: {user_question}
Answer:
The LLM then generates a response grounded in the provided context.
RAG Evaluation Metrics
ragit evaluates RAG quality using three metrics:
Answer Correctness
How well the generated answer matches the expected answer.
# Evaluates semantic similarity between:
# - Generated answer
# - Ground truth answer
Context Relevance
How relevant the retrieved chunks are to the question.
# Evaluates whether retrieved chunks contain
# information needed to answer the question
Faithfulness
Whether the answer is supported by the retrieved context.
# Checks that the answer doesn't contain
# information not present in the context
Combined Score
The final score combines all three metrics:
final_score = mean(answer_correctness, context_relevance, faithfulness)
Hyperparameter Optimization
RAG quality is sensitive to hyperparameters. ragit optimizes:
Indexing Parameters
chunk_size: Characters per chunk (256, 512, 1024)
chunk_overlap: Overlap between chunks (25, 50, 100)
Inference Parameters
num_chunks: Chunks to retrieve (3, 5, 7)
llm_model: Model for generation
embedding_model: Model for embeddings
The optimization process tests combinations and ranks by evaluation score.
Thread Safety
Important: RAGAssistant and SimpleVectorStore are NOT thread-safe.
# WRONG: Sharing assistant between threads
assistant = RAGAssistant("docs/")
def handle_request(question):
return assistant.ask(question) # Race condition!
# CORRECT: Each thread gets its own instance
def handle_request(question):
assistant = RAGAssistant("docs/") # Or use thread-local storage
return assistant.ask(question)
For web applications, see the Platform Integration guide for proper patterns.