Core Concepts

This page explains the fundamental concepts behind ragit and RAG systems in general.

What is RAG?

RAG (Retrieval-Augmented Generation) is a technique that enhances LLM responses by retrieving relevant context from a knowledge base before generating an answer.

The RAG Pipeline

User Question
     |
     v
+------------------+
|  1. Embed Query  |  Convert question to vector
+------------------+
     |
     v
+------------------+
|  2. Retrieve     |  Find similar chunks from index
+------------------+
     |
     v
+------------------+
|  3. Augment      |  Add context to prompt
+------------------+
     |
     v
+------------------+
|  4. Generate     |  LLM produces answer
+------------------+
     |
     v
Final Answer

Why RAG Matters

Current Information: LLMs have knowledge cutoffs; RAG provides up-to-date context
Domain Knowledge: Add specialized knowledge without fine-tuning
Reduced Hallucination: Grounding responses in retrieved facts
Transparency: Know exactly what sources informed the answer

Document Chunking

Documents are split into smaller pieces called “chunks” for efficient retrieval.

Chunk Size

The number of characters in each chunk significantly affects retrieval quality:

Small chunks (128-256): More precise retrieval but may lose context
Medium chunks (512-1024): Good balance of precision and context
Large chunks (2048+): More context but less precise matching

from ragit import chunk_text

# Small chunks for precise retrieval
small_chunks = chunk_text(text, chunk_size=256, chunk_overlap=25)

# Large chunks for more context
large_chunks = chunk_text(text, chunk_size=1024, chunk_overlap=100)

Chunk Overlap

Overlap ensures information at chunk boundaries isn’t lost:

Without overlap:
[Chunk 1: "The quick brown"][Chunk 2: "fox jumps over"]

With overlap:
[Chunk 1: "The quick brown fox"][Chunk 2: "brown fox jumps over"]

Typical overlap values: 10-20% of chunk size.

Embeddings

Embeddings are numerical vector representations of text that capture semantic meaning.

How Embeddings Work

from ragit.providers import OllamaProvider

provider = OllamaProvider()

# Similar sentences have similar embeddings
emb1 = provider.embed("The cat sat on the mat", model="mxbai-embed-large")
emb2 = provider.embed("A feline rested on the rug", model="mxbai-embed-large")
emb3 = provider.embed("Python is a programming language", model="mxbai-embed-large")

# emb1 and emb2 will be similar (both about cats)
# emb3 will be different (about programming)

Embedding Models

ragit supports any embedding model available through Ollama:

mxbai-embed-large: General-purpose, 1024 dimensions
nomic-embed-text: Good for long documents, 768 dimensions
all-minilm: Lightweight, 384 dimensions

Vector Similarity Search

When you ask a question, ragit finds chunks with similar embeddings.

Cosine Similarity

ragit uses cosine similarity to compare embeddings:

similarity = dot(query, document) / (|query| * |document|)

Range: -1 to 1
- 1.0: Identical meaning
- 0.0: Unrelated
- -1.0: Opposite meaning

Pre-normalized Embeddings

ragit pre-normalizes embeddings at index time for fast retrieval:

# Instead of computing full cosine similarity each time:
# similarity = dot(a, b) / (norm(a) * norm(b))

# We normalize once at index time:
# normalized_a = a / norm(a)
# normalized_b = b / norm(b)

# Then similarity is just a dot product:
# similarity = dot(normalized_a, normalized_b)

This makes retrieval O(1) per vector instead of O(n).

Top-K Retrieval

The top_k parameter controls how many chunks are retrieved:

from ragit import RAGAssistant

assistant = RAGAssistant("docs/")

# Retrieve more chunks for complex questions
results = assistant.retrieve("Explain the architecture", top_k=10)

# Retrieve fewer for simple lookups
results = assistant.retrieve("What is the version?", top_k=2)

Trade-offs

Higher top_k: More context, but may include irrelevant information
Lower top_k: More focused, but may miss relevant information

Typical values: 3-5 for focused answers, 5-10 for comprehensive responses.

Prompt Augmentation

Retrieved chunks are added to the LLM prompt:

System: You are a helpful assistant. Answer based on the context.

Context:
[Chunk 1 content]
[Chunk 2 content]
[Chunk 3 content]

Question: {user_question}

Answer:

The LLM then generates a response grounded in the provided context.

RAG Evaluation Metrics

ragit evaluates RAG quality using three metrics:

Answer Correctness

How well the generated answer matches the expected answer.

# Evaluates semantic similarity between:
# - Generated answer
# - Ground truth answer

Context Relevance

How relevant the retrieved chunks are to the question.

# Evaluates whether retrieved chunks contain
# information needed to answer the question

Faithfulness

Whether the answer is supported by the retrieved context.

# Checks that the answer doesn't contain
# information not present in the context

Combined Score

The final score combines all three metrics:

final_score = mean(answer_correctness, context_relevance, faithfulness)

Hyperparameter Optimization

RAG quality is sensitive to hyperparameters. ragit optimizes:

Indexing Parameters

chunk_size: Characters per chunk (256, 512, 1024)
chunk_overlap: Overlap between chunks (25, 50, 100)

Inference Parameters

num_chunks: Chunks to retrieve (3, 5, 7)
llm_model: Model for generation
embedding_model: Model for embeddings

The optimization process tests combinations and ranks by evaluation score.

Thread Safety

Important: RAGAssistant and SimpleVectorStore are NOT thread-safe.

# WRONG: Sharing assistant between threads
assistant = RAGAssistant("docs/")

def handle_request(question):
    return assistant.ask(question)  # Race condition!

# CORRECT: Each thread gets its own instance
def handle_request(question):
    assistant = RAGAssistant("docs/")  # Or use thread-local storage
    return assistant.ask(question)

For web applications, see the Platform Integration guide for proper patterns.