Core Concepts

This page explains the fundamental concepts behind ragit and RAG systems in general.

What is RAG?

RAG (Retrieval-Augmented Generation) is a technique that enhances LLM responses by retrieving relevant context from a knowledge base before generating an answer.

The RAG Pipeline

User Question
     |
     v
+------------------+
|  1. Embed Query  |  Convert question to vector
+------------------+
     |
     v
+------------------+
|  2. Retrieve     |  Find similar chunks from index
+------------------+
     |
     v
+------------------+
|  3. Augment      |  Add context to prompt
+------------------+
     |
     v
+------------------+
|  4. Generate     |  LLM produces answer
+------------------+
     |
     v
Final Answer

Why RAG Matters

  • Current Information: LLMs have knowledge cutoffs; RAG provides up-to-date context

  • Domain Knowledge: Add specialized knowledge without fine-tuning

  • Reduced Hallucination: Grounding responses in retrieved facts

  • Transparency: Know exactly what sources informed the answer

Document Chunking

Documents are split into smaller pieces called “chunks” for efficient retrieval.

Chunk Size

The number of characters in each chunk significantly affects retrieval quality:

  • Small chunks (128-256): More precise retrieval but may lose context

  • Medium chunks (512-1024): Good balance of precision and context

  • Large chunks (2048+): More context but less precise matching

from ragit import chunk_text

# Small chunks for precise retrieval
small_chunks = chunk_text(text, chunk_size=256, chunk_overlap=25)

# Large chunks for more context
large_chunks = chunk_text(text, chunk_size=1024, chunk_overlap=100)

Chunk Overlap

Overlap ensures information at chunk boundaries isn’t lost:

Without overlap:
[Chunk 1: "The quick brown"][Chunk 2: "fox jumps over"]

With overlap:
[Chunk 1: "The quick brown fox"][Chunk 2: "brown fox jumps over"]

Typical overlap values: 10-20% of chunk size.

Embeddings

Embeddings are numerical vector representations of text that capture semantic meaning.

How Embeddings Work

from ragit.providers import OllamaProvider

provider = OllamaProvider()

# Similar sentences have similar embeddings
emb1 = provider.embed("The cat sat on the mat", model="mxbai-embed-large")
emb2 = provider.embed("A feline rested on the rug", model="mxbai-embed-large")
emb3 = provider.embed("Python is a programming language", model="mxbai-embed-large")

# emb1 and emb2 will be similar (both about cats)
# emb3 will be different (about programming)

Embedding Models

ragit supports any embedding model available through Ollama:

  • mxbai-embed-large: General-purpose, 1024 dimensions

  • nomic-embed-text: Good for long documents, 768 dimensions

  • all-minilm: Lightweight, 384 dimensions

Top-K Retrieval

The top_k parameter controls how many chunks are retrieved:

from ragit import RAGAssistant

assistant = RAGAssistant("docs/")

# Retrieve more chunks for complex questions
results = assistant.retrieve("Explain the architecture", top_k=10)

# Retrieve fewer for simple lookups
results = assistant.retrieve("What is the version?", top_k=2)

Trade-offs

  • Higher top_k: More context, but may include irrelevant information

  • Lower top_k: More focused, but may miss relevant information

Typical values: 3-5 for focused answers, 5-10 for comprehensive responses.

Prompt Augmentation

Retrieved chunks are added to the LLM prompt:

System: You are a helpful assistant. Answer based on the context.

Context:
[Chunk 1 content]
[Chunk 2 content]
[Chunk 3 content]

Question: {user_question}

Answer:

The LLM then generates a response grounded in the provided context.

RAG Evaluation Metrics

ragit evaluates RAG quality using three metrics:

Answer Correctness

How well the generated answer matches the expected answer.

# Evaluates semantic similarity between:
# - Generated answer
# - Ground truth answer

Context Relevance

How relevant the retrieved chunks are to the question.

# Evaluates whether retrieved chunks contain
# information needed to answer the question

Faithfulness

Whether the answer is supported by the retrieved context.

# Checks that the answer doesn't contain
# information not present in the context

Combined Score

The final score combines all three metrics:

final_score = mean(answer_correctness, context_relevance, faithfulness)

Hyperparameter Optimization

RAG quality is sensitive to hyperparameters. ragit optimizes:

Indexing Parameters

  • chunk_size: Characters per chunk (256, 512, 1024)

  • chunk_overlap: Overlap between chunks (25, 50, 100)

Inference Parameters

  • num_chunks: Chunks to retrieve (3, 5, 7)

  • llm_model: Model for generation

  • embedding_model: Model for embeddings

The optimization process tests combinations and ranks by evaluation score.

Thread Safety

Important: RAGAssistant and SimpleVectorStore are NOT thread-safe.

# WRONG: Sharing assistant between threads
assistant = RAGAssistant("docs/")

def handle_request(question):
    return assistant.ask(question)  # Race condition!

# CORRECT: Each thread gets its own instance
def handle_request(question):
    assistant = RAGAssistant("docs/")  # Or use thread-local storage
    return assistant.ask(question)

For web applications, see the Platform Integration guide for proper patterns.