Core Concepts
=============

This page explains the fundamental concepts behind ragit and RAG systems in general.

What is RAG?
------------

RAG (Retrieval-Augmented Generation) is a technique that enhances LLM responses by retrieving relevant context from a knowledge base before generating an answer.

The RAG Pipeline
^^^^^^^^^^^^^^^^

.. code-block:: text

   User Question
        |
        v
   +------------------+
   |  1. Embed Query  |  Convert question to vector
   +------------------+
        |
        v
   +------------------+
   |  2. Retrieve     |  Find similar chunks from index
   +------------------+
        |
        v
   +------------------+
   |  3. Augment      |  Add context to prompt
   +------------------+
        |
        v
   +------------------+
   |  4. Generate     |  LLM produces answer
   +------------------+
        |
        v
   Final Answer

Why RAG Matters
^^^^^^^^^^^^^^^

- **Current Information**: LLMs have knowledge cutoffs; RAG provides up-to-date context
- **Domain Knowledge**: Add specialized knowledge without fine-tuning
- **Reduced Hallucination**: Grounding responses in retrieved facts
- **Transparency**: Know exactly what sources informed the answer

Document Chunking
-----------------

Documents are split into smaller pieces called "chunks" for efficient retrieval.

Chunk Size
^^^^^^^^^^

The number of characters in each chunk significantly affects retrieval quality:

- **Small chunks (128-256)**: More precise retrieval but may lose context
- **Medium chunks (512-1024)**: Good balance of precision and context
- **Large chunks (2048+)**: More context but less precise matching

.. code-block:: python

   from ragit import chunk_text

   # Small chunks for precise retrieval
   small_chunks = chunk_text(text, chunk_size=256, chunk_overlap=25)

   # Large chunks for more context
   large_chunks = chunk_text(text, chunk_size=1024, chunk_overlap=100)

Chunk Overlap
^^^^^^^^^^^^^

Overlap ensures information at chunk boundaries isn't lost:

.. code-block:: text

   Without overlap:
   [Chunk 1: "The quick brown"][Chunk 2: "fox jumps over"]

   With overlap:
   [Chunk 1: "The quick brown fox"][Chunk 2: "brown fox jumps over"]

Typical overlap values: 10-20% of chunk size.

Embeddings
----------

Embeddings are numerical vector representations of text that capture semantic meaning.

How Embeddings Work
^^^^^^^^^^^^^^^^^^^

.. code-block:: python

   from ragit.providers import OllamaProvider

   provider = OllamaProvider()

   # Similar sentences have similar embeddings
   emb1 = provider.embed("The cat sat on the mat", model="mxbai-embed-large")
   emb2 = provider.embed("A feline rested on the rug", model="mxbai-embed-large")
   emb3 = provider.embed("Python is a programming language", model="mxbai-embed-large")

   # emb1 and emb2 will be similar (both about cats)
   # emb3 will be different (about programming)

Embedding Models
^^^^^^^^^^^^^^^^

ragit supports any embedding model available through Ollama:

- **mxbai-embed-large**: General-purpose, 1024 dimensions
- **nomic-embed-text**: Good for long documents, 768 dimensions
- **all-minilm**: Lightweight, 384 dimensions

Vector Similarity Search
------------------------

When you ask a question, ragit finds chunks with similar embeddings.

Cosine Similarity
^^^^^^^^^^^^^^^^^

ragit uses cosine similarity to compare embeddings:

.. code-block:: text

   similarity = dot(query, document) / (|query| * |document|)

   Range: -1 to 1
   - 1.0: Identical meaning
   - 0.0: Unrelated
   - -1.0: Opposite meaning

Pre-normalized Embeddings
^^^^^^^^^^^^^^^^^^^^^^^^^

ragit pre-normalizes embeddings at index time for fast retrieval:

.. code-block:: python

   # Instead of computing full cosine similarity each time:
   # similarity = dot(a, b) / (norm(a) * norm(b))

   # We normalize once at index time:
   # normalized_a = a / norm(a)
   # normalized_b = b / norm(b)

   # Then similarity is just a dot product:
   # similarity = dot(normalized_a, normalized_b)

This makes retrieval O(1) per vector instead of O(n).

Top-K Retrieval
---------------

The ``top_k`` parameter controls how many chunks are retrieved:

.. code-block:: python

   from ragit import RAGAssistant

   assistant = RAGAssistant("docs/")

   # Retrieve more chunks for complex questions
   results = assistant.retrieve("Explain the architecture", top_k=10)

   # Retrieve fewer for simple lookups
   results = assistant.retrieve("What is the version?", top_k=2)

Trade-offs
^^^^^^^^^^

- **Higher top_k**: More context, but may include irrelevant information
- **Lower top_k**: More focused, but may miss relevant information

Typical values: 3-5 for focused answers, 5-10 for comprehensive responses.

Prompt Augmentation
-------------------

Retrieved chunks are added to the LLM prompt:

.. code-block:: text

   System: You are a helpful assistant. Answer based on the context.

   Context:
   [Chunk 1 content]
   [Chunk 2 content]
   [Chunk 3 content]

   Question: {user_question}

   Answer:

The LLM then generates a response grounded in the provided context.

RAG Evaluation Metrics
----------------------

ragit evaluates RAG quality using three metrics:

Answer Correctness
^^^^^^^^^^^^^^^^^^

How well the generated answer matches the expected answer.

.. code-block:: python

   # Evaluates semantic similarity between:
   # - Generated answer
   # - Ground truth answer

Context Relevance
^^^^^^^^^^^^^^^^^

How relevant the retrieved chunks are to the question.

.. code-block:: python

   # Evaluates whether retrieved chunks contain
   # information needed to answer the question

Faithfulness
^^^^^^^^^^^^

Whether the answer is supported by the retrieved context.

.. code-block:: python

   # Checks that the answer doesn't contain
   # information not present in the context

Combined Score
^^^^^^^^^^^^^^

The final score combines all three metrics:

.. code-block:: python

   final_score = mean(answer_correctness, context_relevance, faithfulness)

Hyperparameter Optimization
---------------------------

RAG quality is sensitive to hyperparameters. ragit optimizes:

Indexing Parameters
^^^^^^^^^^^^^^^^^^^

- **chunk_size**: Characters per chunk (256, 512, 1024)
- **chunk_overlap**: Overlap between chunks (25, 50, 100)

Inference Parameters
^^^^^^^^^^^^^^^^^^^^

- **num_chunks**: Chunks to retrieve (3, 5, 7)
- **llm_model**: Model for generation
- **embedding_model**: Model for embeddings

The optimization process tests combinations and ranks by evaluation score.

Thread Safety
-------------

Important: ``RAGAssistant`` and ``SimpleVectorStore`` are **NOT thread-safe**.

.. code-block:: python

   # WRONG: Sharing assistant between threads
   assistant = RAGAssistant("docs/")

   def handle_request(question):
       return assistant.ask(question)  # Race condition!

   # CORRECT: Each thread gets its own instance
   def handle_request(question):
       assistant = RAGAssistant("docs/")  # Or use thread-local storage
       return assistant.ask(question)

For web applications, see the :doc:`integration` guide for proper patterns.