RAG Optimization
================

This guide covers how to use ragit's optimization engine to find the best hyperparameters for your RAG pipeline.

Why Optimize?
-------------

RAG quality is highly sensitive to hyperparameters:

- **Chunk size**: Too small loses context, too large dilutes relevance
- **Chunk overlap**: Affects information continuity at boundaries
- **Top-k retrieval**: More chunks = more context but also more noise
- **Model selection**: Different models have different strengths

Manual tuning is time-consuming. ragit automates this process.

Setting Up an Experiment
------------------------

Prepare Documents
^^^^^^^^^^^^^^^^^

First, prepare your documents:

.. code-block:: python

   from ragit import Document, load_directory

   # Option 1: Create documents manually
   documents = [
       Document(
           id="intro",
           content="ragit is a RAG optimization library...",
           metadata={"source": "intro.txt"}
       ),
       Document(
           id="install",
           content="Install ragit using pip install ragit...",
           metadata={"source": "install.txt"}
       ),
   ]

   # Option 2: Load from files
   from ragit import load_directory
   docs = load_directory("docs/", pattern="*.txt")
   documents = [
       Document(id=doc.id, content=doc.content, metadata=doc.metadata)
       for doc in docs
   ]

Create Benchmark Questions
^^^^^^^^^^^^^^^^^^^^^^^^^^

Create questions with expected answers:

.. code-block:: python

   from ragit import BenchmarkQuestion

   benchmark = [
       BenchmarkQuestion(
           question="How do I install ragit?",
           ground_truth="Install using pip install ragit",
           context="Installation documentation"
       ),
       BenchmarkQuestion(
           question="What is the default chunk size?",
           ground_truth="512 characters",
           context="Configuration section"
       ),
       BenchmarkQuestion(
           question="Is RAGAssistant thread-safe?",
           ground_truth="No, RAGAssistant is not thread-safe",
           context="Thread safety documentation"
       ),
       BenchmarkQuestion(
           question="What embedding models are supported?",
           ground_truth="mxbai-embed-large, nomic-embed-text, all-minilm",
           context="Model documentation"
       ),
       BenchmarkQuestion(
           question="How do I configure Ollama URL?",
           ground_truth="Set the OLLAMA_BASE_URL environment variable",
           context="Configuration section"
       ),
   ]

Guidelines for good benchmark questions:

- Use 5-20 questions for meaningful results
- Cover different aspects of your documents
- Make ground truth answers clear and specific
- Include both simple lookups and complex reasoning questions

Running the Experiment
----------------------

Basic Experiment
^^^^^^^^^^^^^^^^

Run with default search space:

.. code-block:: python

   from ragit import RagitExperiment, Document, BenchmarkQuestion

   experiment = RagitExperiment(documents, benchmark)
   results = experiment.run()

   # Results are sorted by score (best first)
   print(f"Tested {len(results)} configurations")

   best = results[0]
   print(f"\nBest configuration: {best.pattern_name}")
   print(f"Final score: {best.final_score:.3f}")
   print(f"Execution time: {best.execution_time:.1f}s")

Custom Search Space
^^^^^^^^^^^^^^^^^^^

Define your own hyperparameter ranges:

.. code-block:: python

   from ragit import RagitExperiment

   experiment = RagitExperiment(documents, benchmark)

   # Define custom search space
   configs = experiment.define_search_space(
       chunk_sizes=[256, 512, 1024, 2048],
       chunk_overlaps=[0, 25, 50, 100],
       num_chunks=[3, 5, 7, 10],
       llm_models=["llama3", "mistral"],
       embedding_models=["mxbai-embed-large"],
       max_configs=50  # Limit total configurations
   )

   print(f"Search space: {len(configs)} configurations")

   # Run with custom configs
   results = experiment.run(configs=configs)

Limiting Configurations
^^^^^^^^^^^^^^^^^^^^^^^

For faster iteration, limit the search:

.. code-block:: python

   # Quick test with fewer configs
   configs = experiment.define_search_space(
       chunk_sizes=[512],
       chunk_overlaps=[50],
       num_chunks=[3, 5],
       max_configs=10
   )

   results = experiment.run(configs=configs)

Understanding Results
---------------------

Examining Results
^^^^^^^^^^^^^^^^^

.. code-block:: python

   from ragit import RagitExperiment

   experiment = RagitExperiment(documents, benchmark)
   results = experiment.run()

   # Top 5 configurations
   print("Top 5 Configurations:")
   print("-" * 60)

   for i, result in enumerate(results[:5], 1):
       print(f"\n{i}. {result.pattern_name}")
       print(f"   Score: {result.final_score:.3f}")
       print(f"   Time: {result.execution_time:.1f}s")
       print(f"   Indexing: {result.indexing_params}")
       print(f"   Inference: {result.inference_params}")

       # Detailed scores
       for metric, values in result.scores.items():
           print(f"   {metric}: {values['mean']:.3f}")

Result Attributes
^^^^^^^^^^^^^^^^^

Each ``EvaluationResult`` contains:

.. code-block:: python

   result = results[0]

   # Configuration name
   result.pattern_name        # "Pattern_1"

   # Indexing hyperparameters
   result.indexing_params     # {"chunk_size": 512, "chunk_overlap": 50}

   # Inference hyperparameters
   result.inference_params    # {"num_chunks": 5, "llm_model": "llama3"}

   # Evaluation scores
   result.scores              # {"answer_correctness": {"mean": 0.85}, ...}

   # Combined score
   result.final_score         # 0.82

   # Time taken
   result.execution_time      # 45.3 seconds

Evaluation Metrics
^^^^^^^^^^^^^^^^^^

Each configuration is scored on three metrics:

1. **Answer Correctness**: Semantic similarity between generated and expected answers
2. **Context Relevance**: How relevant the retrieved chunks are to the question
3. **Faithfulness**: Whether the answer is supported by the retrieved context

.. code-block:: python

   result = results[0]

   print("Detailed Scores:")
   print(f"  Answer Correctness: {result.scores['answer_correctness']['mean']:.3f}")
   print(f"  Context Relevance: {result.scores['context_relevance']['mean']:.3f}")
   print(f"  Faithfulness: {result.scores['faithfulness']['mean']:.3f}")

Applying Optimal Settings
-------------------------

Use the best configuration in your application:

.. code-block:: python

   from ragit import RAGAssistant, RagitExperiment

   # Run experiment
   experiment = RagitExperiment(documents, benchmark)
   results = experiment.run()
   best = results[0]

   # Extract optimal parameters
   chunk_size = best.indexing_params["chunk_size"]
   chunk_overlap = best.indexing_params["chunk_overlap"]
   llm_model = best.inference_params.get("llm_model", "llama3")

   # Create optimized assistant
   assistant = RAGAssistant(
       "docs/",
       chunk_size=chunk_size,
       chunk_overlap=chunk_overlap,
       llm_model=llm_model
   )

   # Use the optimized assistant
   answer = assistant.ask("Your question here")

Saving and Loading Results
--------------------------

Export results for analysis:

.. code-block:: python

   import json
   from ragit import RagitExperiment

   experiment = RagitExperiment(documents, benchmark)
   results = experiment.run()

   # Export to JSON
   results_data = [result.to_dict() for result in results]
   with open("experiment_results.json", "w") as f:
       json.dump(results_data, f, indent=2)

   # Export best config
   best = results[0]
   config = {
       "chunk_size": best.indexing_params["chunk_size"],
       "chunk_overlap": best.indexing_params["chunk_overlap"],
       "num_chunks": best.inference_params.get("num_chunks", 5),
       "llm_model": best.inference_params.get("llm_model"),
       "final_score": best.final_score
   }
   with open("best_config.json", "w") as f:
       json.dump(config, f, indent=2)

Advanced Optimization
---------------------

Custom Provider
^^^^^^^^^^^^^^^

Use a custom provider for the experiment:

.. code-block:: python

   from ragit import RagitExperiment
   from ragit.providers import OllamaProvider

   # Custom provider with different settings
   provider = OllamaProvider(
       base_url="http://gpu-server:11434",
       timeout=300
   )

   experiment = RagitExperiment(
       documents,
       benchmark,
       provider=provider
   )
   results = experiment.run()

Progress Tracking
^^^^^^^^^^^^^^^^^

The experiment shows progress using tqdm:

.. code-block:: text

   Evaluating configurations: 100%|██████████| 24/24 [05:32<00:00, 13.83s/it]

For programmatic progress tracking:

.. code-block:: python

   from ragit import RagitExperiment

   experiment = RagitExperiment(documents, benchmark)

   # Access results incrementally
   configs = experiment.define_search_space(max_configs=10)

   for i, config in enumerate(configs):
       result = experiment.evaluate_config(config)
       print(f"Config {i+1}/10: {result.final_score:.3f}")

Optimization Tips
-----------------

Start Broad, Then Narrow
^^^^^^^^^^^^^^^^^^^^^^^^

.. code-block:: python

   # Phase 1: Broad search
   configs = experiment.define_search_space(
       chunk_sizes=[256, 512, 1024],
       chunk_overlaps=[25, 50, 100],
       num_chunks=[3, 5, 7],
       max_configs=30
   )
   results = experiment.run(configs=configs)

   # Find best chunk_size from Phase 1
   best_chunk_size = results[0].indexing_params["chunk_size"]

   # Phase 2: Fine-tune around best
   fine_configs = experiment.define_search_space(
       chunk_sizes=[best_chunk_size - 128, best_chunk_size, best_chunk_size + 128],
       chunk_overlaps=[25, 50, 75, 100],
       num_chunks=[4, 5, 6],
       max_configs=20
   )
   final_results = experiment.run(configs=fine_configs)

Quality vs Speed Trade-offs
^^^^^^^^^^^^^^^^^^^^^^^^^^^

- **Smaller chunks**: Faster embedding, more precise retrieval
- **Fewer num_chunks**: Faster generation, less context
- **Smaller LLM**: Faster responses, potentially lower quality

.. code-block:: python

   # Optimize for speed
   fast_configs = experiment.define_search_space(
       chunk_sizes=[256, 512],
       num_chunks=[2, 3],
       llm_models=["mistral"]  # Fast model
   )

   # Optimize for quality
   quality_configs = experiment.define_search_space(
       chunk_sizes=[512, 1024],
       num_chunks=[5, 7, 10],
       llm_models=["llama3:70b"]  # Large model
   )

Representative Benchmark
^^^^^^^^^^^^^^^^^^^^^^^^

Your benchmark should reflect real usage:

- Include questions users actually ask
- Cover different difficulty levels
- Test edge cases and corner cases
- Update benchmark as usage patterns change