RAGAssistant API

The RAGAssistant class provides a high-level interface for RAG operations.

Note

RAGAssistant is thread-safe as of v0.11.0. It uses lock-free atomic operations with immutable state, allowing concurrent reads while another thread writes. See Platform Integration for usage patterns.

Class Reference

Bases: object

High-level RAG assistant for document Q&A and generation.

Handles document indexing, retrieval, and LLM generation in one simple API.

Parameters:

documents (list[Document] or str or Path) – Documents to index. Can be: - List of Document objects - Path to a single file - Path to a directory (will load all .txt, .md, .rst files)
embed_fn (Callable[[str], list[float]], optional) – Function that takes text and returns an embedding vector. If provided, creates a FunctionProvider internally.
generate_fn (Callable, optional) – Function for text generation. Supports (prompt) or (prompt, system_prompt). If provided without embed_fn, must also provide embed_fn.
provider (BaseEmbeddingProvider, optional) – Provider for embeddings (and optionally LLM). If embed_fn is provided, this is ignored for embeddings.
embedding_model (str, optional) – Embedding model name (used with provider).
llm_model (str, optional) – LLM model name (used with provider).
chunk_size (int, optional) – Chunk size for splitting documents (default: 512).
chunk_overlap (int, optional) – Overlap between chunks (default: 50).

Raises:

ValueError – If neither embed_fn nor provider is provided.
Thread Safety –
------------- –
This class uses lock-free atomic operations for thread safety. –
Multiple threads can safely call retrieve() while another thread –
calls add_documents(). The IndexState is immutable, and reference –
swaps are atomic under Python's GIL. –

Examples

>>> # With custom embedding function (retrieval-only)
>>> assistant = RAGAssistant(docs, embed_fn=my_embed)
>>> results = assistant.retrieve("query")
>>>
>>> # With custom embedding and LLM functions (full RAG)
>>> assistant = RAGAssistant(docs, embed_fn=my_embed, generate_fn=my_llm)
>>> answer = assistant.ask("What is X?")
>>>
>>> # With Ollama provider (supports nomic-embed-text)
>>> from ragit.providers import OllamaProvider
>>> assistant = RAGAssistant(docs, provider=OllamaProvider())
>>>
>>> # Save and load index for persistence
>>> assistant.save_index("/path/to/index")
>>> loaded = RAGAssistant.load_index("/path/to/index", provider=OllamaProvider())

__init__(documents: list[Document] | str | Path, embed_fn: Callable[[str], list[float]] | None = None, generate_fn: Callable[[...], str] | None = None, provider: BaseEmbeddingProvider | BaseLLMProvider | None = None, embedding_model: str | None = None, llm_model: str | None = None, chunk_size: int = 512, chunk_overlap: int = 50)[source]

add_documents(documents: list[Document] | str | Path) → int[source]

Add documents to the existing index incrementally.

This method is thread-safe. It creates a new IndexState and atomically swaps the reference, ensuring readers always see a consistent state.

Parameters:: documents – Documents to add.
Returns:: Number of chunks added.
Raises:: IndexingError – If embedding count doesn’t match chunk count.

remove_documents(source_path_pattern: str) → int[source]

Remove documents matching a source path pattern.

This method is thread-safe. It creates a new IndexState and atomically swaps the reference.

Parameters:: source_path_pattern – Glob pattern to match ‘source’ metadata.
Returns:: Number of chunks removed.

update_documents(documents: list[Document] | str | Path) → int[source]

Update existing documents (remove old, add new).

Uses document source path to identify what to remove.

Parameters:: documents – New versions of documents.
Returns:: Number of chunks added.

retrieve(query: str, top_k: int = 3) → list[tuple[Chunk, float]][source]

Retrieve relevant chunks for a query.

Uses vectorized cosine similarity for fast search over all chunks. This method is thread-safe - it reads a consistent snapshot of the index.

Parameters:

query (str) – Search query.
top_k (int) – Number of chunks to return (default: 3).

Returns:

List of (chunk, similarity_score) tuples, sorted by relevance.

Return type:

list[tuple[Chunk, float]]

Examples

>>> results = assistant.retrieve("how to create a route")
>>> for chunk, score in results:
...     print(f"{score:.2f}: {chunk.content[:100]}...")

retrieve_with_context(query: str, top_k: int = 3, window_size: int = 1, min_score: float = 0.0) → list[tuple[Chunk, float]][source]

Retrieve chunks with adjacent context expansion (window search).

For each retrieved chunk, also includes adjacent chunks from the same document to provide more context. This is useful when relevant information spans multiple chunks.

Pattern inspired by ai4rag window_search.

Parameters:

query (str) – Search query.
top_k (int) – Number of initial chunks to retrieve (default: 3).
window_size (int) – Number of adjacent chunks to include on each side (default: 1). Set to 0 to disable window expansion.
min_score (float) – Minimum similarity score threshold (default: 0.0).

Returns:

List of (chunk, similarity_score) tuples, sorted by relevance. Adjacent chunks have slightly lower scores.

Return type:

list[tuple[Chunk, float]]

Examples

>>> # Get chunks with 1 adjacent chunk on each side
>>> results = assistant.retrieve_with_context("query", window_size=1)
>>> for chunk, score in results:
...     print(f"{score:.2f}: {chunk.content[:50]}...")

get_context_with_window(query: str, top_k: int = 3, window_size: int = 1, min_score: float = 0.0) → str[source]

Get formatted context with adjacent chunk expansion.

Merges overlapping text from adjacent chunks intelligently.

Parameters:

query (str) – Search query.
top_k (int) – Number of initial chunks to retrieve.
window_size (int) – Number of adjacent chunks on each side.
min_score (float) – Minimum similarity score threshold.

Returns:

Formatted context string with merged chunks.

Return type:

str

get_context(query: str, top_k: int = 3) → str[source]

Get formatted context string from retrieved chunks.

Parameters:

query (str) – Search query.
top_k (int) – Number of chunks to include.

Returns:

Formatted context string.

Return type:

str

generate(prompt: str, system_prompt: str | None = None, temperature: float = 0.7) → str[source]

Generate text using the LLM (without retrieval).

Parameters:

prompt (str) – User prompt.
system_prompt (str, optional) – System prompt for context.
temperature (float) – Sampling temperature (default: 0.7).

Returns:

Generated text.

Return type:

str

Raises:

NotImplementedError – If no LLM is configured.

ask(question: str, system_prompt: str | None = None, top_k: int = 3, temperature: float = 0.7) → str[source]

Ask a question using RAG (retrieve + generate).

Parameters:

question (str) – Question to answer.
system_prompt (str, optional) – System prompt. Defaults to a helpful assistant prompt.
top_k (int) – Number of context chunks to retrieve (default: 3).
temperature (float) – Sampling temperature (default: 0.7).

Returns:

Generated answer.

Return type:

str

Raises:

NotImplementedError – If no LLM is configured.

Examples

>>> answer = assistant.ask("How do I create a REST API?")
>>> print(answer)

generate_code(request: str, language: str = 'python', top_k: int = 3, temperature: float = 0.7) → str[source]

Generate code based on documentation context.

Parameters:

request (str) – Description of what code to generate.
language (str) – Programming language (default: “python”).
top_k (int) – Number of context chunks to retrieve.
temperature (float) – Sampling temperature.

Returns:

Generated code (cleaned, without markdown).

Return type:

str

Raises:

NotImplementedError – If no LLM is configured.

Examples

>>> code = assistant.generate_code("create a REST API with user endpoints")
>>> print(code)

property num_chunks: int: Return number of indexed chunks.

property chunk_count: int: Number of chunks in index (alias for num_chunks).

property is_indexed: bool: Check if index has any documents.

property num_documents: int: Return number of loaded documents.

property has_llm: bool: Check if LLM is configured.

save_index(path: str | Path) → None[source]

Save index to disk for later restoration.

Saves the index in an efficient format: - chunks.json: Chunk metadata and content - embeddings.npy: Numpy array of embeddings (binary format) - metadata.json: Index configuration

Parameters:: path – Directory path to save index files.

Example

>>> assistant.save_index("/path/to/index")
>>> # Later...
>>> loaded = RAGAssistant.load_index("/path/to/index", provider=provider)

classmethod load_index(path: str | Path, provider: BaseEmbeddingProvider | BaseLLMProvider | None = None) → RAGAssistant[source]

Load a previously saved index.

Parameters:

path – Directory path containing saved index files.
provider – Provider for embeddings/LLM (required for new queries).

Returns:

RAGAssistant instance with loaded index.

Raises:

IndexingError – If loaded index is corrupted (count mismatch).
FileNotFoundError – If index files don’t exist.

Example

>>> loaded = RAGAssistant.load_index("/path/to/index", provider=OllamaProvider())
>>> results = loaded.retrieve("query")

Quick Reference

Constructor

from ragit import RAGAssistant

assistant = RAGAssistant(
    source,                          # str, Path, or list of sources
    chunk_size=512,                  # Characters per chunk
    chunk_overlap=50,                # Overlap between chunks
    llm_model="llama3",              # LLM model name
    embedding_model="mxbai-embed-large",  # Embedding model name
    pattern="*.txt",                 # File pattern for directories
    recursive=False                  # Recursive directory search
)

Methods

ask()

Ask a question and get an answer using RAG.

answer = assistant.ask("What is the installation process?")
print(answer)

retrieve()

Retrieve relevant chunks without generating an answer.

results = assistant.retrieve("database configuration", top_k=5)

for chunk, score in results:
    print(f"Score: {score:.3f}")
    print(f"Source: {chunk.doc_id}")
    print(f"Content: {chunk.content[:100]}...")

generate()

Direct LLM generation without retrieval.

response = assistant.generate("Explain what RAG means")
print(response)

generate_code()

Generate code with context from documents.

code = assistant.generate_code("Write a function to parse JSON")
print(code)

Examples

Basic Usage

from ragit import RAGAssistant

# Create assistant
assistant = RAGAssistant("docs/")

# Ask questions
answer = assistant.ask("How do I get started?")
print(answer)

Custom Configuration

from ragit import RAGAssistant

assistant = RAGAssistant(
    "docs/",
    chunk_size=1024,                    # Larger chunks
    chunk_overlap=100,                  # More overlap
    llm_model="llama3:70b",             # Larger model
    embedding_model="nomic-embed-text"  # Different embeddings
)

Multiple Document Sources

from ragit import RAGAssistant

# Load from multiple sources
assistant = RAGAssistant([
    "README.md",           # Single file
    "docs/",               # Directory
    "examples/tutorial/"   # Another directory
])

Recursive Loading

from ragit import RAGAssistant

# Load all markdown files recursively
assistant = RAGAssistant(
    "project/",
    pattern="**/*.md",
    recursive=True
)

Getting Sources with Answers

from ragit import RAGAssistant

assistant = RAGAssistant("docs/")

question = "How do I configure logging?"

# Get sources
sources = assistant.retrieve(question, top_k=3)

# Get answer
answer = assistant.ask(question)

print(f"Answer: {answer}\n")
print("Based on:")
for chunk, score in sources:
    print(f"  - {chunk.doc_id} (relevance: {score:.2f})")

Index Persistence

Save and load indexes to avoid re-computing embeddings:

from ragit import RAGAssistant
from ragit.providers import OllamaProvider

# Build and save index
assistant = RAGAssistant("docs/", provider=OllamaProvider())
assistant.save_index("./my_index")

# Load index later (much faster)
loaded = RAGAssistant.load_index("./my_index", provider=OllamaProvider())
results = loaded.retrieve("query")

Thread Safety

RAGAssistant is thread-safe as of v0.11.0:

import threading
from ragit import RAGAssistant

assistant = RAGAssistant("docs/", provider=provider)

# Safe: concurrent reads
for _ in range(10):
    threading.Thread(target=lambda: assistant.retrieve("query")).start()

# Safe: read while writing
threading.Thread(target=lambda: assistant.add_documents([doc])).start()
threading.Thread(target=lambda: assistant.retrieve("query")).start()

Internal Attributes

These attributes are available but considered internal:

# Access index state (immutable IndexState dataclass)
state = assistant._state
print(f"Total chunks: {len(state.chunks)}")
print(f"Matrix shape: {state.embedding_matrix.shape}")

# Use public properties instead
print(f"Chunk count: {assistant.chunk_count}")
print(f"Is indexed: {assistant.is_indexed}")