AI-Powered Documentation Search¶

Hera includes an optional AI search feature that lets you ask natural language questions about the documentation and source code. It uses a self-hosted LLM (via Ollama) to generate answers based on the indexed content.

The entire pipeline runs locally — no data is sent to external services.

How it works¶

When enabled, a floating search button appears on every documentation page. Click it to open a search panel where you can type questions in plain English. The system:

Searches the indexed docs and docstrings for relevant content (via Qdrant vector search)
Fetches full text from Cassandra (source of truth)
Sends the top matches + your question to a local LLM (Ollama)
Streams the answer back in real-time
Shows source references with relevance scores

Quick start (all-in-one)¶

If you just want everything running with one command:

make rag-setup
make rag-docs-serve

This starts all services, downloads the LLM model, builds the index, starts the API, and serves the docs with the search widget enabled.

For step-by-step control, follow the three phases below.

Phase 1: Set up the RAG environment¶

Prerequisites¶

Docker — all three backend services run as containers
~6 GB disk — for the LLM model (~4 GB) + databases (~2 GB)
~4 GB RAM — for embedding model + Ollama inference

Start the services¶

The RAG system uses three backend services:

Service	Purpose	Default port
Qdrant	Vector store for fast similarity search	`6333`
Cassandra	Text store (source of truth for chunk content)	`9042`
Ollama	Self-hosted LLM inference	`11434`

Start all three at once:

make rag-services-up

Or start individually:

make rag-qdrant-up       # Vector store
make rag-cassandra-up    # Text store (takes ~30s to become ready)
make rag-ollama-up       # LLM server

Download the LLM model¶

make rag-ollama-pull

This downloads the default model (llama3, ~4 GB). To use a different model:

OLLAMA_MODEL=mistral make rag-ollama-pull

Verify services are running¶

make rag-services-status

You should see three running containers: hera-qdrant, hera-cassandra, hera-ollama.

Data locations¶

Service	Data stored at
Qdrant	`~/qdrant-data/`
Cassandra	`~/cassandra-data/`
Ollama	`~/ollama-data/`

Override with environment variables: QDRANT_DATA, CASS_DATA, OLLAMA_DATA.

Configuration¶

All RAG settings use the RAG_ prefix and can be set in a .env file (see .env.example):

# .env (in project root)
RAG_OLLAMA_MODEL=llama3
RAG_OLLAMA_BASE_URL=http://localhost:11434
RAG_QDRANT_HOST=localhost
RAG_QDRANT_PORT=6333
RAG_CASSANDRA_HOSTS=["localhost"]
RAG_CASSANDRA_PORT=9042
RAG_EMBED_MODEL=BAAI/bge-small-en-v1.5
RAG_TOP_K=5
RAG_CHUNK_SIZE=1000

Phase 2: Set up the RAG index¶

Install Python dependencies¶

pip install -e .[rag]

This installs sentence-transformers, qdrant-client, cassandra-driver, fastapi, typer, and related packages.

Build the index¶

make rag-index

Or via the CLI directly:

hera-rag-search index --docs docs --code hera

What gets indexed¶

Content type	Source	Chunking strategy
Markdown docs	`docs/*/.md`	Split by headings (H1/H2/H3), sliding window for long sections
Python docstrings	`hera/*/.py`	One chunk per module/class/function docstring (AST-parsed)
Jupyter notebooks	`docs/*/.ipynb`	One chunk per cell

Small chunks (< 80 characters) are automatically skipped.

Index output¶

After indexing, you'll see a summary like:

Indexed: 450 doc chunks, 380 docstrings, 25 notebook cells
Total: 855 chunks in Qdrant + Cassandra

Rebuild options¶

# Incremental: only index new/changed files
make rag-index

# Full rebuild: wipe everything and re-index
make rag-reindex

# Docs only (skip Python docstrings)
make rag-index-docs

Auto re-index on file changes¶

make rag-watch

This watches docs/ and hera/ for changes and automatically re-indexes modified files (debounced at 2 seconds).

Phase 3: Activate the RAG¶

Start the REST API server¶

make rag-serve

The API runs at http://localhost:8765. Verify it's working:

curl http://localhost:8765/health
# {"status": "ok", "model": "llama3"}

make rag-docs-serve

This starts both the RAG API and MkDocs with the search widget enabled. The floating "Ask AI" button appears in the bottom-right corner of every page.

To enable manually:

RAG_ENABLED=true mkdocs serve

Use from the command line¶

# Ask a question (LLM generates the answer)
hera-rag-search search "How do I create a project and load a repository?"

# Search without LLM (show matching chunks only)
hera-rag-search search "demography API" --raw

# Filter by content type
hera-rag-search search "addDataSource" --type docstring
hera-rag-search search "project lifecycle" --type markdown

# Filter by source path
hera-rag-search search "LSM" --source toolkits/

Use from Python¶

from hera.utils.rag import RAGSearch

rag = RAGSearch()

# Full answer
answer = rag.ask("How does the risk assessment toolkit work?")
print(answer)

# Streaming answer
for token in rag.stream("Show me the demography API"):
    print(token, end="", flush=True)

# Raw chunks (no LLM)
chunks = rag.retrieve("data source versioning", top_k=10)
for c in chunks:
    print(f"{c['score']:.3f}  {c['source']} § {c['section']}")

Use the REST API directly¶

# Search with LLM answer
curl -X POST http://localhost:8765/search \
  -H "Content-Type: application/json" \
  -d '{"query": "How do I create a project?", "top_k": 5}'

# Stream answer (SSE)
curl -N "http://localhost:8765/stream?q=How+does+the+LSM+toolkit+work"

# Chunks only (no LLM)
curl -X POST http://localhost:8765/search \
  -H "Content-Type: application/json" \
  -d '{"query": "How do I create a project?", "no_llm": true}'

API docs are available at http://localhost:8765/docs (OpenAPI/Swagger).

Combined: serve + auto re-index¶

make rag-serve-watch

This starts the API server and watches for file changes simultaneously.

Troubleshooting¶

Cassandra takes a long time to start¶

Cassandra can take 30–60 seconds to become ready after the container starts. The make rag-cassandra-up target waits automatically. If you see connection errors, wait a minute and retry.

Ollama model download is slow¶

The default llama3 model is ~4 GB. Use a smaller model for faster setup:

export RAG_OLLAMA_MODEL=phi3
make rag-ollama-pull

Port conflicts¶

If ports 6333, 9042, or 11434 are already in use, either stop the conflicting service or change the RAG ports in .env:

RAG_QDRANT_PORT=16333
RAG_CASSANDRA_PORT=19042
RAG_OLLAMA_BASE_URL=http://localhost:21434

Embedding model download¶

The sentence-transformers model (BAAI/bge-small-en-v1.5, ~130 MB) downloads automatically on first use. Subsequent runs use the cached version.

Re-indexing after documentation changes¶

If search results seem stale, rebuild the index:

make rag-reindex

Cleanup¶

# Stop all services (keep data)
make rag-services-down

# Stop services AND delete all data (qdrant, cassandra, ollama)
make rag-clean

Disabling AI search¶

Set RAG_ENABLED=false (the default) or simply don't start the RAG server. The docs site works normally without it — the widget only appears when both RAG_ENABLED=true and the RAG API is accessible.

All Makefile targets¶

Target	Purpose
`make rag-setup`	Full setup: install + services + model + index
`make rag-services-up`	Start Qdrant + Cassandra + Ollama
`make rag-services-down`	Stop all services
`make rag-services-status`	Show service status
`make rag-ollama-pull`	Download LLM model
`make rag-index`	Build search index
`make rag-reindex`	Wipe and rebuild index
`make rag-index-docs`	Index docs only (skip docstrings)
`make rag-search`	Search with default query
`make rag-search-raw`	Search without LLM
`make rag-serve`	Start REST API server
`make rag-serve-watch`	Serve + auto re-index on changes
`make rag-watch`	Watch files for auto re-indexing
`make rag-docs-serve`	MkDocs + RAG widget enabled
`make rag-docs-build`	Build static site + RAG widget
`make rag-clean`	Stop services + delete all data