Seven Things Tutorials Do Not Tell You About Production RAG

# Seven Production RAG Lessons

## 1. Naive Chunking Destroys Retrieval Quality
Fixed-size 512-token chunks split claims from their supporting evidence. Semantic chunking (boundary detection via sentence-transformer similarity) improved RAGAS context precision by ~22%.

## 2. Vector Search Alone Is Not Enough
Dense embeddings miss exact keyword queries. Hybrid: BM25 (sparse) + pgvector (dense) merged with Reciprocal Rank Fusion. RAGAS answer relevancy: 0.71 → 0.84.

## 3. Ship an Eval Suite Before You Ship the Feature
100-question eval set, RAGAS metrics, LLM-as-judge. CI blocks deployment if any metric drops > 0.1. Without this you cannot prove you improved anything.

## 4. Drift Is Silent
A RAG system scoring 0.87 at launch will score 0.71 in six months. Documents change, query patterns shift. Evidently AI nightly drift detection + automated re-ingestion solves this.

## 5. Guardrails Are Not Optional for Public Systems
Direct prompt injection, indirect injection via ingested documents, context extraction — these happen within 48 hours of any public launch.

## 6. Token Cost Compounds Fast
Unmonitored: surprise bills. Solution: `spend_log` table + Grafana daily cost dashboard + semantic caching (cosine > 0.98 = return cached response, no LLM call).

## 7. Build the /debug Endpoint First
`GET /ai/rag/debug?q=query` returns the query, expansions, top-20 chunks pre-ranking, top-5 post-ranking, and the final prompt. This single endpoint saved 4 hours of debugging in week one of production traffic.