Lesson Learned: RAG Vectorization Gap - Critical Knowledge Base Failure
Lesson Learned: RAG Vectorization Gap - Critical Knowledge Base Failure
ID: LL-017
Date: December 12, 2025 Severity: HIGH Category: data_integrity, verification Discovered By: CEO questioned RAG status Root Cause: CTO failed to monitor vectorization completeness
The Failure
87% of RAG documents (972/1113) were NOT vectorized.
The system had:
- 1,113 documents in
data/rag/in_memory_store.json(text only) - Only 141 documents in ChromaDB with actual vector embeddings
- 972 documents could only be found via keyword search, NOT semantic search
Impact: When asking âWhat did Buffett say about market timing?â, the system could only find documents with those exact words - missing conceptually similar content that used different wording.
Why It Wasnât Caught
- Health check script was incomplete (
scripts/verify_rag_hygiene.py)- Checked if files exist â
- Checked if dependencies installed â
- DID NOT check in-memory vs ChromaDB document count gap â
- Dashboard showed document counts but not vectorization status
- Listed sources and counts
- Never compared whatâs stored vs whatâs vectorized â
- No automatic alerting for vectorization gaps
- No threshold-based alerts
- No daily vectorization health check
- CTO didnât proactively audit RAG completeness
- Assumed system was working
- Didnât verify vectorization after data ingestion
The Fix
1. Added Vectorization Gap Check to verify_rag_hygiene.py
def _check_vectorization_gap(self) -> None:
"""CRITICAL: Check if all documents are vectorized."""
in_mem_count = len(in_memory_store["documents"])
chroma_count = chromadb_collection.count()
gap = in_mem_count - chroma_count
if gap > 0:
# FAIL if >10% unvectorized
pct_unvectorized = gap / in_mem_count * 100
status = "FAIL" if pct_unvectorized > 10 else "WARN"
message = f"VECTORIZATION GAP: {gap} docs ({pct_unvectorized:.0f}%) NOT vectorized"
2. Added to Progress Dashboard
The dashboard now shows:
- Total documents vs vectorized count
- Vectorization progress bar
- Gap warning with action item
3. Prevention Checklist
Before any RAG ingestion:
- Verify ChromaDB is accessible
- Check sentence-transformers can load model
- After ingestion, compare counts:
in_memory == chromadb - Run
python scripts/verify_rag_hygiene.pyto confirm
Root Cause Analysis
Technical: The in-memory fallback stores documents WITHOUT embeddings when:
- ChromaDB isnât installed
- sentence-transformers canât download model (network blocked)
- HuggingFace is unreachable
Process: No verification step after ingestion to confirm vectorization succeeded.
Cultural: CTO assumed system was working without verification. CEO had to discover the gap.
Prevention Protocol
Automated Checks (Added)
- Pre-commit hook: Verify RAG hygiene passes
- CI/CD check: Run
verify_rag_hygiene.pyin GitHub Actions - Dashboard alert: Red warning if vectorization < 90%
Manual Verification (Required)
After any RAG data ingestion:
python scripts/verify_rag_hygiene.py
# Must see: "Vectorization Gap: PASS (0 documents unvectorized)"
Monitoring Threshold
| Metric | Green | Yellow | Red |
|---|---|---|---|
| Vectorization % | >95% | 80-95% | <80% |
| Gap Count | 0-50 | 51-200 | >200 |
CEO Directive
âHow did you let this failure occur? You expected me to question you? Donât you have a lessons learned in your RAG and ML to prevent such knowledge gaps???â
CTO Acknowledgment: This was a CTO failure. The monitoring existed but was incomplete. I should have:
- Proactively audited RAG completeness
- Built proper vectorization gap detection
- Not assumed the system was working without verification
Files Modified
scripts/verify_rag_hygiene.py- Added vectorization gap checkscripts/generate_progress_dashboard.py- Added RAG vectorization visualizationrag_knowledge/lessons_learned/ll_017_rag_vectorization_gap_dec12.md- This file
Prevention Rules
- Apply lessons learned from this incident
- Add automated checks to prevent recurrence
- Update RAG knowledge base
Related Lessons
ll_009_ci_syntax_failure_dec11.md- Another âassumed it workedâ failurell_010_dead_code_and_dormant_systems_dec11.md- Systems that look active but arenât
Key Takeaway: VERIFY, DONâT ASSUME. If something can fail silently, it will.