A curated resource map of tools, frameworks, techniques, and learning materials for building Retrieval-Augmented Generation (RAG) systems. This repository catalogs the RAG ecosystem and provides links to authoritative sources, tutorials, and implementations to help you explore and build RAG applications.
Retrieval-Augmented Generation (RAG) is a sophisticated technique in Generative AI that enhances Large Language Models (LLMs) by dynamically retrieving and incorporating relevant context from external knowledge sources during the generation process. Unlike traditional LLMs that rely solely on pre-trained knowledge, RAG systems enable models to access up-to-date, domain-specific, or proprietary information, significantly improving accuracy, reducing hallucinations, and enabling real-time knowledge integration.
- Reduced Hallucinations: Grounds responses in retrieved factual information
- Domain Adaptation: Enables LLMs to work with specialized knowledge without fine-tuning
- Real-time Updates: Incorporates latest information without model retraining
- Cost Efficiency: More economical than fine-tuning for domain-specific tasks
- Transparency: Provides source attribution for generated content
- Privacy & Security: Keeps sensitive data in private knowledge bases
- ℹ️ General Information on RAG
- 🏗️ Architecture Patterns
- 🎯 Advanced Approaches
- 🧰 Frameworks that Facilitate RAG
- 🐍 Python Ecosystem for RAG
- 🛠️ Techniques
- 📊 Metrics & Evaluation
- 💾 Databases
- 🔌 Platform-Specific RAG Implementations
- 🚀 Production Considerations
- 💡 Best Practices
RAG addresses a fundamental limitation of LLMs: their static knowledge cutoff and inability to access external information. Traditional RAG implementations employ a retrieval pipeline that enriches LLM prompts with contextually relevant documents from a knowledge base. For example, when querying about renovation materials for a specific house, the LLM may have general renovation knowledge but lacks details about that particular property. An RAG system can retrieve relevant documents (e.g., blueprints, material specifications, local building codes) to provide accurate, context-aware responses.
- Complete basic RAG implementation in Python: Full-stack RAG example with LangChain and Chroma
- LangChain RAG Tutorial: Comprehensive guide to building RAG applications
- LlamaIndex RAG Tutorial: Getting started with LlamaIndex for RAG
- Haystack RAG Pipeline: Building RAG pipelines with Haystack
- Production RAG patterns and best practices: Production-ready RAG optimization strategies
- LangChain Production Guide: Deploying LangChain applications to production
- Python Async Best Practices: Writing efficient async Python code for AI applications
RAG systems can be architected using various patterns depending on requirements:
- Naive RAG: Basic retrieve-then-generate pipeline without optimization
- Advanced RAG: Incorporates query rewriting, re-ranking, and context compression
- Modular RAG: Composable components for retrieval, ranking, and generation
- Agentic RAG: LLM-driven agents that make retrieval decisions dynamically
- Self-RAG: Models that self-reflect on retrieval quality and adjust strategies
- Graph RAG: Leverages knowledge graphs for structured information retrieval
RAG implementations vary in complexity, from simple document retrieval to advanced techniques integrating iterative feedback loops, multi-agent systems, and domain-specific enhancements. Modern approaches include:
- Vision-RAG: Embeds entire pages as images, allowing vision models to handle reasoning directly without parsing text-RAG.
- Cache-Augmented Generation (CAG): Preloads relevant documents into a model’s context and stores the inference state (Key-Value (KV) cache).
- Agentic RAG: Also known as retrieval agents, can make decisions on retrieval processes.
- Corrective RAG (CRAG): Methods to correct or refine the retrieved information before integration into LLM responses.
- Retrieval-Augmented Fine-Tuning (RAFT): Techniques to fine-tune LLMs specifically for enhanced retrieval and generation tasks.
- Self Reflective RAG: Models that dynamically adjust retrieval strategies based on model performance feedback.
- RAG Fusion: Techniques combining multiple retrieval methods for improved context integration.
- Temporal Augmented Retrieval (TAR): Considering time-sensitive data in retrieval processes.
- Plan-then-RAG (PlanRAG): Strategies involving planning stages before executing RAG for complex tasks.
- GraphRAG: A structured approach using knowledge graphs for enhanced context integration and reasoning.
- FLARE - An approach that incorporates active retrieval-augmented generation to improve response quality.
- GNN-RAG: Graph neural retrieval for large language modeling reasoning.
- Multimodal RAG: Extends RAG to handle multiple modalities such as text, images, and audio.
- VideoRAG: Extends RAG to videos using Large Video Language Models (LVLMs) to retrieve and integrate visual and textual content for multimodal generation.
- REFRAG: Optimizes RAG decoding by compressing retrieved context into embeddings before generation, reducing latency while maintaining output quality.
- InstructRAG: Enhances RAG systems through instruction-based fine-tuning using self-synthesized rationales to improve retrieval and generation quality.
- Haystack: LLM orchestration framework to build customizable, production-ready LLM applications.
- LangChain: An all-purpose framework for working with LLMs.
- Semantic Kernel: An SDK from Microsoft for developing Generative AI applications.
- LlamaIndex: Framework for connecting custom data sources to LLMs.
- Dify: An open-source LLM app development platform.
- Cognita: Open-source RAG framework for building modular and production ready applications.
- Verba: Open-source application for RAG out of the box.
- Mastra: Typescript framework for building AI applications.
- Letta: Open source framework for building stateful LLM applications.
- Flowise: Drag & drop UI to build customized LLM flows.
- Swiftide: Rust framework for building modular, streaming LLM applications.
- CocoIndex: ETL framework to index data for AI, such as RAG; with realtime incremental updates.
- Pathway: Performant open-source Python ETL framework with Rust runtime, supporting 300+ data sources.
- Pathway AI Pipelines: A production-ready RAG framework supporting real-time indexing, retrieval, and change tracking across diverse data sources.
- LiteLLM: Unified interface for multiple LLM providers (OpenAI, Anthropic, Hugging Face, Replicate) with logging, monitoring, and cost tracking.
Python is the most mature ecosystem for RAG today, with extensive support for LLMs, embeddings, vector databases, evaluation, and production tooling.
See the full guide: Python Ecosystem for RAG
- Data cleaning techniques: Pre-processing steps to refine input data and improve model performance.
- Strategies
- Tagging and Labeling: Adding semantic tags or labels to retrieved data to enhance relevance.
- Chain of Thought (CoT): Encouraging the model to think through problems step by step before providing an answer.
- Chain of Verification (CoVe): Prompting the model to verify each step of its reasoning for accuracy.
- Self-Consistency: Generating multiple reasoning paths and selecting the most consistent answer.
- Zero-Shot Prompting: Designing prompts that guide the model without any examples.
- Few-Shot Prompting: Providing a few examples in the prompt to demonstrate the desired response format.
- Reason & Act (ReAct) prompting: Combines reasoning (e.g. CoT) with acting (e.g. tool calling).
- Caching
- Prompt Caching: Optimizes LLMs by storing and reusing precomputed attention states.
- Structuring
- Token-Oriented Object Notation: A compact, deterministic JSON format for LLM prompts.
Chunking strategy is one of the most critical decisions in RAG system design, directly impacting retrieval precision and context quality. The optimal approach depends on document types, domain characteristics, and query patterns.
-
- Use Case: Simple, uniform documents where structure is less important
- Characteristics: Divides text into consistent-sized segments (typically 256-512 tokens) with configurable overlap (10-20%)
- Pros: Simple to implement, predictable chunk sizes, efficient processing
- Cons: May split sentences/paragraphs, loses document structure, can fragment semantic units
- Implementation: CharacterTextSplitter (LangChain), SentenceSplitter (LlamaIndex)
-
- Use Case: Documents with hierarchical structure (markdown, HTML, code)
- Characteristics: Recursively splits by separators (paragraphs → sentences → words) until desired chunk size
- Pros: Preserves natural boundaries, respects document hierarchy, better semantic coherence
- Cons: More complex, variable chunk sizes, requires careful separator configuration
- Implementation: RecursiveCharacterTextSplitter (LangChain)
-
- Use Case: Structured documents with clear sections (markdown headers, PDF sections, database records)
- Characteristics: Segments based on document metadata, formatting cues, or structural elements
- Pros: Maintains document structure, preserves context, enables metadata-rich retrieval
- Cons: Requires structured input, may create very large or very small chunks
- Implementation: MarkdownHeaderTextSplitter (LangChain)
- Multimodal: Handle images and text with models like OpenCLIP
-
- Use Case: Documents where semantic coherence is critical (narratives, technical documentation)
- Characteristics: Uses embedding similarity to identify natural semantic boundaries
- Pros: Preserves semantic units, adapts to content, improves retrieval relevance
- Cons: Computationally expensive, requires embedding model, less predictable chunk sizes
- Best For: High-quality retrieval where context preservation is paramount
-
- Use Case: Complex documents requiring intelligent segmentation decisions
- Characteristics: Uses LLMs to analyze content and determine optimal chunk boundaries
- Pros: Highly adaptive, understands context, can apply domain knowledge
- Cons: High cost, slower processing, requires LLM API access
- Best For: Specialized domains where standard chunking fails
Chunking Best Practices:
- Overlap Strategy: Use 10-20% overlap to maintain context across boundaries
- Size Optimization: Balance chunk size (larger = more context, smaller = better precision)
- Metadata Preservation: Retain document structure, headers, and formatting in chunk metadata
- Multi-Granularity: Consider hierarchical approaches (small chunks for retrieval, larger for context)
Embeddings are the foundation of semantic search in RAG systems. The choice of embedding model significantly impacts retrieval quality.
-
Model Selection
- MTEB Leaderboard: Comprehensive benchmark for evaluating embedding models across multiple tasks and languages. Consider models that perform well on tasks relevant to your use case (retrieval, clustering, classification).
- Model Characteristics: Evaluate models based on:
- Dimensions: Higher dimensions (768-1024) generally offer better quality but increase storage and compute costs
- Context Length: Ensure models support your document chunk sizes
- Multilingual Support: Required for international applications
- Domain Specialization: General-purpose vs. domain-specific (e.g., scientific, legal, medical)
-
Custom Embeddings
- Fine-tuning: Adapt pre-trained models to your domain using contrastive learning, triplet loss, or supervised fine-tuning
- Training from Scratch: For highly specialized domains with sufficient labeled data
- Multi-Modal Embeddings: For applications requiring text, image, or audio understanding (e.g., CLIP, ImageBind)
- Ensemble Methods: Combine multiple embedding models for improved robustness
- Search Methods
- Vector Store Flat Index
- Simple and efficient form of retrieval.
- Content is vectorized and stored as flat content vectors.
- Hierarchical Index Retrieval
- Hierarchically narrow data to different levels.
- Executes retrievals by hierarchical order.
- Hypothetical Questions
- Used to increase similarity between database chunks and queries (same with HyDE).
- LLM is used to generate specific questions for each text chunk.
- Converts these questions into vector embeddings.
- During search, matches queries against this index of question vectors.
- Hypothetical Document Embeddings (HyDE)
- Used to increase similarity between database chunks and queries (same with Hypothetical Questions).
- LLM is used to generate a hypothetical response based on the query.
- Converts this response into a vector embedding.
- Compares the query vector with the hypothetical response vector.
- Small to Big Retrieval
- Improves retrieval by using smaller chunks for search and larger chunks for context.
- Smaller child chunks refers to bigger parent chunks
- Contextual Retrieval
- Enhances RAG retrieval accuracy by preserving document context that is typically lost during chunking.
- Each text chunk is enriched with a short, model-generated summary before embedding and indexing, resulting in Contextual Embeddings and Contextual BM25.
- This combined approach improves both semantic and lexical matching, reducing retrieval failure rates when paired with reranking.
- Adaptive Retrieval
- Dynamically decide when and how much to retrieve during generation.
- Query Reformulation and Expansion
- Automatically rewrites or expands the query before retrieval to boost recall.
- Useful for long or ambiguous user queries.
- Vector Store Flat Index
- Re-ranking: Enhances search results in RAG pipelines by reordering initially retrieved documents, prioritizing those most semantically relevant to the query.
Ensuring high-quality, safe, and reliable responses is critical for production RAG systems.
-
Hallucination Mitigation
- Detection Techniques: Implement methods to identify when models generate unsupported information
- Grounding Verification: Cross-reference generated claims with retrieved context
- Confidence Scoring: Assign confidence scores to generated responses based on source quality
- Source Attribution: Require citations for all factual claims
- Retrieval Quality: Improve retrieval precision to reduce hallucination risk
-
Guardrails & Safety
- Implementation Guide: Comprehensive approach to implementing safety mechanisms
- Content Moderation: Filter harmful, biased, or inappropriate content at input and output stages
- Bias Mitigation: Detect and mitigate biases in retrieved content and generated responses
- Fact-Checking: Verify claims against authoritative sources or knowledge bases
- Toxicity Detection: Use classifiers to identify and filter toxic content
-
Prompt Injection Prevention
- Security Guide: Understanding and preventing prompt injection attacks
- Input Validation: Rigorously validate and sanitize all external inputs using whitelisting, length limits, and pattern matching
- Content Separation: Use clear delimiters, templating systems, and role-based prompts to separate instructions from user data
- Output Monitoring: Continuously monitor responses for anomalies, unexpected behaviors, or security violations
- Rate Limiting: Implement rate limits and abuse detection to prevent systematic attacks
- Sandboxing: Isolate LLM execution environments to limit potential damage from successful injections
These metrics are used to measure the similarity between embeddings, which is crucial for evaluating how effectively RAG systems retrieve and integrate external documents or data sources. By selecting appropriate similarity metrics, you can optimize the performance and accuracy of your RAG system. Alternatively, you may develop custom metrics tailored to your specific domain or niche to capture domain-specific nuances and improve relevance.
-
- Measures the cosine of the angle between two vectors in a multi-dimensional space.
- Highly effective for comparing text embeddings where the direction of the vectors represents semantic information.
- Commonly used in RAG systems to measure semantic similarity between query embeddings and document embeddings.
-
- Calculates the sum of the products of corresponding entries of two sequences of numbers.
- Equivalent to cosine similarity when vectors are normalized.
- Simple and efficient, often used with hardware acceleration for large-scale computations.
-
- Computes the straight-line distance between two points in Euclidean space.
- Can be used with embeddings but may lose effectiveness in high-dimensional spaces due to the "curse of dimensionality."
- Often used in clustering algorithms like K-means after dimensionality reduction.
-
- Measures the similarity between two finite sets as the size of the intersection divided by the size of the union of the sets.
- Useful when comparing sets of tokens, such as in bag-of-words models or n-gram comparisons.
- Less applicable to continuous embeddings produced by LLMs.
Note: Cosine Similarity and Dot Product are generally seen as the most effective metrics for measuring similarity between high-dimensional embeddings.
Response evaluation in RAG solutions involves assessing the quality of language model outputs using diverse metrics. Here are structured approaches to evaluating these responses:
-
Automated Benchmarking
- BLEU: Evaluates the overlap of n-grams between machine-generated and reference outputs, providing insight into precision.
- ROUGE: Measures recall by comparing n-grams, skip-bigrams, or longest common subsequence with reference outputs.
- METEOR: Focuses on exact matches, stemming, synonyms, and alignment for machine translation.
-
Human Evaluation Involves human judges assessing responses for:
- Relevance: Alignment with user queries.
- Fluency: Grammatical and stylistic quality.
- Factual Accuracy: Verifying claims against authoritative sources.
- Coherence: Logical consistency within responses.
-
Model Evaluation Leverages pre-trained evaluators to benchmark outputs against diverse criteria:
- TuringBench: Offers comprehensive evaluations across language benchmarks.
- Hugging Face Evaluate: Calculates alignment with human preferences.
-
Key Dimensions for Evaluation
- Groundedness: Assesses if responses are based entirely on provided context. Low groundedness may indicate reliance on hallucinated or irrelevant information.
- Completeness: Measures if the response answers all aspects of a query.
- Approaches: AI-assisted retrieval scoring and prompt-based intent verification.
- Utilization: Evaluates the extent to which retrieved data contributes to the response.
- Analysis: Use LLMs to check the inclusion of retrieved chunks in responses.
These tools can assist in evaluating the performance of your RAG system, from tracking user feedback to logging query interactions and comparing multiple evaluation metrics over time.
- LangFuse: Open-source tool for tracking LLM metrics, observability, and prompt management.
- Ragas: Framework that helps evaluate RAG pipelines.
- LangSmith: A platform for building production-grade LLM applications, allows you to closely monitor and evaluate your application.
- Hugging Face Evaluate: Tool for computing metrics like BLEU and ROUGE to assess text quality.
- Weights & Biases: Tracks experiments, logs metrics, and visualizes performance.
Vector databases are critical components of RAG systems, providing efficient storage and similarity search capabilities for embeddings. The selection of an appropriate database depends on factors such as scale, latency requirements, deployment model (cloud vs. on-premises), and feature needs (hybrid search, filtering, etc.). The list below features database systems suitable for RAG applications:
- Apache Cassandra: Distributed NoSQL database management system.
- MongoDB Atlas: Globally distributed, multi-model database service with integrated vector search.
- Vespa: Open-source big data processing and serving engine designed for real-time applications.
- Elasticsearch: Provides vector search capabilities along with traditional search functionalities.
- OpenSearch: Distributed search and analytics engine, forked from Elasticsearch.
- Chroma DB: An AI-native open-source embedding database.
- Milvus: An open-source vector database for AI-powered applications.
- Pinecone: A serverless vector database, optimized for machine learning workflows.
- Oracle AI Vector Search: Integrates vector search capabilities within Oracle Database for semantic querying based on vector embeddings.
- Pgvector: An open-source extension for vector similarity search in PostgreSQL.
- Azure Cosmos DB: Globally distributed, multi-model database service with integrated vector search.
- Couchbase: A distributed NoSQL cloud database.
- Lantern: A privacy-aware personal search engine.
- LlamaIndex: Employs a straightforward in-memory vector store for rapid experimentation.
- Neo4j: Graph database management system.
- Qdrant: An open-source vector database designed for similarity search.
- Redis Stack: An in-memory data structure store used as a database, cache, and message broker.
- SurrealDB: A scalable multi-model database optimized for time-series data.
- Weaviate: A open-source cloud-native vector search engine.
- FAISS: A library for efficient similarity search and clustering of dense vectors, designed to handle large-scale datasets and optimized for fast retrieval of nearest neighbors.
Building production-grade RAG systems requires addressing several critical aspects beyond the core retrieval and generation pipeline:
- Indexing Throughput: Design pipelines to handle high-volume document ingestion with incremental updates
- Query Latency: Optimize retrieval speed through efficient indexing (HNSW, IVF), caching strategies, and parallel processing
- Concurrent Requests: Implement connection pooling, request queuing, and load balancing for high-traffic scenarios
- Resource Management: Monitor GPU/CPU utilization, memory consumption, and database connection pools
- Observability: Implement comprehensive logging, tracing, and metrics collection (latency, throughput, error rates)
- Health Checks: Monitor embedding service availability, vector database connectivity, and LLM API status
- Error Handling: Implement retry logic, circuit breakers, and graceful degradation strategies
- A/B Testing: Compare different retrieval strategies, chunking methods, and prompt templates
- Incremental Updates: Support real-time or near-real-time document indexing without full re-indexing
- Version Control: Track document versions, embedding model versions, and prompt templates
- Data Quality: Implement validation pipelines to detect corrupted embeddings, missing metadata, or stale content
- Backup & Recovery: Regular backups of vector indexes and metadata stores
- Access Control: Implement authentication, authorization, and audit logging
- Data Privacy: Encrypt data at rest and in transit, support data residency requirements
- Content Filtering: Apply content moderation, PII detection, and compliance checks
- Rate Limiting: Protect against abuse and ensure fair resource allocation
- Embedding Caching: Cache frequently accessed embeddings to reduce API costs
- Selective Retrieval: Use query routing to avoid unnecessary retrieval operations
- Model Selection: Balance cost and performance when choosing embedding and LLM models
- Resource Right-sizing: Optimize infrastructure based on actual usage patterns
For detailed implementation guides for specific platforms, see the documentation:
- Supabase Integration Guide: Building RAG systems with Supabase, pgvector, and Edge Functions
- Domain-Aware Chunking: Use semantic or document-structure-based chunking over fixed-size for better context preservation
- Overlap Management: Include strategic overlap (10-20%) to maintain context across boundaries
- Metadata Preservation: Retain document structure, headers, and formatting cues in chunk metadata
- Multi-Granularity: Consider hierarchical chunking (small chunks for retrieval, larger chunks for context)
- Model Evaluation: Use MTEB leaderboard and domain-specific benchmarks to select appropriate models
- Dimension Optimization: Balance embedding dimensions (higher = better quality, lower = faster retrieval)
- Domain Fine-tuning: Fine-tune embeddings on domain-specific data when possible
- Consistency: Ensure the same embedding model is used for indexing and querying
- Hybrid Search: Combine semantic (vector) and lexical (BM25/keyword) search for improved recall
- Re-ranking: Apply cross-encoders or learned-to-rank models to improve precision
- Query Understanding: Implement query classification, intent detection, and query expansion
- Result Diversification: Avoid redundant results by implementing diversity constraints
- Clear Instructions: Provide explicit instructions on how to use retrieved context
- Source Attribution: Request citations and require grounding in provided context
- Few-Shot Examples: Include examples demonstrating desired response format and quality
- Context Compression: Use techniques like summarization or extraction when context exceeds limits
- Multi-Dimensional Metrics: Evaluate relevance, accuracy, completeness, and groundedness
- Human-in-the-Loop: Incorporate human feedback for continuous improvement
- Synthetic Evaluation: Generate test queries and expected outputs for automated testing
- Production Monitoring: Track user satisfaction, query patterns, and failure modes
- Feedback Loops: Collect user feedback, query logs, and performance metrics
- Experimentation: Systematically test improvements (chunking, retrieval, prompts) with controlled experiments
- Model Updates: Plan for embedding model upgrades and migration strategies
- Documentation: Maintain clear documentation of architecture, decisions, and operational procedures
This is a community-driven resource and continues to evolve. Contributions are welcome! If you'd like to add resources, fix errors, or improve organization:
- Fork the repository
- Create a branch for your changes
- Submit a pull request with a clear description
For new entries, ensure links are working, descriptions are accurate and concise, and content fits the appropriate section.
This project is licensed under the CC0 1.0 Universal.