😎 Awesome Retrieval Augmented Generation (RAG)

A curated resource map of tools, frameworks, techniques, and learning materials for building Retrieval-Augmented Generation (RAG) systems. This repository catalogs the RAG ecosystem and provides links to authoritative sources, tutorials, and implementations to help you explore and build RAG applications.

Overview

Retrieval-Augmented Generation (RAG) is a sophisticated technique in Generative AI that enhances Large Language Models (LLMs) by dynamically retrieving and incorporating relevant context from external knowledge sources during the generation process. Unlike traditional LLMs that rely solely on pre-trained knowledge, RAG systems enable models to access up-to-date, domain-specific, or proprietary information, significantly improving accuracy, reducing hallucinations, and enabling real-time knowledge integration.

Key Benefits

Reduced Hallucinations: Grounds responses in retrieved factual information
Domain Adaptation: Enables LLMs to work with specialized knowledge without fine-tuning
Real-time Updates: Incorporates latest information without model retraining
Cost Efficiency: More economical than fine-tuning for domain-specific tasks
Transparency: Provides source attribution for generated content
Privacy & Security: Keeps sensitive data in private knowledge bases

Content

ℹ️ General Information on RAG

RAG addresses a fundamental limitation of LLMs: their static knowledge cutoff and inability to access external information. Traditional RAG implementations employ a retrieval pipeline that enriches LLM prompts with contextually relevant documents from a knowledge base. For example, when querying about renovation materials for a specific house, the LLM may have general renovation knowledge but lacks details about that particular property. An RAG system can retrieve relevant documents (e.g., blueprints, material specifications, local building codes) to provide accurate, context-aware responses.

Implementation Resources

Python Tutorials & Examples

Complete basic RAG implementation in Python: Full-stack RAG example with LangChain and Chroma
LangChain RAG Tutorial: Comprehensive guide to building RAG applications
LlamaIndex RAG Tutorial: Getting started with LlamaIndex for RAG
Haystack RAG Pipeline: Building RAG pipelines with Haystack

Production & Best Practices

Production RAG patterns and best practices: Production-ready RAG optimization strategies
LangChain Production Guide: Deploying LangChain applications to production
Python Async Best Practices: Writing efficient async Python code for AI applications

🏗️ Architecture Patterns

RAG systems can be architected using various patterns depending on requirements:

Naive RAG: Basic retrieve-then-generate pipeline without optimization
Advanced RAG: Incorporates query rewriting, re-ranking, and context compression
Modular RAG: Composable components for retrieval, ranking, and generation
Agentic RAG: LLM-driven agents that make retrieval decisions dynamically
Self-RAG: Models that self-reflect on retrieval quality and adjust strategies
Graph RAG: Leverages knowledge graphs for structured information retrieval

🎯 Advanced Approaches

RAG implementations vary in complexity, from simple document retrieval to advanced techniques integrating iterative feedback loops, multi-agent systems, and domain-specific enhancements. Modern approaches include:

Vision-RAG: Embeds entire pages as images, allowing vision models to handle reasoning directly without parsing text-RAG.
Cache-Augmented Generation (CAG): Preloads relevant documents into a model’s context and stores the inference state (Key-Value (KV) cache).
Agentic RAG: Also known as retrieval agents, can make decisions on retrieval processes.
Corrective RAG (CRAG): Methods to correct or refine the retrieved information before integration into LLM responses.
Retrieval-Augmented Fine-Tuning (RAFT): Techniques to fine-tune LLMs specifically for enhanced retrieval and generation tasks.
Self Reflective RAG: Models that dynamically adjust retrieval strategies based on model performance feedback.
RAG Fusion: Techniques combining multiple retrieval methods for improved context integration.
Temporal Augmented Retrieval (TAR): Considering time-sensitive data in retrieval processes.
Plan-then-RAG (PlanRAG): Strategies involving planning stages before executing RAG for complex tasks.
GraphRAG: A structured approach using knowledge graphs for enhanced context integration and reasoning.
FLARE - An approach that incorporates active retrieval-augmented generation to improve response quality.
GNN-RAG: Graph neural retrieval for large language modeling reasoning.
Multimodal RAG: Extends RAG to handle multiple modalities such as text, images, and audio.
VideoRAG: Extends RAG to videos using Large Video Language Models (LVLMs) to retrieve and integrate visual and textual content for multimodal generation.
REFRAG: Optimizes RAG decoding by compressing retrieved context into embeddings before generation, reducing latency while maintaining output quality.
InstructRAG: Enhances RAG systems through instruction-based fine-tuning using self-synthesized rationales to improve retrieval and generation quality.

🧰 Frameworks that Facilitate RAG

Haystack: LLM orchestration framework to build customizable, production-ready LLM applications.
LangChain: An all-purpose framework for working with LLMs.
Semantic Kernel: An SDK from Microsoft for developing Generative AI applications.
LlamaIndex: Framework for connecting custom data sources to LLMs.
Dify: An open-source LLM app development platform.
Cognita: Open-source RAG framework for building modular and production ready applications.
Verba: Open-source application for RAG out of the box.
Mastra: Typescript framework for building AI applications.
Letta: Open source framework for building stateful LLM applications.
Flowise: Drag & drop UI to build customized LLM flows.
Swiftide: Rust framework for building modular, streaming LLM applications.
CocoIndex: ETL framework to index data for AI, such as RAG; with realtime incremental updates.
Pathway: Performant open-source Python ETL framework with Rust runtime, supporting 300+ data sources.
Pathway AI Pipelines: A production-ready RAG framework supporting real-time indexing, retrieval, and change tracking across diverse data sources.
LiteLLM: Unified interface for multiple LLM providers (OpenAI, Anthropic, Hugging Face, Replicate) with logging, monitoring, and cost tracking.

🐍 Python Ecosystem for RAG

Python is the most mature ecosystem for RAG today, with extensive support for LLMs, embeddings, vector databases, evaluation, and production tooling.

See the full guide: Python Ecosystem for RAG

🛠️ Techniques

Data cleaning

Data cleaning techniques: Pre-processing steps to refine input data and improve model performance.

Prompting

Strategies
- Tagging and Labeling: Adding semantic tags or labels to retrieved data to enhance relevance.
- Chain of Thought (CoT): Encouraging the model to think through problems step by step before providing an answer.
- Chain of Verification (CoVe): Prompting the model to verify each step of its reasoning for accuracy.
- Self-Consistency: Generating multiple reasoning paths and selecting the most consistent answer.
- Zero-Shot Prompting: Designing prompts that guide the model without any examples.
- Few-Shot Prompting: Providing a few examples in the prompt to demonstrate the desired response format.
- Reason & Act (ReAct) prompting: Combines reasoning (e.g. CoT) with acting (e.g. tool calling).
Caching
- Prompt Caching: Optimizes LLMs by storing and reusing precomputed attention states.
Structuring
- Token-Oriented Object Notation: A compact, deterministic JSON format for LLM prompts.

Chunking

Chunking strategy is one of the most critical decisions in RAG system design, directly impacting retrieval precision and context quality. The optimal approach depends on document types, domain characteristics, and query patterns.

Fixed-Size Chunking
- Use Case: Simple, uniform documents where structure is less important
- Characteristics: Divides text into consistent-sized segments (typically 256-512 tokens) with configurable overlap (10-20%)
- Pros: Simple to implement, predictable chunk sizes, efficient processing
- Cons: May split sentences/paragraphs, loses document structure, can fragment semantic units
- Implementation: CharacterTextSplitter (LangChain), SentenceSplitter (LlamaIndex)
Recursive Chunking
- Use Case: Documents with hierarchical structure (markdown, HTML, code)
- Characteristics: Recursively splits by separators (paragraphs → sentences → words) until desired chunk size
- Pros: Preserves natural boundaries, respects document hierarchy, better semantic coherence
- Cons: More complex, variable chunk sizes, requires careful separator configuration
- Implementation: RecursiveCharacterTextSplitter (LangChain)
Document-Based Chunking
- Use Case: Structured documents with clear sections (markdown headers, PDF sections, database records)
- Characteristics: Segments based on document metadata, formatting cues, or structural elements
- Pros: Maintains document structure, preserves context, enables metadata-rich retrieval
- Cons: Requires structured input, may create very large or very small chunks
- Implementation: MarkdownHeaderTextSplitter (LangChain)
- Multimodal: Handle images and text with models like OpenCLIP
Semantic Chunking
- Use Case: Documents where semantic coherence is critical (narratives, technical documentation)
- Characteristics: Uses embedding similarity to identify natural semantic boundaries
- Pros: Preserves semantic units, adapts to content, improves retrieval relevance
- Cons: Computationally expensive, requires embedding model, less predictable chunk sizes
- Best For: High-quality retrieval where context preservation is paramount
Agentic Chunking
- Use Case: Complex documents requiring intelligent segmentation decisions
- Characteristics: Uses LLMs to analyze content and determine optimal chunk boundaries
- Pros: Highly adaptive, understands context, can apply domain knowledge
- Cons: High cost, slower processing, requires LLM API access
- Best For: Specialized domains where standard chunking fails

Chunking Best Practices:

Overlap Strategy: Use 10-20% overlap to maintain context across boundaries
Size Optimization: Balance chunk size (larger = more context, smaller = better precision)
Metadata Preservation: Retain document structure, headers, and formatting in chunk metadata
Multi-Granularity: Consider hierarchical approaches (small chunks for retrieval, larger for context)

Embeddings

Embeddings are the foundation of semantic search in RAG systems. The choice of embedding model significantly impacts retrieval quality.

Model Selection
- MTEB Leaderboard: Comprehensive benchmark for evaluating embedding models across multiple tasks and languages. Consider models that perform well on tasks relevant to your use case (retrieval, clustering, classification).
- Model Characteristics: Evaluate models based on:
  - Dimensions: Higher dimensions (768-1024) generally offer better quality but increase storage and compute costs
  - Context Length: Ensure models support your document chunk sizes
  - Multilingual Support: Required for international applications
  - Domain Specialization: General-purpose vs. domain-specific (e.g., scientific, legal, medical)
Custom Embeddings
- Fine-tuning: Adapt pre-trained models to your domain using contrastive learning, triplet loss, or supervised fine-tuning
- Training from Scratch: For highly specialized domains with sufficient labeled data
- Multi-Modal Embeddings: For applications requiring text, image, or audio understanding (e.g., CLIP, ImageBind)
- Ensemble Methods: Combine multiple embedding models for improved robustness

Retrieval

Search Methods
- Vector Store Flat Index
  - Simple and efficient form of retrieval.
  - Content is vectorized and stored as flat content vectors.
- Hierarchical Index Retrieval
  - Hierarchically narrow data to different levels.
  - Executes retrievals by hierarchical order.
- Hypothetical Questions
  - Used to increase similarity between database chunks and queries (same with HyDE).
  - LLM is used to generate specific questions for each text chunk.
  - Converts these questions into vector embeddings.
  - During search, matches queries against this index of question vectors.
- Hypothetical Document Embeddings (HyDE)
  - Used to increase similarity between database chunks and queries (same with Hypothetical Questions).
  - LLM is used to generate a hypothetical response based on the query.
  - Converts this response into a vector embedding.
  - Compares the query vector with the hypothetical response vector.
- Small to Big Retrieval
  - Improves retrieval by using smaller chunks for search and larger chunks for context.
  - Smaller child chunks refers to bigger parent chunks
- Contextual Retrieval
  - Enhances RAG retrieval accuracy by preserving document context that is typically lost during chunking.
  - Each text chunk is enriched with a short, model-generated summary before embedding and indexing, resulting in Contextual Embeddings and Contextual BM25.
  - This combined approach improves both semantic and lexical matching, reducing retrieval failure rates when paired with reranking.
- Adaptive Retrieval
  - Dynamically decide when and how much to retrieve during generation.
- Query Reformulation and Expansion
  - Automatically rewrites or expands the query before retrieval to boost recall.
  - Useful for long or ambiguous user queries.
Re-ranking: Enhances search results in RAG pipelines by reordering initially retrieved documents, prioritizing those most semantically relevant to the query.

Response Quality & Safety

Ensuring high-quality, safe, and reliable responses is critical for production RAG systems.

Hallucination Mitigation
- Detection Techniques: Implement methods to identify when models generate unsupported information
- Grounding Verification: Cross-reference generated claims with retrieved context
- Confidence Scoring: Assign confidence scores to generated responses based on source quality
- Source Attribution: Require citations for all factual claims
- Retrieval Quality: Improve retrieval precision to reduce hallucination risk
Guardrails & Safety
- Implementation Guide: Comprehensive approach to implementing safety mechanisms
- Content Moderation: Filter harmful, biased, or inappropriate content at input and output stages
- Bias Mitigation: Detect and mitigate biases in retrieved content and generated responses
- Fact-Checking: Verify claims against authoritative sources or knowledge bases
- Toxicity Detection: Use classifiers to identify and filter toxic content
Prompt Injection Prevention
- Security Guide: Understanding and preventing prompt injection attacks
- Input Validation: Rigorously validate and sanitize all external inputs using whitelisting, length limits, and pattern matching
- Content Separation: Use clear delimiters, templating systems, and role-based prompts to separate instructions from user data
- Output Monitoring: Continuously monitor responses for anomalies, unexpected behaviors, or security violations
- Rate Limiting: Implement rate limits and abuse detection to prevent systematic attacks
- Sandboxing: Isolate LLM execution environments to limit potential damage from successful injections

📊 Metrics & Evaluation

Similarity Metrics for Embeddings

These metrics are used to measure the similarity between embeddings, which is crucial for evaluating how effectively RAG systems retrieve and integrate external documents or data sources. By selecting appropriate similarity metrics, you can optimize the performance and accuracy of your RAG system. Alternatively, you may develop custom metrics tailored to your specific domain or niche to capture domain-specific nuances and improve relevance.

Cosine Similarity
- Measures the cosine of the angle between two vectors in a multi-dimensional space.
- Highly effective for comparing text embeddings where the direction of the vectors represents semantic information.
- Commonly used in RAG systems to measure semantic similarity between query embeddings and document embeddings.
Dot Product
- Calculates the sum of the products of corresponding entries of two sequences of numbers.
- Equivalent to cosine similarity when vectors are normalized.
- Simple and efficient, often used with hardware acceleration for large-scale computations.
Euclidean Distance
- Computes the straight-line distance between two points in Euclidean space.
- Can be used with embeddings but may lose effectiveness in high-dimensional spaces due to the "curse of dimensionality."
- Often used in clustering algorithms like K-means after dimensionality reduction.
Jaccard Similarity
- Measures the similarity between two finite sets as the size of the intersection divided by the size of the union of the sets.
- Useful when comparing sets of tokens, such as in bag-of-words models or n-gram comparisons.
- Less applicable to continuous embeddings produced by LLMs.

Note: Cosine Similarity and Dot Product are generally seen as the most effective metrics for measuring similarity between high-dimensional embeddings.

Response Evaluation Metrics

Response evaluation in RAG solutions involves assessing the quality of language model outputs using diverse metrics. Here are structured approaches to evaluating these responses:

Automated Benchmarking
- BLEU: Evaluates the overlap of n-grams between machine-generated and reference outputs, providing insight into precision.
- ROUGE: Measures recall by comparing n-grams, skip-bigrams, or longest common subsequence with reference outputs.
- METEOR: Focuses on exact matches, stemming, synonyms, and alignment for machine translation.
Human Evaluation Involves human judges assessing responses for:
- Relevance: Alignment with user queries.
- Fluency: Grammatical and stylistic quality.
- Factual Accuracy: Verifying claims against authoritative sources.
- Coherence: Logical consistency within responses.
Model Evaluation Leverages pre-trained evaluators to benchmark outputs against diverse criteria:
- TuringBench: Offers comprehensive evaluations across language benchmarks.
- Hugging Face Evaluate: Calculates alignment with human preferences.
Key Dimensions for Evaluation
- Groundedness: Assesses if responses are based entirely on provided context. Low groundedness may indicate reliance on hallucinated or irrelevant information.
- Completeness: Measures if the response answers all aspects of a query.
- Approaches: AI-assisted retrieval scoring and prompt-based intent verification.
- Utilization: Evaluates the extent to which retrieved data contributes to the response.
- Analysis: Use LLMs to check the inclusion of retrieved chunks in responses.

Tools

These tools can assist in evaluating the performance of your RAG system, from tracking user feedback to logging query interactions and comparing multiple evaluation metrics over time.

LangFuse: Open-source tool for tracking LLM metrics, observability, and prompt management.
Ragas: Framework that helps evaluate RAG pipelines.
LangSmith: A platform for building production-grade LLM applications, allows you to closely monitor and evaluate your application.
Hugging Face Evaluate: Tool for computing metrics like BLEU and ROUGE to assess text quality.
Weights & Biases: Tracks experiments, logs metrics, and visualizes performance.

💾 Databases

Vector databases are critical components of RAG systems, providing efficient storage and similarity search capabilities for embeddings. The selection of an appropriate database depends on factors such as scale, latency requirements, deployment model (cloud vs. on-premises), and feature needs (hybrid search, filtering, etc.). The list below features database systems suitable for RAG applications:

Benchmarks

Picking a vector database

Distributed Data Processing and Serving Engines:

Apache Cassandra: Distributed NoSQL database management system.
MongoDB Atlas: Globally distributed, multi-model database service with integrated vector search.
Vespa: Open-source big data processing and serving engine designed for real-time applications.

Search Engines with Vector Capabilities:

Elasticsearch: Provides vector search capabilities along with traditional search functionalities.
OpenSearch: Distributed search and analytics engine, forked from Elasticsearch.

Vector Databases:

Chroma DB: An AI-native open-source embedding database.
Milvus: An open-source vector database for AI-powered applications.
Pinecone: A serverless vector database, optimized for machine learning workflows.
Oracle AI Vector Search: Integrates vector search capabilities within Oracle Database for semantic querying based on vector embeddings.

Relational Database Extensions:

Pgvector: An open-source extension for vector similarity search in PostgreSQL.

Other Database Systems:

Azure Cosmos DB: Globally distributed, multi-model database service with integrated vector search.
Couchbase: A distributed NoSQL cloud database.
Lantern: A privacy-aware personal search engine.
LlamaIndex: Employs a straightforward in-memory vector store for rapid experimentation.
Neo4j: Graph database management system.
Qdrant: An open-source vector database designed for similarity search.
Redis Stack: An in-memory data structure store used as a database, cache, and message broker.
SurrealDB: A scalable multi-model database optimized for time-series data.
Weaviate: A open-source cloud-native vector search engine.

Vector Search Libraries and Tools:

FAISS: A library for efficient similarity search and clustering of dense vectors, designed to handle large-scale datasets and optimized for fast retrieval of nearest neighbors.

🚀 Production Considerations

Building production-grade RAG systems requires addressing several critical aspects beyond the core retrieval and generation pipeline:

Scalability & Performance

Indexing Throughput: Design pipelines to handle high-volume document ingestion with incremental updates
Query Latency: Optimize retrieval speed through efficient indexing (HNSW, IVF), caching strategies, and parallel processing
Concurrent Requests: Implement connection pooling, request queuing, and load balancing for high-traffic scenarios
Resource Management: Monitor GPU/CPU utilization, memory consumption, and database connection pools

Reliability & Monitoring

Observability: Implement comprehensive logging, tracing, and metrics collection (latency, throughput, error rates)
Health Checks: Monitor embedding service availability, vector database connectivity, and LLM API status
Error Handling: Implement retry logic, circuit breakers, and graceful degradation strategies
A/B Testing: Compare different retrieval strategies, chunking methods, and prompt templates

Data Management

Incremental Updates: Support real-time or near-real-time document indexing without full re-indexing
Version Control: Track document versions, embedding model versions, and prompt templates
Data Quality: Implement validation pipelines to detect corrupted embeddings, missing metadata, or stale content
Backup & Recovery: Regular backups of vector indexes and metadata stores

Security & Compliance

Access Control: Implement authentication, authorization, and audit logging
Data Privacy: Encrypt data at rest and in transit, support data residency requirements
Content Filtering: Apply content moderation, PII detection, and compliance checks
Rate Limiting: Protect against abuse and ensure fair resource allocation

Cost Optimization

Embedding Caching: Cache frequently accessed embeddings to reduce API costs
Selective Retrieval: Use query routing to avoid unnecessary retrieval operations
Model Selection: Balance cost and performance when choosing embedding and LLM models
Resource Right-sizing: Optimize infrastructure based on actual usage patterns

🔌 Platform-Specific RAG Implementations

For detailed implementation guides for specific platforms, see the documentation:

Supabase Integration Guide: Building RAG systems with Supabase, pgvector, and Edge Functions

💡 Best Practices

Chunking Strategy

Domain-Aware Chunking: Use semantic or document-structure-based chunking over fixed-size for better context preservation
Overlap Management: Include strategic overlap (10-20%) to maintain context across boundaries
Metadata Preservation: Retain document structure, headers, and formatting cues in chunk metadata
Multi-Granularity: Consider hierarchical chunking (small chunks for retrieval, larger chunks for context)

Embedding Selection

Model Evaluation: Use MTEB leaderboard and domain-specific benchmarks to select appropriate models
Dimension Optimization: Balance embedding dimensions (higher = better quality, lower = faster retrieval)
Domain Fine-tuning: Fine-tune embeddings on domain-specific data when possible
Consistency: Ensure the same embedding model is used for indexing and querying

Retrieval Optimization

Hybrid Search: Combine semantic (vector) and lexical (BM25/keyword) search for improved recall
Re-ranking: Apply cross-encoders or learned-to-rank models to improve precision
Query Understanding: Implement query classification, intent detection, and query expansion
Result Diversification: Avoid redundant results by implementing diversity constraints

Prompt Engineering

Clear Instructions: Provide explicit instructions on how to use retrieved context
Source Attribution: Request citations and require grounding in provided context
Few-Shot Examples: Include examples demonstrating desired response format and quality
Context Compression: Use techniques like summarization or extraction when context exceeds limits

Evaluation Framework

Multi-Dimensional Metrics: Evaluate relevance, accuracy, completeness, and groundedness
Human-in-the-Loop: Incorporate human feedback for continuous improvement
Synthetic Evaluation: Generate test queries and expected outputs for automated testing
Production Monitoring: Track user satisfaction, query patterns, and failure modes

Iterative Improvement

Feedback Loops: Collect user feedback, query logs, and performance metrics
Experimentation: Systematically test improvements (chunking, retrieval, prompts) with controlled experiments
Model Updates: Plan for embedding model upgrades and migration strategies
Documentation: Maintain clear documentation of architecture, decisions, and operational procedures

Contributing

This is a community-driven resource and continues to evolve. Contributions are welcome! If you'd like to add resources, fix errors, or improve organization:

Fork the repository
Create a branch for your changes
Submit a pull request with a clear description

For new entries, ensure links are working, descriptions are accurate and concise, and content fits the appropriate section.

License

This project is licensed under the CC0 1.0 Universal.

Name		Name	Last commit message	Last commit date
Latest commit History 81 Commits
docs		docs
LICENSE		LICENSE
README.md		README.md

License

Danielskry/Awesome-RAG

Folders and files

Latest commit

History

Repository files navigation

😎 Awesome Retrieval Augmented Generation (RAG)

Overview

Key Benefits

Content

ℹ️ General Information on RAG

Implementation Resources

Python Tutorials & Examples

Production & Best Practices

🏗️ Architecture Patterns

🎯 Advanced Approaches

🧰 Frameworks that Facilitate RAG

🐍 Python Ecosystem for RAG

🛠️ Techniques

Data cleaning

Prompting

Chunking

Embeddings

Retrieval

Response Quality & Safety

📊 Metrics & Evaluation

Similarity Metrics for Embeddings

Response Evaluation Metrics

Tools

💾 Databases

Benchmarks

Distributed Data Processing and Serving Engines:

Search Engines with Vector Capabilities:

Vector Databases:

Relational Database Extensions:

Other Database Systems:

Vector Search Libraries and Tools:

🚀 Production Considerations

Scalability & Performance

Reliability & Monitoring

Data Management

Security & Compliance

Cost Optimization

🔌 Platform-Specific RAG Implementations

💡 Best Practices

Chunking Strategy

Embedding Selection

Retrieval Optimization

Prompt Engineering

Evaluation Framework

Iterative Improvement

Contributing

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 5

Packages