I built this project to understand how Retrieval-Augmented Generation (RAG) systems work and to gain hands-on experience with document retrieval pipelines.
The project helped me learn about:
- PDF-based information retrieval
- Semantic search
- Vector embeddings
- Text chunking strategies
- Context retrieval
- LLM-based question answering
I mainly developed this system as a foundation for my current work involving medical data retrieval and healthcare-related AI applications. Working on this project gave me practical experience with how documents are processed, stored, retrieved, and used for generating responses in real-world RAG systems.
https://rag-based-pdf-question-answering-system.streamlit.app/
https://github.com/Erebo/RAG-based-PDF-Question-Answering-System
- Upload and process PDF documents
- Ask questions directly from uploaded PDFs
- Retrieval-Augmented Generation (RAG) pipeline
- Semantic chunk retrieval
- Adjustable chunk size and overlap
- Configurable top-k retrieval
- Retrieved context visualization
- Clean and interactive Streamlit UI
- Python
- Streamlit
- LangChain
- FAISS / Vector Store
- Sentence Transformers
- LLMs
- PyPDF
- Semantic Search
The application follows a standard RAG (Retrieval-Augmented Generation) workflow.
Users upload one or multiple PDF files.
The system extracts text from the uploaded PDFs.
The extracted text is divided into smaller chunks using configurable:
- Chunk size
- Chunk overlap
Each chunk is converted into vector embeddings for semantic understanding.
The embeddings are stored in a vector database for efficient retrieval.
The user query is converted into embeddings and matched against stored vectors.
Top relevant chunks are retrieved based on semantic similarity.
The retrieved context is passed to the language model to generate accurate responses.
- PDF Upload System
- Interactive Chat Interface
- Retrieval Configuration Panel
- Retrieved Context Viewer
- Chunk Size
- Chunk Overlap
- Top-k Retrieval
git clone https://github.com/Erebo/RAG-based-PDF-Question-Answering-System.gitcd RAG-based-PDF-Question-Answering-Systempip install -r requirements.txtstreamlit run app.py- Research paper analysis
- Academic PDF querying
- AI-assisted document retrieval
- Knowledge extraction from reports
- Understanding long-form documents efficiently
- Adaptive chunking strategies
- Hybrid retrieval systems
- Citation-aware responses
- Multi-document memory
- Streaming response generation
- Medical-domain optimized retrieval
- Better context ranking mechanisms
Mahadi Rahman Jihad
Research Enthusiast | AI | Computer Vision | Healthcare AI
GitHub: https://github.com/Erebo
This project was developed as a personal learning and research exploration project to better understand the practical implementation of modern Retrieval-Augmented Generation systems and intelligent document retrieval pipelines for future AI and healthcare-related applications.