🏦 Centrale Rischi Financial Data Parser (Secure & Scalable)

📋 Executive Summary

This repository hosts a production-grade, containerized data extraction engine designed to parse complex, multi-page financial PDF reports (specifically Centrale Rischi documents). The solution automates the extraction of critical financial indicators, tabular data, and metadata, transforming unstructured PDF content into structured, queryable datasets.

Key Achievement: Engineered to process high volumes of financial documents with precision, this parser handles dynamic table structures, multi-page layouts, and data normalization, significantly reducing manual data entry efforts.

🏗️ Architecture & Workflow

The system is architected as a modular microservice, leveraging Docker for portability and ease of deployment.

Core Components:

Parser Engine (src/core):
- Utilizes Camelot (Lattice mode) for high-fidelity table extraction.
- Integrates PyMuPDF (Fitz) for layout analysis, metadata extraction (headers, dates), and page orientation detection.
API Layer (src/api):
- A robust Flask API endpoint (/parse_document) handles file uploads and processing requests.
- Implements asynchronous processing using threading to ensure API responsiveness.
Data Pipeline (src/storage):
- S3 Integration: Securely uploads raw PDF backups to AWS S3.
- DynamoDB Integration: Stores structured parsing results and processing status in AWS DynamoDB.
Infrastructure (docker/):
- Fully containerized using Docker.
- Deployed with Gunicorn as the WSGI HTTP server for production performance.
- Includes Nginx configuration for reverse proxy handling.

🛠️ Technical Specifications

Component	Technology	Purpose
Language	Python 3.10+	Core logic and scripting.
Web Framework	Flask	RESTful API for document submission.
PDF Extraction	Camelot-py, PyMuPDF (Fitz)	Table and text extraction logic.
Cloud Services	AWS S3, DynamoDB	Storage and NoSQL database for results.
Containerization	Docker & Docker Compose	Deployment consistency and isolation.
Server	Gunicorn & Nginx	Production-ready application serving.

🚀 Setup & Installation

Prerequisites

Docker and Docker Compose installed.
AWS Credentials configured (via Environment Variables).

1. Clone the Repository

git clone [https://github.com/Arazmalek/centrale_rischi_parser.git](https://github.com/Arazmalek/centrale_rischi_parser.git)
cd centrale_rischi_parser

2. Configuration

Create a .env file in the root directory (based on .env.example) and populate it with your credentials. Note: Never commit real credentials to Git.

# .env example
AWS_ACCESS_KEY_ID=your_access_key
AWS_SECRET_ACCESS_KEY=your_secret_key
AWS_REGION=eu-south-1
S3_BUCKET_NAME=your-bucket-name
DYNAMO_TABLE_NAME=your-table-name

3. Build and Run with Docker

docker-compose up --build

The API will be available at http://localhost:80 (via Nginx) or http://localhost:5000 (direct).

📖 Usage

API Endpoint: `/parse_document`

Method: POST
Payload: form-data with a file field named file.

Example Request (cURL):

curl -X POST -F "file=@/path/to/sample_report.pdf" http://localhost/parse_document

Response:

{
    "statusCode": 200,
    "request_id": "unique-uuid-v4",
    "message": "File accepted for processing."
}

🔒 Security & Disclaimer

⚠️ IMPORTANT DISCLAIMER:

Synthetic Data: The PDF files provided in the data_samples/ directory are synthetically generated dummy documents. No real financial data or Personally Identifiable Information (PII) from real clients is used, stored, or processed in this public repository.

Credential Safety: All sensitive configuration (AWS keys, DB endpoints) has been abstracted into environment variables. The code provided here is a sanitized version of the production system, designed for portfolio demonstration purposes.

Araz Malekazari | Senior Data Engineer & System Architect

Name		Name	Last commit message	Last commit date
Latest commit History 47 Commits
config		config
data_samples		data_samples
docker/nginx		docker/nginx
src		src
tests		tests
Dockerfile		Dockerfile
README.md		README.md
docker-compose.yml		docker-compose.yml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🏦 Centrale Rischi Financial Data Parser (Secure & Scalable)

📋 Executive Summary

🏗️ Architecture & Workflow

Core Components:

🛠️ Technical Specifications

🚀 Setup & Installation

Prerequisites

1. Clone the Repository

2. Configuration

3. Build and Run with Docker

📖 Usage

API Endpoint: `/parse_document`

🔒 Security & Disclaimer

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🏦 Centrale Rischi Financial Data Parser (Secure & Scalable)

📋 Executive Summary

🏗️ Architecture & Workflow

Core Components:

🛠️ Technical Specifications

🚀 Setup & Installation

Prerequisites

1. Clone the Repository

2. Configuration

3. Build and Run with Docker

📖 Usage

API Endpoint: /parse_document

🔒 Security & Disclaimer

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

API Endpoint: `/parse_document`

Packages