Data Engineering Zoomcamp

Overview

Welcome to my Data Engineering Zoomcamp repository. This repository is a comprehensive guide to mastering data engineering concepts and practices through hands-on exercises and real-world applications. Thus, it includes a collection of learning materials, notes, homeworks, projects and extra exercises completed during the Data Engineering Zoomcamp.

Description

This repository contains all the materials and code developed during the Data Engineering Zoomcamp. The course covers various aspects of data engineering, including data ingestion, data transformation, and data warehousing, using modern tools and frameworks based on sponsorship partners.

Technologies Used

Python: For scripting and data manipulation.
Kestra: For orchestrating workflows.
dlt: For data ingestion.
PostgreSQL: As the relational database management system.
BigQuery: For data warehousing, partitioning & clustering, and machine learning.
Docker: For containerization of applications.
dbt: For analytics engineering.
Spark: For batch processing.
Apache Kafka: For real-time data streaming.
Pandas: For data analysis and manipulation.
SQLAlchemy: For database interaction.

Modules

Module 1: Containerization and Infrastructure as Code

Introduction to GCP
Docker and Docker Compose
Running PostgreSQL with Docker
Infrastructure setup with Terraform
Homework

Module 2: Workflow Orchestration

Data Lakes and Workflow Orchestration
Workflow orchestration with Kestra
Homework

Workshop 1: Data Ingestion

API reading and pipeline scalability
Data normalization and incremental loading
Homework

Module 3: Data Warehousing

Introduction to BigQuery
Partitioning, clustering, and best practices
Machine learning in BigQuery

Module 4: Analytics Engineering

dbt (data build tool) with PostgreSQL & BigQuery
Testing, documentation, and deployment
Data visualization with Metabase

Module 5: Batch Processing

Introduction to Apache Spark
DataFrames and SQL
Internals of GroupBy and Joins

Module 6: Streaming

Introduction to Kafka
Kafka Streams and KSQL
Schema management with Avro

Final Project

Apply all concepts learned in a real-world scenario
Peer review and feedback process

Resources

Course Notes

All notes will be centralized within this directory.

Final Project

The Final Project directory showcases a data engineering project designed to empower business intelligence insights. It includes:

ETL Pipelines: Well-structured Extract, Transform, Load (ETL) processes that integrate diverse data sources into a cohesive data warehouse.
Data Models: Comprehensive data models optimized for analytical queries, facilitating efficient data retrieval and reporting.
Documentation: Detailed guides on pipeline architecture, data flow, and usage instructions for stakeholders.
Dashboards: Interactive dashboards built using BI tools, demonstrating key metrics and visualizations derived from the processed data.
Testing Suite: Automated tests to validate data integrity and pipeline performance, ensuring reliable analytics.

This directory serves as a practical demonstration of data engineering principles applied to generate actionable business insights, aligning with the goals of a business intelligence analyst.

Contributing

All contributions from the community are welcome 👍. To ensure a smooth collaboration process, please follow these guidelines:

Fork the Repository: Start by forking the repository to your own GitHub account.
Clone Your Fork: Clone your forked repository to your local machine using:
```
git clone https://github.com/your-username/repo-name.git
```
Create a Branch: Create a new branch for your feature or bug fix:
```
git checkout -b category/reference/description-in-kebab-case
```
Make Changes: Implement your changes and ensure they are well-documented.
Commit Your Changes: Commit your changes with a clear message:
```
git commit -m 'category: do something; do some other things'
```
Push to Your Fork: Push your changes to your forked repository:
```
git push origin category/reference/description-in-kebab-case
```
Submit a Pull Request: Navigate to the original repository and submit a pull request. Provide a detailed description of your changes and why they should be merged.

We appreciate your contributions and will review your pull request as soon as possible. Kindly please follow the simplified naming convention for branches and commit as summarized here.

License

This project is licensed under the Apache 2.0 License. You are free to use, modify, and distribute these materials, provided that proper attribution is given to the original authors.

For more details, please refer to the LICENSE file in the repository.

Acknowledgments

We would like to thank all the instructors for their hard work and dedication to the Data Engineering Zoomcamp.
Special thanks to our sponsorship partners Kestra and dlt for their support.

Name		Name	Last commit message	Last commit date
Latest commit History 132 Commits
01_intro_to_data_engineering		01_intro_to_data_engineering
02_workflow_orchestration		02_workflow_orchestration
03_data_warehousing		03_data_warehousing
04_analytics_engineering		04_analytics_engineering
05_batch_processing		05_batch_processing
06_streaming		06_streaming
course_notes		course_notes
projects		projects
workshops		workshops
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Engineering Zoomcamp

Overview

Table of Contents

Description

Technologies Used

Modules

Module 1: Containerization and Infrastructure as Code

Module 2: Workflow Orchestration

Workshop 1: Data Ingestion

Module 3: Data Warehousing

Module 4: Analytics Engineering

Module 5: Batch Processing

Module 6: Streaming

Final Project

Resources

Course Notes

Final Project

Contributing

License

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Data Engineering Zoomcamp

Overview

Table of Contents

Description

Technologies Used

Modules

Resources

Course Notes

Final Project

Contributing

License

Acknowledgments

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Uh oh!

Uh oh!

Uh oh!

Languages