Skip to content

pizofreude/de-zoomcamp

Repository files navigation

ReadMeSupportPalestine

Data Engineering Zoomcamp

Overview

Welcome to my Data Engineering Zoomcamp repository. This repository is a comprehensive guide to mastering data engineering concepts and practices through hands-on exercises and real-world applications. Thus, it includes a collection of learning materials, notes, homeworks, projects and extra exercises completed during the Data Engineering Zoomcamp.

Table of Contents

Description

This repository contains all the materials and code developed during the Data Engineering Zoomcamp. The course covers various aspects of data engineering, including data ingestion, data transformation, and data warehousing, using modern tools and frameworks based on sponsorship partners.

Technologies Used

  • Python: For scripting and data manipulation.
  • Kestra: For orchestrating workflows.
  • dlt: For data ingestion.
  • PostgreSQL: As the relational database management system.
  • BigQuery: For data warehousing, partitioning & clustering, and machine learning.
  • Docker: For containerization of applications.
  • dbt: For analytics engineering.
  • Spark: For batch processing.
  • Apache Kafka: For real-time data streaming.
  • Pandas: For data analysis and manipulation.
  • SQLAlchemy: For database interaction.

Modules

  • Introduction to GCP
  • Docker and Docker Compose
  • Running PostgreSQL with Docker
  • Infrastructure setup with Terraform
  • Homework
  • Data Lakes and Workflow Orchestration
  • Workflow orchestration with Kestra
  • Homework
  • API reading and pipeline scalability
  • Data normalization and incremental loading
  • Homework
  • Introduction to BigQuery
  • Partitioning, clustering, and best practices
  • Machine learning in BigQuery
  • dbt (data build tool) with PostgreSQL & BigQuery
  • Testing, documentation, and deployment
  • Data visualization with Metabase
  • Introduction to Apache Spark
  • DataFrames and SQL
  • Internals of GroupBy and Joins
  • Introduction to Kafka
  • Kafka Streams and KSQL
  • Schema management with Avro
  • Apply all concepts learned in a real-world scenario
  • Peer review and feedback process

Resources

Course Notes

All notes will be centralized within this directory.

Final Project

The Final Project directory showcases a data engineering project designed to empower business intelligence insights. It includes:

  • ETL Pipelines: Well-structured Extract, Transform, Load (ETL) processes that integrate diverse data sources into a cohesive data warehouse.
  • Data Models: Comprehensive data models optimized for analytical queries, facilitating efficient data retrieval and reporting.
  • Documentation: Detailed guides on pipeline architecture, data flow, and usage instructions for stakeholders.
  • Dashboards: Interactive dashboards built using BI tools, demonstrating key metrics and visualizations derived from the processed data.
  • Testing Suite: Automated tests to validate data integrity and pipeline performance, ensuring reliable analytics.

This directory serves as a practical demonstration of data engineering principles applied to generate actionable business insights, aligning with the goals of a business intelligence analyst.

Contributing

All contributions from the community are welcome 👍. To ensure a smooth collaboration process, please follow these guidelines:

  1. Fork the Repository: Start by forking the repository to your own GitHub account.
  2. Clone Your Fork: Clone your forked repository to your local machine using:
    git clone https://github.com/your-username/repo-name.git
  3. Create a Branch: Create a new branch for your feature or bug fix:
    git checkout -b category/reference/description-in-kebab-case
  4. Make Changes: Implement your changes and ensure they are well-documented.
  5. Commit Your Changes: Commit your changes with a clear message:
    git commit -m 'category: do something; do some other things'
  6. Push to Your Fork: Push your changes to your forked repository:
    git push origin category/reference/description-in-kebab-case
  7. Submit a Pull Request: Navigate to the original repository and submit a pull request. Provide a detailed description of your changes and why they should be merged.

We appreciate your contributions and will review your pull request as soon as possible. Kindly please follow the simplified naming convention for branches and commit as summarized here.

License

This project is licensed under the Apache 2.0 License. You are free to use, modify, and distribute these materials, provided that proper attribution is given to the original authors.

For more details, please refer to the LICENSE file in the repository.

Acknowledgments

  • We would like to thank all the instructors for their hard work and dedication to the Data Engineering Zoomcamp.

  • Special thanks to our sponsorship partners Kestra and dlt for their support.

Releases

No releases published

Packages

 
 
 

Contributors

Languages