Audrey Yang wyang10

Hi there 👋 I’m Audrey~ 🚀

About Me 🌱

I'm a Cloud Data Engineer building scalable, reliable, and cost-efficient cloud data platforms.
I specialize in turning raw, messy, multi-source data into trusted analytics layers and ML-ready pipelines
through a mix of modern ELT, streaming systems, and strong distributed systems fundamentals.

Quick Pitch 💬

🎓 MSCS @ Northeastern University (2022–2024)
☁️ Focus: Cloud-Native Data Engineering
🔗 Connect: GitHub: wyang10 • LinkedIn: linkedin.com/in/awhy

Highlights 💡

Focused on building cloud-native, event-driven data systems on AWS / GCP cloud platform.
Experienced delivering data platforms and analytics pipelines with data quality and schema governance.
Strong in reliability engineering (idempotency, DLQ/replay, observability), IaC (Terraform), Kubernetes, and CI/CD.

Experience 🧩

Data Engineer — LumiereX (Jan 2025 – Present)

Built event-driven Serverless ELT ingestion on AWS(S3, API Gateway, Lambda, SQS, Glue, Step Functions).
Improved data quality layers, and optimized Spark jobs for cost/performance.
Inplemented in reliability engineering (idempotency, DLQ/replay, observability).

Software Engineer Intern — VisionX (Jan 2024 – Jul 2024)

Contributed to a Kafka → Flink streaming pipeline to enable real-time ML scoring for IoT sensory.
Focused on modules including schema governance, ingestion reliability, and validation checks.
Containerized Flink jobs with Docker, deployed to Kubernetes.

Featured Projects 👨‍💻

🔷 AWS Serverless ELT Pipeline — v2.0 (Enterprise-ready)

Orchestration: EventBridge → Step Functions → Glue Job + optional Great Expectations gate.
Catalog / Query: Glue Data Catalog + Crawler + Athena tables for silver/ Parquet.
Replay / Recovery: replay & dlq-redrive scripts for backfill and poison-message recovery.
Idempotency: DynamoDB TTL for object-level dedup, optional GSI for audit.
CI/CD: GitHub Actions pipelines (Lambda build+deploy, Terraform plan+apply).

🔷 Smote Heart Attack ML Data Mining Pipline

End-to-End, Reproducible ML Pipeline Engineered a modular, production-style ML system for predicting in-hospital mortality.
Go from raw CSV → cleaned features → baseline models → reproducible CLI pipeline, with optional SMOTE to address severe class imbalance.

🔷 Modern ELT Pipeline with Embedded Data Quality at Scale

A production-ready ELT & Data Quality Framework using Airflow + dbt + Snowflake + Great Expectations + CICD.
Automates data ingestion, transformation, testing, and lineage into a reproducible orchestration system.

How I Work 👯

I design modular, observable pipelines that are easy to test, debug, and scale.
I prioritize trade-offs that maximize team velocity, reliability, and cloud spend efficiency.
I enjoy collaborations involving data modeling, pipeline quality, and distributed system design.

Core Skills ⚡

Languages & Tools
Python (Pandas, PySpark) • SQL • Java • Bash

Cloud & Orchestration
GCP (BigQuery, Dataflow) • AWS (S3, EMR, Glue, Lambda, SQS, Step functions, IAM)
GitHub Actions • Airflow • dbt • Docker • Kubernetes • Terraform

Big Data & Storage
Spark • Kafka • Flink • Databricks • Delta Lake
Snowflake • Parquet • SCD Type2 • dimensional modeling

Data Quality & CI/CD
Great Expectations • dbt tests • automated lineage • monitoring

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Audrey Yang wyang10

Achievements