I'm a Cloud Data Engineer building scalable, reliable, and cost-efficient cloud data platforms.
I specialize in turning raw, messy, multi-source data into trusted analytics layers and ML-ready pipelines
through a mix of modern ELT, streaming systems, and strong distributed systems fundamentals.
🎓 MSCS @ Northeastern University (2022–2024)
☁️ Focus: Cloud-Native Data Engineering
🔗 Connect: GitHub: wyang10 • LinkedIn: linkedin.com/in/awhy
- Focused on building cloud-native, event-driven data systems on AWS / GCP cloud platform.
- Experienced delivering data platforms and analytics pipelines with data quality and schema governance.
- Strong in reliability engineering (idempotency, DLQ/replay, observability), IaC (Terraform), Kubernetes, and CI/CD.
Data Engineer — LumiereX (Jan 2025 – Present)
- Built event-driven Serverless ELT ingestion on AWS(S3, API Gateway, Lambda, SQS, Glue, Step Functions).
- Improved data quality layers, and optimized Spark jobs for cost/performance.
- Inplemented in reliability engineering (idempotency, DLQ/replay, observability).
Software Engineer Intern — VisionX (Jan 2024 – Jul 2024)
- Contributed to a Kafka → Flink streaming pipeline to enable real-time ML scoring for IoT sensory.
- Focused on modules including schema governance, ingestion reliability, and validation checks.
- Containerized Flink jobs with Docker, deployed to Kubernetes.
- Orchestration: EventBridge → Step Functions → Glue Job + optional Great Expectations gate.
- Catalog / Query: Glue Data Catalog + Crawler + Athena tables for silver/ Parquet.
- Replay / Recovery: replay & dlq-redrive scripts for backfill and poison-message recovery.
- Idempotency: DynamoDB TTL for object-level dedup, optional GSI for audit.
- CI/CD: GitHub Actions pipelines (Lambda build+deploy, Terraform plan+apply).
- End-to-End, Reproducible ML Pipeline Engineered a modular, production-style ML system for predicting in-hospital mortality.
- Go from raw CSV → cleaned features → baseline models → reproducible CLI pipeline, with optional SMOTE to address severe class imbalance.
- A production-ready ELT & Data Quality Framework using Airflow + dbt + Snowflake + Great Expectations + CICD.
- Automates data ingestion, transformation, testing, and lineage into a reproducible orchestration system.
- I design modular, observable pipelines that are easy to test, debug, and scale.
- I prioritize trade-offs that maximize team velocity, reliability, and cloud spend efficiency.
- I enjoy collaborations involving data modeling, pipeline quality, and distributed system design.
Languages & Tools
Python (Pandas, PySpark) • SQL • Java • Bash
Cloud & Orchestration
GCP (BigQuery, Dataflow) • AWS (S3, EMR, Glue, Lambda, SQS, Step functions, IAM)
GitHub Actions • Airflow • dbt • Docker • Kubernetes • Terraform
Big Data & Storage
Spark • Kafka • Flink • Databricks • Delta Lake
Snowflake • Parquet • SCD Type2 • dimensional modeling
Data Quality & CI/CD
Great Expectations • dbt tests • automated lineage • monitoring

