VIVEK KUMAR
Data Engineer

VIVEK KUMAR

Cloud-native data pipelines on Azure & AWS.
Medallion Lakehouse · PySpark · Databricks · Kafka

Vivek Kumar
Azure
Databricks
Kafka
PySpark
0
M+ Records/Day
0
% Processing Time ↓
0
% Manual Cut
0
% Reliability
0
M+ Records/Run (AWS)
Work History

Experience

Dataminerz Innovative Solutions
Data Engineer Intern
Jan 2026 – May 2026  ·  Noida, Uttar Pradesh
⬡ Project 1 — Guidepoint · Medallion Architecture (Azure / Databricks)
  • Architected Medallion (Bronze → Silver → Gold) on Azure Databricks — 40% faster than legacy batch processes
  • Implemented Lakehouse on ADLS Gen2 ingesting 10M+ records/day from structured & semi-structured sources using PySpark
  • Built ETL workflows with PySpark & SQL achieving 35% improvement in transformation efficiency across Silver and Gold layers
  • Delivered analytics-ready Gold datasets for 3+ stakeholder groups via Azure Synapse — 50+ GB daily throughput
  • Reduced pipeline change-request turnaround by 25% through cross-functional collaboration
⬡ Project 2 — Structurely · AWS Medallion Architecture Pipeline
  • Built Landing → Bronze → Silver → Gold pipeline ingesting Salesforce CRM & MongoDB into Aurora PostgreSQL — 8M+ records/run
  • Developed 4 AWS Glue PySpark jobs: full ingestion, cleaning, transformation, loading — 90% less manual handling
  • Designed metadata-driven control plane on Aurora PostgreSQL with pipeline_config & pipeline_audit tables and 5 PL/pgSQL functions
  • Configured S3 multi-layer Parquet storage — 30% storage cost reduction vs CSV
  • Set up Glue Crawlers to auto-update Glue Data Catalog after every S3 write — saving 5+ hrs/week of manual schema management
  • Orchestrated zero-touch daily execution via AWS Glue Workflow + EventBridge; 100% secrets via Secrets Manager
Technical Arsenal

Skills

Languages
PythonSQLPySpark
Data Engineering & ETL
Medallion ArchitectureLakehouse ETL DesignIncremental Load Watermark MgmtCDC / SCD
Azure Stack
Azure DatabricksADLS Gen2 Azure SynapseAzure Data Factory Azure SQL DBKey Vault Event HubsBlob Storage
AWS Stack
AWS GlueS3 LambdaEventBridge Aurora PostgreSQLSecrets Manager CloudWatchEC2
Big Data & Streaming
Apache SparkApache Kafka Apache FlinkApache Airflow Debezium CDCApache Iceberg HDFS
Databases & Analytics
PostgreSQLMongoDB ClickHouseMySQL dbtPower BI Grafana
DevOps & Tools
DockerGit / GitHub LinuxBash VS Code
PySpark / Big Data90%
Azure Stack87%
AWS Stack82%
Python / SQL92%
Kafka / Streaming78%
Built Things

Projects

01 / Azure · Lakehouse
Guidepoint Medallion Architecture
Enterprise-grade Bronze → Silver → Gold pipeline on Azure Databricks ingesting 10M+ records/day. Full Lakehouse on ADLS Gen2 with Delta Lake. Synapse analytics-ready outputs.
DatabricksADLS Gen2PySparkAzure SynapseDelta Lake
02 / AWS · Medallion
Structurely AWS Data Pipeline
Metadata-driven Medallion pipeline from Salesforce CRM & MongoDB → Aurora PostgreSQL ingesting 8M+ records/run. Glue PySpark jobs, S3 Parquet, EventBridge orchestration.
AWS GlueS3EventBridgeAurora PostgreSQLPySpark
03 / Batch · Big Data
Retail Analytics Platform
End-to-end analytics pipeline with Docker, Airflow, Spark, HDFS, EC2, and Power BI dashboards. Production-grade batch orchestration with full monitoring.
DockerAirflowSparkHDFSEC2Power BI
04 / Real-time · B.Tech Final Year
SENTINEL — Disaster Monitoring
Real-time pipeline ingesting weather, earthquake & disaster data via OpenWeatherMap, USGS, GDACS APIs. WebSocket-powered live dashboard with Chart.js visualizations.
WebSocketChart.jsOpenWeatherMapUSGSGDACS
05 / Streaming
Real-Time E-Commerce Analytics
Large-scale streaming with Kafka KRaft, Debezium CDC, Apache Flink, Iceberg, ClickHouse OLAP, Airflow, dbt, and Grafana dashboards — fully containerized on Docker + EC2.
KafkaDebeziumFlinkIcebergClickHousedbt
06 / MongoDB · AWS
MongoDB Atlas → AWS Pipeline
Python incremental extraction from MongoDB Atlas to S3 with PostgreSQL audit tracking. Replicated with AWS-native: Glue, Lambda, EventBridge, Secrets Manager, CloudWatch.
MongoDB AtlasAWS LambdaGlueS3CloudWatch
Get In Touch

Contact

Let's
Build
Together.
Education
B.Tech — Computer Science & Engineering
B.S.A. College of Engineering & Technology, Mathura
AKTU · 2022 – 2026
Certifications
SQL Certification — HCL GUVI2025 ✓
DP-203: Azure Data Engineer AssociateIn Progress
AWS Certified Cloud PractitionerIn Progress
Open To
Data EngineerCloud Engineer DevOps EngineerData Analyst