A production-grade Data Lakehouse designed for extreme cost efficiency and scalability. This architecture ingests raw edge logs, processes them via serverless ETL, and stores them in Apache Iceberg format for ACID-compliant analytics.
Architecture
flowchart TB
subgraph Users
Browser[Browser]
end
subgraph GitHub
Repo[GitHub Repository]
Actions[GitHub Actions]
end
subgraph AWS_Global["AWS Global"]
WAF[AWS WAF
Web ACL] Route53[Route53
DNS] ACM[ACM Certificate
us-east-1] CloudFront[CloudFront Pro
Primary Distribution] CloudFrontDR[CloudFront Pro
DR Distribution] LambdaDR[Lambda
DR Failover] CWHealthAlarm[CloudWatch Alarm
Health Check] end subgraph AWS_EU["AWS eu-central-1"] S3[S3 Bucket
allaboutdata.eu] KMS[KMS
Website CMK] KMSLake[KMS
Datalake CMK] end subgraph AWS_DR["AWS eu-central-2 (DR)"] S3DR[S3 Bucket
allaboutdata.eu-dr] KMSDR[KMS
DR Customer Managed Key] end subgraph DataLake["AWS Data Lake"] S3Lake[S3 Data Lake Bucket] GlueCatalog[Glue Data Catalog
Bronze Table] EventBridge[EventBridge
Scheduler] GlueJob[Glue Spark Job
Bronze → Silver] Iceberg[Iceberg Warehouse
Silver Table] GoldJob[Glue Spark Job
Silver → Gold] Gold[Gold Warehouse
Parquet Aggregations] Athena[Athena
SQL Analytics] LambdaDash[Lambda
Dashboard Export] GlueDQ[Glue Data Quality
DQDL Rulesets] end subgraph Observability["AWS Observability"] CWDashboard[CloudWatch
Dashboard] CWAlarms[CloudWatch
Alarms] SNS[SNS
Email Alerts] end subgraph External GoogleWorkspace[Google Workspace
Email] end %% User Flow Browser -->|HTTPS Request| Route53 Route53 -->|DNS Resolution| CloudFront WAF -->|Protection| CloudFront WAF -->|Protection| CloudFrontDR CloudFront -->|Origin| S3 CloudFrontDR -->|Origin| S3DR ACM -.->|TLS Certificate| CloudFront ACM -.->|TLS Certificate| CloudFrontDR %% DR Failover Flow Route53 -.->|Health Check| CWHealthAlarm CWHealthAlarm -->|SNS → ALARM/OK| LambdaDR LambdaDR -.->|AssociateAlias| CloudFront LambdaDR -.->|AssociateAlias| CloudFrontDR %% CI/CD Flow Repo -->|Push to main| Actions Actions -->|S3 Sync| S3 Actions -->|Cache Invalidation| CloudFront %% Email (Google Workspace) Route53 -.->|MX Record| GoogleWorkspace %% Data Lake Flow CloudFront -->|Access Logs| S3Lake S3Lake -->|TSV Logs| GlueCatalog S3Lake -.->|S3 Events| EventBridge EventBridge -->|6h Trigger| GlueJob EventBridge -->|Daily Trigger| GoldJob GlueCatalog -->|Incremental Read| GlueJob GlueJob -->|Write Iceberg V2| Iceberg Iceberg -->|Delta Read| GoldJob GoldJob -->|Write Parquet 128MB| Gold Iceberg -->|Query| Athena Gold -->|Query| Athena %% Dashboard Export Flow GoldJob -.->|Job Complete| EventBridge EventBridge -->|Trigger| LambdaDash LambdaDash -->|Query Gold| Athena LambdaDash -->|Write JSON| S3 %% Cross-Region Replication S3 -->|S3 CRR| S3DR %% Encryption KMS -.->|SSE-KMS| S3 KMSDR -.->|SSE-KMS| S3DR KMSLake -.->|SSE-KMS| S3Lake %% Data Quality (Daily at 3 AM) EventBridge -->|Daily 3 AM| GlueDQ GlueDQ -->|Validate| Iceberg GlueDQ -->|Validate| Gold %% Observability Flow GlueJob -.->|Metrics| CWDashboard GoldJob -.->|Metrics| CWDashboard LambdaDash -.->|Metrics| CWDashboard CWAlarms -->|Alert| SNS SNS -->|Email| GoogleWorkspace %% Styling classDef aws fill:#FF9900,stroke:#232F3E,color:#232F3E classDef github fill:#24292E,stroke:#24292E,color:#fff classDef user fill:#4285F4,stroke:#1a73e8,color:#fff classDef external fill:#34A853,stroke:#1e8e3e,color:#fff classDef datalake fill:#8C4FFF,stroke:#232F3E,color:#fff classDef observability fill:#DD3522,stroke:#232F3E,color:#fff classDef security fill:#1A8FE3,stroke:#232F3E,color:#fff class Route53,ACM,CloudFront,CloudFrontDR,S3,S3DR aws class WAF,KMS,KMSDR,KMSLake security class Repo,Actions github class Browser user class GoogleWorkspace external class S3Lake,GlueCatalog,EventBridge,GlueJob,Iceberg,GoldJob,Gold,Athena,LambdaDash,GlueDQ datalake class CWDashboard,CWAlarms,CWHealthAlarm,SNS observability class LambdaDR datalake
Web ACL] Route53[Route53
DNS] ACM[ACM Certificate
us-east-1] CloudFront[CloudFront Pro
Primary Distribution] CloudFrontDR[CloudFront Pro
DR Distribution] LambdaDR[Lambda
DR Failover] CWHealthAlarm[CloudWatch Alarm
Health Check] end subgraph AWS_EU["AWS eu-central-1"] S3[S3 Bucket
allaboutdata.eu] KMS[KMS
Website CMK] KMSLake[KMS
Datalake CMK] end subgraph AWS_DR["AWS eu-central-2 (DR)"] S3DR[S3 Bucket
allaboutdata.eu-dr] KMSDR[KMS
DR Customer Managed Key] end subgraph DataLake["AWS Data Lake"] S3Lake[S3 Data Lake Bucket] GlueCatalog[Glue Data Catalog
Bronze Table] EventBridge[EventBridge
Scheduler] GlueJob[Glue Spark Job
Bronze → Silver] Iceberg[Iceberg Warehouse
Silver Table] GoldJob[Glue Spark Job
Silver → Gold] Gold[Gold Warehouse
Parquet Aggregations] Athena[Athena
SQL Analytics] LambdaDash[Lambda
Dashboard Export] GlueDQ[Glue Data Quality
DQDL Rulesets] end subgraph Observability["AWS Observability"] CWDashboard[CloudWatch
Dashboard] CWAlarms[CloudWatch
Alarms] SNS[SNS
Email Alerts] end subgraph External GoogleWorkspace[Google Workspace
Email] end %% User Flow Browser -->|HTTPS Request| Route53 Route53 -->|DNS Resolution| CloudFront WAF -->|Protection| CloudFront WAF -->|Protection| CloudFrontDR CloudFront -->|Origin| S3 CloudFrontDR -->|Origin| S3DR ACM -.->|TLS Certificate| CloudFront ACM -.->|TLS Certificate| CloudFrontDR %% DR Failover Flow Route53 -.->|Health Check| CWHealthAlarm CWHealthAlarm -->|SNS → ALARM/OK| LambdaDR LambdaDR -.->|AssociateAlias| CloudFront LambdaDR -.->|AssociateAlias| CloudFrontDR %% CI/CD Flow Repo -->|Push to main| Actions Actions -->|S3 Sync| S3 Actions -->|Cache Invalidation| CloudFront %% Email (Google Workspace) Route53 -.->|MX Record| GoogleWorkspace %% Data Lake Flow CloudFront -->|Access Logs| S3Lake S3Lake -->|TSV Logs| GlueCatalog S3Lake -.->|S3 Events| EventBridge EventBridge -->|6h Trigger| GlueJob EventBridge -->|Daily Trigger| GoldJob GlueCatalog -->|Incremental Read| GlueJob GlueJob -->|Write Iceberg V2| Iceberg Iceberg -->|Delta Read| GoldJob GoldJob -->|Write Parquet 128MB| Gold Iceberg -->|Query| Athena Gold -->|Query| Athena %% Dashboard Export Flow GoldJob -.->|Job Complete| EventBridge EventBridge -->|Trigger| LambdaDash LambdaDash -->|Query Gold| Athena LambdaDash -->|Write JSON| S3 %% Cross-Region Replication S3 -->|S3 CRR| S3DR %% Encryption KMS -.->|SSE-KMS| S3 KMSDR -.->|SSE-KMS| S3DR KMSLake -.->|SSE-KMS| S3Lake %% Data Quality (Daily at 3 AM) EventBridge -->|Daily 3 AM| GlueDQ GlueDQ -->|Validate| Iceberg GlueDQ -->|Validate| Gold %% Observability Flow GlueJob -.->|Metrics| CWDashboard GoldJob -.->|Metrics| CWDashboard LambdaDash -.->|Metrics| CWDashboard CWAlarms -->|Alert| SNS SNS -->|Email| GoogleWorkspace %% Styling classDef aws fill:#FF9900,stroke:#232F3E,color:#232F3E classDef github fill:#24292E,stroke:#24292E,color:#fff classDef user fill:#4285F4,stroke:#1a73e8,color:#fff classDef external fill:#34A853,stroke:#1e8e3e,color:#fff classDef datalake fill:#8C4FFF,stroke:#232F3E,color:#fff classDef observability fill:#DD3522,stroke:#232F3E,color:#fff classDef security fill:#1A8FE3,stroke:#232F3E,color:#fff class Route53,ACM,CloudFront,CloudFrontDR,S3,S3DR aws class WAF,KMS,KMSDR,KMSLake security class Repo,Actions github class Browser user class GoogleWorkspace external class S3Lake,GlueCatalog,EventBridge,GlueJob,Iceberg,GoldJob,Gold,Athena,LambdaDash,GlueDQ datalake class CWDashboard,CWAlarms,CWHealthAlarm,SNS observability class LambdaDR datalake
The Stack
- Ingest: CloudFront + S3 (Real-time)
- Storage: Apache Iceberg (Bronze/Silver)
- Compute: AWS Glue (Serverless Spark)
- IaC: Terraform (100% Code)
Key Metrics
| Metric | Value |
|---|---|
| Cost | < $1.00 / month (Idle) |
| Latency | < 5 mins (Log to Insight) |
| Format | Apache Iceberg (ACID) |
Why Iceberg?
Apache Iceberg brings warehouse-level reliability to the data lake. Schema evolution, time travel, and partition evolution come standard — no vendor lock-in required.
The serverless compute layer (AWS Glue) means you only pay when data is being processed. During quiet periods, the entire pipeline costs fractions of a cent.
Lessons Learned
- Start with Bronze/Silver — Don’t over-engineer the medallion architecture upfront
- Iceberg compaction matters — Schedule regular compaction to keep query performance healthy
- Terraform everything — Infrastructure drift is the enemy of reproducibility