Serverless Data Lakehouse

  • Luis de Sousa
  • Feb 10, 2026

A production-grade Data Lakehouse designed for extreme cost efficiency and scalability. This architecture ingests raw edge logs, processes them via serverless ETL, and stores them in Apache Iceberg format for ACID-compliant analytics.

Architecture

flowchart TB subgraph Users Browser[Browser] end subgraph GitHub Repo[GitHub Repository] Actions[GitHub Actions] end subgraph AWS_Global["AWS Global"] WAF[AWS WAF
Web ACL] Route53[Route53
DNS] ACM[ACM Certificate
us-east-1] CloudFront[CloudFront Pro
Primary Distribution] CloudFrontDR[CloudFront Pro
DR Distribution] LambdaDR[Lambda
DR Failover] CWHealthAlarm[CloudWatch Alarm
Health Check] end subgraph AWS_EU["AWS eu-central-1"] S3[S3 Bucket
allaboutdata.eu] KMS[KMS
Website CMK] KMSLake[KMS
Datalake CMK] end subgraph AWS_DR["AWS eu-central-2 (DR)"] S3DR[S3 Bucket
allaboutdata.eu-dr] KMSDR[KMS
DR Customer Managed Key] end subgraph DataLake["AWS Data Lake"] S3Lake[S3 Data Lake Bucket] GlueCatalog[Glue Data Catalog
Bronze Table] EventBridge[EventBridge
Scheduler] GlueJob[Glue Spark Job
Bronze → Silver] Iceberg[Iceberg Warehouse
Silver Table] GoldJob[Glue Spark Job
Silver → Gold] Gold[Gold Warehouse
Parquet Aggregations] Athena[Athena
SQL Analytics] LambdaDash[Lambda
Dashboard Export] GlueDQ[Glue Data Quality
DQDL Rulesets] end subgraph Observability["AWS Observability"] CWDashboard[CloudWatch
Dashboard] CWAlarms[CloudWatch
Alarms] SNS[SNS
Email Alerts] end subgraph External GoogleWorkspace[Google Workspace
Email] end %% User Flow Browser -->|HTTPS Request| Route53 Route53 -->|DNS Resolution| CloudFront WAF -->|Protection| CloudFront WAF -->|Protection| CloudFrontDR CloudFront -->|Origin| S3 CloudFrontDR -->|Origin| S3DR ACM -.->|TLS Certificate| CloudFront ACM -.->|TLS Certificate| CloudFrontDR %% DR Failover Flow Route53 -.->|Health Check| CWHealthAlarm CWHealthAlarm -->|SNS → ALARM/OK| LambdaDR LambdaDR -.->|AssociateAlias| CloudFront LambdaDR -.->|AssociateAlias| CloudFrontDR %% CI/CD Flow Repo -->|Push to main| Actions Actions -->|S3 Sync| S3 Actions -->|Cache Invalidation| CloudFront %% Email (Google Workspace) Route53 -.->|MX Record| GoogleWorkspace %% Data Lake Flow CloudFront -->|Access Logs| S3Lake S3Lake -->|TSV Logs| GlueCatalog S3Lake -.->|S3 Events| EventBridge EventBridge -->|6h Trigger| GlueJob EventBridge -->|Daily Trigger| GoldJob GlueCatalog -->|Incremental Read| GlueJob GlueJob -->|Write Iceberg V2| Iceberg Iceberg -->|Delta Read| GoldJob GoldJob -->|Write Parquet 128MB| Gold Iceberg -->|Query| Athena Gold -->|Query| Athena %% Dashboard Export Flow GoldJob -.->|Job Complete| EventBridge EventBridge -->|Trigger| LambdaDash LambdaDash -->|Query Gold| Athena LambdaDash -->|Write JSON| S3 %% Cross-Region Replication S3 -->|S3 CRR| S3DR %% Encryption KMS -.->|SSE-KMS| S3 KMSDR -.->|SSE-KMS| S3DR KMSLake -.->|SSE-KMS| S3Lake %% Data Quality (Daily at 3 AM) EventBridge -->|Daily 3 AM| GlueDQ GlueDQ -->|Validate| Iceberg GlueDQ -->|Validate| Gold %% Observability Flow GlueJob -.->|Metrics| CWDashboard GoldJob -.->|Metrics| CWDashboard LambdaDash -.->|Metrics| CWDashboard CWAlarms -->|Alert| SNS SNS -->|Email| GoogleWorkspace %% Styling classDef aws fill:#FF9900,stroke:#232F3E,color:#232F3E classDef github fill:#24292E,stroke:#24292E,color:#fff classDef user fill:#4285F4,stroke:#1a73e8,color:#fff classDef external fill:#34A853,stroke:#1e8e3e,color:#fff classDef datalake fill:#8C4FFF,stroke:#232F3E,color:#fff classDef observability fill:#DD3522,stroke:#232F3E,color:#fff classDef security fill:#1A8FE3,stroke:#232F3E,color:#fff class Route53,ACM,CloudFront,CloudFrontDR,S3,S3DR aws class WAF,KMS,KMSDR,KMSLake security class Repo,Actions github class Browser user class GoogleWorkspace external class S3Lake,GlueCatalog,EventBridge,GlueJob,Iceberg,GoldJob,Gold,Athena,LambdaDash,GlueDQ datalake class CWDashboard,CWAlarms,CWHealthAlarm,SNS observability class LambdaDR datalake

The Stack

  • Ingest: CloudFront + S3 (Real-time)
  • Storage: Apache Iceberg (Bronze/Silver)
  • Compute: AWS Glue (Serverless Spark)
  • IaC: Terraform (100% Code)

Key Metrics

MetricValue
Cost< $1.00 / month (Idle)
Latency< 5 mins (Log to Insight)
FormatApache Iceberg (ACID)

Why Iceberg?

Apache Iceberg brings warehouse-level reliability to the data lake. Schema evolution, time travel, and partition evolution come standard — no vendor lock-in required.

The serverless compute layer (AWS Glue) means you only pay when data is being processed. During quiet periods, the entire pipeline costs fractions of a cent.

Lessons Learned

  1. Start with Bronze/Silver — Don’t over-engineer the medallion architecture upfront
  2. Iceberg compaction matters — Schedule regular compaction to keep query performance healthy
  3. Terraform everything — Infrastructure drift is the enemy of reproducibility