Airflow + S3 Pipeline
Built a production-style Airflow 3.2.1 data ingestion pipeline with S3 using LocalStack and Docker — zero AWS credentials required.
Overview
A locally-running Airflow 3.2.1 pipeline that mirrors production topology: CeleryExecutor, separate service containers, and a medallion-architecture S3 layout (raw → processed → analytics), all backed by LocalStack instead of real AWS.
Problem
I wanted to learn how a real Airflow pipeline talks to S3, but I didn't want to spin up AWS resources, worry about costs, or accidentally leave something running that charges my credit card.
Constraints
- No AWS account or credentials — everything must run locally
- Airflow 3.x has significant breaking changes from 2.x, and most tutorials target 2.x
- Must mirror production topology, not just a toy LocalExecutor setup
Approach
Ran eight Docker containers (Postgres, Redis, LocalStack, and five Airflow 3.x services) using a customized version of the official Airflow docker-compose. Configured LocalStack as an S3 stand-in and set up three buckets following the medallion pattern. Built two DAGs: a scheduled ingestion pipeline and an event-driven sensor pipeline.
Key Decisions
Use LocalStack instead of real S3
LocalStack responds to the same API calls as real S3, enabling zero-cost, fast iteration with no credentials. The only configuration difference is the endpoint_url on the aws_default connection.
- MinIO
- Real AWS with Free Tier
Use CeleryExecutor with Redis instead of LocalExecutor
CeleryExecutor mirrors production topology — separate worker containers, a real message broker, and the same connection issues you'd hit in staging. Debugging these locally means no surprises later.
- LocalExecutor
Adopt the medallion architecture (raw → processed → analytics)
Three buckets following the medallion pattern is how most real data platforms organize things. It gave me a realistic data flow to implement: parallel ingestion, fan-in aggregation, and quality gating.
Tech Stack
- Apache Airflow 3.2.1
- Docker / Docker Compose
- LocalStack
- PostgreSQL
- Redis
- CeleryExecutor
- Python
Result & Impact
- 8Docker containers
- 3S3 buckets (medallion)
- 2DAGs
Building this project taught me more about Airflow than reading documentation ever did. The split-service architecture in 3.x, the worker callback mechanism, the JWT auth flow — these are things you don't really understand until you break them and fix them. Doing it all locally with LocalStack meant I could iterate fast, break things without consequences, and actually see how the pieces fit together.
Learnings
- Major version upgrades often rename or remove familiar commands — always check the migration guide
- Distributed systems require explicit service discovery configuration; defaulting to localhost inside containers won't work
- Shared secrets must be explicitly synchronized across services — auto-generated secrets will silently mismatch
- Optional runtime context fields can be None on manual or non-standard executions — always handle the fallback case
- Templating engines only render in designated fields; assuming they work everywhere leads to silent bugs
- Reproducing production topology locally surfaces the same issues early, when they're cheap to fix
Architecture
The project runs eight Docker containers: Postgres for Airflow metadata, Redis as the Celery broker, LocalStack as an S3 stand-in, and five Airflow services (api-server, scheduler, dag-processor, worker, triggerer). The docker-compose setup is based on the official Airflow defaults with customizations.
The Pipeline
The main DAG (s3_data_ingestion) runs daily at 6am with three parallel ingest tasks, a fan-in analytics step, and a data quality gate. A second DAG (s3_sensor_ingestion) uses S3KeySensor for event-driven processing.
Local vs Production
| Local dev | Production |
|---|---|
| LocalStack | Real S3 + IAM roles |
test/test credentials | IRSA or instance profiles |
Static JWT secret in .env | Vault or Secrets Manager |
| Single Postgres container | Managed Postgres with replicas |
docker-compose | Kubernetes + Helm |
Source Code
The full project is available on GitHub at omgsian/airflow-s3-project.