Airflow + S3 Pipeline

Data Engineer · 2026 · 2 weeks · 3 min read

Built a production-style Airflow 3.2.1 data ingestion pipeline with S3 using LocalStack and Docker — zero AWS credentials required.

Overview

A locally-running Airflow 3.2.1 pipeline that mirrors production topology: CeleryExecutor, separate service containers, and a medallion-architecture S3 layout (raw → processed → analytics), all backed by LocalStack instead of real AWS.

Problem

I wanted to learn how a real Airflow pipeline talks to S3, but I didn't want to spin up AWS resources, worry about costs, or accidentally leave something running that charges my credit card.

Constraints

  • No AWS account or credentials — everything must run locally
  • Airflow 3.x has significant breaking changes from 2.x, and most tutorials target 2.x
  • Must mirror production topology, not just a toy LocalExecutor setup

Approach

Ran eight Docker containers (Postgres, Redis, LocalStack, and five Airflow 3.x services) using a customized version of the official Airflow docker-compose. Configured LocalStack as an S3 stand-in and set up three buckets following the medallion pattern. Built two DAGs: a scheduled ingestion pipeline and an event-driven sensor pipeline.

Key Decisions

Use LocalStack instead of real S3

Reasoning:

LocalStack responds to the same API calls as real S3, enabling zero-cost, fast iteration with no credentials. The only configuration difference is the endpoint_url on the aws_default connection.

Alternatives considered:
  • MinIO
  • Real AWS with Free Tier

Use CeleryExecutor with Redis instead of LocalExecutor

Reasoning:

CeleryExecutor mirrors production topology — separate worker containers, a real message broker, and the same connection issues you'd hit in staging. Debugging these locally means no surprises later.

Alternatives considered:
  • LocalExecutor

Adopt the medallion architecture (raw → processed → analytics)

Reasoning:

Three buckets following the medallion pattern is how most real data platforms organize things. It gave me a realistic data flow to implement: parallel ingestion, fan-in aggregation, and quality gating.

Tech Stack

  • Apache Airflow 3.2.1
  • Docker / Docker Compose
  • LocalStack
  • PostgreSQL
  • Redis
  • CeleryExecutor
  • Python

Result & Impact

  • 8
    Docker containers
  • 3
    S3 buckets (medallion)
  • 2
    DAGs

Building this project taught me more about Airflow than reading documentation ever did. The split-service architecture in 3.x, the worker callback mechanism, the JWT auth flow — these are things you don't really understand until you break them and fix them. Doing it all locally with LocalStack meant I could iterate fast, break things without consequences, and actually see how the pieces fit together.

Learnings

  • Major version upgrades often rename or remove familiar commands — always check the migration guide
  • Distributed systems require explicit service discovery configuration; defaulting to localhost inside containers won't work
  • Shared secrets must be explicitly synchronized across services — auto-generated secrets will silently mismatch
  • Optional runtime context fields can be None on manual or non-standard executions — always handle the fallback case
  • Templating engines only render in designated fields; assuming they work everywhere leads to silent bugs
  • Reproducing production topology locally surfaces the same issues early, when they're cheap to fix

Architecture

The project runs eight Docker containers: Postgres for Airflow metadata, Redis as the Celery broker, LocalStack as an S3 stand-in, and five Airflow services (api-server, scheduler, dag-processor, worker, triggerer). The docker-compose setup is based on the official Airflow defaults with customizations.

The Pipeline

The main DAG (s3_data_ingestion) runs daily at 6am with three parallel ingest tasks, a fan-in analytics step, and a data quality gate. A second DAG (s3_sensor_ingestion) uses S3KeySensor for event-driven processing.

Local vs Production

Local devProduction
LocalStackReal S3 + IAM roles
test/test credentialsIRSA or instance profiles
Static JWT secret in .envVault or Secrets Manager
Single Postgres containerManaged Postgres with replicas
docker-composeKubernetes + Helm

Source Code

The full project is available on GitHub at omgsian/airflow-s3-project.