Building an Airflow + S3 Pipeline That Runs Entirely on Your Laptop - Writing

I wanted to learn how a real Airflow pipeline talks to S3, but I didn’t want to spin up AWS resources, worry about costs, or accidentally leave something running that charges my credit card. So I built the whole thing locally using Docker and LocalStack, and it ended up being way more educational than I expected.

Here’s what I learned.

The setup: Airflow 3.2.1 + CeleryExecutor + LocalStack

graph TB
    subgraph Airflow Services
        API["api-server\n(UI + REST API)"]
        SCHED[scheduler]
        DAGPROC[dag-processor]
        WORKER[worker]
        TRIGGER[triggerer]
    end

    PG[("Postgres\n(metadata db)")]
    REDS[("Redis\n(Celery broker)")]

    subgraph LocalStack
        S3["S3 API\n(endpoint_url)"]
    end

    subgraph S3 Buckets - Medallion Pattern
        RAW["raw-data-bucket\n(CSV / JSON)"]
        PROC["processed-data-bucket\n(cleaned + enriched)"]
        ANAL["analytics-data-bucket\n(aggregated summaries)"]
    end

    DAGPROC -->|"parses DAG files"| SCHED
    SCHED -->|"schedules tasks"| WORKER
    WORKER -->|"Task SDK callbacks"| API
    TRIGGER -->|"deferred tasks"| WORKER
    REDS -->|"broker messages"| WORKER
    API --- PG
    SCHED --- PG
    WORKER --- PG

    WORKER -->|"S3Hook / S3KeySensor"| S3
    S3 --> RAW
    S3 --> PROC
    S3 --> ANAL

    RAW -->|"ingest_*"| PROC
    PROC -->|"generate_analytics_summary"| ANAL

The project runs eight Docker containers: Postgres for Airflow metadata, Redis as the Celery broker, LocalStack pretending to be S3, and then five Airflow services (api-server, scheduler, dag-processor, worker, triggerer). That sounds like a lot, but it mirrors what a production Airflow cluster actually looks like. Anyway, the bulk of the Docker stuff, I didn’t actually write it myself, as I took it with some customisations based on what Apache Airflow has provided as defaults.

Airflow 3.x splits what used to be a single webserver process into separate services. The api-server handles the UI and REST API, the dag-processor parses your DAG files, the scheduler decides what to run, the worker executes tasks, and the triggerer handles async deferred tasks. Each one runs in its own container. This matters because it means connection issues that would bite you in production also bite you locally. It do be nasty.

Because I’m too poor to afford to pay for AWS (joking, it’s not too bad to run this myself), I chose LocalStack, which helps to mock AWS serivces that runs locally and responds to the same API calls as real S3. My Airflow S3Hook and S3KeySensor talk to it instead of AWS, which means zero credentials, zero cost, and fast iteration. The trick is configuring the aws_default Airflow connection with an endpoint_url pointing at LocalStack, without that, everything tries to hit real AWS and fails.

The pipeline: raw -> processed -> analytics

I set up three S3 buckets following the medallion pattern (raw, processed, analytics) because that’s how most real data platforms organize things:

raw-data-bucket, landing zone for source CSV and JSON files
processed-data-bucket, cleaned and enriched data with metadata attached
analytics-data-bucket, aggregated summaries ready for dashboards

The main DAG (s3_data_ingestion) runs daily at 6am and looks like this:

graph LR
    LIST[list_raw_data] --> ING_C[ingest_customers]
    LIST --> ING_T[ingest_transactions]
    LIST --> ING_P[ingest_products]
    ING_C --> GEN[generate_analytics_summary]
    ING_T --> GEN
    ING_P --> GEN
    GEN --> DQ[data_quality_check]

The three ingest tasks run in parallel since they’re independent. Then generate_analytics_summary fans in all three outputs and computes things like total revenue, transaction counts by status, and customers by country. Finally, data_quality_check validates that records exist, required fields are present, and product prices are positive. If anything fails, the whole DAG run fails loudly, which is exactly what you want.

There’s also a second DAG (s3_sensor_ingestion) that uses S3KeySensor to wait for specific files to appear in S3 before processing them. It’s an event-driven pattern instead of a scheduled one, useful when you don’t control when upstream data arrives.

The gotchas

Airflow 3.x has some significant differences from 2.x, and most tutorials you find online are still for 2.x. Here are the things that tripped me up:

airflow db init doesn’t exist anymore. Use airflow db migrate.

Workers need to call back to the API server. This is new in Airflow 3.x. The Task SDK in the worker makes HTTP calls back to the api-server during execution. If you don’t set AIRFLOW__CORE__EXECUTION_API_SERVER_URL, the worker tries localhost:8080 inside its own container, which has nothing listening. Tasks fail with connection refused errors that look like a network problem but are really a configuration problem.

JWT secrets must match across all services. If you don’t explicitly set AIRFLOW__API_AUTH__JWT_SECRET, each container generates its own random secret. The worker signs tokens with one key, the api-server verifies with a different key, and you get “Invalid auth token: Signature verification failed.” This one took a while to figure out.

context['ds'] can be None on manual runs. In Airflow 3.x, logical_date is optional for manually triggered runs. If your tasks rely on ds and someone clicks “Trigger DAG” in the UI, you get a KeyError. I wrote a small helper that falls back through logical_date, data_interval_start, and today’s date.

Jinja templates in params don’t get rendered inside Python callables. I initially tried {{ params.prefix or '' }} inside a params dict, but Python callables aren’t templated fields, the Jinja string was passed literally as the S3 prefix. The fix was to just read context["params"]["prefix"] directly.

What makes this “production-style”

It’s not production-ready, you wouldn’t run docker-compose up and call it a day. But the topology is production-style:

CeleryExecutor with a real Redis broker (not LocalExecutor)
Separate containers for each Airflow service (not a monolithic webserver)
Proper health checks on every service with startup ordering via depends_on
A custom Dockerfile that bakes in dependencies instead of pip-installing at runtime

The idea is that when you eventually move to Kubernetes with the official Helm chart, the same connection issues and configuration requirements apply. Debugging the JWT problem locally means you won’t be surprised when it happens in staging.

Or, have your boss pay for Managed Airflow, because who’s got the time to manage an infrastructure on its own.

What I’d change for real AWS

The biggest differences between this local setup and a real production deployment:

Local dev	Production
LocalStack	Real S3 + IAM roles
`test/test` credentials	IRSA or instance profiles
Static JWT secret in `.env`	Vault or Secrets Manager
Single Postgres container	Managed Postgres with replicas
`docker-compose`	Kubernetes + Helm

The docker-compose stack is an accurate mirror of the production structure, which is what makes it useful. You’re not just running Airflow, you’re running it the way it runs in production, just smaller.

Why this was worth doing

Building this project taught me more about Airflow than reading documentation ever did. The split-service architecture in 3.x, the worker callback mechanism, the JWT auth flow, these are things you don’t really understand until you break them and fix them. And doing it all locally with LocalStack meant I could iterate fast, break things without consequences, and actually see how the pieces fit together.