Parquet and Iceberg: Why I Stopped Treating My Data Lake Like a Folder - Writing

My early data lakes were just folders of CSVs on S3. They worked — barely. Queries were slow, schemas drifted without warning, and there was no way to roll back a bad write. I treated S3 like a hard drive and paid the price every time I needed to answer a question.

Parquet was the first upgrade. It’s a columnar file format — the same principle behind Redshift’s performance, but applied to files. Values from the same column are stored together, compressed together, and read together. A 2GB CSV of order data might shrink to 200MB in Parquet, and querying a single column can be ten times faster because the engine skips everything else. Parquet also stores metadata — schema, row counts, min/max values per column — that lets query engines like Athena skip entire row groups they don’t need. It’s not just smaller. It’s smarter.

But Parquet alone doesn’t give you a table. You still have to track which files belong to which partition, manage schemas, and handle partial writes manually. That’s where it still felt fragile.

Apache Iceberg solves this by adding a table abstraction on top of Parquet (or ORC). An Iceberg table tracks snapshots — each snapshot is a consistent, immutable set of data files. When you write data, you create a new snapshot. The old one stays until you expire it. This gives you ACID transactions on your data lake: readers always see a consistent state, writes never corrupt in-progress reads, and you can roll back to any previous snapshot with a single command.

Schema evolution works cleanly too. Add a column, rename a column, change a type — Iceberg handles the mapping so old data files don’t need rewriting. Time travel lets you query the table as of a specific snapshot or timestamp. For the first time, my data lake had the guarantees I used to think only a warehouse could provide.

The combination — Parquet for efficient storage, Iceberg for table management — is what the industry is calling the lakehouse pattern. You get warehouse semantics on lake-priced storage. It’s changed how I think about architecture. Instead of forcing everything into Redshift, I can keep data in open formats on S3 and still get the reliability I need.