Yesterday I showed how we could query local parquet and delta files using pandas and deltalake. Although these libraries work, once you start loading big parquet files you see your system stall while your memory usage spikes.
A colleague suggested to give DuckDB a try. I never heard about it, but let’s discover it together.
What is DuckDB?
DuckDB is an in-process analytical database — think SQLite, but built for OLAP workloads instead of transactional ones. It runs entirely inside your Python process (no server to spin up, no connection string to manage) and is optimized for the kinds of queries data engineers run every day: large scans, aggregations, joins, and window functions over columnar data.
A few things that make it stand out:
- It reads files directly. You don't import data into DuckDB before querying it. You point it at a Parquet file, a folder of Parquet files, or a Delta table, and it queries them in place. No ETL step, no intermediate copy.
- It's columnar and vectorized. DuckDB processes data in column-oriented batches, which is exactly how Parquet stores data on disk. This alignment means it can read only the columns and row groups it needs, skipping the rest entirely — a technique called predicate pushdown.
- It's fast. Benchmarks consistently put DuckDB ahead of pandas and Spark for single-node analytical queries, often by a significant margin. It also parallelizes across your CPU cores automatically.
- It speaks standard SQL. No custom API to learn. If you know SQL, you know DuckDB.
Sounds great right?
Installation
pip install duckdb
That's it. No Docker, no JVM, no config files.
Querying local Parquet files
Once you have your OneLake files synced locally via OneLake File Explorer, they'll live somewhere like:
C:\Users\<you>\OneLake - <workspace>\<lakehouse>.Lakehouse\Tables\<table name>
DuckDB can query these directly. Note: use forward slashes in your SQL strings — backslashes cause issues.
import duckdb
# Query a single Parquet file
result = duckdb.query("""
SELECT *
FROM 'C:/Users/you/OneLake - MyWorkspace/MyLakehouse.Lakehouse/Tables/orders/part-00001.parquet'
LIMIT 5
""").df()
# Query all Parquet files in a folder using a glob
result = duckdb.query("""
SELECT *
FROM 'C:/Users/you/OneLake - MyWorkspace/MyLakehouse.Lakehouse/Tables/orders/*.parquet'
LIMIT 5
""").df()
The .df() at the end returns a pandas DataFrame, which is handy for further processing or display in a notebook.
Only read what you need
This is where DuckDB really earns its place in a data engineering toolkit. When you add a WHERE clause, DuckDB doesn't read the entire file and then filter — it uses the Parquet file's built-in metadata (row group statistics) to skip chunks of the file that can't possibly match your filter.
# DuckDB reads only the row groups that could contain status = 'shipped'
result = duckdb.query("""
SELECT customer_id, order_date, amount
FROM 'C:/Users/you/OneLake - MyWorkspace/MyLakehouse.Lakehouse/Tables/orders/*.parquet'
WHERE status = 'shipped'
AND order_date >= '2024-01-01'
""").df()
Similarly, if you only select a few columns, DuckDB reads only those columns off disk — not the full row. On a wide table with 50+ columns, this can reduce I/O by an order of magnitude.
Querying Delta Lake tables
Delta tables are stored as a folder of Parquet files plus a _delta_log/ transaction log. DuckDB supports Delta natively via the delta_scan function, which reads the transaction log to determine which files are part of the current table state.
table_path = "C:/Users/you/OneLake - MyWorkspace/MyLakehouse.Lakehouse/Tables/orders"
result = duckdb.query(f"""
SELECT *
FROM delta_scan('{table_path}')
WHERE status = 'shipped'
""").df()
This is important: if you just glob the Parquet files in a Delta folder directly, you might accidentally include files that were deleted or replaced by later transactions. Using delta_scan ensures you're reading a consistent, correct snapshot of the table.
Using a persistent connection for Multiple Queries
By default, duckdb.query() uses an in-memory connection that's discarded after each call. For notebook workflows where you want to run multiple queries, create a connection once and reuse it:
python
import duckdb
con = duckdb.connect() # In-memory, but persistent across calls
# Register a path as a named view for convenience
con.execute("""
CREATE OR REPLACE VIEW orders AS
SELECT * FROM delta_scan('C:/Users/you/.../Tables/orders')
""")
con.execute("""
CREATE OR REPLACE VIEW customers AS
SELECT * FROM 'C:/Users/you/.../Tables/customers/*.parquet'
""")
# Now query as if they were tables
result = con.execute("""
SELECT o.order_id, c.name
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
LIMIT 10
""").df()
This makes your notebooks much cleaner, especially when working with multiple tables.
Mixing DuckDB with Pandas
DuckDB integrates seamlessly with the rest of the Python data stack. You can query a pandas DataFrame directly from SQL:
import pandas as pd
# Some existing DataFrame in your notebook
df = pd.read_parquet("...")
# Query it with SQL — DuckDB can see Python variables
result = duckdb.query("SELECT * FROM df WHERE amount > 1000").df()
When to reach for DuckDB?
DuckDB isn't always the right tool — but it's the right tool more often than people expect. Here's a quick guide:
| Situation | DuckDB ? |
| Need to aggregate or filter a large Parquet table without loading it all | ✅ Yes |
| Running ad hoc SQL exploration in a notebook | ✅ Yes |
| Joining multiple local Parquet/Delta tables | ✅ Yes |
| Need Delta time travel (reading historical versions) | ❌ Use delta-rs |
| Simple row-level transforms on a small table | ⚠️ Pandas is fine |
For single-node workloads on data that fits on a local disk — which covers a huge portion of real-world data engineering tasks — DuckDB is hard to beat.
Wrapping up
DuckDB fills a gap that pandas can't: fast, memory-efficient SQL over large local files, with no infrastructure overhead. When you combine that with OneLake File Explorer's ability to sync your Fabric Lakehouse tables locally, you get a surprisingly powerful local development setup — query terabytes of data with standard SQL, iterate quickly, and push results back to Fabric when you're done.
I added DuckDB as part of my default data analytics toolkit and I would recommend doing the same.
