Skip to main content

Run SQL queries on local parquet and delta files using DuckDB

Yesterday I showed how we could query local parquet and delta files using pandas and deltalake. Although these libraries work, once you start loading big parquet files you see your system stall while your memory usage spikes.

A colleague suggested to give DuckDB a try. I never heard about it, but let’s discover it together.

What is DuckDB?

DuckDB is an in-process analytical database — think SQLite, but built for OLAP workloads instead of transactional ones. It runs entirely inside your Python process (no server to spin up, no connection string to manage) and is optimized for the kinds of queries data engineers run every day: large scans, aggregations, joins, and window functions over columnar data.


A few things that make it stand out:

  • It reads files directly. You don't import data into DuckDB before querying it. You point it at a Parquet file, a folder of Parquet files, or a Delta table, and it queries them in place. No ETL step, no intermediate copy.
  • It's columnar and vectorized. DuckDB processes data in column-oriented batches, which is exactly how Parquet stores data on disk. This alignment means it can read only the columns and row groups it needs, skipping the rest entirely — a technique called predicate pushdown.
  • It's fast. Benchmarks consistently put DuckDB ahead of pandas and Spark for single-node analytical queries, often by a significant margin. It also parallelizes across your CPU cores automatically.
  • It speaks standard SQL. No custom API to learn. If you know SQL, you know DuckDB.

Sounds great right?

Installation

pip install duckdb

That's it. No Docker, no JVM, no config files.

Querying local Parquet files

Once you have your OneLake files synced locally via OneLake File Explorer, they'll live somewhere like:

C:\Users\<you>\OneLake - <workspace>\<lakehouse>.Lakehouse\Tables\<table name>

DuckDB can query these directly. Note: use forward slashes in your SQL strings — backslashes cause issues.

import duckdb

# Query a single Parquet file
result = duckdb.query("""
    SELECT *
    FROM 'C:/Users/you/OneLake - MyWorkspace/MyLakehouse.Lakehouse/Tables/orders/part-00001.parquet'
    LIMIT 5
""").df()

# Query all Parquet files in a folder using a glob
result = duckdb.query("""
    SELECT *
    FROM 'C:/Users/you/OneLake - MyWorkspace/MyLakehouse.Lakehouse/Tables/orders/*.parquet'
    LIMIT 5
""").df()

The .df() at the end returns a pandas DataFrame, which is handy for further processing or display in a notebook.



Only read what you need

This is where DuckDB really earns its place in a data engineering toolkit. When you add a WHERE clause, DuckDB doesn't read the entire file and then filter — it uses the Parquet file's built-in metadata (row group statistics) to skip chunks of the file that can't possibly match your filter.

# DuckDB reads only the row groups that could contain status = 'shipped'
result = duckdb.query("""
    SELECT customer_id, order_date, amount
    FROM 'C:/Users/you/OneLake - MyWorkspace/MyLakehouse.Lakehouse/Tables/orders/*.parquet'
    WHERE status = 'shipped'
      AND order_date >= '2024-01-01'
""").df()

Similarly, if you only select a few columns, DuckDB reads only those columns off disk — not the full row. On a wide table with 50+ columns, this can reduce I/O by an order of magnitude.

Querying Delta Lake tables

Delta tables are stored as a folder of Parquet files plus a _delta_log/ transaction log. DuckDB supports Delta natively via the delta_scan function, which reads the transaction log to determine which files are part of the current table state.

table_path = "C:/Users/you/OneLake - MyWorkspace/MyLakehouse.Lakehouse/Tables/orders"

result = duckdb.query(f"""
    SELECT *
    FROM delta_scan('{table_path}')
    WHERE status = 'shipped'
""").df()

This is important: if you just glob the Parquet files in a Delta folder directly, you might accidentally include files that were deleted or replaced by later transactions. Using delta_scan ensures you're reading a consistent, correct snapshot of the table.


Using a persistent connection for Multiple Queries

By default, duckdb.query() uses an in-memory connection that's discarded after each call. For notebook workflows where you want to run multiple queries, create a connection once and reuse it:

python

import duckdb

con = duckdb.connect()  # In-memory, but persistent across calls

# Register a path as a named view for convenience
con.execute("""
    CREATE OR REPLACE VIEW orders AS
    SELECT * FROM delta_scan('C:/Users/you/.../Tables/orders')
""")

con.execute("""
    CREATE OR REPLACE VIEW customers AS
    SELECT * FROM 'C:/Users/you/.../Tables/customers/*.parquet'
""")

# Now query as if they were tables
result = con.execute("""
    SELECT o.order_id, c.name
    FROM orders o
    JOIN customers c ON o.customer_id = c.customer_id
    LIMIT 10
""").df()

This makes your notebooks much cleaner, especially when working with multiple tables.

Mixing DuckDB with Pandas

DuckDB integrates seamlessly with the rest of the Python data stack. You can query a pandas DataFrame directly from SQL:

import pandas as pd

# Some existing DataFrame in your notebook
df = pd.read_parquet("...")

# Query it with SQL — DuckDB can see Python variables
result = duckdb.query("SELECT * FROM df WHERE amount > 1000").df()

When to reach for DuckDB?

DuckDB isn't always the right tool — but it's the right tool more often than people expect. Here's a quick guide:

Situation DuckDB ?
Need to aggregate or filter a large Parquet table without loading it all ✅ Yes
Running ad hoc SQL exploration in a notebook ✅ Yes
Joining multiple local Parquet/Delta tables ✅ Yes
Need Delta time travel (reading historical versions) ❌ Use delta-rs
Simple row-level transforms on a small table ⚠️ Pandas is fine

For single-node workloads on data that fits on a local disk — which covers a huge portion of real-world data engineering tasks — DuckDB is hard to beat.

Wrapping up

DuckDB fills a gap that pandas can't: fast, memory-efficient SQL over large local files, with no infrastructure overhead. When you combine that with OneLake File Explorer's ability to sync your Fabric Lakehouse tables locally, you get a surprisingly powerful local development setup — query terabytes of data with standard SQL, iterate quickly, and push results back to Fabric when you're done.

I added DuckDB as part of my default data analytics toolkit and I would recommend doing the same.

More information

DuckDB – An in-process SQL OLAP database management system

Popular posts from this blog

Kubernetes–Limit your environmental impact

Reducing the carbon footprint and CO2 emission of our (cloud) workloads, is a responsibility of all of us. If you are running a Kubernetes cluster, have a look at Kube-Green . kube-green is a simple Kubernetes operator that automatically shuts down (some of) your pods when you don't need them. A single pod produces about 11 Kg CO2eq per year( here the calculation). Reason enough to give it a try! Installing kube-green in your cluster The easiest way to install the operator in your cluster is through kubectl. We first need to install a cert-manager: kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.14.5/cert-manager.yaml Remark: Wait a minute before you continue as it can take some time before the cert-manager is up & running inside your cluster. Now we can install the kube-green operator: kubectl apply -f https://github.com/kube-green/kube-green/releases/latest/download/kube-green.yaml Now in the namespace where we want t...

Azure DevOps/ GitHub emoji

I’m really bad at remembering emoji’s. So here is cheat sheet with all emoji’s that can be used in tools that support the github emoji markdown markup: All credits go to rcaviers who created this list.

Podman– Command execution failed with exit code 125

After updating WSL on one of the developer machines, Podman failed to work. When we took a look through Podman Desktop, we noticed that Podman had stopped running and returned the following error message: Error: Command execution failed with exit code 125 Here are the steps we tried to fix the issue: We started by running podman info to get some extra details on what could be wrong: >podman info OS: windows/amd64 provider: wsl version: 5.3.1 Cannot connect to Podman. Please verify your connection to the Linux system using `podman system connection list`, or try `podman machine init` and `podman machine start` to manage a new Linux VM Error: unable to connect to Podman socket: failed to connect: dial tcp 127.0.0.1:2655: connectex: No connection could be made because the target machine actively refused it. That makes sense as the podman VM was not running. Let’s check the VM: >podman machine list NAME         ...