Skip to main content

Python vs PySpark notebooks in MS Fabric

Being new to Microsoft Fabric I noticed that you have multiple options when writing notebooks using Python: run your code with PySpark (backed by a Spark cluster) or with Python (running natively on the notebook's compute). Both options look almost identical on the surface — you're still writing Python syntax either way — but under the hood they behave very differently, and picking the wrong one can cost you time, money, and unnecessary complexity.

In this post I try to identify the key differences and give you some heuristics for deciding which engine to reach for.

Python vs PySpark: what's actually different?

When you select PySpark in a Fabric notebook, your code runs on a distributed Apache Spark cluster. Fabric spins up a cluster, distributes your data across multiple worker nodes, and executes transformations in parallel. The core abstraction is the DataFrame (or RDD), and operations are lazy — nothing actually runs until you trigger an action like .show() or .write().

When you select Python, your code runs on a single machine — the notebook's own compute node. There's no cluster to spin up, no distributed execution, and no Spark overhead. You work with familiar Python libraries like pandas, scikit-learn, matplotlib, or anything else you'd pip install.

The distinction matters because these two modes have very different performance profiles, startup costs, and ideal use cases.

When to choose PySpark?

Go with PySpark when your data is large. Spark earns its keep when you're working with datasets that don't fit comfortably in memory on a single machine — think hundreds of gigabytes to petabytes. Spark distributes the work across nodes, so what would take hours on a single machine can be done in minutes.

PySpark is the right call when you're working with Delta Lake and lakehouses at scale. Fabric's lakehouse is built on Delta Lake, and Spark has native, highly optimized support for reading and writing Delta tables. If you're building data pipelines that process large volumes of structured or semi-structured data, PySpark gives you the best throughput.

Choose PySpark for parallel, distributed transformations. Joining two 50 GB tables, aggregating billions of rows, or running window functions across a partitioned dataset — these are exactly what Spark was designed for. Trying to do this in pandas will either crash your session or take forever.

PySpark shines in production pipelines. When you're building repeatable, scheduled data engineering workflows that need to scale reliably, the fault-tolerance and horizontal scalability of Spark is a genuine advantage.

A good rule of thumb: if your data is bigger than a few gigabytes, or if you expect it to grow significantly, start with PySpark.

When to choose Python?

Choose Python when your dataset is small. If you're working with a few thousand to a few million rows that comfortably fit in memory, pandas will be faster than PySpark — often dramatically so. Spark has real startup overhead (cluster initialization alone can take 1–2 minutes), and for small data that cost is never recovered.

Python is better for data exploration and ad hoc analysis. When you're iterating quickly, exploring a new dataset, or sketching out ideas, the immediacy of plain Python is a huge advantage. You don't want to wait for cluster startup every time you tweak a visualization or re-run a cell.

Reach for Python when you need rich library access. The Python ecosystem is vast. If your work involves machine learning (scikit-learn, XGBoost, PyTorch), statistical modeling (statsmodels), advanced visualization (Plotly, seaborn), or anything specialized, native Python gives you direct, uncomplicated access. While Spark ML exists, it doesn't cover everything, and integrating non-Spark libraries into a PySpark notebook adds friction.

Python is ideal for any task that is fundamentally single-machine. Training a model on a prepared dataset, generating a report, calling an API, processing a small file — none of these benefit from distribution.

If you're writing scripts, utilities, or helper functions that don't touch large datasets at all, native Python is simpler and more appropriate.

The gray zone

The trickiest decisions come when you're somewhere in the middle — say, a 2–10 GB dataset. Here, either engine could work, and the right answer depends on a few factors:

  • How often will this run? If it's a one-time exploration, Python + pandas (or the faster alternatives like Polars) is probably fine. If it'll run daily in production and data will grow, PySpark is the safer bet.
  • What are you doing with the data? Simple filtering and aggregations are fast in pandas even at moderate scale. Complex multi-table joins or large shuffles favor Spark.
  • Do you have a Spark cluster already running? If one is active, the overhead is already paid. If not, starting one for a 3 GB job is usually not worth it.

Here is a quick decision guide that can help:

Situation Recommended Engine
Data > ~10 GB PySpark
Production data pipeline at scale PySpark
Large Delta Lake reads/writes PySpark
Complex distributed joins or aggregations PySpark
Data exploration, prototyping Python
Data < a few GB Python
Machine learning with scikit-learn / PyTorch Python
Quick ad hoc analysis Python
Calling APIs or working with files Python

A more detailed guide is provided by Microsoft and can be found here: Choosing Between Python and PySpark Notebooks in Microsoft Fabric - Microsoft Fabric | Microsoft Learn

A note on cost:

In Microsoft Fabric, running a Spark session consumes capacity units even when the cluster is idle. Native Python notebooks are generally much cheaper to run for lightweight tasks because they don't spin up a cluster. If you're running lots of small, frequent notebooks, using Python instead of PySpark can meaningfully reduce your Fabric spend.

The choice between PySpark and Python in Microsoft Fabric isn't really about preference — it's about matching the right tool to the right job. PySpark is a powerful distributed engine that earns its overhead when data is large and pipelines need to scale. Python is a fast, flexible, and cost-effective option for smaller data, exploratory work, and tasks that lean on the broader Python ecosystem.

A practical approach: start with Python for exploration and early development. Once you understand the data and the transformations you need, switch to PySpark if scale demands it. Many mature Fabric workflows use both — Python notebooks for lightweight orchestration and analysis, PySpark notebooks for the heavy lifting in the middle of the pipeline.

More information

Choosing Between Python and PySpark Notebooks in Microsoft Fabric - Microsoft Fabric | Microsoft Learn

Use Python experience on Notebook - Microsoft Fabric | Microsoft Learn

Popular posts from this blog

Kubernetes–Limit your environmental impact

Reducing the carbon footprint and CO2 emission of our (cloud) workloads, is a responsibility of all of us. If you are running a Kubernetes cluster, have a look at Kube-Green . kube-green is a simple Kubernetes operator that automatically shuts down (some of) your pods when you don't need them. A single pod produces about 11 Kg CO2eq per year( here the calculation). Reason enough to give it a try! Installing kube-green in your cluster The easiest way to install the operator in your cluster is through kubectl. We first need to install a cert-manager: kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.14.5/cert-manager.yaml Remark: Wait a minute before you continue as it can take some time before the cert-manager is up & running inside your cluster. Now we can install the kube-green operator: kubectl apply -f https://github.com/kube-green/kube-green/releases/latest/download/kube-green.yaml Now in the namespace where we want t...

Azure DevOps/ GitHub emoji

I’m really bad at remembering emoji’s. So here is cheat sheet with all emoji’s that can be used in tools that support the github emoji markdown markup: All credits go to rcaviers who created this list.

Podman– Command execution failed with exit code 125

After updating WSL on one of the developer machines, Podman failed to work. When we took a look through Podman Desktop, we noticed that Podman had stopped running and returned the following error message: Error: Command execution failed with exit code 125 Here are the steps we tried to fix the issue: We started by running podman info to get some extra details on what could be wrong: >podman info OS: windows/amd64 provider: wsl version: 5.3.1 Cannot connect to Podman. Please verify your connection to the Linux system using `podman system connection list`, or try `podman machine init` and `podman machine start` to manage a new Linux VM Error: unable to connect to Podman socket: failed to connect: dial tcp 127.0.0.1:2655: connectex: No connection could be made because the target machine actively refused it. That makes sense as the podman VM was not running. Let’s check the VM: >podman machine list NAME         ...