Skip to main content

The lock that killed my migration

The timeout appeared without warning. A migration pipeline that had run fine in staging suddenly ground to a halt in production, throwing lock wait timeouts across multiple worker threads. The application was querying and writing to the same table, and somewhere in that dance of reads and writes, things had seized up entirely.

Here is a walk through the investigation, from the first diagnostic query all the way to the fix — along with an explanation of why it worked.

Find the blocker

The first tool in any SQL Server lock investigation is sys.dm_exec_requests joined against sys.dm_exec_sessions. This query shows you every session that is currently blocked, who is blocking it, and what SQL both parties are running:

A second query dug into the exact lock modes held on the specific table in question:

The output told the story

The results were unambiguous. Three sessions. One blocker. Two victims.

Blocking Blocked Wait type Wait (sec) Status Blocked_SQL Blocking_SQL
155 163 LCK_M_IX 29.240 suspended (@DocumentKey nvarchar(4000),@Id int)UPDATE dbo.PrognoseAntwoord SET PdfDocumentKey = @DocumentKey WHERE PrognoseAntwoordID=@Id SELECT pa.PrognoseAntwoordID as Id, pa.Rapport as Data, 'application/pdf' as Formaat FROM [dbo].[PrognoseAntwoord] pa WHERE pa.PdfDocumentKey IS NULL ORDER BY pa.PrognoseAntwoordID ASC
155 164 LCK_M_IX 29.241 suspended (@DocumentKey nvarchar(4000),@Id int)UPDATE dbo.PrognoseAntwoord SET PdfDocumentKey = @DocumentKey WHERE PrognoseAntwoordID=@Id SELECT pa.PrognoseAntwoordID as Id, pa.Rapport as Data, 'application/pdf' as Formaat FROM [dbo].[PrognoseAntwoord] pa WHERE pa.PdfDocumentKey IS NULL ORDER BY pa.PrognoseAntwoordID ASC

Blocked sessions 163 and 164 were both attempting an UPDATE dbo.PrognoseAntwoord SET PdfDocumentKey = @DocumentKey WHERE PrognoseAntwoordID = @Id. They had been waiting nearly 30 seconds, with no prospect of completing.

Session 155 — the blocker — was not running an UPDATE. It was running the SELECT: a streaming fetch of all rows where PdfDocumentKey IS NULL, joined across two tables, ordered by ID.

The reader was starving the writers. Not through any bug in the UPDATE logic, but simply by staying open.

Understanding the mechanism

What SQL Server does by default

Under SQL Server's default isolation level — READ COMMITTED — a SELECT statement acquires shared locks (S or IS) on the pages and rows it reads. In normal usage, those locks are released as the cursor advances. But when a result set is consumed row by row over a live connection — as a streaming pipeline does — the connection remains open for the full duration of the fetch. The engine holds locks longer than a conventional single-shot query would.

Why Intent Exclusive locks conflict

An UPDATE statement must first acquire an Intent Exclusive (IX) lock at the page or table level before taking a row-level exclusive lock. IX is incompatible with shared locks held by readers on the same resource. So sessions 163 and 164 could not even begin their updates while session 155 held its read locks open across the table.

Root cause

  • Session 155 (FetchDocuments) streams a SELECT on dbo.PrognoseAntwoord, keeping shared locks alive for the entire pipeline duration.
  • Sessions 163 & 164 (ProcessMigratedDocuments) attempt to UPDATE the same table and block on LCK_M_IX — Intent Exclusive — which is incompatible with those held shared locks.
  • The pipeline deadlocks with itself. Readers block writers. Writers wait indefinitely. The timeout fires.

The fix: READ UNCOMMITTED on the fetch connection

The solution is to change the isolation level of the fetch connection to READ UNCOMMITTED. At this level, shared locks are never acquired during reads. Session 155 can scan the entire table without holding a single lock, which frees sessions 163 and 164 to apply their updates immediately.

By setting READ UNCOMMITTED on the fetch connection, the SELECT acquires no shared locks. The UPDATE workers can immediately acquire their Intent Exclusive locks and proceed. The pipeline runs to completion without contention.

A word on the trade-offs

READ UNCOMMITTED means the SELECT can theoretically read rows that a concurrent transaction has written but not yet committed — so-called dirty reads. In most OLTP scenarios that is unacceptable. But in a migration pipeline where the SELECT is only looking for rows where PdfDocumentKey IS NULL, and the UPDATE is only ever setting that column to a non-null value, a dirty read would at worst cause the pipeline to attempt processing a document twice — which should be idempotent by design anyway.

When to be cautious: If the application cannot tolerate dirty reads at all, consider SNAPSHOT isolation or READ COMMITTED SNAPSHOT ISOLATION (RCSI) as alternatives. Both provide statement-level consistency without blocking writers. RCSI requires a single database-level setting and is often the right long-term answer for high-concurrency applications.

For this particular pipeline — a batch migration with idempotent update logic — READ UNCOMMITTED is the most direct and lowest-overhead fix. The locks disappear, the workers proceed in parallel, and the timeout never fires again.

Conclusion

Lock contention bugs are among the hardest production issues to reproduce, because they depend on timing, concurrency, and connection lifetime — none of which behave the same way in a staging environment with a single worker thread. When you see LCK_M_IX waits piling up, start with sys.dm_exec_requests and follow the chain. The blocker is almost always something you would not expect.

Popular posts from this blog

Kubernetes–Limit your environmental impact

Reducing the carbon footprint and CO2 emission of our (cloud) workloads, is a responsibility of all of us. If you are running a Kubernetes cluster, have a look at Kube-Green . kube-green is a simple Kubernetes operator that automatically shuts down (some of) your pods when you don't need them. A single pod produces about 11 Kg CO2eq per year( here the calculation). Reason enough to give it a try! Installing kube-green in your cluster The easiest way to install the operator in your cluster is through kubectl. We first need to install a cert-manager: kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.14.5/cert-manager.yaml Remark: Wait a minute before you continue as it can take some time before the cert-manager is up & running inside your cluster. Now we can install the kube-green operator: kubectl apply -f https://github.com/kube-green/kube-green/releases/latest/download/kube-green.yaml Now in the namespace where we want t...

Azure DevOps/ GitHub emoji

I’m really bad at remembering emoji’s. So here is cheat sheet with all emoji’s that can be used in tools that support the github emoji markdown markup: All credits go to rcaviers who created this list.

Podman– Command execution failed with exit code 125

After updating WSL on one of the developer machines, Podman failed to work. When we took a look through Podman Desktop, we noticed that Podman had stopped running and returned the following error message: Error: Command execution failed with exit code 125 Here are the steps we tried to fix the issue: We started by running podman info to get some extra details on what could be wrong: >podman info OS: windows/amd64 provider: wsl version: 5.3.1 Cannot connect to Podman. Please verify your connection to the Linux system using `podman system connection list`, or try `podman machine init` and `podman machine start` to manage a new Linux VM Error: unable to connect to Podman socket: failed to connect: dial tcp 127.0.0.1:2655: connectex: No connection could be made because the target machine actively refused it. That makes sense as the podman VM was not running. Let’s check the VM: >podman machine list NAME         ...