08 · gpu

GPU under load: making Panfrost usable

GPU reset is the nuclear option. Close paths are worse: scheduler fences, GEM mappings, and per-file MMU lifetime all have to agree.

◐ partial

The PinePhone Pro has a Mali-T860 MP4 GPU. FreeBSD’s panfrost driver — ported from Linux’s drivers/gpu/drm/panfrost/ and living in our overlay at src/sys/dev/drm/panfrost/ — handles GPU jobs, GEM buffer allocation, and the IOMMU page tables that map userland buffers into GPU virtual memory. For light loads (sway compositing, gtk apps, single-window video) it is solid. For heavy loads (Hyprland with animations, glmark2, anything pushing real fragment shader workload through the chip) it would crash the entire system inside the first minute, taking USB networking, the display, and SSH down with it.

As of kernel #77 from be91afa Refcount Panfrost file lifetime , the basic claim is now much stronger: Sway is using Mali-T860 (Panfrost), glxgears runs around 278 FPS, closing glxgears through both timeout and a Sway window kill no longer panics, and the post-test driver counters show job_cnt=0, sc->jobs[]=NULL, irq_fault=0, mmu_faults=0, and resets=0. The remaining qualifier is not “does Panfrost work at all?” It is “what does Hyprland, WebGL, and real sustained shader load still expose?”

The crash wasn’t one bug. It started as four bugs that only manifested together when a GPU job timed out. Later, when rendering was finally stable, the last obvious panic moved to process teardown: closing a healthy GL client freed per-file MMU state while jobs and GEM mappings could still reference it.

This essay is not a single arc. It’s the sequence of invariants we had to discover before the GPU became usable.

[WAR STORY]

Patch 1 — inline GPU reset, not deferred

panfrost / scheduler

▸ symptom

glmark2 runs for ~40 seconds. A complex shader job overruns the 5-second job timeout in panfrost_job_timedout. Serial console shows panfrost: GPU job timeout, queueing reset task. Then nothing. The display freezes mid-frame. SSH dies. Phone has to be hard-rebooted via the side button.

▸ hypothesis 1

The reset task is taking too long once it runs. Add some logging in the reset path. Boot, run, hang, reboot, read serial. The reset task never started. The taskqueue handler was queued but the queue was wedged.

▸ breakthrough

The DRM scheduler stops itself when a job times out — drm_sched_stop parks the scheduler kthread for the affected slot. Concurrently, any new DRM atomic commits from the compositor block in drm_atomic_helper_wait_for_vblanks, which is waiting on a fence that the parked scheduler can no longer signal. The taskqueue we deferred the reset to runs on the same kthread pool that the scheduler runs on. The reset task gets queued behind the parked scheduler entry. It never runs. Deadlock.

▸ fix

Match Linux behavior: do the entire GPU reset inline in the timedout_job callback. Stop all schedulers, mask job interrupts, soft-stop hardware jobs, hard-reset the GPU, re-submit pending jobs, restart schedulers — all before returning from timedout_job. eb0e2d1 panfrost: inline GPU reset in timedout callback, fix build config implements panfrost_reset_nolock exactly as Linux does, with a concrete ordering: schedulers stop first (so userland can’t submit anything new), then JOB_INT_MASK = 0 (so the IRQ handler can’t re-enter the reset path), then HW soft-stop, then the actual device_reset, then resubmit, then re-arm schedulers.

[WAR STORY]

Patch 2 — page-table teardown before MMU reset

panfrost / iommu

▸ symptom

After the inline-reset patch, the system survives most timeouts. But occasionally — and reliably under sustained heavy load — postclose (the path that runs when a GPU client closes its DRM file descriptor) panics with “l3e found (indexes 1 5 12 7)”. Stack trace points at smmu_pmap_remove_pages in sys/arm64/iommu/iommu_pmap.c. The IOMMU has live L3 page-table entries when we tried to release the page table. There is a KASSERT for this exact case.

▸ hypothesis 1

Bug in smmu_pmap_remove_pages. We could weaken the assertion — clear the entries instead of panicking. Tried it briefly. d50072f iommu: clear stale L3 entries instead of panicking in smmu_pmap_remove_pages in fact did this for several days. It “worked” in the sense that the panic stopped, but the GPU now leaked physical memory: every closed client left mapped pages pinned forever.

▸ hypothesis 2

The IOMMU code is right; we’re calling it wrong. panfrost_postclose was calling panfrost_mmu_pgtable_free directly on a page table that still had user buffers mapped into GPU VA. The Linux driver walks drm_mm (the GPU-side allocator) and explicitly unmaps every node before tearing down the page table. We weren’t doing that.

▸ fix

e275d21 panfrost: properly unmap all GPU pages in pgtable_free (fix postclose panic) walks the drm_mm allocator first, calls pmap_gpu_remove for each node, then tears down the page table. The IOMMU assertion stays in place — it is a real invariant and weakening it just hid this bug.

drm_mm_for_each_node(node, &pfile->mm) {
    len = node->size << PAGE_SHIFT;
    pmap_gpu_remove(&mmu->p, node->start << PAGE_SHIFT, len);
}
smmu_pmap_remove_pages(&mmu->p);
smmu_pmap_release(&mmu->p);

Reverted the iommu_pmap weakening as part of the same commit. Don’t paper over driver bugs by relaxing IOMMU safety checks.

▸ lesson

GPU drivers and IOMMU drivers have a contract: the GPU driver must tear down its mappings before releasing the page tables. The IOMMU layer enforces this with assertions. When the assertion fires, the bug is in the GPU driver, not the IOMMU.

[WAR STORY]

Patch 8 — Firefox glxtest waits forever on a Panfrost BO (open)

panfrost / dma fence

▸ symptom

On 2026-05-06, after Firefox was installed and launched under Sway, the phone looked wedged from the panel but both USB Ethernet and WiFi still answered. The kernel had already killed Firefox once for memory reclaim failure, and a stale fuzzel lock made the launcher look sick, but the compositor problem was lower-level. procstat -k showed Sway’s main thread blocked here:

sway
  drm_ioctl
  panfrost_ioctl_wait_bo
  dma_resv_wait_timeout_rcu

The companion glxtest process was stuck in a locked sysctl path, and one Firefox content process was suspended while another thread waited in dsp_clone opening /dev/dsp. Killing clients did not recover the desktop. SIGKILL did not reap Sway; it became an orphan under PID 1 and remained in state R. sudo reboot stopped sshd, but the kernel kept answering ICMP for more than a minute. The phone came back only after reboot.

▸ hypothesis 1

This is not the old drm_modeset_lock signature. There was no fresh hw_done / flip_done timeout cascade in the receipt. The blocking point was the Panfrost BO wait ioctl, so the first split is between “the GPU job/fence never completed” and “the wait path cannot observe completion because the FreeBSD dma-resv/fence port is missing a wakeup or lifetime edge.”

▸ hypothesis 2

Firefox startup is a useful reproducer because glxtest asks Mesa/GL to probe the renderer before the browser proper settles. That makes it a smaller target than WebGL content: one helper process, one compositor, and no need to load a large page. The next run should not start by killing Firefox; it should first capture stacks and counters.

▸ breakthrough

The new read-only receipt is mise run debug:gpu:wedge:phone -- <name>. It records process state, procstat -k stacks, Sway IPC if it answers, DRM/Panfrost sysctls, thermal state, dmesg, and messages into logs/gpu-wedge/. A healthy post-reboot baseline on 2026-05-06 classified as unclassified; the next frozen Firefox/Sway desktop should classify as panfrost-bo-wait if the same stack appears.

▸ lesson

The critical observation was that “the phone is wedged” had three layers: network was alive, userland clients were sick, and the compositor was stuck in a kernel wait that even SIGKILL could not unwind. If the capture script says panfrost-bo-wait, debug Panfrost fence completion and BO lifetime first. If it says drm-modeset-wedge, return to the rk_vop atomic-helper path from patch 7. They are different bugs until a shared lower-level fence/lifetime fault proves otherwise.

[WAR STORY]

Patch 3 — skip wait_for_vblanks during GPU reset

rk_drm / panfrost

▸ symptom

With patches 1 and 2 in place, the system mostly survives GPU timeouts — kernel doesn’t panic, GPU resets, jobs resubmit. But after a timeout, the display sometimes freezes for 20+ seconds before recovering, and during that window USB networking dies (host can’t reach 10.0.0.2). Eventually the system recovers, but a 20-second freeze is unacceptable on a phone.

▸ hypothesis 1

The GPU is in a degraded mode after reset. Maybe the next compositor frame is taking the slow path. Added timing to commit_tail. Each phase of the atomic commit was logged. The result was unambiguous: commit_hw_done returned in microseconds; wait_for_vblanks was sitting for the full 20 seconds.

▸ breakthrough

The compositor calls drm_atomic_helper_wait_for_vblanks to confirm the new frame has been scanned out. Inside it, the helper waits on the same scheduler fence machinery the GPU driver uses. When drm_sched_stop parks the scheduler thread mid-reset, anything waiting on that thread’s fences blocks until the thread restarts. Our compositor was waiting on a fence whose backing thread was parked.

We don’t actually need wait_for_vblanks — we send the vblank event immediately in atomic_begin (essay 7’s commit 736d191). The wait is pure overhead, and it deadlocks when the GPU is being reset.

▸ fix

604bd8f rk_drm: skip wait_for_vblanks in commit_tail to prevent GPU reset deadlock deletes the drm_atomic_helper_wait_for_vblanks call from rk_drm_atomic_commit_tail. The function is now five short helper calls in a row with no waits. Display freezes during GPU reset dropped from 20 seconds to under one second.

[WAR STORY]

Patch 4 — remove MMU from in-use list before teardown

panfrost / mmu / IRQ

▸ symptom

Even with patches 1-3, we get a low-rate crash signature: NULL dereference at offset 0x20 in panfrost_mmu_irq_handler. Specifically, pfile->mm_lock (which is at offset 0x20 in struct panfrost_file) is being accessed on a pfile that’s been freed.

▸ hypothesis 1

Use-after-free in the GEM object teardown. We had real bugs in gem_create_object_with_handle ( 0ed920a panfrost: fix use-after-free in gem_create_object_with_handle ) and gem_open error path ( bdd62bb panfrost: add debug logging to gem_free_object and gem_open error path ). Neither matched the offset 0x20 pattern. Different bug.

▸ hypothesis 2

The MMU IRQ handler is racing pgtable teardown. Look at the order of operations in panfrost_mmu_pgtable_free:

1. drm_mm_for_each_node — unmap each page region
2. smmu_pmap_remove_pages — release page-table pages
3. smmu_pmap_release — release the pmap itself
4. TAILQ_REMOVE(&sc->mmu_in_use, mmu, next) — unregister

During steps 1-3, an MMU fault can arrive. The IRQ handler walks sc->mmu_in_use to find the pfile for the faulting AS. It finds our half-torn-down mmu. Dereferences pfile->mm_lock. Pfile is mostly-freed memory. NULL or junk at offset 0x20. Crash.

▸ breakthrough

The IRQ handler treats mmu_in_use as the source of truth for “which mappings are live.” We treated it as cleanup. Wrong. Whatever’s on mmu_in_use will be touched by the IRQ handler. Remove it first, then tear down — not the other way around.

▸ fix

9bd776a panfrost: remove mmu from in-use list before teardown to fix IRQ race swaps the order. Before unmapping anything, hold as_mtx and remove the mmu from mmu_in_use. After that point the IRQ handler cannot find this pfile, so the teardown is safe to proceed. This is a pure ordering fix — no new locks, no new fields. The diff is twelve lines.

mtx_lock_spin(&sc->as_mtx);
if (mmu->as >= 0) {
    sc->as_alloc_set &= ~(1 << mmu->as);
    TAILQ_REMOVE(&sc->mmu_in_use, mmu, next);
}
mtx_unlock_spin(&sc->as_mtx);

/* Now safe to unmap — MMU IRQ handler won't find us. */
drm_mm_for_each_node(node, &pfile->mm) {
    len = node->size << PAGE_SHIFT;
    pmap_gpu_remove(&mmu->p, node->start << PAGE_SHIFT, len);
}
smmu_pmap_remove_pages(&mmu->p);
smmu_pmap_release(&mmu->p);

A fifth fix landed alongside: f930a3d panfrost: signal pending fences during GPU reset signals pending DRM scheduler fences during reset with dma_fence_set_error(-ETIMEDOUT) then dma_fence_signal. Without it, atomic commits waiting on in-flight GPU work hang forever after the reset — a slower-motion patch-3 deadlock. Iterate sched->ring_mirror_list per slot and signal anything unsignaled.

▸ lesson

Hardware timeout recovery is the worst path in the driver. It runs concurrently with everything else, it can’t take long, and it has to coordinate with subsystems whose authors have never thought about your reset semantics. The bugs you ship in normal-path code will be obvious; the bugs you ship in the timeout path will hide for weeks because they only fire when the GPU misbehaves and someone is also doing something else with the screen. Test reset paths under production load, with the compositor running, USB networking active, and concurrent GEM allocations — not in isolation.

Per the project’s GPU-status memory note (project_gpu_crash_fixes.md), the first stable milestone was “normal use works; heavy stress still needs proof.” Kernel #77 advances that: normal Sway compositing, GL renderer selection, glxgears, and the close path are now clean on hardware. The remaining open work is narrower: run Hyprland and browser/WebGL stress with the same counters and serial capture, then decide whether the reset path is actually production-stable or merely good enough for normal UI load.

The driver runs Sway, gtk apps, mpv playback, glxinfo, and glxgears with hardware acceleration. Hyprland and browser/WebGL stress are the next validation target. ◐ partial now reflects untested high-load coverage, not a known failure in the normal compositor path.

[WAR STORY]

Patch 5 — the atomic_helper NULL deref under sway theme reload

rk_drm / rk_plane

▸ symptom

Reloading a sway theme — which fires a burst of atomic commits in under 100 ms — would reliably reboot the phone. Captured via serial photography (the boot console is gone after the CRU hack, so we can’t tee boot output, but post-boot panic frames do appear on serial briefly before reboot):

rk_vop@: atomic_begin: event=0xffff... call #1
rk_vop@: atomic_flush:  event=0           call #1
... (eleven good pairs)
rk_vop@: atomic_begin: event=0xffff... call #12
Fatal data abort:
  far: 0                         ← faulting address = NULL
  esr: 0x96000004                ← translation fault, level 4
WARNING [list_empty(&lock->head)]                 failed in drm_modeset_lock.c
WARNING [drm_modeset_is_locked(&crtc->mutex)]     failed at drm_atomic_helper.c:617
WARNING [drm_modeset_is_locked(&plane->mutex)]    failed at drm_atomic_helper.c:892
[drm] *ERROR* [CRTC:33:crtc-0] hw_done timed out
[drm] *ERROR* [PLANE:31:plane-0] flip_done timed out

The wall of clkmode_link_recalc: Attempt to use unresolved linked clock: clkin_gmac lines around the panic is incidental: the GMAC clock is referenced by a node we don’t use, and clk_link.c was unconditionally printf-ing on every call. With sway hammering atomic commits the message rate hit hundreds per second and obscured the actual fault. (Fixed separately by rate-limiting clknode_link_* warnings — see the patches index.)

▸ hypothesis 1

The lock-assertion warnings (lines 617 and 892) made me chase a lock-ordering bug for a while: maybe a workqueue was running atomic_check without holding mode_config.connection_mutex. But those lines are in drm_atomic_helper_check_modeset and _check_planes, called during userland’s DRM_IOCTL_MODE_ATOMIC path. Userland always takes the locks. The warnings firing meant we were re-entering check from somewhere else, and the previous commit had already wedged (hw_done timed out is from drm_atomic_helper_wait_for_dependencies — sitting 10 seconds for the prior commit before kicking off the next one). That’s a symptom, not the cause.

▸ hypothesis 2

Look at what runs between a successful pair and the wedged one. The atomic commit calls into our rk_vop_plane_atomic_update. That function had — at line 197 — this:

if (!plane->state->visible)
    panic("plane is not visible");

When sway drops a visible window or the cursor moves off-screen, the helper still calls atomic_update on the now-invisible plane (Linux’s helper skips it; FreeBSD’s helper does not). The first 11 commits had visible planes. The 12th had an invisible cursor plane. panic() ran. The kernel started the panic path concurrent with another core still inside drm_atomic_helper_wait_for_dependencies — which is where the timeout warnings came from. The NULL deref was the panic handler unwinding through partially-released DRM state. The lock-assertion warnings were the tail end of the same panic, dumped in non-deterministic order.

▸ breakthrough

The panic("plane is not visible") was placeholder code marked /* TODO */ — a leftover from when the driver was first written and the author hadn’t decided how to handle plane disable yet. Under normal use (one window, no cursor moves) it never triggered. Under sway theme reload it triggered every invocation. Once you see the /* TODO */, the rest of the analysis is post-hoc.

▸ fix

Three coordinated changes in HEAD Revert "fusb302: wait for chip BMC ACTIVITY=0 before TX_START" :

  1. rk_vop_plane_atomic_update: replace the panic with a graceful “gate the window off” path — if state->fb == NULL || !state->visible, clear WIN0_CTRL0_EN (or WIN2_CTRL0_EN for the cursor plane), latch via REG_CFG_DONE, return. No more panic on theme reload.
  2. rk_vop_plane_atomic_disable: previously a stub. Now it actually disables the window. Otherwise a freed framebuffer keeps scanning out until something else writes the plane registers.
  3. rk_crtc_atomic_begin / rk_crtc_atomic_flush: defensive crtc->state == NULL guard so a future race in the helper can’t NULL-deref through us. Also dropped the per-call device_printf that was emitting two lines per commit and helped flood the serial console during the failure.

Separately, clk_link.c got a ratecheck() so the clkin_gmac warnings stop drowning out real diagnostics.

▸ lesson

A panic() left in a driver under a /* TODO */ comment is a time bomb. It works fine until the load pattern that exercises that branch shows up — and then it doesn’t just fail, it brings the whole machine down with no recovery. Replace panic() with a printf and a “best effort” handler the first time you write the driver; come back and do it properly later. The cost of a wrong best-effort behavior is at most a visual glitch; the cost of a wrong panic() is “user can’t even get a backtrace before the phone reboots.” For backtraces post-mortem, see the new GPU debugging recipe — savecore-to-swap, dmesg snapshots, and the serial-capture script.

[WAR STORY]

Patch 6 — close glxgears without freeing the MMU underneath it

panfrost / GEM / MMU lifetime

▸ symptom

After the scheduler and clock fixes, glxgears finally behaved like a real GPU workload: smooth, hardware-rendered, and stable for as long as it was left running. Then closing the window rebooted the phone. The panic was a Fatal data abort after process exit, with a low fault address consistent with a NULL-derived or freed-structure dereference.

▸ hypothesis 1

The GPU was still executing work at close. That had been true earlier, but kernel #77 counters before the close-path fix told a different story: job submits and completions matched, sc->jobs[] was idle, no MMU faults were recorded, and the renderer was stable while the process lived. The failure moved from “GPU cannot run the workload” to “driver teardown frees something still referenced by asynchronous DRM objects.”

▸ breakthrough

Linux Panfrost does not let a GEM mapping or scheduled job point directly at file-private MMU storage without owning its lifetime. It gives the MMU context its own refcount. Our FreeBSD port kept the MMU embedded in struct panfrost_file, then panfrost_postclose() destroyed scheduler entities, freed the page table, tore down drm_mm, and freed pfile immediately. Jobs still had job->pfile; mappings still had mapping->mmu = &pfile->mmu. That is exactly the use-after-free shape the panic showed.

▸ fix

be91afa Refcount Panfrost file lifetime ports the important part of Linux’s lifetime model into the current FreeBSD shape. struct panfrost_file now has a refcount; panfrost_postclose() destroys scheduler entities and drops only the file’s own reference; GEM mappings take a file reference when created and release it after panfrost_gem_teardown_mapping(); submitted jobs take a file reference once the scheduler owns them and release it in panfrost_job_cleanup(). The final panfrost_mmu_pgtable_free(), drm_mm_takedown(), lock destroy, and free(pfile) happen only on the last reference.

That was deployed as kernel #77 and tested on hardware:

OpenGL renderer string: Mali-T860 (Panfrost)
1381 frames in 5.0 seconds = 276.179 FPS
1394 frames in 5.0 seconds = 278.665 FPS
post-close: job_cnt=0, sc->jobs[0..2]=0, irq_fault=0, mmu_faults=0, resets=0

▸ lesson

File close is not a synchronous “nothing can touch this anymore” boundary in a DRM driver. Fences, scheduler jobs, GEM handles, BO mappings, and MMU contexts all have separate lifetimes. If any object stores a pointer into file-private memory, either that memory needs its own refcount or close must explicitly drain every possible asynchronous user before freeing it. Upstream Panfrost already encoded that lesson; the FreeBSD port needed to preserve it.

[WAR STORY]

Patch 7 — the modeset-lock wedge under sustained load (open)

drm core / rk_vop

▸ symptom

After patches 1–6, light desktop work is stable. But a heavier load — Hyprland compositing while pkg install runs, or a dd benchmark over WiFi while the screen is being repainted — reliably wedges the phone within a few minutes. SSH, USB networking, and the display all stop at the same time. A photograph of the framebuffer captures the post-mortem (the kernel is too sick to schedule shutdown, and the serial console is part of what’s wedged):

DRM modeset-lock wedge — framebuffer post-mortem
DRM modeset-lock wedge — framebuffer post-mortem Reproduced under sustained Hyprland + WiFi-load. The aarch64 register dump above the warnings is the actual root cause — the modeset-lock WARNs are downstream debris.

▸ phone · crash · 2026-04-29

The signature has two parts. First, an aarch64 Fatal data abort register dump (x0..x29, sp, lr, elr, spsr, far). Then a cascade of WARNs from the DRM core:

WARNING !list_empty(&lock->head) failed at .../drm/core/drm_modeset_lock.c:268
WARNING drm_modeset_is_locked(&crtc->mutex) failed at .../drm/core/drm_atomic_helper.c:617
WARNING drm_modeset_is_locked(&dev->mode_config.connection_mutex) failed at .../drm/core/drm_atomic_helper.c:667
WARNING drm_modeset_is_locked(&plane->mutex) failed at .../drm/core/drm_atomic_helper.c:892
(drm) ERROR: [CRTC:33:crtc-0]    hw_done timed out
(drm) ERROR: [CRTC:33:crtc-0]    flip_done timed out
(drm) ERROR: [CONNECTOR:35:DSI-1] hw_done timed out
(drm) ERROR: [CONNECTOR:35:DSI-1] flip_done timed out
(drm) ERROR: [PLANE:31:plane-0]   hw_done timed out
(drm) ERROR: [PLANE:32:plane-1]   hw_done timed out

The order matters. The data abort fires first; the modeset warnings are downstream debris from a half-completed atomic commit whose threads couldn’t finish unwinding. Distinct from the panic("plane is not visible") we fixed in patch 5 — that one fired on a clean code path. This one is preceded by a corrupted control-flow trap.

▸ hypothesis 1

A panfrost lifetime regression — patch 6 fixed glxgears close, but maybe Hyprland’s pattern (rapid reload cycles, multi-window GL contexts) hits a different teardown race. Tentative; the register dump should pin the function once we can decode it.

▸ hypothesis 2

The atomic-commit pipeline itself, not panfrost. The 2026-04-30 cross-driver audit (see appendix: cross-driver audit) walked rk_drm.c::rk_drm_atomic_commit_tail, rk_vop.c::rk_crtc_atomic_begin/flush, and rk_vop_intr against Linux mainline drm_atomic_helper_commit_tail_rpm and rockchip_drm_vop.c::vop_crtc_atomic_flush. Three structural divergences, all in the same code path:

  1. rk_drm.c:147-165 deliberately omits drm_atomic_helper_wait_for_vblanks between commit_hw_done and cleanup_planes. A later commit’s wait_for_dependencies then sees a commit->flip_done that nobody completes. This matches the flip_done timed out line of the wedge exactly.
  2. rk_vop.c:280 INTR_CLEAR0 = ~0 clobbers any FS_INTR that arrives between the status read and the ack. Frame completions race the ack and are lost — vblank counter advances stop, the CRTC commit chain stalls.
  3. Both rk_crtc_atomic_begin and rk_crtc_atomic_flush handle state->event with no driver-side vop->event latch (Linux mainline keeps the event on the driver between flush and the IRQ handler that drains it). vblank_get is taken in flush at line 470 with no matching vblank_put reachable from the IRQ path.

Each of those is independently a “pipeline stalls under load” bug. Two or three together explain why heavier compositing reproduces the wedge in minutes.

▸ lesson

The fix-the-symptom path is tempting — WARN_ON at drm_modeset_lock.c:268 is just a “lock list looks weird” warning, and we could try to silence it. But the register dump above the warnings says the kernel already trapped through a corrupted pointer; muting the warnings would lose the only post-mortem we have.

The first two code-side fixes are now in the overlay: wait_for_vblanks is back in the commit tail and rk_vop_intr acks only the status bits it read. The page-flip event latch between atomic_flush and the FS_INTR vblank IRQ is still open. Next bench session: build and run scripts/wedge-repro with the gated WARN_ONkdb_backtracepanic instrumentation from e22bae6 drm: gate WARN_ON kdb_backtrace+panic behind sysctls for wedge debugging . If the wedge persists, capture dev.rk_vop.0.vop_event_latched_total, dev.rk_vop.0.fs_intr_total, and dev.rk_vop.0.vblank_refs_outstanding alongside the backtrace and split the next investigation between “event never latched”, “FS_INTR stopped”, and “atomic commit waited on a non-VOP fence.”