08 · gpu

GPU under load: four patches for Panfrost

GPU reset is the nuclear option. It touches the scheduler, the IOMMU, the display, and the IRQ subsystem. Choreograph or freeze.

◐ partial

The PinePhone Pro has a Mali-T860 MP4 GPU. FreeBSD’s panfrost driver — ported from Linux’s drivers/gpu/drm/panfrost/ and living in our overlay at src/sys/dev/drm/panfrost/ — handles GPU jobs, GEM buffer allocation, and the IOMMU page tables that map userland buffers into GPU virtual memory. For light loads (sway compositing, gtk apps, single-window video) it is solid. For heavy loads (Hyprland with animations, glmark2, anything pushing real fragment shader workload through the chip) it would crash the entire system inside the first minute, taking USB networking, the display, and SSH down with it.

The crash isn’t one bug. It’s four bugs that only manifest together when a GPU job times out. A timeout triggers a GPU reset, and the reset path threads through the scheduler, the IOMMU, the display compositor, and the IRQ controller — each of which is doing its own thing concurrently with the failing GPU. Every one of those subsystems had its own race. Four patches landed across two days (April 6 → April 7) to choreograph the reset.

This essay is not a single arc. It’s four small ones, each fixing a different mode of the same crash.

[WAR STORY]

Patch 1 — inline GPU reset, not deferred

panfrost / scheduler

▸ symptom

glmark2 runs for ~40 seconds. A complex shader job overruns the 5-second job timeout in panfrost_job_timedout. Serial console shows panfrost: GPU job timeout, queueing reset task. Then nothing. The display freezes mid-frame. SSH dies. Phone has to be hard-rebooted via the side button.

▸ hypothesis 1

The reset task is taking too long once it runs. Add some logging in the reset path. Boot, run, hang, reboot, read serial. The reset task never started. The taskqueue handler was queued but the queue was wedged.

▸ breakthrough

The DRM scheduler stops itself when a job times out — drm_sched_stop parks the scheduler kthread for the affected slot. Concurrently, any new DRM atomic commits from the compositor block in drm_atomic_helper_wait_for_vblanks, which is waiting on a fence that the parked scheduler can no longer signal. The taskqueue we deferred the reset to runs on the same kthread pool that the scheduler runs on. The reset task gets queued behind the parked scheduler entry. It never runs. Deadlock.

▸ fix

Match Linux behavior: do the entire GPU reset inline in the timedout_job callback. Stop all schedulers, mask job interrupts, soft-stop hardware jobs, hard-reset the GPU, re-submit pending jobs, restart schedulers — all before returning from timedout_job. eb0e2d1 panfrost: inline GPU reset in timedout callback, fix build config implements panfrost_reset_nolock exactly as Linux does, with a concrete ordering: schedulers stop first (so userland can’t submit anything new), then JOB_INT_MASK = 0 (so the IRQ handler can’t re-enter the reset path), then HW soft-stop, then the actual device_reset, then resubmit, then re-arm schedulers.

[WAR STORY]

Patch 2 — page-table teardown before MMU reset

panfrost / iommu

▸ symptom

After the inline-reset patch, the system survives most timeouts. But occasionally — and reliably under sustained heavy load — postclose (the path that runs when a GPU client closes its DRM file descriptor) panics with “l3e found (indexes 1 5 12 7)”. Stack trace points at smmu_pmap_remove_pages in sys/arm64/iommu/iommu_pmap.c. The IOMMU has live L3 page-table entries when we tried to release the page table. There is a KASSERT for this exact case.

▸ hypothesis 1

Bug in smmu_pmap_remove_pages. We could weaken the assertion — clear the entries instead of panicking. Tried it briefly. d50072f iommu: clear stale L3 entries instead of panicking in smmu_pmap_remove_pages in fact did this for several days. It “worked” in the sense that the panic stopped, but the GPU now leaked physical memory: every closed client left mapped pages pinned forever.

▸ hypothesis 2

The IOMMU code is right; we’re calling it wrong. panfrost_postclose was calling panfrost_mmu_pgtable_free directly on a page table that still had user buffers mapped into GPU VA. The Linux driver walks drm_mm (the GPU-side allocator) and explicitly unmaps every node before tearing down the page table. We weren’t doing that.

▸ fix

e275d21 panfrost: properly unmap all GPU pages in pgtable_free (fix postclose panic) walks the drm_mm allocator first, calls pmap_gpu_remove for each node, then tears down the page table. The IOMMU assertion stays in place — it is a real invariant and weakening it just hid this bug.

drm_mm_for_each_node(node, &pfile->mm) {
    len = node->size << PAGE_SHIFT;
    pmap_gpu_remove(&mmu->p, node->start << PAGE_SHIFT, len);
}
smmu_pmap_remove_pages(&mmu->p);
smmu_pmap_release(&mmu->p);

Reverted the iommu_pmap weakening as part of the same commit. Don’t paper over driver bugs by relaxing IOMMU safety checks.

▸ lesson

GPU drivers and IOMMU drivers have a contract: the GPU driver must tear down its mappings before releasing the page tables. The IOMMU layer enforces this with assertions. When the assertion fires, the bug is in the GPU driver, not the IOMMU.

[WAR STORY]

Patch 3 — skip wait_for_vblanks during GPU reset

rk_drm / panfrost

▸ symptom

With patches 1 and 2 in place, the system mostly survives GPU timeouts — kernel doesn’t panic, GPU resets, jobs resubmit. But after a timeout, the display sometimes freezes for 20+ seconds before recovering, and during that window USB networking dies (host can’t reach 10.0.0.2). Eventually the system recovers, but a 20-second freeze is unacceptable on a phone.

▸ hypothesis 1

The GPU is in a degraded mode after reset. Maybe the next compositor frame is taking the slow path. Added timing to commit_tail. Each phase of the atomic commit was logged. The result was unambiguous: commit_hw_done returned in microseconds; wait_for_vblanks was sitting for the full 20 seconds.

▸ breakthrough

The compositor calls drm_atomic_helper_wait_for_vblanks to confirm the new frame has been scanned out. Inside it, the helper waits on the same scheduler fence machinery the GPU driver uses. When drm_sched_stop parks the scheduler thread mid-reset, anything waiting on that thread’s fences blocks until the thread restarts. Our compositor was waiting on a fence whose backing thread was parked.

We don’t actually need wait_for_vblanks — we send the vblank event immediately in atomic_begin (essay 7’s commit 736d191). The wait is pure overhead, and it deadlocks when the GPU is being reset.

▸ fix

604bd8f rk_drm: skip wait_for_vblanks in commit_tail to prevent GPU reset deadlock deletes the drm_atomic_helper_wait_for_vblanks call from rk_drm_atomic_commit_tail. The function is now five short helper calls in a row with no waits. Display freezes during GPU reset dropped from 20 seconds to under one second.

[WAR STORY]

Patch 4 — remove MMU from in-use list before teardown

panfrost / mmu / IRQ

▸ symptom

Even with patches 1-3, we get a low-rate crash signature: NULL dereference at offset 0x20 in panfrost_mmu_irq_handler. Specifically, pfile->mm_lock (which is at offset 0x20 in struct panfrost_file) is being accessed on a pfile that’s been freed.

▸ hypothesis 1

Use-after-free in the GEM object teardown. We had real bugs in gem_create_object_with_handle ( 0ed920a panfrost: fix use-after-free in gem_create_object_with_handle ) and gem_open error path ( bdd62bb panfrost: add debug logging to gem_free_object and gem_open error path ). Neither matched the offset 0x20 pattern. Different bug.

▸ hypothesis 2

The MMU IRQ handler is racing pgtable teardown. Look at the order of operations in panfrost_mmu_pgtable_free:

1. drm_mm_for_each_node — unmap each page region
2. smmu_pmap_remove_pages — release page-table pages
3. smmu_pmap_release — release the pmap itself
4. TAILQ_REMOVE(&sc->mmu_in_use, mmu, next) — unregister

During steps 1-3, an MMU fault can arrive. The IRQ handler walks sc->mmu_in_use to find the pfile for the faulting AS. It finds our half-torn-down mmu. Dereferences pfile->mm_lock. Pfile is mostly-freed memory. NULL or junk at offset 0x20. Crash.

▸ breakthrough

The IRQ handler treats mmu_in_use as the source of truth for “which mappings are live.” We treated it as cleanup. Wrong. Whatever’s on mmu_in_use will be touched by the IRQ handler. Remove it first, then tear down — not the other way around.

▸ fix

9bd776a panfrost: remove mmu from in-use list before teardown to fix IRQ race swaps the order. Before unmapping anything, hold as_mtx and remove the mmu from mmu_in_use. After that point the IRQ handler cannot find this pfile, so the teardown is safe to proceed. This is a pure ordering fix — no new locks, no new fields. The diff is twelve lines.

mtx_lock_spin(&sc->as_mtx);
if (mmu->as >= 0) {
    sc->as_alloc_set &= ~(1 << mmu->as);
    TAILQ_REMOVE(&sc->mmu_in_use, mmu, next);
}
mtx_unlock_spin(&sc->as_mtx);

/* Now safe to unmap — MMU IRQ handler won't find us. */
drm_mm_for_each_node(node, &pfile->mm) {
    len = node->size << PAGE_SHIFT;
    pmap_gpu_remove(&mmu->p, node->start << PAGE_SHIFT, len);
}
smmu_pmap_remove_pages(&mmu->p);
smmu_pmap_release(&mmu->p);

A fifth fix landed alongside: f930a3d panfrost: signal pending fences during GPU reset signals pending DRM scheduler fences during reset with dma_fence_set_error(-ETIMEDOUT) then dma_fence_signal. Without it, atomic commits waiting on in-flight GPU work hang forever after the reset — a slower-motion patch-3 deadlock. Iterate sched->ring_mirror_list per slot and signal anything unsignaled.

▸ lesson

Hardware timeout recovery is the worst path in the driver. It runs concurrently with everything else, it can’t take long, and it has to coordinate with subsystems whose authors have never thought about your reset semantics. The bugs you ship in normal-path code will be obvious; the bugs you ship in the timeout path will hide for weeks because they only fire when the GPU misbehaves and someone is also doing something else with the screen. Test reset paths under production load, with the compositor running, USB networking active, and concurrent GEM allocations — not in isolation.

Per the project’s GPU-status memory note (project_gpu_crash_fixes.md): four patches landed; the system is stable for normal use; system freeze on heavy GPU load is not fully solved. There’s still a failure mode where USB networking dies even though the kernel doesn’t panic. The reset succeeds, the schedulers restart, but something downstream — possibly an unsignaled fence we haven’t found yet, possibly a stuck compositor — keeps the display chain wedged. Investigation continues.

The driver runs sway, gtk apps, mpv playback, and most of glmark2 reliably. Hyprland with animations is borderline. Pure stress tests (glmark2 terrain at full quality, or Chromium with WebGL benchmarks) can still wedge the system. ◐ partial reflects that bound.

[WAR STORY]

Patch 5 — the atomic_helper NULL deref under sway theme reload

rk_drm / rk_plane

▸ symptom

Reloading a sway theme — which fires a burst of atomic commits in under 100 ms — would reliably reboot the phone. Captured via serial photography (the boot console is gone after the CRU hack, so we can’t tee boot output, but post-boot panic frames do appear on serial briefly before reboot):

rk_vop@: atomic_begin: event=0xffff... call #1
rk_vop@: atomic_flush:  event=0           call #1
... (eleven good pairs)
rk_vop@: atomic_begin: event=0xffff... call #12
Fatal data abort:
  far: 0                         ← faulting address = NULL
  esr: 0x96000004                ← translation fault, level 4
WARNING [list_empty(&lock->head)]                 failed in drm_modeset_lock.c
WARNING [drm_modeset_is_locked(&crtc->mutex)]     failed at drm_atomic_helper.c:617
WARNING [drm_modeset_is_locked(&plane->mutex)]    failed at drm_atomic_helper.c:892
[drm] *ERROR* [CRTC:33:crtc-0] hw_done timed out
[drm] *ERROR* [PLANE:31:plane-0] flip_done timed out

The wall of clkmode_link_recalc: Attempt to use unresolved linked clock: clkin_gmac lines around the panic is incidental: the GMAC clock is referenced by a node we don’t use, and clk_link.c was unconditionally printf-ing on every call. With sway hammering atomic commits the message rate hit hundreds per second and obscured the actual fault. (Fixed separately by rate-limiting clknode_link_* warnings — see the patches index.)

▸ hypothesis 1

The lock-assertion warnings (lines 617 and 892) made me chase a lock-ordering bug for a while: maybe a workqueue was running atomic_check without holding mode_config.connection_mutex. But those lines are in drm_atomic_helper_check_modeset and _check_planes, called during userland’s DRM_IOCTL_MODE_ATOMIC path. Userland always takes the locks. The warnings firing meant we were re-entering check from somewhere else, and the previous commit had already wedged (hw_done timed out is from drm_atomic_helper_wait_for_dependencies — sitting 10 seconds for the prior commit before kicking off the next one). That’s a symptom, not the cause.

▸ hypothesis 2

Look at what runs between a successful pair and the wedged one. The atomic commit calls into our rk_vop_plane_atomic_update. That function had — at line 197 — this:

if (!plane->state->visible)
    panic("plane is not visible");

When sway drops a visible window or the cursor moves off-screen, the helper still calls atomic_update on the now-invisible plane (Linux’s helper skips it; FreeBSD’s helper does not). The first 11 commits had visible planes. The 12th had an invisible cursor plane. panic() ran. The kernel started the panic path concurrent with another core still inside drm_atomic_helper_wait_for_dependencies — which is where the timeout warnings came from. The NULL deref was the panic handler unwinding through partially-released DRM state. The lock-assertion warnings were the tail end of the same panic, dumped in non-deterministic order.

▸ breakthrough

The panic("plane is not visible") was placeholder code marked /* TODO */ — a leftover from when the driver was first written and the author hadn’t decided how to handle plane disable yet. Under normal use (one window, no cursor moves) it never triggered. Under sway theme reload it triggered every invocation. Once you see the /* TODO */, the rest of the analysis is post-hoc.

▸ fix

Three coordinated changes in HEAD drm: fix rk_plane panic that triggered DRM atomic_helper NULL deref :

  1. rk_vop_plane_atomic_update: replace the panic with a graceful “gate the window off” path — if state->fb == NULL || !state->visible, clear WIN0_CTRL0_EN (or WIN2_CTRL0_EN for the cursor plane), latch via REG_CFG_DONE, return. No more panic on theme reload.
  2. rk_vop_plane_atomic_disable: previously a stub. Now it actually disables the window. Otherwise a freed framebuffer keeps scanning out until something else writes the plane registers.
  3. rk_crtc_atomic_begin / rk_crtc_atomic_flush: defensive crtc->state == NULL guard so a future race in the helper can’t NULL-deref through us. Also dropped the per-call device_printf that was emitting two lines per commit and helped flood the serial console during the failure.

Separately, clk_link.c got a ratecheck() so the clkin_gmac warnings stop drowning out real diagnostics.

▸ lesson

A panic() left in a driver under a /* TODO */ comment is a time bomb. It works fine until the load pattern that exercises that branch shows up — and then it doesn’t just fail, it brings the whole machine down with no recovery. Replace panic() with a printf and a “best effort” handler the first time you write the driver; come back and do it properly later. The cost of a wrong best-effort behavior is at most a visual glitch; the cost of a wrong panic() is “user can’t even get a backtrace before the phone reboots.” For backtraces post-mortem, see the new GPU debugging recipe — savecore-to-swap, dmesg snapshots, and the serial-capture script.