Appendix · story

Work log: GPU stress, thermal guard, and Panfrost DVFS

Sway-visible glxgears, glmark2, and the first OPP-aware Panfrost governor.

2026-05-05 — visible Sway GPU stress

The first false start was package drift: installing mesa-demos and glmark2 pulled in the stock libdrm, which broke Mesa’s FreeBSD platform device enumeration and fell back toward the wrong renderer path. Reinstalling the patched aarch64 package from /tmp/libdrm-2.4.131,1.pkg and locking it restored the expected strings in /usr/local/lib/libdrm.so.2: hw.dri.%d.busid, hw.dri.%d.name, and platform:.

Sway then needed a restart so lazy Xwayland came up against the repaired library stack. After that, glxinfo -B from a swaymsg exec context reported:

OpenGL renderer string: Mali-T860 (Panfrost)
OpenGL version string: 3.1 Mesa 24.1.7
OpenGL ES profile version string: OpenGL ES 3.1 Mesa 24.1.7

vblank_mode=0 glxgears -info ran as a visible Xwayland window on the phone. It printed five 5-second FPS samples:

1332.359
1342.642
1338.187
1340.760
1341.549

Average: 1339.099 FPS. The renderer line was Mali-T860 (Panfrost). During the run, hw.clock.aclk_gpu.frequency stayed at 384000000, hw.regulator.vdd_gpu.uvolt stayed at 1000000, and TSADC reported the GPU rising from roughly 63.1 C to 66.7 C.

The long glmark2 --size 420x760 pass was intentionally guarded. It ran 16 scenes before the monitor killed it at GPU 78.125 C:

[build] use-vbo=false: FPS: 429
[texture] texture-filter=nearest: FPS: 1253
[bump] bump-render=high-poly: FPS: 263
[desktop] blur-radius=5:effect=blur:passes=1:separable=true:windows=4: FPS: 202
thermal stop at cpu=72777 gpu=78125

No panfrost fault appeared in dmesg; the only relevant kernel noise was the TSADC comparator storm guard disarming COMP_INT from the still-live comparator test state, plus one DRM vblank warning.

The complete score came from a shorter, still visible pass:

vblank_mode=0 glmark2 --size 320x480 -b :duration=1 --results fps:cpu:shader

That run completed with glmark2 Score: 859. It topped out near GPU 72.2 C, while the GPU clock and regulator again stayed fixed at 384 MHz and 1.0 V.

◐ partial GPU acceleration was real enough for Sway, glxgears, and a complete short glmark2 pass. At this point in the day, the missing pieces were GPU DVFS/OPP policy and a thermal policy that could run the default glmark2 duration without a hand-written guard.

2026-05-05 — OPP-aware Panfrost DVFS

The next pass wired Panfrost to the RK3399 GPU OPP table and the mali-supply regulator instead of treating gpu_clock_mhz as a bare clock write. 59bde6a panfrost: add OPP-aware GPU DVFS added the OPP parser, voltage-safe clock transitions, thermal caps, and runtime sysctls:

dev.panfrost.0.gpu_opp
dev.panfrost.0.gpu_max_auto_mhz
dev.panfrost.0.gpu_dvfs_mode
dev.panfrost.0.gpu_dvfs_state
dev.panfrost.0.gpu_opp_table

The initial governor booted and idled down, but it exposed two real policy bugs. First, it logged every no-op idle transition; 737b881 panfrost: suppress no-op DVFS transitions suppresses same-OPP transitions. Second, changing OPPs while live GPU work was in flight could wedge a visible glxgears run. d3b80a7 panfrost: avoid active-job DVFS transitions makes job submission raise the clock before JS_COMMAND_START and prevents the periodic governor from changing OPP while active_jobs > 0.

On kernel #170, the controlled Sway tests passed:

GL_RENDERER = Mali-T860 (Panfrost)
7341 frames in 5.0 seconds = 1468.195 FPS
7239 frames in 5.0 seconds = 1447.637 FPS

The performance-mode run held OPP4 at 594000000 Hz and 925000 uV; GPU temperature rose only from roughly 53.9 C to 56.7 C. The auto-mode run also reached OPP4 under sustained load, without panfrost faults, timeouts, or panics in dmesg.

Two follow-up commits turned that into phone-usable behavior. 70a9c84 panfrost: quiet DVFS transition logging makes per-OPP transition logs debug-only (dev.panfrost.0.debug_mask bit 0x40) and extends the idle-down holdoff to reduce churn. 3bfc1d0 panfrost: avoid max clock for light auto work fixes the important idle-power bug: auto mode no longer jumps straight to max-auto for a tiny compositor submission. It wakes light work to the powersave floor and lets sustained measured load ramp the GPU.

Kernel #172 verified the final policy on hardware:

dev.panfrost.0.gpu_opp_table:
0: 200 MHz target=825000 min=825000 max=1150000
1: 297 MHz target=825000 min=825000 max=1150000 current
2: 400 MHz target=825000 min=825000 max=1150000
3: 500 MHz target=875000 min=875000 max=1150000
4: 600 MHz target=925000 min=925000 max=1150000 max-auto
5: 800 MHz target=1100000 min=1100000 max=1150000

mode=auto
cur_opp=1
max_auto_opp=4
powersave_opp=1
last_load=0
active_jobs=0

A 20-second idle sample stayed mostly at OPP1 (297 MHz / 825 mV) with occasional OPP0 (200 MHz / 825 mV) and no spurious OPP4 jumps from light compositor activity. That closes the “fixed-clock Panfrost” gap. What remains is thermal policy: TSADC readout exists and the governor has warm/hot caps, but comparator IRQs, TSHUT, cpufreq throttling, and a real user-visible thermal/battery policy are still separate work.

◐ partial GPU DVFS now works well enough for Sway and short GL stress. The remaining GPU items are longer thermal-soak validation, browser/WebGL coverage, and the separate DRM modeset-lock wedge.

2026-05-05 evening — glmark2 no longer wedges Panfrost

The first full 420x760 glmark2 rerun found a real Panfrost reset bug, not a thermal limit. Temperature stayed below warm policy, but the texture scene left slots 0/1 in STATUS=0x8, active_jobs kept climbing after glmark2 exited, and every reset failed:

timeouts=36
resets=25
reset_success=0
reset_fail=25
active_jobs=40
panfrost_reset_nolock: device reset failed: -1

2b134eb panfrost: wait for GPU reset completion fixed the immediate reset path by waiting for GPU_IRQ_RESET_COMPLETED with real microsecond delays, attempting a hard reset if soft reset times out, and restoring GPU interrupt masks after reset.

After booting kernel #175, the short glmark2 --size 320x480 -b :duration=1 --results fps:cpu:shader pass completed at score 882 with no timeouts/resets. The 420x760 default pass then completed at score 609:

timeouts=0
resets=0
reset_success=0
reset_fail=0
active_jobs=0
policy_level=2
policy_gpu_cap_mhz=400
policy_cpu_cap_mhz=1200

The thermal guard engaged during the long run and did the right thing: it stopped powerd, capped GPU max-auto to 400 MHz, capped CPU to 1200 MHz, then after cooldown restored powerd, GPU max-auto 600 MHz, and CPU 1416 MHz. Two Panfrost MMU fault counters latched without a visible failure (status=0x660003c2, addr=0x2000000), so browser/WebGL soak is still not closed.

◐ partial The Panfrost glmark2 wedge is closed on kernel #175. Remaining GPU work is WebGL/browser stress and the separate DRM modeset-lock wedge.