Appendix · recipe

GPU debugging: capturing the next crash

Serial logging, swap-backed mini-dumps, and dmesg snapshots — so the next panic isn't lost to the reboot.

The PinePhone Pro reboots on any unhandled kernel fault. When the host is not already logging /dev/ttyUSB1, the best artifact is often a phone photo of the framebuffer after the crash. That is not good enough: the register dump, DRM warnings, and reset path need to land in a file on the laptop before the board is power-cycled.

This page is the recipe for not losing the next one.

0. Snapshot a live wedge over SSH

When the panel is frozen but USB Ethernet still answers, capture the live process stacks before killing clients, restarting Sway, or rebooting:

mise run debug:gpu:wedge:phone -- firefox-glxtest-wedge

The task writes logs/gpu-wedge/<timestamp>-<name>.log. It is read-only: it records process state, procstat -k stacks, Sway IPC if it still answers, DRM/Panfrost sysctls, thermal state, dmesg, and /var/log/messages. The local analysis classifies the known signatures:

This is the first command to run on the next browser/WebGL or compositor freeze. It exists because the 2026-05-06 Firefox bench left sway unkillable inside panfrost_ioctl_wait_bo; even SIGKILL could not reap it, and sudo reboot stopped sshd but then hung late while the kernel still answered ICMP.

For browser media tests the receipt also captures /dev/sndstat, mixer(8), PulseAudio / PipeWire / virtual_oss process stacks, RT5640 dmesg lines, and the selected Panfrost counters. That separation mattered on 2026-05-06: the first YouTube run did not show a compositor wait; it showed PulseAudio’s OSS thread pegged. Forcing Firefox’s cubeb backend to OSS through mobile-config-firefox and disabling PulseAudio autospawn made the same YouTube launch avoid PulseAudio entirely. The phone still got hot, but the receipt showed ordinary Firefox/media CPU saturation rather than a GPU timeout, DRM modeset lock, or PulseAudio spin.

1. Live serial logging

Before triggering anything that might crash — running glmark2, reloading a sway theme, resetting WiFi after a wedged transfer, or plugging in the gigabit-ethernet dongle for the first time — start a serial capture on the laptop. mise run serial:capture -- <name> wraps tools/capture-serial.sh, which opens picocom against /dev/ttyUSB1 at 1500000 baud (the FT232 dongle — the CP2102 doesn’t go that high) and logs everything to logs/serial/<timestamp>-<name>.log:

mise run serial:capture -- stress-test
# … hit Ctrl-A Ctrl-X to exit
grep -nE 'panic|Fatal|abort|WARNING' logs/serial/*-stress-test.log

Even when the crash hard-reboots the phone before any disk write completes, the serial transcript on the laptop is intact.

For longer risky sessions, use detached capture so the terminal running the test does not own the serial port:

mise run serial:capture:daemon -- wifi-reset
mise run serial:capture:status
# … run the risky test, reboot, or hard-reset the phone …
grep -nE 'panic|Fatal|abort|WARNING' "$(cat logs/serial/.capture.log)"
mise run serial:capture:stop

Detached capture records U-Boot, the EFI loader, kernel boot, and post-boot console output as long as the FT232 adapter stays plugged in. It also answers the loader’s ESC[6n cursor-position query; a raw cat logger can record that query but leave the phone stuck before FreeBSD reaches USB networking.

2. Mini-dump on panic via savecore (caveat: needs a real swap)

FreeBSD writes a kernel mini-dump to the swap partition on panic(), then savecore copies it to /var/crash/ on the next boot. This is the most useful piece of post-mortem state — it captures the panic message, register state, and a stack trace into a file you can read on honor with kgdb.

The overlay’s rc.conf sets dumpdev="AUTO" + savecore_enable="YES" + savecore_flags="-z", and tools/configure-dump.sh applies the same to a live phone:

./tools/configure-dump.sh
ssh pinephone 'dumpon -l && ls /var/crash'

BUT: today’s Honeyguide image has no swap partition, and md(4)-backed swap files do not support DIOCSKERNELDUMP (verified — the ioctl returns Operation not supported). So dumpon -l will say /dev/null and a panic will not be captured by savecore until one of:

  1. The next image build is repartitioned to include a real swap slice (~512 MB is plenty for a mini-dump on a 4 GB phone). Edit honeyguide/img/create_img_clean.sh.
  2. netdump(4) is wired up to honor over the USB-Ethernet link. The cdce/usb_template driver stack would need DEBUGNET hooks first (net.netdump.enabled will refuse to come up otherwise).

Until then, fall back to sections 1 and 3.

After a panic and reboot, fish out the dump:

ssh pinephone 'ls -lhrt /var/crash | tail'
ssh pinephone 'cat /var/crash/info.last'   # panic string + uptime + version
ssh pinephone 'gunzip -c /var/crash/vmcore.last.gz' \
    | ssh honor 'cat > /tmp/vmcore'
ssh honor 'kgdb \
  ~/pine64-freebsd/honeyguide/obj.clang/.../sys/PINEPHONE_PRO/kernel.debug \
  /tmp/vmcore'

The kernel.debug (the unstripped one) lives in the buildkernel object tree on honor. bt in kgdb gives you the panic backtrace.

To verify the path works without waiting for a real crash, deliberately panic the phone:

ssh pinephone 'sudo sysctl debug.kdb.panic=1'
# … wait for reboot, then check /var/crash

(Don’t do this casually — it’s a hard panic.)

3. Per-minute dmesg snapshots

Mini-dumps require a clean panic(). A hang — kernel still alive but display dead, IRQs frozen, watchdog bites — leaves nothing for savecore. For those, we want the most recent dmesg written to disk before things went sideways.

mise run debug:wifi:setup:phone installs tools/install-dmesg-snapshots.sh on the phone: a per-minute cron that writes dmesg to /var/log/dmesg-snapshots/dmesg-HHMM.log and rotates anything older than an hour. Disk impact is bounded (60 small files, total under a megabyte).

mise run debug:wifi:setup:phone
# After a hang and reboot:
ssh pinephone 'ls -lhrt /var/log/dmesg-snapshots/ | tail'

The HHMM in the filename is the time the snapshot was taken, so you can match it against your serial log to see what was on dmesg seconds before the wedge.

4. Snapshot the live WiFi state before and after a risky test

For the bwfm(4) / BCM43455 work, the most common failure mode is not an immediate panic but a firmware command rejection followed by a wedged or missing interface. Keep debug reads passive. The state sysctls used by the current harness should report cached counters and MMIO state; they should not issue fresh SDIO CMD52/CMD53 transactions from sysctl context.

mise run debug:wifi:phone collects the pieces that are easy to lose track of mid-session:

It writes the snapshot to logs/wifi/<timestamp>-<name>.log. The script appends a short receipt analysis at the end of the log; rerun it manually with mise run debug:wifi:analyze -- logs/wifi/<file>.log if you want to compare older captures.

mise run module:refresh:phone -- bwfm_sdio
mise run module:compare:phone -- bwfm_sdio
mise run debug:wifi:phone -- before-scan bwfm_sdio
ssh pinephone 'sudo ifconfig wlan0 scan'
mise run debug:wifi:phone -- after-scan bwfm_sdio

For transfer tests, prefer the bounded harness and leave verbose trace dumps off unless you are specifically debugging the trace ring:

SIZE_MIB=2 POLL_SECS=5 DEBUG_TIMEOUT=5 DUMP_TRACE=0 \
  mise run debug:wifi:transfer -- both sdio-irq-clock-fix-small

The transfer harness appends the same analysis after it stops the poller. The important line is summary: classification=...: irq-active-watchdog means the host delivered function interrupts to bwfm_sdio; irq-armed-poll-fallback means the function IRQ was claimed but the watchdog/poll path did the work. completed-with-usb-stalls is still useful evidence, but it says the USB management link timed out during the run and the serial log should be checked.

If a debug sysctl ever hangs the phone by itself, treat that as a driver bug. The next transfer should wait until the sysctl has been made passive again.

5. Reading a captured panic

A typical fault frame on aarch64 looks like:

Fatal data abort:
  x0:  0xffff...    x7:  0xffffffffffffff
  ...
  far: 0            esr: 0x96000004

For NULL derefs, the most useful follow-up is:

  1. Find the function name from pc (addr2line or kgdb’s info line *0x...).
  2. Look at the instruction at pc — usually a ldr xN, [xM, #offset]. xM is the base pointer (matches a register in the dump); offset tells you which field.
  3. Trace back from the call site to find what made xM NULL.

For the panic that prompted this page, the panic was deliberate: panic("plane is not visible") left as a /* TODO */ in rk_vop_plane_atomic_update. The lock-assertion warnings further down the log were tail-end damage as the panic unwound state still held by another core. See the war story.

6. When all else fails

If every other capture fails (savecore unconfigured, hang too hard, serial cable disconnected), the last line of defence is a screen photograph through the privacy switch. Keep a phone-with-decent-camera nearby. Aim at the EFI framebuffer; the panic frame stays visible for ~1 second before the watchdog reboots. Decode by hand. We have the technology to do better than this — use the steps above so we don’t have to.