GPU debugging: capturing the next crash

The PinePhone Pro reboots on any unhandled kernel fault. The CRU hack (see essay 7) prevents the early-boot serial console from coming back up, so by the time the kernel logs “Fatal data abort” the host serial port is already clear of the boot banner — and within a second the chip resets. That’s a 200 ms window to photograph the screen, and after that, all state is gone.

This page is the recipe for not losing the next one.

1. Live serial logging

Before triggering anything that might crash — running glmark2, reloading a sway theme, plugging in the gigabit-ethernet dongle for the first time — start a serial capture on the laptop. The script tools/capture-serial.sh opens picocom against /dev/ttyUSB1 at 1500000 baud (the FT232 dongle — the CP2102 doesn’t go that high) and tees everything to logs/serial/<timestamp>-<name>.log:

./tools/capture-serial.sh stress-test
# … hit Ctrl-A Ctrl-X to exit
grep -nE 'panic|Fatal|abort|WARNING' logs/serial/*-stress-test.log

Even when the crash hard-reboots the phone before any disk write completes, the serial transcript on the laptop is intact.

2. Mini-dump on panic via savecore (caveat: needs a real swap)

FreeBSD writes a kernel mini-dump to the swap partition on panic(), then savecore copies it to /var/crash/ on the next boot. This is the most useful piece of post-mortem state — it captures the panic message, register state, and a stack trace into a file you can read on honor with kgdb.

The overlay’s rc.conf sets dumpdev="AUTO" + savecore_enable="YES" + savecore_flags="-z", and tools/configure-dump.sh applies the same to a live phone:

./tools/configure-dump.sh
ssh pinephone 'dumpon -l && ls /var/crash'

BUT: today’s Honeyguide image has no swap partition, and md(4)-backed swap files do not support DIOCSKERNELDUMP (verified — the ioctl returns Operation not supported). So dumpon -l will say /dev/null and a panic will not be captured by savecore until one of:

The next image build is repartitioned to include a real swap slice (~512 MB is plenty for a mini-dump on a 4 GB phone). Edit honeyguide/img/create_img_clean.sh.
netdump(4) is wired up to honor over the USB-Ethernet link. The cdce/usb_template driver stack would need DEBUGNET hooks first (net.netdump.enabled will refuse to come up otherwise).

Until then, fall back to sections 1 and 3.

After a panic and reboot, fish out the dump:

ssh pinephone 'ls -lhrt /var/crash | tail'
ssh pinephone 'cat /var/crash/info.last'   # panic string + uptime + version
ssh pinephone 'gunzip -c /var/crash/vmcore.last.gz' \
    | ssh honor 'cat > /tmp/vmcore'
ssh honor 'kgdb \
  ~/pine64-freebsd/honeyguide/obj.clang/.../sys/PINEPHONE_PRO/kernel.debug \
  /tmp/vmcore'

The kernel.debug (the unstripped one) lives in the buildkernel object tree on honor. bt in kgdb gives you the panic backtrace.

To verify the path works without waiting for a real crash, deliberately panic the phone:

ssh pinephone 'sudo sysctl debug.kdb.panic=1'
# … wait for reboot, then check /var/crash

(Don’t do this casually — it’s a hard panic.)

3. Per-minute dmesg snapshots

Mini-dumps require a clean panic(). A hang — kernel still alive but display dead, IRQs frozen, watchdog bites — leaves nothing for savecore. For those, we want the most recent dmesg written to disk before things went sideways.

tools/install-dmesg-snapshots.sh adds a per-minute cron to root on the phone that writes dmesg to /var/log/dmesg-snapshots/dmesg-HHMM.log and rotates anything older than an hour. Disk impact is bounded (60 small files, total under a megabyte).

./tools/install-dmesg-snapshots.sh
# After a hang and reboot:
ssh pinephone 'ls -lhrt /var/log/dmesg-snapshots/ | tail'

The HHMM in the filename is the time the snapshot was taken, so you can match it against your serial log to see what was on dmesg seconds before the wedge.

4. Reading a captured panic

A typical fault frame on aarch64 looks like:

Fatal data abort:
  x0:  0xffff...    x7:  0xffffffffffffff
  ...
  far: 0            esr: 0x96000004

far is the faulting address. 0 means a NULL deref; small offsets (0x20, 0x40) mean we deref’d through a NULL struct pointer to a field at that offset.
esr decodes the fault class. 0x96 is “Data abort taken without a change in EL”; the lower bits are the fault status code. 0x04 is “translation fault, level 0” — there’s literally no page table entry, consistent with NULL.
x0 is usually the first argument or this-equivalent at the point of the call.

For NULL derefs, the most useful follow-up is:

Find the function name from pc (addr2line or kgdb’s info line *0x...).
Look at the instruction at pc — usually a ldr xN, [xM, #offset]. xM is the base pointer (matches a register in the dump); offset tells you which field.
Trace back from the call site to find what made xM NULL.

For the panic that prompted this page, the panic was deliberate: panic("plane is not visible") left as a /* TODO */ in rk_vop_plane_atomic_update. The lock-assertion warnings further down the log were tail-end damage as the panic unwound state still held by another core. See the war story.

5. When all else fails

If every other capture fails (savecore unconfigured, hang too hard, serial cable disconnected), the last line of defence is a screen photograph through the privacy switch. Keep a phone-with-decent-camera nearby. Aim at the EFI framebuffer; the panic frame stays visible for ~1 second before the watchdog reboots. Decode by hand. We have the technology to do better than this — use the steps above so we don’t have to.