DWC3 gadget from scratch

● working

The previous essay said ssh pinephone works. This one is what it took to get there. The DWC3 device-mode driver at src/sys/dev/usb/controller/dwc3/dwc3_gadget.c is 2,707 lines I wrote against the Synopsys DesignWare DWC3 inside the RK3399S, with the host-side bus traced on a Linux laptop running usbmon because the device under test was the only thing that could possibly print debug. There is no upstream FreeBSD device-mode DWC3. Starting points were Linux 5.2, Barebox, and the DWC3 databook. None of them agree about EP0.

Two arcs took ten days. EP0 SETUP, where the hardware reported 8-byte completions with a buffer full of zeros. And bulk TX, where CDC Ethernet would push two packets and deadlock — except sometimes it didn’t, and there’s the rub.

[WAR STORY]

The EP0 SETUP TRB infinite loop

dwc3 / ep0

▸ symptom

Plug USB-C into the host. Linux journal logs “device descriptor read/64, error -110” three times. Phone-side serial shows the DWC3 firing XferComplete events on EP0 OUT immediately after each SETUP TRB is queued, but ep0_bounce[0..7] is 00 00 00 00 00 00 00 00. The host gives up, the device retries, the host gives up. Forever. No data ever flows.

▸ hypothesis 1

DMA cache coherency. EP0 bounce buffer is BUS_DMA_COHERENT, which on arm64 with a non-coherent SoC like RK3399 doesn’t actually mean coherent. The DWC3 might be writing data we read as a stale cache line. Switched to BUS_DMA_NOCACHE. 50b4566 dwc3_gadget: move SETUP TRB to connect-done, use BUS_DMA_NOCACHE

Same result. Bounce still zeros. Not coherency — the DWC3 isn’t writing anything to that address.

▸ hypothesis 2

The DWC3 is auto-completing the SETUP TRB during bus reset, before the host has sent a SETUP. Linux waits for connect-done before the first SETUP queue; we were queuing during reset. Moved SETUP to connect-done. 50b4566 dwc3_gadget: move SETUP TRB to connect-done, use BUS_DMA_NOCACHE

Fewer spurious completions during reset, but steady-state is identical. Even after enumeration starts properly we get XferComplete with zero data. Added a “spurious SETUP, re-queuing” path; the phone logged that line forever. 0d538c3 dwc3_gadget: add spurious SETUP detection and re-queue — note the commit message: “This is not a timing issue (re-queuing loops infinitely) but a hardware state issue.”

▸ hypothesis 3

EP0 state inherited from U-Boot. Maybe rk2aw left the controller half-initialized. Did DCTL.CSFTRST at attach plus a full CRU reset of the PHY. 9a99f0d dwc3: use DCTL.CSFTRST for device-mode reset, add CRU reset dance No change. Reset was clean. Something we were programming was wrong.

▸ breakthrough

Two register-define bugs in dwc3.h. DWC3_DEPCFG_EP_NUMBER(x) was ((x) << 1) — bit 1. The databook puts EP number at bits [29:25]. 60f1183 dwc3: fix DEPCFG EP_NUMBER (bits [29:25] not [1:0]) and TRBCTL_SETUP (6 not 2)

The actual root cause of the zero-data SETUP: per databook table 6-3, the TRBCTL encoding is 1=Normal, 2=Control-Setup, 3=Status-2, 4=Status-3, 5=Control-Data, 6=Isochronous-First, 7=Isochronous, 8=Link. We were using 6 for SETUP — which is Isochronous-First. The DWC3 saw TRBCTL=6, decided this was the first TRB of an iso transfer, and auto-completed without data because that’s what iso-first does. 92b3f13 dwc3_gadget: fix TRBCTL values — CONTROL_SETUP is 2 not 6

▸ fix

Patch one register definition file. The next boot, the very first SETUP TRB completed with real bytes:

SETUP: 80 06 00 01 00 00 40 00

80 06 = bmRequestType=0x80, bRequest=GET_DESCRIPTOR. 00 01 = descriptor type Device, index 0. 00 00 = lang. 40 00 = wLength 64. The host was asking the right question; we’d been answering the wrong one for two weeks because of one bit-position constant.

▸ lesson

Register-defines bugs are silent and indistinguishable from working hardware. The TRB completion event fires either way; the bus signal differs only in what the host sees. The “reset and re-queue” instinct is wrong: the hardware is doing exactly what you’re telling it to. Find these by tracing the bus from a different machine and checking every bit position against a known-working register-write trace.

The other half of EP0 took its own path. The original integration tried FreeBSD’s usb_template framework, which defers usbd_transfer_done callbacks to the USB process kthread. By the time ctrl_start ran and started the EP0 IN transfer, the host had already retried the bus. DWC3 EP0 needs the response inside the same ISR — the host gives up at 50 ms; on a contended kernel we were missing the deadline by one or two. Reverted the kthread path and handle EP0 inline. 2ed4468 Revert usb_template integration — deferred callbacks too slow for EP0

That plus the TRBCTL fix, plus smaller cleanups (5db8e09, d2cf055, 52f48b9), got EP0 enumerating cleanly. Then bulk data started, and broke in its own way.

[WAR STORY]

The lost TX completion

dwc3 / bulk

▸ symptom

CDC ECM enumeration completes. Host sees enxaabbccddeef0, assigns 10.0.0.1, sends a ping. Phone receives it (RX completion fires, ICMP packet enters the network stack). Phone tries to reply. The reply mbuf gets into dwc3_gadget_if_start, gets DMA-prepared, STARTTRANSFER is issued — and then nothing. No XferComplete on EP1 IN. Subsequent TX attempts queue and stall. After about a minute the host’s ARP entry expires. Sometimes the second packet works. Sometimes 100 packets work and the 101st loses a completion. Reproducible only by sustained traffic.

▸ hypothesis 1

DMA coherency on TX. Allocated BUS_DMA_COHERENT, sync PREWRITE before STARTTRANSFER — maybe the TRB.HWO clear isn’t propagating to cache. Switched non-cached, added explicit invalidate. No change.

▸ hypothesis 2

Single TX buffer reuse. Maybe the previous DMA hadn’t finished. Audited: tx_busy set at submit, cleared on completion, counters checked. tx_busy was clear, transfer issued, no completion ever arrived. Not a reuse race.

▸ hypothesis 3

TRB ring index drift. Per-endpoint ring with trb_enqueue/trb_dequeue. prepare_trb advances enqueue; start_transfer reads from dequeue, which never moved on TX. The second TX wrote a fresh TRB at slot 1; STARTTRANSFER pointed at stale TRB[0]. Hardware processed the stale TRB (HWO=0 already), saw nothing, silently completed without firing IOC. 9255d7a dwc3_gadget: fix TRB ring index mismatch — always use TRB[0]

▸ breakthrough

Right diagnosis. The smoking gun was logging in 9f13c7b dwc3_gadget: add debug prints for bulk RX/TX completion that printed TRB.bpl/bph/size/ctrl after each submit — bpl values stuck at the first packet’s DMA address through three submits. Single-buffer mode means there’s never more than one outstanding transfer per endpoint; the ring is overkill. So instead of fixing the dequeue advance, always use TRB[0]:

/*
 * Single-buffer mode: always use TRB[0].  We only have one
 * outstanding transfer per endpoint at a time, so there's no
 * need to advance through the ring.  Reset both indices to 0
 * to keep prepare_trb and start_transfer in sync.
 */
ep->trb_enqueue = 0;
ep->trb_dequeue = 0;
trb = &ep->trb_ring[0];

▸ fix

That’s the version that’s been stable since April 3. It pins enqueue/dequeue to 0 in dwc3_gadget_ep_prepare_trb and the start_transfer reads from the same slot. RX got a parallel fix — the previous code was advancing trb_dequeue on every RX completion and re-queueing into TRB[1], TRB[2], etc., which mostly worked but would occasionally lose a completion when the indices wrapped. 9255d7a dwc3_gadget: fix TRB ring index mismatch — always use TRB[0]

▸ lesson

Two-pointer ring index code is famously easy to get subtly wrong. If your hardware only allows one outstanding transfer at a time, do not implement a ring. The Linux DWC3 driver implements a real ring because Linux supports streaming bulk endpoints with multiple in-flight TRBs; we don’t, we won’t, and the half-implemented version was strictly worse than no ring at all. The CDC ECM TX ring we do have (8-slot, d4dfbbd dwc3_gadget: allocate 8 TX ring DMA buffers in attach + 569ebb7 dwc3_gadget: rewrite if_start with TX ring — queue up to 8 packets ) sits at a different layer — it queues mbufs in software so we can submit the next one as soon as the previous TX completes. That’s not a hardware ring; that’s a software queue feeding a single-slot hardware path.

An honest gap remains. Linux’s DWC3 driver handles batched completions with a mask-and-drain pattern: at IRQ entry, set GEVNTSIZ.intmask=1; drain the entire event ring; clear the mask at exit. A completion can land between your last drain and your IRQ-return, and without the mask you’ll never see it. We don’t do that because we don’t multi-buffer. If we ever push CDC throughput hard enough to need multiple in-flight TRBs, we will need that pattern too.

Two more fixes to actually pass packets: f580710 dwc3_gadget: fix CDC Ethernet data path — different MAC, send notification — phone and host both had MAC aa:bb:cc:dd:ee:f0, so ARP couldn’t disambiguate; phone is now …f2. And we have to send USB_CDC_NOTIFY_NETWORK_CONNECTION on EP2 IN after SET_CONFIGURATION or Linux’s cdc_ether stays in operstate “unknown” forever. The first attempt at that notification used tx_buf and collided with bulk TX; 981ac97 dwc3_gadget: fix CDC notification buffer collision with bulk TX/RX moved it to ep0_bounce, which is idle during SET_CONFIGURATION STATUS.

A side story: do not modify dwc3_gadget.c

[WAR STORY]

The debug-printf catastrophe

dwc3 / DO NOT TOUCH

▸ symptom

April 8, 2026. I wrap noisy device_printf calls in #ifdef DWC3_DEBUG because the driver is “stable now.” Macro is undefined. Build succeeds. Reboot. USB networking does not come up. SSH — the only path into the phone — is dead.

▸ breakthrough

The debug printfs weren’t decoration. They were buffering timing. The device_printf in dwc3_ep0_start_setup added ~30 µs of serial-console latency between STARTTRANSFER and the next operation — exactly enough margin that the SETUP queue happened in a quiet window. With the printf removed, STARTTRANSFER raced the bus reset and EP0 wedged.

▸ fix

Reverted in adfcb21 revert dwc3_gadget DWC3_DEBUG changes — broke USB networking with one line: “revert dwc3_gadget DWC3_DEBUG changes — broke USB networking.” The macro stays undefined in normal builds; the printfs stay verbose.

▸ lesson

In hardware drivers, debug printfs are not free decoration. They buffer timing, and the timing is part of the contract. Our EP0 is implicitly racing the host on the SETUP-vs-reset window. The race is invisible because the printf delay covers it. The right fix is to remove the race — but that’s a multi-day patch on a driver that’s currently the only path SSH takes into the phone. So the printf stays. Filed under “tech debt, accepted, with a note.”

A second touch: the watchdog that wasn’t locked

The “do not touch” rule held until April 23, when sustained TX load finally exposed something the hand-off-the-driver discipline had been hiding. Capturing the phone-screenshot batch (eleven scp transfers, ~1.7 MB each, back-to-back over CDC ECM) reliably wedged the kernel inside dwc3 cleanup. That one was worth touching.

[WAR STORY]

The TX watchdog races the ISR

dwc3 / locking

▸ symptom

Mid-batch transfer, the phone reboots. Serial catches a fatal data abort: far=0, esr=0x96000004, backtrace inside the dwc3 cleanup path. Single small scp works. Two work. Eleven in a row, every time, takes the kernel down. Always inside the TX recovery branch.

▸ hypothesis 1

Mbuf double-free in the recovery path. The recovery branch frees tx_mbuf and clears tx_busy; maybe a late ISR completion frees it again. Audited the m_freem calls. The pointer was nulled after free in every path — no double-free that grep could find.

▸ hypothesis 2

Stale TRB pointer dereferenced after an endpoint reconfigure. The recovery branch calls dwc3_gadget_configure_ep to reset EP1 IN; if a completion event arrived between the reconfigure and the cleanup, sc->eps[3] could be inconsistent. Added defensive null checks. The crash moved by a few instructions but kept happening.

▸ breakthrough

callout_init(&sc->tx_watchdog, 0) at attach. The 0 flag means MPSAFE-without-mutex — the callout fires from softclock with no serialization at all. Every other path through dwc3_gadget.c is implicitly under USB_BUS_LOCK (the ISR takes it; the ifnet start callback takes it). The watchdog took nothing. The recovery branch and the ISR were stomping on sc->eps[], the in-flight mbuf pointer, and the TRB ring concurrently. The NULL deref was just the loudest victim.

▸ fix

Convert to callout_init_mtx against the bus lock, assert ownership at watchdog entry, and split dwc3_gadget_if_start into a _locked helper plus a wrapper so the ISR/watchdog can call the inner version without recursive lock attempts. 6b15dc5 dwc3_gadget: serialize TX watchdog against ISR via USB_BUS_LOCK

/*
 * MPSAFE-with-mutex: the callout layer takes the bus lock before
 * calling dwc3_gadget_tx_watchdog and releases it after. This is
 * what serializes the watchdog against the ISR.
 */
callout_init_mtx(&sc->tx_watchdog, &sc->sc_bus.bus_mtx, 0);

static void
dwc3_gadget_tx_watchdog(void *arg)
{
    struct dwc3_gadget_softc *sc = arg;

    USB_BUS_LOCK_ASSERT(&sc->sc_bus, MA_OWNED);
    /* …recovery branch now safe against ISR… */
}

Verified by re-running the eleven-transfer phone-shot batch — zero wedges, zero stalls, zero recovery-path entries. The watchdog still fires occasionally under load and clears cleanly.

▸ lesson

A callout registered with flag 0 is MPSAFE without serialization — which is fine for a body that only touches its own argument and ticks, and catastrophic for anything that walks shared driver state. If the body needs the same lock as the ISR, use callout_init_mtx and let the callout layer enforce it; don’t try to take the lock inside the callout body, because by the time you’re running the rest of the driver has already raced you.