Skip to content

holepunch: replace timing-based test wait with identify-driven synchronization#3506

Open
Deln0r wants to merge 1 commit into
libp2p:masterfrom
Deln0r:fix/issue-3440-holepunch-flaky-test
Open

holepunch: replace timing-based test wait with identify-driven synchronization#3506
Deln0r wants to merge 1 commit into
libp2p:masterfrom
Deln0r:fix/issue-3440-holepunch-flaky-test

Conversation

@Deln0r
Copy link
Copy Markdown

@Deln0r Deln0r commented May 13, 2026

Updates #3440. Not closing because I couldn't reliably reproduce the original flake to confirm the fix in CI (see the "Test plan" caveat at the bottom).

Why the test is flaky

TestFailuresOnResponder currently relies on a 100 ms time.Sleep plus an EventuallyWithT poll of h1.Mux().Protocols(). The issue notes that the order of those two steps was found to matter empirically without a clear root cause. Reading (*Service).waitForPublicAddr in p2p/protocol/holepunch/svc.go, the actual asynchronous events the test depends on are:

  1. Before h1.Connect: holepunch.Service must have installed its stream handler. That call is gated on the local node observing at least one public address, so it is genuinely asynchronous from libp2p.New returning.
  2. Between h1.Connect and h2.NewStream(... holepunch.Protocol): h2 must have learned about h1's protocol set via the identify exchange. NewStream dispatches on h2's peerstore view, which the identify subscription populates asynchronously after Connect.

The current code only covers (1). (2) is implicit, which is why the test still races when identify lags Connect.

Change

  • Replace the magic sleep + 100 ms-cadence EventuallyWithT with a tighter Eventually (10 ms poll, 5 s budget) and use slices.Contains for the protocol check.
  • After h1.Connect, add a second Eventually that polls h2.Peerstore().GetProtocols(h1.ID()) until it contains holepunch.Protocol. This is the deterministic version of "wait for identify to land".

Net effect: both async dependencies are explicitly waited on, the happy path takes ~tens of milliseconds instead of a fixed 100 ms sleep, and the comment about the order being load-bearing goes away.

Test plan — caveat

I couldn't reproduce the original flake on my workstation in either branch: stock master panics in simnet.SimConn.WriteTo ("upPacketReceiver is nil. Did you forget to call simconn.SetUpPacketReceiver?") on macOS / Go 1.25, before the test body even runs. The panic reproduces on stock upstream/master with no changes, so it's not caused by this PR. If CI has a working repro path for the flake, the right validation is to run TestFailuresOnResponder with -count=100 on this branch and confirm zero failures.

I'm happy to follow up with a focused environment fix for the simconn panic if that's of independent interest — let me know.

Notes

  • The fix is principled rather than empirical, which I think is appropriate given the original comment ("Apparently changing the order ... flakiness — I don't know why this works").
  • No new dependencies, no behaviour change to production code, no test renames.

…onization

The TestFailuresOnResponder subtests block on a 100 ms sleep followed
by an EventuallyWithT poll of h1.Mux().Protocols(); libp2p#3440 records
that the order of those two steps was found to matter empirically
without a clear root cause. The actual sequencing requirement is:

  1. Before h1.Connect, h1 must have installed its holepunch stream
     handler. (*Service).waitForPublicAddr in
     p2p/protocol/holepunch/svc.go gates SetStreamHandler on
     observing a public address, so this step is genuinely async
     from libp2p.New returning.

  2. Between h1.Connect and h2.NewStream, h2 must have learned about
     h1's protocol set via the identify exchange. NewStream
     dispatches on h2's peerstore view, which the identify
     subscription populates asynchronously after Connect.

Replace the magic sleep with an explicit poll of h1.Mux().Protocols()
on a 10 ms cadence, then add a second poll on
h2.Peerstore().GetProtocols(h1.ID()) after Connect. Both poll intervals
are tight (10 ms) so a stable steady state is observed quickly without
adding fixed latency to the happy path.

Updates libp2p#3440.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant