Fix tests of PoolCounter extension
The PoolCounter test suite is flappy, specially on CI it usually fails with traces such as:

  Scenario: Just readers               # features/simulation.feature:2
    When I start 1024 reader like sims # features/step_definitions/sim_steps.rb:1
Can't keep trying.  Giving up.
Can't keep trying.  Giving up.
    And wait 10 seconds                # features/step_definitions/sim_steps.rb:19
    Then all sims report success       # features/step_definitions/sim_steps.rb:22
      Cannot assign requested address - connect(2) for (Errno::EADDRNOTAVAIL)
      ./features/support/client.rb:51:in `connect' 
      ./features/support/client.rb:51:in `connect'
      ./features/support/client.rb:14:in `request'
      ./features/support/sim.rb:56:in `send_to_client'
      ./features/support/reader_sim.rb:8:in `act'
      ./features/support/sim.rb:23:in `block in start'
      features/simulation.feature:5:in `Then all sims report success'

Some scenario spawn 1024 clients connecting to the daemon, the logic being in daemon/tests/features/support/client.rb. There is a note about EADDRNOTAVAIL. To work around exhaustion of TCP ephemeral ports, the client bind with SO_REUSEADDR in an attempt to reuse a port, but there is still a failure eventually.

With 1024 clients spawned, maybe that reach the file descriptor limit which seems to default to 1024?

I rephrased the task detail to give more information. The test suite establish 1024 socket connections to poolcounterd and eventually one fail with EADDRNOTAVAIL. Either because:

  • ephemeral ports are exhausted (though the client should reuse address)
  • there is no more file descriptors available

Change 417021 had a related patch set uploaded (by Hashar; owner: Hashar):
[mediawiki/extensions/PoolCounter@master] tests: wait before retrying connection

I can reproduce by running poolcounterd under strace and then running:

bundle exec cucumber --name 'Just readers' features/simulation.feature

That constantly fails on my machine.

I kept trying to figure out a solution yesterday night, but eventually I give up. I am not familiar enough with sockets / low level TCP and I don't know anything about the PoolCounter protocol.

This test is now non-voting and removed from gate-and-submit. Please revert If7bd90e81e1e98d3b6a170fa8c9a0d6690b7eb36 when the test is fixed and should run voting on gate-and-submit. Thanks.

FTR: The entry for steward/maintainer for poolcounter is empty on Dev/Maint

Fairly low-priority, assuming it is a test-specific issue. The current issue is mainly that the tests are flaky and makes contributions difficult to review in Jenkins.

Snipped from an email I wrote:

Tangentially, it would be nice if we could get the poolcounter daemon
tests voting again, they're currently too flaky:

They're written in ruby/cucumber/rspec right now, since that's what Nik
really liked when he was developing CirrusSearch. Since we're moving
away from that stack for browser tests, I was thinking about porting it
to a different language that more of us are familiar with aka Python but
I never made any actual progress on that.

Change 417021 abandoned by Hashar:
tests: wait before retrying connection

You can check my commit message at

Roughly the step When I start 1024 search like sims attempts to open 1024 tcp sockets. When repeated that eventually exhausts the pool of sockets available in Linux. I am not sure why it has been set at 1024, maybe lowering to 128 would do it.

Change 455385 had a related patch set uploaded (by Legoktm; owner: Legoktm):
[mediawiki/services/poolcounter@master] Port test suite to Python

Change 455385 merged by jenkins-bot:
[mediawiki/services/poolcounter@master] Port test suite to Python

There's still one flaky test, but that's marked as xfailing so it won't cause CI builds to fail. Filed T203890: PoolCounter's is flaky due to a bug for it.