Page MenuHomePhabricator

Fix tests of PoolCounter extension
Closed, ResolvedPublic

Description

The PoolCounter test suite is flappy, specially on CI it usually fails with traces such as:

  Scenario: Just readers               # features/simulation.feature:2
    When I start 1024 reader like sims # features/step_definitions/sim_steps.rb:1
Can't keep trying.  Giving up.
Can't keep trying.  Giving up.
    And wait 10 seconds                # features/step_definitions/sim_steps.rb:19
    Then all sims report success       # features/step_definitions/sim_steps.rb:22
      Cannot assign requested address - connect(2) for 127.0.0.1:7531 (Errno::EADDRNOTAVAIL)
      ./features/support/client.rb:51:in `connect' 
      ./features/support/client.rb:51:in `connect'
      ./features/support/client.rb:14:in `request'
      ./features/support/sim.rb:56:in `send_to_client'
      ./features/support/reader_sim.rb:8:in `act'
      ./features/support/sim.rb:23:in `block in start'
      features/simulation.feature:5:in `Then all sims report success'

Some scenario spawn 1024 clients connecting to the daemon, the logic being in daemon/tests/features/support/client.rb. There is a note about EADDRNOTAVAIL. To work around exhaustion of TCP ephemeral ports, the client bind with SO_REUSEADDR in an attempt to reuse a port, but there is still a failure eventually.

With 1024 clients spawned, maybe that reach the file descriptor limit which seems to default to 1024?

Event Timeline

Umherirrender removed Umherirrender as the assignee of this task.
hashar subscribed.

I rephrased the task detail to give more information. The test suite establish 1024 socket connections to poolcounterd and eventually one fail with EADDRNOTAVAIL. Either because:

  • ephemeral ports are exhausted (though the client should reuse address)
  • there is no more file descriptors available

Change 417021 had a related patch set uploaded (by Hashar; owner: Hashar):
[mediawiki/extensions/PoolCounter@master] tests: wait before retrying connection

https://gerrit.wikimedia.org/r/417021

I can reproduce by running poolcounterd under strace and then running:

bundle exec cucumber --name 'Just readers' features/simulation.feature

That constantly fails on my machine.

I kept trying to figure out a solution yesterday night, but eventually I give up. I am not familiar enough with sockets / low level TCP and I don't know anything about the PoolCounter protocol.

This test is now non-voting and removed from gate-and-submit. Please revert If7bd90e81e1e98d3b6a170fa8c9a0d6690b7eb36 when the test is fixed and should run voting on gate-and-submit. Thanks.

greg subscribed.

FTR: The entry for steward/maintainer for poolcounter is empty on Dev/Maint

Krinkle subscribed.

Fairly low-priority, assuming it is a test-specific issue. The current issue is mainly that the tests are flaky and makes contributions difficult to review in Jenkins.

Snipped from an email I wrote:

Tangentially, it would be nice if we could get the poolcounter daemon
tests voting again, they're currently too flaky:
https://phabricator.wikimedia.org/T178517.

They're written in ruby/cucumber/rspec right now, since that's what Nik
really liked when he was developing CirrusSearch. Since we're moving
away from that stack for browser tests, I was thinking about porting it
to a different language that more of us are familiar with aka Python but
I never made any actual progress on that.

Change 417021 abandoned by Hashar:
tests: wait before retrying connection

https://gerrit.wikimedia.org/r/417021

You can check my commit message at https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/PoolCounter/+/417021/

Roughly the step When I start 1024 search like sims attempts to open 1024 tcp sockets. When repeated that eventually exhausts the pool of sockets available in Linux. I am not sure why it has been set at 1024, maybe lowering to 128 would do it.

Change 455385 had a related patch set uploaded (by Legoktm; owner: Legoktm):
[mediawiki/services/poolcounter@master] Port test suite to Python

https://gerrit.wikimedia.org/r/455385

Change 455385 merged by jenkins-bot:
[mediawiki/services/poolcounter@master] Port test suite to Python

https://gerrit.wikimedia.org/r/455385

There's still one flaky test, but that's marked as xfailing so it won't cause CI builds to fail. Filed T203890: PoolCounter's test_errors.py:test_locking_while_waiting is flaky due to a bug for it.