Helium was reboot, api was out for 8 minutes.
helium is a poolcounter machine. Unfortunately mediawiki has a 0.5 sec timeout when falling back to the next poolcounter server in line which is too high.
- Mentioned In
- T32452: gracefully handle a poolcounterd outage
T123723: Move bacula director and storage daemon off helium?
T123734: Migrate pool counters to trusty/jessie
T112834: Prefixsearch requests time out on mobile testwiki
rOMWC7efefbc3d42c: poolcounter: enable connect_timeout for testwiki
rOMWCbbdc66ecb3fa: poolcounter: add connect_timeout in codfw
rEPOCaf3ce612a811: Add support for connect_timeout
rMEXT23743609d36b: Updated mediawiki/extensions Project: mediawiki/extensions/PoolCounter…
T104996: Ferm rules for backup roles
- Mentioned Here
- P1887 Fast_reconnect.php
T83729: Fix monitoring of poolcounter service
T32452: gracefully handle a poolcounterd outage
T65027: Improve poolcounter error messages.
So, we think the issue here is that a timeout of 0.5 seconds is way too high and, if a server is completely down (so it doesn't respond to SYNs in any way) the connection attempts just pile up.
On the other hand, a very short timeout can result in too many false positives, so what @tstarling suggested was to do what we already do for mysql:
- set a short timeout, a bit higher than twice the average RTT between the poolcounter and the appservers
- retry N times the connection, where usually N=2 is reasonable
I made a test using fsockopen() like the poolcounter extension does and results are promising, see P1887.
We can get to a penalty in the order of ~ 30/40 milliseconds if a server is down, which I think our systems would be able to withstand.
After patching hhvm for adding support for float timeouts, I did the following test:
- reduced the poolcounter config on a server in codfw to point to one inexistent ip (which simulates the "machine is powered down" scenario)
- Requested repeatedly
curl -H 'Host: test.wikipedia.org' -H 'X-Forwarded-Proto: https' 'http://localhost/w/api.php?action=query&format=json&generator=prefixsearch&redirects=true&gpssearch=a&gps&gpslimit=20&list=search&srsearch=a&srnamespace=0&srwhat=text&srinfo=suggestion&srprop=&sroffset=0&srlimit=1&prop=pageterms%7Cpageimages&wbptterms=description&piprop=thumbnail&pithumbsize=320&pilimit=20&continue=' -v
which was hanging up connecting to the poolcounter before
- Observed error messages about being unable to connect to the poolcounter in the logs on fluorine
I am pretty confident the problem is now solved.