Page MenuHomePhabricator

Stop a poolcounter server fail from being a SPOF for the service and the api (and the site)
Closed, ResolvedPublic

Description

https://wikitech.wikimedia.org/wiki/Incident_documentation/20150709-poolcounter

TL;DR
Helium was reboot, api was out for 8 minutes.
reason:
helium is a poolcounter machine. Unfortunately mediawiki has a 0.5 sec timeout when falling back to the next poolcounter server in line which is too high.

Details

Related Gerrit Patches:
operations/mediawiki-config : masterpoolcounter: enable connect_timeout for testwiki
operations/mediawiki-config : masterpoolcounter: add connect_timeout in codfw
mediawiki/extensions/PoolCounter : masterAdd support for connect_timeout

Event Timeline

Matanya created this task.Jul 9 2015, 6:17 PM
Matanya raised the priority of this task from to Needs Triage.
Matanya updated the task description. (Show Details)
Matanya added a project: acl*sre-team.
Matanya added a subscriber: Matanya.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 9 2015, 6:17 PM
fgiunchedi triaged this task as High priority.Jul 20 2015, 2:22 PM
fgiunchedi added a subscriber: fgiunchedi.
Dzahn added a subscriber: Dzahn.Aug 10 2015, 5:21 PM

there was an API outage due to poolcounter server dropping packages due to an issue with ferm rules: https://wikitech.wikimedia.org/wiki/Incident_documentation/20150806-poolcounter

Joe set Security to None.
fgiunchedi reassigned this task from fgiunchedi to Joe.Aug 11 2015, 8:30 AM
fgiunchedi added subscribers: tstarling, Joe.

@Joe has kindly agreed to investigate this, he's been already bouncing ideas with @tstarling and others

Joe added a comment.Aug 17 2015, 9:17 AM

So, we think the issue here is that a timeout of 0.5 seconds is way too high and, if a server is completely down (so it doesn't respond to SYNs in any way) the connection attempts just pile up.

On the other hand, a very short timeout can result in too many false positives, so what @tstarling suggested was to do what we already do for mysql:

  • set a short timeout, a bit higher than twice the average RTT between the poolcounter and the appservers
  • retry N times the connection, where usually N=2 is reasonable

I made a test using fsockopen() like the poolcounter extension does and results are promising, see P1887.

We can get to a penalty in the order of ~ 30/40 milliseconds if a server is down, which I think our systems would be able to withstand.

Change 231996 had a related patch set uploaded (by Giuseppe Lavagetto):
Add support for connect_timeout

https://gerrit.wikimedia.org/r/231996

Change 231996 merged by jenkins-bot:
Add support for connect_timeout

https://gerrit.wikimedia.org/r/231996

Change 238108 had a related patch set uploaded (by Giuseppe Lavagetto):
poolcounter: add connect_timeout in codfw

https://gerrit.wikimedia.org/r/238108

Change 238109 had a related patch set uploaded (by Giuseppe Lavagetto):
poolcounter: enable connect_timeout for testwiki

https://gerrit.wikimedia.org/r/238109

Change 238108 merged by jenkins-bot:
poolcounter: add connect_timeout in codfw

https://gerrit.wikimedia.org/r/238108

Change 238109 merged by jenkins-bot:
poolcounter: enable connect_timeout for testwiki

https://gerrit.wikimedia.org/r/238109

I think we should merge T32452 in this one

Joe added a comment.Feb 2 2016, 10:08 AM

After patching hhvm for adding support for float timeouts, I did the following test:

  1. reduced the poolcounter config on a server in codfw to point to one inexistent ip (which simulates the "machine is powered down" scenario)
  2. Requested repeatedly
curl -H 'Host: test.wikipedia.org' -H 'X-Forwarded-Proto: https' 'http://localhost/w/api.php?action=query&format=json&generator=prefixsearch&redirects=true&gpssearch=a&gps&gpslimit=20&list=search&srsearch=a&srnamespace=0&srwhat=text&srinfo=suggestion&srprop=&sroffset=0&srlimit=1&prop=pageterms%7Cpageimages&wbptterms=description&piprop=thumbnail&pithumbsize=320&pilimit=20&continue=' -v

which was hanging up connecting to the poolcounter before

  1. Observed error messages about being unable to connect to the poolcounter in the logs on fluorine

I am pretty confident the problem is now solved.

Joe closed this task as Resolved.Feb 2 2016, 10:08 AM