Page MenuHomePhabricator

Stop a poolcounter server fail from being a SPOF for the service and the api (and the site)
Closed, ResolvedPublic

Description

https://wikitech.wikimedia.org/wiki/Incident_documentation/20150709-poolcounter

TL;DR
Helium was reboot, api was out for 8 minutes.
reason:
helium is a poolcounter machine. Unfortunately mediawiki has a 0.5 sec timeout when falling back to the next poolcounter server in line which is too high.

Event Timeline

Matanya raised the priority of this task from to Needs Triage.
Matanya updated the task description. (Show Details)
Matanya added a project: acl*sre-team.
Matanya subscribed.
fgiunchedi subscribed.

there was an API outage due to poolcounter server dropping packages due to an issue with ferm rules: https://wikitech.wikimedia.org/wiki/Incident_documentation/20150806-poolcounter

fgiunchedi added subscribers: tstarling, Joe.

@Joe has kindly agreed to investigate this, he's been already bouncing ideas with @tstarling and others

So, we think the issue here is that a timeout of 0.5 seconds is way too high and, if a server is completely down (so it doesn't respond to SYNs in any way) the connection attempts just pile up.

On the other hand, a very short timeout can result in too many false positives, so what @tstarling suggested was to do what we already do for mysql:

  • set a short timeout, a bit higher than twice the average RTT between the poolcounter and the appservers
  • retry N times the connection, where usually N=2 is reasonable

I made a test using fsockopen() like the poolcounter extension does and results are promising, see P1887.

We can get to a penalty in the order of ~ 30/40 milliseconds if a server is down, which I think our systems would be able to withstand.

Change 231996 had a related patch set uploaded (by Giuseppe Lavagetto):
Add support for connect_timeout

https://gerrit.wikimedia.org/r/231996

Change 231996 merged by jenkins-bot:
Add support for connect_timeout

https://gerrit.wikimedia.org/r/231996

Change 238108 had a related patch set uploaded (by Giuseppe Lavagetto):
poolcounter: add connect_timeout in codfw

https://gerrit.wikimedia.org/r/238108

Change 238109 had a related patch set uploaded (by Giuseppe Lavagetto):
poolcounter: enable connect_timeout for testwiki

https://gerrit.wikimedia.org/r/238109

Change 238108 merged by jenkins-bot:
poolcounter: add connect_timeout in codfw

https://gerrit.wikimedia.org/r/238108

Change 238109 merged by jenkins-bot:
poolcounter: enable connect_timeout for testwiki

https://gerrit.wikimedia.org/r/238109

After patching hhvm for adding support for float timeouts, I did the following test:

  1. reduced the poolcounter config on a server in codfw to point to one inexistent ip (which simulates the "machine is powered down" scenario)
  2. Requested repeatedly
curl -H 'Host: test.wikipedia.org' -H 'X-Forwarded-Proto: https' 'http://localhost/w/api.php?action=query&format=json&generator=prefixsearch&redirects=true&gpssearch=a&gps&gpslimit=20&list=search&srsearch=a&srnamespace=0&srwhat=text&srinfo=suggestion&srprop=&sroffset=0&srlimit=1&prop=pageterms%7Cpageimages&wbptterms=description&piprop=thumbnail&pithumbsize=320&pilimit=20&continue=' -v

which was hanging up connecting to the poolcounter before

  1. Observed error messages about being unable to connect to the poolcounter in the logs on fluorine

I am pretty confident the problem is now solved.

Change 605436 had a related patch set uploaded (by Tim Starling; owner: Tim Starling):
[operations/mediawiki-config@master] Use a short connect timeout for PoolCounter

https://gerrit.wikimedia.org/r/605436

Reopening pending deployment of my patch above.

Change 605436 merged by jenkins-bot:
[operations/mediawiki-config@master] Use a short connect timeout for PoolCounter

https://gerrit.wikimedia.org/r/605436

Mentioned in SAL (#wikimedia-operations) [2020-06-15T23:56:55Z] <tstarling@deploy1001> Synchronized wmf-config/PoolCounterSettings.php: reducing connect timeout per T105378 (duration: 01m 00s)

tstarling lowered the priority of this task from High to Low.Jun 16 2020, 12:14 AM

Since deployment 15 minutes ago, we have had 6 connection timeouts logged. We typically get ~5 per day. It's probably fine -- the reason we are setting the timeout so short is because there is not much production impact of a timeout. Let's leave the task open for a day or so, then I'll post the log stats, in case we need to do some further tweaking.

The count over 24 hours was 1689 connection timeout errors. I think that crosses my threshold of concern. Another tweak may be needed.

The count over 24 hours was 1689 connection timeout errors. I think that crosses my threshold of concern. Another tweak may be needed.

It does sound like a regression at first. However

https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?panelId=12&fullscreen&tab=visualization&orgId=1&from=now-24h&to=now calculates[1], 229122112 processed requests.

1700 (rounding up) failed requests over a (roughly, rounding down) 229 million requests, amounts to about ~0,00075% failure rate. That sounds pretty low (it would mean an availability of 99,99925%, aka 5 nines and then some, which is pretty awesome in my book). I have my doubts we are going to see any particular gain from tweaking, to be honest. Most end users network connections have worse failure rates (and in-DC things like the jobqueue are going to retry anyway).

That being said, 10ms might indeed be a bit too aggressive and even a 100% increase (to 20ms) isn't going to endanger the infrastructure. Feel free to try it out, but I don't expect miracles.

One thing that is not clear to me. Are those fully failed requests? Or did mediawiki just fallback to the next server in the pool?

[1] Use sum(increase(poolcounter_processed_count{job="poolcounter_exporter"}[$_interval])) for the exact date range.

The log messages represent failure of all servers in the pool, although the error is not user-visible. PoolCounter_ConnectionManager::open() fails silently after two attempts to connect to a server. Once the loop over all hosts in PoolCounter_ConnectionManager::get() fails to get a valid connection, the last error is wrapped in a Status and returned to PoolCounterWork::execute(), which logs the error. It's those log messages that we see. The impact is that PoolCounterWork::execute() runs the work locally, as if the pool population was zero -- a fair assumption unless the cause of the connection error is a very heavy system-wide overload.

The log messages represent failure of all servers in the pool, although the error is not user-visible. PoolCounter_ConnectionManager::open() fails silently after two attempts to connect to a server. Once the loop over all hosts in PoolCounter_ConnectionManager::get() fails to get a valid connection, the last error is wrapped in a Status and returned to PoolCounterWork::execute(), which logs the error. It's those log messages that we see. The impact is that PoolCounterWork::execute() runs the work locally, as if the pool population was zero -- a fair assumption unless the cause of the connection error is a very heavy system-wide overload.

Interesting. So, the issues seems to be on the client side (mediawiki) then, not poolcounter service side. There are a number of scenarios I had in mind that could explain transient connectivity issues to the poolcounter service, e.g.:

but neither would account for not being able to connect to both poolcounter instances in such a short timeframe (~40ms if I calculate correctly. The first because we did not do any such maintenance recently, and most important because it would not happen at the same 1s anyway. The latter because the nodes the VMs are on are not saturated at the same time. The network is definitely fallible of course, but in this case for it to be to 2 different destinations, this probably is close to the source and not the destination of the connection.

Considering that the connection errors are apparently harmless and local to the client, and considering that we don't have as much room to move as I first thought in increasing timeouts (since they are multiplied by 4), I'm going to reclose the bug. If excessive timeouts, client-side stalls or network saturation prove to be a problem, that could be a different bug. The problem of poolcounter being a SPOF should be fixed.