Stop a poolcounter server fail from being a SPOF for the service and the api (and the site)
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Matanya
	Jul 9 2015, 6:17 PM

Description

https://wikitech.wikimedia.org/wiki/Incident_documentation/20150709-poolcounter

TL;DR
Helium was reboot, api was out for 8 minutes.
reason:
helium is a poolcounter machine. Unfortunately mediawiki has a 0.5 sec timeout when falling back to the next poolcounter server in line which is too high.

Details

Subject	Repo	Branch	Lines +/-
Use a short connect timeout for PoolCounter	operations/mediawiki-config	master	+2 -6
poolcounter: enable connect_timeout for testwiki	operations/mediawiki-config	master	+5 -0
poolcounter: add connect_timeout in codfw	operations/mediawiki-config	master	+2 -1
Add support for connect_timeout	mediawiki/extensions/PoolCounter	master	+23 -1

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	Joe	T105378 Stop a poolcounter server fail from being a SPOF for the service and the api (and the site)
Resolved	Joe	T112501 Create a poolcounter instance in deployment-prep
Resolved	Andrew	T112200 Update remaining virt nodes to kilo

Event Timeline

Matanya created this task.Jul 9 2015, 6:17 PM

Matanya raised the priority of this task from to Needs Triage.

Matanya updated the task description. (Show Details)

Matanya added a project: acl*sre-team.

Matanya subscribed.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 9 2015, 6:17 PM

fgiunchedi triaged this task as High priority.Jul 20 2015, 2:22 PM

fgiunchedi subscribed.

fgiunchedi claimed this task.Jul 22 2015, 12:49 PM

see also related poolcounter tickets:

Dzahn mentioned this in T104996: Ferm rules for backup roles.Aug 6 2015, 9:35 PM

• chasemp awarded a token.Aug 6 2015, 9:47 PM

there was an API outage due to poolcounter server dropping packages due to an issue with ferm rules: https://wikitech.wikimedia.org/wiki/Incident_documentation/20150806-poolcounter

Joe added a project: MediaWiki-General.Aug 11 2015, 6:21 AM

Joe set Security to None.

@Joe has kindly agreed to investigate this, he's been already bouncing ideas with @tstarling and others

So, we think the issue here is that a timeout of 0.5 seconds is way too high and, if a server is completely down (so it doesn't respond to SYNs in any way) the connection attempts just pile up.

On the other hand, a very short timeout can result in too many false positives, so what @tstarling suggested was to do what we already do for mysql:

set a short timeout, a bit higher than twice the average RTT between the poolcounter and the appservers
retry N times the connection, where usually N=2 is reasonable

I made a test using fsockopen() like the poolcounter extension does and results are promising, see P1887.

We can get to a penalty in the order of ~ 30/40 milliseconds if a server is down, which I think our systems would be able to withstand.

Change 231996 had a related patch set uploaded (by Giuseppe Lavagetto):
Add support for connect_timeout

https://gerrit.wikimedia.org/r/231996

gerritbot added a project: Patch-For-Review.Aug 17 2015, 9:56 AM

Change 231996 merged by jenkins-bot:
Add support for connect_timeout

https://gerrit.wikimedia.org/r/231996

Joe mentioned this in rMEXT23743609d36b: Updated mediawiki/extensions Project: mediawiki/extensions/PoolCounter….Aug 19 2015, 11:15 PM

tstarling mentioned this in rEPOCaf3ce612a811: Add support for connect_timeout.Aug 19 2015, 11:15 PM

• Forrestbot added a project: WMF-deploy-2015-08-25_(1.26wmf20).Aug 20 2015, 12:00 AM

Change 238108 had a related patch set uploaded (by Giuseppe Lavagetto):
poolcounter: add connect_timeout in codfw

https://gerrit.wikimedia.org/r/238108

Change 238109 had a related patch set uploaded (by Giuseppe Lavagetto):
poolcounter: enable connect_timeout for testwiki

https://gerrit.wikimedia.org/r/238109

Andrew closed subtask T112501: Create a poolcounter instance in deployment-prep as Resolved.Sep 14 2015, 3:37 PM

Change 238108 merged by jenkins-bot:
poolcounter: add connect_timeout in codfw

https://gerrit.wikimedia.org/r/238108

Joe mentioned this in rOMWCbbdc66ecb3fa: poolcounter: add connect_timeout in codfw.Sep 15 2015, 3:08 PM

Change 238109 merged by jenkins-bot:
poolcounter: enable connect_timeout for testwiki

https://gerrit.wikimedia.org/r/238109

Joe mentioned this in rOMWC7efefbc3d42c: poolcounter: enable connect_timeout for testwiki.Sep 15 2015, 3:12 PM

Joe mentioned this in T112834: Prefixsearch requests time out on mobile testwiki.Sep 17 2015, 8:50 AM

Krenair subscribed.Oct 1 2015, 11:42 PM

Joe mentioned this in T123734: Migrate pool counters to trusty/jessie.Jan 21 2016, 11:19 AM

akosiaris mentioned this in T123723: Move bacula director and storage daemon off helium?.Jan 29 2016, 10:23 AM

I think we should merge T32452 in this one

After patching hhvm for adding support for float timeouts, I did the following test:

reduced the poolcounter config on a server in codfw to point to one inexistent ip (which simulates the "machine is powered down" scenario)
Requested repeatedly

curl -H 'Host: test.wikipedia.org' -H 'X-Forwarded-Proto: https' 'http://localhost/w/api.php?action=query&format=json&generator=prefixsearch&redirects=true&gpssearch=a&gps&gpslimit=20&list=search&srsearch=a&srnamespace=0&srwhat=text&srinfo=suggestion&srprop=&sroffset=0&srlimit=1&prop=pageterms%7Cpageimages&wbptterms=description&piprop=thumbnail&pithumbsize=320&pilimit=20&continue=' -v

which was hanging up connecting to the poolcounter before

Observed error messages about being unable to connect to the poolcounter in the logs on fluorine

I am pretty confident the problem is now solved.

Joe closed this task as Resolved.Feb 2 2016, 10:08 AM

tstarling mentioned this in T32452: gracefully handle a poolcounterd outage.Jul 18 2018, 6:31 AM

Change 605436 had a related patch set uploaded (by Tim Starling; owner: Tim Starling):
[operations/mediawiki-config@master] Use a short connect timeout for PoolCounter

https://gerrit.wikimedia.org/r/605436

Reopening pending deployment of my patch above.

Krinkle added a project: Sustainability (Incident Followup).Jun 15 2020, 3:28 PM

Change 605436 merged by jenkins-bot:
[operations/mediawiki-config@master] Use a short connect timeout for PoolCounter

https://gerrit.wikimedia.org/r/605436

Mentioned in SAL (#wikimedia-operations) [2020-06-15T23:56:55Z] <tstarling@deploy1001> Synchronized wmf-config/PoolCounterSettings.php: reducing connect timeout per T105378 (duration: 01m 00s)

Since deployment 15 minutes ago, we have had 6 connection timeouts logged. We typically get ~5 per day. It's probably fine -- the reason we are setting the timeout so short is because there is not much production impact of a timeout. Let's leave the task open for a day or so, then I'll post the log stats, in case we need to do some further tweaking.

The count over 24 hours was 1689 connection timeout errors. I think that crosses my threshold of concern. Another tweak may be needed.

In T105378#6230593, @tstarling wrote:

The count over 24 hours was 1689 connection timeout errors. I think that crosses my threshold of concern. Another tweak may be needed.

It does sound like a regression at first. However

https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?panelId=12&fullscreen&tab=visualization&orgId=1&from=now-24h&to=now calculates[1], 229122112 processed requests.

1700 (rounding up) failed requests over a (roughly, rounding down) 229 million requests, amounts to about ~0,00075% failure rate. That sounds pretty low (it would mean an availability of 99,99925%, aka 5 nines and then some, which is pretty awesome in my book). I have my doubts we are going to see any particular gain from tweaking, to be honest. Most end users network connections have worse failure rates (and in-DC things like the jobqueue are going to retry anyway).

That being said, 10ms might indeed be a bit too aggressive and even a 100% increase (to 20ms) isn't going to endanger the infrastructure. Feel free to try it out, but I don't expect miracles.

One thing that is not clear to me. Are those fully failed requests? Or did mediawiki just fallback to the next server in the pool?

[1] Use sum(increase(poolcounter_processed_count{job="poolcounter_exporter"}[$_interval])) for the exact date range.

The log messages represent failure of all servers in the pool, although the error is not user-visible. PoolCounter_ConnectionManager::open() fails silently after two attempts to connect to a server. Once the loop over all hosts in PoolCounter_ConnectionManager::get() fails to get a valid connection, the last error is wrapped in a Status and returned to PoolCounterWork::execute(), which logs the error. It's those log messages that we see. The impact is that PoolCounterWork::execute() runs the work locally, as if the pool population was zero -- a fair assumption unless the cause of the connection error is a very heavy system-wide overload.

In T105378#6233389, @tstarling wrote:

The log messages represent failure of all servers in the pool, although the error is not user-visible. PoolCounter_ConnectionManager::open() fails silently after two attempts to connect to a server. Once the loop over all hosts in PoolCounter_ConnectionManager::get() fails to get a valid connection, the last error is wrapped in a Status and returned to PoolCounterWork::execute(), which logs the error. It's those log messages that we see. The impact is that PoolCounterWork::execute() runs the work locally, as if the pool population was zero -- a fair assumption unless the cause of the connection error is a very heavy system-wide overload.

Interesting. So, the issues seems to be on the client side (mediawiki) then, not poolcounter service side. There are a number of scenarios I had in mind that could explain transient connectivity issues to the poolcounter service, e.g.:

migrating the poolcounter VMs around in the DC between nodes during maintenance, which does introduce a connectivity loss of ~1s
transient saturation of a ganeti host's network. See https://grafana.wikimedia.org/d/000000607/cluster-overview?panelId=2791&fullscreen&orgId=1&from=now-24h&to=now&var-datasource=eqiad%20prometheus%2Fops&var-cluster=ganeti&var-instance=All

but neither would account for not being able to connect to both poolcounter instances in such a short timeframe (~40ms if I calculate correctly. The first because we did not do any such maintenance recently, and most important because it would not happen at the same 1s anyway. The latter because the nodes the VMs are on are not saturated at the same time. The network is definitely fallible of course, but in this case for it to be to 2 different destinations, this probably is close to the source and not the destination of the connection.

MoritzMuehlenhoff subscribed.Jun 18 2020, 9:28 AM

Considering that the connection errors are apparently harmless and local to the client, and considering that we don't have as much room to move as I first thought in increasing timeouts (since they are multiplied by 4), I'm going to reclose the bug. If excessive timeouts, client-side stalls or network saturation prove to be a problem, that could be a different bug. The problem of poolcounter being a SPOF should be fixed.

akosiaris awarded a token.Jun 22 2020, 7:58 AM

Stop a poolcounter server fail from being a SPOF for the service and the api (and the site)Closed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Stop a poolcounter server fail from being a SPOF for the service and the api (and the site)
Closed, ResolvedPublic
Actions

Related Objects
Search...