Page MenuHomePhabricator

gracefully handle a poolcounterd outage
Closed, ResolvedPublic

Description

Author: afeldman

Description:
PoolCounter as currently deployed is a SPOF in our infrastructure. If it's enabled in MediaWiki and the poolcounterd server is completely down, an error page will displayed for any article in need of parsing.

There is a separate RT ticket to make poolcounterd redundant in our infrastructure but we'd still like to make sure total failure is handled gracefully.


Version: unspecified
Severity: normal

Details

Reference
bz30452

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 11:58 PM
bzimport added a project: MediaWiki-Parser.
bzimport set Reference to bz30452.
bzimport added a subscriber: Unknown Object (MLST).

I thought I fixed this in r84322, which was deployed in March.

shealen.clare wrote:

(In reply to comment #0)

PoolCounter as currently deployed is a SPOF in our infrastructure. If it's
enabled in MediaWiki and the poolcounterd server is completely down, an error
page will displayed for any article in need of parsing.

There is a separate RT ticket to make poolcounterd redundant in our
infrastructure but we'd still like to make sure total failure is handled
gracefully.

Are the conditions available for you to reproduce this bug (e.g. poolcounter server down), or can we trust Tim that it's been fixed in https://www.mediawiki.org/wiki/Special:Code/MediaWiki/84322 ?

A connection error would return a Status of type fatal, thus with r84322 the apache instance would do the work itself.
The poolcounter failing won't result in downtime for the wiki *if* Michael Jackson doesn't die. In which case we would be subject to the same overload as without the poolcounter (and the solution is just to restart it).

Assuming that the server would cope with all those connections in an overload (fd max, tcp buffers...), this is fixed.

shealen.clare wrote:

With the exception of the recent conversation I generated, this bug has not been touched in at least six months. With this in mind, I've been asked by the bugmeister to bump this bug's priority down for "High". Concerns should be addressed to mah@everybody.org.

Does not look like "high" priority to me, hence setting to normal.

More general info: https://wikitech.wikimedia.org/wiki/PoolCounter

FWIW we just had another outage in prod due to poolcounter being unavailable so this is still an issue.

https://phabricator.wikimedia.org/T104996#1516149

tstarling claimed this task.

As detailed in T105378, this was fixed by reducing the connect timeout.