Page MenuHomePhabricator

SiteStatsInit::refresh() triggered inappropriately, caused downtime
Closed, ResolvedPublic

Description

On pl.wikipedia.org from 05:58:45 onwards, SiteStatsInit::refresh() began to be called several times per second. It's not known at this stage why SiteStats::isSane() returned false.

The binlog shows that the refresh() queries were often executed in autocommit mode, meaning that the DELETE query was committed before the INSERT query began. This would have caused isSane() to return false until the new row insert was committed, leading to a flood of attempted refreshes.

Eventually, a flood of SELECT COUNT(*) queries at around 07:10 caused an overload on all s2 slaves, leading to an overload of the apache pool and site-wide downtime. SiteStatsInit was disabled and all related queries were killed. When the dust settled, the site_stats row was missing, and had to be recovered from binlogs.

I suggest removing the isSane() checks from loadAndLazyInit(), and doing a refresh only from maintenance scripts or web-based upgrade. SiteStats::load() should be able to tolerate a missing site_stats row, and the accessor functions should return false without giving a PHP warning. Additionally, the refresh should be done with REPLACE instead of DELETE and INSERT.


Version: 1.18.x
Severity: normal

Details

Reference
bz34156

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 12:17 AM
bzimport set Reference to bz34156.
bzimport added a subscriber: Unknown Object (MLST).
Krinkle added a project: Platform Engineering.

I'm not very familiar with the code myself, but would expect that after all this, the original issue not be possible anymore. But CC-ing Tim to validate just in case :)

aaron claimed this task.
aaron added a subscriber: aaron.

The current code will never trigger this path if $wgMiserMode is set, which it is on production.