Update of QueryPages failing on commons with "MASTER_POS_WAIT() or MASTER_GTID_WAIT() failed: MySQL server has gone away"
Closed, ResolvedPublicbug

Description

Hi, a lot of (if not even all) special pages on commons dont get updated since a month now. Before they got updated abut once or twice a week. Some examples: https://commons.wikimedia.org/wiki/Special:BrokenRedirects, https://commons.wikimedia.org/wiki/Special:DoubleRedirects, https://commons.wikimedia.org/wiki/Special:UncategorizedCategories or https://commons.wikimedia.org/wiki/Special:ListDuplicatedFiles. I guess there is something wrong on the server site, like a non running cron or similar, which should be fixed.

JuTa created this task.Sep 16 2017, 7:01 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 16 2017, 7:01 PM
JuTa updated the task description. (Show Details)Sep 16 2017, 7:06 PM
JuTa updated the task description. (Show Details)Sep 16 2017, 7:09 PM

(Likely unrelated to the MediaWiki-General-or-Unknown code base as this is about Wikimedia sites)

Zoranzoki21 updated the task description. (Show Details)Sep 16 2017, 8:21 PM

(Likely unrelated to the MediaWiki-General-or-Unknown code base as this is about Wikimedia sites)

Sorry, missclick :)

[3b1aca6d12a4ea6c0eecc3a7] [no req] Wikimedia\Rdbms\DBExpectedError from line 847 of /srv/mediawiki/php-1.30.0-wmf.18/includes/libs/rdbms/database/DatabaseMysqlBase.php: MASTER_POS_WAIT() or MASTER_GTID_WAIT() failed: MySQL server has gone away (10.64.0.93)
Backtrace:
#0 /srv/mediawiki/php-1.30.0-wmf.18/includes/libs/rdbms/loadbalancer/LoadBalancer.php(597): Wikimedia\Rdbms\DatabaseMysqlBase->masterPosWait(Wikimedia\Rdbms\MySQLMasterPos, integer)
#1 /srv/mediawiki/php-1.30.0-wmf.18/includes/libs/rdbms/loadbalancer/LoadBalancer.php(510): Wikimedia\Rdbms\LoadBalancer->doWait(integer, boolean, integer)
#2 /srv/mediawiki/php-1.30.0-wmf.18/includes/libs/rdbms/lbfactory/LBFactory.php(361): Wikimedia\Rdbms\LoadBalancer->waitForAll(Wikimedia\Rdbms\MySQLMasterPos, integer)
#3 /srv/mediawiki/php-1.30.0-wmf.18/includes/GlobalFunctions.php(3032): Wikimedia\Rdbms\LBFactory->waitForReplication(array)
#4 /srv/mediawiki/php-1.30.0-wmf.18/maintenance/updateSpecialPages.php(157): wfWaitForSlaves()
#5 /srv/mediawiki/php-1.30.0-wmf.18/maintenance/updateSpecialPages.php(47): UpdateSpecialPages->doSpecialPageCacheUpdates(Wikimedia\Rdbms\DatabaseMysqli)
#6 /srv/mediawiki/php-1.30.0-wmf.18/maintenance/doMaintenance.php(92): UpdateSpecialPages->execute()
#7 /srv/mediawiki/php-1.30.0-wmf.18/maintenance/updateSpecialPages.php(164): require_once(string)
#8 /srv/mediawiki/multiversion/MWScript.php(99): require_once(string)
#9 {main}

I'm not really clear what this means. Is it saying that slave lag was so high on db1081 on september 16 that it timedout waiting for slaves? (That's probably not what it means, since it happens every time but with a different slave https://logstash.wikimedia.org/goto/9769e35ad59971a214bf88d323d480b1 )

Bawolff renamed this task from No update of special pages on commons to Update of QueryPages failing on commons with "MASTER_POS_WAIT() or MASTER_GTID_WAIT() failed: MySQL server has gone away".Sep 18 2017, 3:00 PM
jcrespo added a subscriber: jcrespo.

There was no lag on the last occurrence of that error:
https://grafana.wikimedia.org/dashboard/db/mysql-replication-lag?panelId=4&fullscreen&orgId=1&from=1505552411468&to=1505562592019
https://grafana.wikimedia.org/dashboard/db/mediawiki-mysql-loadbalancer?panelId=1&fullscreen&orgId=1&from=1505557748993&to=1505561233557

(the precision of the monitoring doesn't mean there was not lag at all, but is enough to say it was less than 60 seconds and most likely close to 0, both enough to say it is not the source of the errors). In any case, even if there was lag on db1081, that shouldn't influence by itself the overall execution.

That, combined with the fact that it fails every time means that it is not an isolated incident. My bet would be on a query problem or a logic problem (some kind of long transaction expecting not-repeatable read data mode). The reason why it could fail on commons only is that it it a long query for some commons particularity (large number of images, large number of articles, large number of templates, etc.).

Someone should manually run the job and trace the execution/database patterns.

(Likely unrelated to the MediaWiki-General-or-Unknown code base as this is about Wikimedia sites)

I'd say likely a mediawiki bug right now, based on detecting non-existent lag. If it is not that, a secondary options it would be the lag check (there are some issues I transmitted to performance about the measuring method- not sure if they already changed that) or a watchdog. This could be the DB-level query killer when a server gets overloaded- but looks very unlikely. Probably the (mediawiki) write watchdog for inserts taking more than X seconds? I would ask performance for his opinion- even if probably a query killer means not that the query killer is bad, but that the used query is.

This is in the context of the expensive special page updates, so there will be read queries here against vslow slaves that are very long (on the order of an hour)

there will be read queries here against vslow slaves that are very long (on the order of an hour)

I know, the watchdog has those into account, but under extreme pressure (outage-like, 5000 concurrent connections) it starts killing everthing. I do not see that happening, so I commented:

but looks very unlikely

I do not know how the mediawiki large-write-transaction killer works.

Jeff_G added a subscriber: Jeff_G.

Change 380888 had a related patch set uploaded (by Brian Wolff; owner: Brian Wolff):
[mediawiki/core@master] Reopen db connection after doing SpecialPageCacheUpdates

https://gerrit.wikimedia.org/r/380888

Change 380888 merged by jenkins-bot:
[mediawiki/core@master] Reopen db connection after doing SpecialPageCacheUpdates

https://gerrit.wikimedia.org/r/380888

Bawolff closed this task as Resolved.Sep 28 2017, 4:31 PM
Bawolff claimed this task.

Assuming this works, special pages on commons should start updating again on oct 5.


The cause seems to be similar to T171027 - A spike in amount of recentchanges entries from wikidata makes counting the number of active users for Special:Statistics really slow, and we were only reconnecting to db after query page updates and not callback cache updates.