Hi, a lot of (if not even all) special pages on commons dont get updated since a month now. Before they got updated abut once or twice a week. Some examples: https://commons.wikimedia.org/wiki/Special:BrokenRedirects, https://commons.wikimedia.org/wiki/Special:DoubleRedirects, https://commons.wikimedia.org/wiki/Special:UncategorizedCategories or https://commons.wikimedia.org/wiki/Special:ListDuplicatedFiles. I guess there is something wrong on the server site, like a non running cron or similar, which should be fixed.
https://commons.wikimedia.org/wiki/Special:BrokenRedirects, https://commons.wikimedia.org/wiki/Special:DoubleRedirects, https://commons.wikimedia.org/wiki/Special:UncategorizedCategories or https://commons.wikimedia.org/wiki/Special:ListDuplicatedFiles.
All of these are enabled on commons, so looks like something is wrong with the cron job. But it appears they are being updated on other wikis. Weird.
[3b1aca6d12a4ea6c0eecc3a7] [no req] Wikimedia\Rdbms\DBExpectedError from line 847 of /srv/mediawiki/php-1.30.0-wmf.18/includes/libs/rdbms/database/DatabaseMysqlBase.php: MASTER_POS_WAIT() or MASTER_GTID_WAIT() failed: MySQL server has gone away (10.64.0.93)
#0 /srv/mediawiki/php-1.30.0-wmf.18/includes/libs/rdbms/loadbalancer/LoadBalancer.php(597): Wikimedia\Rdbms\DatabaseMysqlBase->masterPosWait(Wikimedia\Rdbms\MySQLMasterPos, integer)
#1 /srv/mediawiki/php-1.30.0-wmf.18/includes/libs/rdbms/loadbalancer/LoadBalancer.php(510): Wikimedia\Rdbms\LoadBalancer->doWait(integer, boolean, integer)
#2 /srv/mediawiki/php-1.30.0-wmf.18/includes/libs/rdbms/lbfactory/LBFactory.php(361): Wikimedia\Rdbms\LoadBalancer->waitForAll(Wikimedia\Rdbms\MySQLMasterPos, integer)
#3 /srv/mediawiki/php-1.30.0-wmf.18/includes/GlobalFunctions.php(3032): Wikimedia\Rdbms\LBFactory->waitForReplication(array)
#4 /srv/mediawiki/php-1.30.0-wmf.18/maintenance/updateSpecialPages.php(157): wfWaitForSlaves()
#5 /srv/mediawiki/php-1.30.0-wmf.18/maintenance/updateSpecialPages.php(47): UpdateSpecialPages->doSpecialPageCacheUpdates(Wikimedia\Rdbms\DatabaseMysqli)
#6 /srv/mediawiki/php-1.30.0-wmf.18/maintenance/doMaintenance.php(92): UpdateSpecialPages->execute()
#7 /srv/mediawiki/php-1.30.0-wmf.18/maintenance/updateSpecialPages.php(164): require_once(string)
#8 /srv/mediawiki/multiversion/MWScript.php(99): require_once(string)
I'm not really clear what this means. Is it saying that slave lag was so high on db1081 on september 16 that it timedout waiting for slaves? (That's probably not what it means, since it happens every time but with a different slave https://logstash.wikimedia.org/goto/9769e35ad59971a214bf88d323d480b1 )
There was no lag on the last occurrence of that error:
(the precision of the monitoring doesn't mean there was not lag at all, but is enough to say it was less than 60 seconds and most likely close to 0, both enough to say it is not the source of the errors). In any case, even if there was lag on db1081, that shouldn't influence by itself the overall execution.
That, combined with the fact that it fails every time means that it is not an isolated incident. My bet would be on a query problem or a logic problem (some kind of long transaction expecting not-repeatable read data mode). The reason why it could fail on commons only is that it it a long query for some commons particularity (large number of images, large number of articles, large number of templates, etc.).
Someone should manually run the job and trace the execution/database patterns.
(Likely unrelated to the MediaWiki-General-or-Unknown code base as this is about Wikimedia sites)
I'd say likely a mediawiki bug right now, based on detecting non-existent lag. If it is not that, a secondary options it would be the lag check (there are some issues I transmitted to performance about the measuring method- not sure if they already changed that) or a watchdog. This could be the DB-level query killer when a server gets overloaded- but looks very unlikely. Probably the (mediawiki) write watchdog for inserts taking more than X seconds? I would ask performance for his opinion- even if probably a query killer means not that the query killer is bad, but that the used query is.
there will be read queries here against vslow slaves that are very long (on the order of an hour)
I know, the watchdog has those into account, but under extreme pressure (outage-like, 5000 concurrent connections) it starts killing everthing. I do not see that happening, so I commented:
but looks very unlikely
I do not know how the mediawiki large-write-transaction killer works.
Assuming this works, special pages on commons should start updating again on oct 5.
The cause seems to be similar to T171027 - A spike in amount of recentchanges entries from wikidata makes counting the number of active users for Special:Statistics really slow, and we were only reconnecting to db after query page updates and not callback cache updates.