Not sure why, needs investigation.
The Cirrus issue is T186765. The replica errors continued on for all DB servers pretty much.
• demon | |
Feb 7 2018, 11:51 PM |
F13336489: Screen Shot 2018-02-08 at 4.20.37 PM.png | |
Feb 9 2018, 12:31 AM |
F13232102: Screen Shot 2018-02-07 at 4.04.04 PM.png | |
Feb 8 2018, 12:09 AM |
Not sure why, needs investigation.
The Cirrus issue is T186765. The replica errors continued on for all DB servers pretty much.
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | Release | • demon | T183959 1.31.0-wmf.20 deployment blockers | ||
Resolved | • demon | T186764 refreshLinks/jobqueue issues in wmf.20 causing MW-reported replag |
@demon where did you see that lag?
I have been taking a first look at graphs for a given error and I cannot see lag being graphed or any significant spike on any of the hosts for that shard for that time before you rolled back (20:39 according to SAL).
I wasn't able to find any alerts being fired of on IRC either.
By inspecting a bit all the errors caused between 20:36 and 20:40 the majority (if not all) are coming from (https://logstash.wikimedia.org/goto/b8c75a3003ea381d1e35af0b1ccff067):
cli_argv /srv/mediawiki-staging/multiversion/MWScript.php runJobs.php --wiki=commonswiki --maxjobs 3000 --maxtime 50s --type refreshLinks --nothrottle url /rpc/RunJobs.php?wiki=mediawikiwiki&type=CentralAuthCreateLocalAccountJob&maxtime=60&maxmem=300M
Here I am not seeing any big error rate: https://logstash.wikimedia.org/goto/e133241bc85e40f49cab1040a48d836d
Finally, checking s2 master, between 20:35 and 20:41 most of the writes are UPDATE /* LinksUpdate::updateLinksTimestamp
Yeah, but I am not able to see lag anywhere, not grafana or logtash complaining, just those errors I pasted at T186764#3954890
Inspecting s3 binlogs between 20:35 and 20:41 there are quite some INSERT /* LinksUpdate::incrTableUpdate */ IGNORE INTO templatelinks and some of them are a bit weird: https://phabricator.wikimedia.org/P6668 (private paste because I have pasted the full insert)
Inspecting binlogs before the issue I wasn't able to find any insert like that. Could be coincidence, I don't know.
@demon In addition to T186764#3954890, the fact that this happened exactly at code deploy means this is a MediaWiki-libs-Rdbms not a DBA issue. Check with the maintainers of that. Maybe some load balancer issue? CC @aaron
I pinged DBA for help investigating, not because I thought it was a problem on your end :)
Still very very busted. Raising priority again since this is actively breaking the train moving forward.
00:16 <+logmsgbot> !log demon@tin rebuilt and synchronized wikiversions files: group1 to wmf.20 *duck and cover*
Rolled back immediately. But I have stacktraces! There's two, but obviously same proximate area of code (jobqueue/refreshLinks):
#0 /srv/mediawiki/php-1.31.0-wmf.20/includes/libs/rdbms/lbfactory/LBFactory.php(414): Wikimedia\Rdbms\LBFactory->waitForReplication(array) #1 /srv/mediawiki/php-1.31.0-wmf.20/includes/jobqueue/jobs/RefreshLinksJob.php(258): Wikimedia\Rdbms\LBFactory->commitAndWaitForReplication(string, integer) #2 /srv/mediawiki/php-1.31.0-wmf.20/includes/jobqueue/jobs/RefreshLinksJob.php(122): RefreshLinksJob->runForTitle(Title) #3 /srv/mediawiki/php-1.31.0-wmf.20/includes/jobqueue/JobRunner.php(294): RefreshLinksJob->run() #4 /srv/mediawiki/php-1.31.0-wmf.20/includes/jobqueue/JobRunner.php(193): JobRunner->executeJob(RefreshLinksJob, Wikimedia\Rdbms\LBFactoryMulti, BufferingStatsdDataFactory, integer) #5 /srv/mediawiki/rpc/RunJobs.php(47): JobRunner->run(array) #6 {main}
#0 /srv/mediawiki/php-1.31.0-wmf.20/includes/libs/rdbms/lbfactory/LBFactory.php(414): Wikimedia\Rdbms\LBFactory->waitForReplication(array) #1 /srv/mediawiki/php-1.31.0-wmf.20/includes/jobqueue/utils/PurgeJobUtils.php(75): Wikimedia\Rdbms\LBFactory->commitAndWaitForReplication(string, integer) #2 [internal function]: Closure$PurgeJobUtils::invalidatePages(integer) #3 /srv/mediawiki/php-1.31.0-wmf.20/includes/libs/rdbms/database/Database.php(2839): call_user_func_array(Closure$PurgeJobUtils::invalidatePages;2704, array) #4 /srv/mediawiki/php-1.31.0-wmf.20/includes/libs/rdbms/database/Database.php(2768): Wikimedia\Rdbms\Database->runOnTransactionIdleCallbacks(integer) #5 /srv/mediawiki/php-1.31.0-wmf.20/includes/jobqueue/utils/PurgeJobUtils.php(79): Wikimedia\Rdbms\Database->onTransactionIdle(Closure$PurgeJobUtils::invalidatePages;2704, string) #6 /srv/mediawiki/php-1.31.0-wmf.20/includes/deferred/LinksUpdate.php(405): PurgeJobUtils::invalidatePages(Wikimedia\Rdbms\DatabaseMysqli, integer, array) #7 /srv/mediawiki/php-1.31.0-wmf.20/includes/deferred/LinksUpdate.php(229): LinksUpdate->invalidateImageDescriptions(array) #8 /srv/mediawiki/php-1.31.0-wmf.20/includes/deferred/LinksUpdate.php(176): LinksUpdate->doIncrementalUpdate() #9 /srv/mediawiki/php-1.31.0-wmf.20/includes/deferred/DeferredUpdates.php(259): LinksUpdate->doUpdate() #10 /srv/mediawiki/php-1.31.0-wmf.20/includes/deferred/DeferredUpdates.php(210): DeferredUpdates::runUpdate(LinksUpdate, Wikimedia\Rdbms\LBFactoryMulti, string, integer) #11 /srv/mediawiki/php-1.31.0-wmf.20/includes/deferred/DeferredUpdates.php(131): DeferredUpdates::execute(array, string, integer) #12 /srv/mediawiki/php-1.31.0-wmf.20/includes/MediaWiki.php(904): DeferredUpdates::doUpdates(string) #13 /srv/mediawiki/php-1.31.0-wmf.20/includes/MediaWiki.php(724): MediaWiki->restInPeace(string, boolean) #14 [internal function]: Closure$MediaWiki::doPostOutputShutdown() #15 {main}
I know @Ottomata and @Pchelolo did some work earlier today on jobs/change-prop so CCing them, but this was with wmf.20 before & after their work so I may be wrong. Oh and re-pinging @aaron because JobQueue/refreshLinks generally.
Change 409199 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/core@master] rdbms: make DOMAIN_ANY work on new MySQL connections
Change 409199 merged by jenkins-bot:
[mediawiki/core@master] rdbms: make DOMAIN_ANY ignore bogus MySQL DB names in config
Change 409327 had a related patch set uploaded (by Chad; owner: Aaron Schulz):
[mediawiki/core@wmf/1.31.0-wmf.20] rdbms: make DOMAIN_ANY ignore bogus MySQL DB names in config
Change 409327 merged by jenkins-bot:
[mediawiki/core@wmf/1.31.0-wmf.20] rdbms: make DOMAIN_ANY ignore bogus MySQL DB names in config
Do you have a one-line summary of what was the issue? "bug on X patch/deployment of Y feature"- not looking to blame anyone, just genuinely curious.
@demon because we may have had misunderstanding in the past (T186764#3955920), maybe adding "DBA (not dba team/blocked external)" would be better understood as "please help us if you have the time, but you don't own it"? I am not demanding that, just suggesting if that could help avoiding missunderstandings if it doesn't take a lot of time? What do you think?