Page MenuHomePhabricator

refreshLinks/jobqueue issues in wmf.20 causing MW-reported replag
Closed, ResolvedPublic

Description

Not sure why, needs investigation.

Screen Shot 2018-02-07 at 4.04.04 PM.png (1×2 px, 356 KB)

The Cirrus issue is T186765. The replica errors continued on for all DB servers pretty much.

Event Timeline

demon triaged this task as Unbreak Now! priority.Feb 7 2018, 11:51 PM
demon created this task.
Marostegui lowered the priority of this task from Unbreak Now! to Medium.Feb 8 2018, 6:27 AM
Marostegui subscribed.

Decreasing priority as this is not happening anymore

@demon where did you see that lag?
I have been taking a first look at graphs for a given error and I cannot see lag being graphed or any significant spike on any of the hosts for that shard for that time before you rolled back (20:39 according to SAL).
I wasn't able to find any alerts being fired of on IRC either.

By inspecting a bit all the errors caused between 20:36 and 20:40 the majority (if not all) are coming from (https://logstash.wikimedia.org/goto/b8c75a3003ea381d1e35af0b1ccff067):

cli_argv	       	/srv/mediawiki-staging/multiversion/MWScript.php runJobs.php --wiki=commonswiki --maxjobs 3000 --maxtime 50s --type refreshLinks --nothrottle

url	       	/rpc/RunJobs.php?wiki=mediawikiwiki&type=CentralAuthCreateLocalAccountJob&maxtime=60&maxmem=300M

Here I am not seeing any big error rate: https://logstash.wikimedia.org/goto/e133241bc85e40f49cab1040a48d836d

Finally, checking s2 master, between 20:35 and 20:41 most of the writes are UPDATE /* LinksUpdate::updateLinksTimestamp

This coincides with my deployment @ 20:36 and rollback @ 20:39.

It was over on the group1 dashboard, which reports MW and HHVM errors

This coincides with my deployment @ 20:36 and rollback @ 20:39.

Yeah, but I am not able to see lag anywhere, not grafana or logtash complaining, just those errors I pasted at T186764#3954890
Inspecting s3 binlogs between 20:35 and 20:41 there are quite some INSERT /* LinksUpdate::incrTableUpdate */ IGNORE INTO templatelinks and some of them are a bit weird: https://phabricator.wikimedia.org/P6668 (private paste because I have pasted the full insert)

Inspecting binlogs before the issue I wasn't able to find any insert like that. Could be coincidence, I don't know.

jcrespo added subscribers: aaron, jcrespo.

@demon In addition to T186764#3954890, the fact that this happened exactly at code deploy means this is a MediaWiki-libs-Rdbms not a DBA issue. Check with the maintainers of that. Maybe some load balancer issue? CC @aaron

I pinged DBA for help investigating, not because I thought it was a problem on your end :)

demon raised the priority of this task from Medium to Unbreak Now!.Feb 9 2018, 12:31 AM
demon added subscribers: Pchelolo, Ottomata.

Still very very busted. Raising priority again since this is actively breaking the train moving forward.

00:16 <+logmsgbot> !log demon@tin rebuilt and synchronized wikiversions files: group1 to wmf.20 *duck and cover*

Rolled back immediately. But I have stacktraces! There's two, but obviously same proximate area of code (jobqueue/refreshLinks):

Job runners/Terbium
#0 /srv/mediawiki/php-1.31.0-wmf.20/includes/libs/rdbms/lbfactory/LBFactory.php(414): Wikimedia\Rdbms\LBFactory->waitForReplication(array)
#1 /srv/mediawiki/php-1.31.0-wmf.20/includes/jobqueue/jobs/RefreshLinksJob.php(258): Wikimedia\Rdbms\LBFactory->commitAndWaitForReplication(string, integer)
#2 /srv/mediawiki/php-1.31.0-wmf.20/includes/jobqueue/jobs/RefreshLinksJob.php(122): RefreshLinksJob->runForTitle(Title)
#3 /srv/mediawiki/php-1.31.0-wmf.20/includes/jobqueue/JobRunner.php(294): RefreshLinksJob->run()
#4 /srv/mediawiki/php-1.31.0-wmf.20/includes/jobqueue/JobRunner.php(193): JobRunner->executeJob(RefreshLinksJob, Wikimedia\Rdbms\LBFactoryMulti, BufferingStatsdDataFactory, integer)
#5 /srv/mediawiki/rpc/RunJobs.php(47): JobRunner->run(array)
#6 {main}
Normal apaches (I think)
#0 /srv/mediawiki/php-1.31.0-wmf.20/includes/libs/rdbms/lbfactory/LBFactory.php(414): Wikimedia\Rdbms\LBFactory->waitForReplication(array)
#1 /srv/mediawiki/php-1.31.0-wmf.20/includes/jobqueue/utils/PurgeJobUtils.php(75): Wikimedia\Rdbms\LBFactory->commitAndWaitForReplication(string, integer)
#2 [internal function]: Closure$PurgeJobUtils::invalidatePages(integer)
#3 /srv/mediawiki/php-1.31.0-wmf.20/includes/libs/rdbms/database/Database.php(2839): call_user_func_array(Closure$PurgeJobUtils::invalidatePages;2704, array)
#4 /srv/mediawiki/php-1.31.0-wmf.20/includes/libs/rdbms/database/Database.php(2768): Wikimedia\Rdbms\Database->runOnTransactionIdleCallbacks(integer)
#5 /srv/mediawiki/php-1.31.0-wmf.20/includes/jobqueue/utils/PurgeJobUtils.php(79): Wikimedia\Rdbms\Database->onTransactionIdle(Closure$PurgeJobUtils::invalidatePages;2704, string)
#6 /srv/mediawiki/php-1.31.0-wmf.20/includes/deferred/LinksUpdate.php(405): PurgeJobUtils::invalidatePages(Wikimedia\Rdbms\DatabaseMysqli, integer, array)
#7 /srv/mediawiki/php-1.31.0-wmf.20/includes/deferred/LinksUpdate.php(229): LinksUpdate->invalidateImageDescriptions(array)
#8 /srv/mediawiki/php-1.31.0-wmf.20/includes/deferred/LinksUpdate.php(176): LinksUpdate->doIncrementalUpdate()
#9 /srv/mediawiki/php-1.31.0-wmf.20/includes/deferred/DeferredUpdates.php(259): LinksUpdate->doUpdate()
#10 /srv/mediawiki/php-1.31.0-wmf.20/includes/deferred/DeferredUpdates.php(210): DeferredUpdates::runUpdate(LinksUpdate, Wikimedia\Rdbms\LBFactoryMulti, string, integer)
#11 /srv/mediawiki/php-1.31.0-wmf.20/includes/deferred/DeferredUpdates.php(131): DeferredUpdates::execute(array, string, integer)
#12 /srv/mediawiki/php-1.31.0-wmf.20/includes/MediaWiki.php(904): DeferredUpdates::doUpdates(string)
#13 /srv/mediawiki/php-1.31.0-wmf.20/includes/MediaWiki.php(724): MediaWiki->restInPeace(string, boolean)
#14 [internal function]: Closure$MediaWiki::doPostOutputShutdown()
#15 {main}

I know @Ottomata and @Pchelolo did some work earlier today on jobs/change-prop so CCing them, but this was with wmf.20 before & after their work so I may be wrong. Oh and re-pinging @aaron because JobQueue/refreshLinks generally.

demon renamed this task from DB spike during rollout of wmf.20 to group1 to refreshLinks/jobqueue issues in wmf.20 causing MW-reported replag.Feb 9 2018, 12:32 AM

I know @Ottomata and @Pchelolo did some work earlier today on jobs/change-prop so CCing them

The work we were doing was about librdkafka update and we did hold off the transition of refreshLinks job to kafka because of that, so you can be sure there's no possibility our stuff was related to this.

I know @Ottomata and @Pchelolo did some work earlier today on jobs/change-prop so CCing them

The work we were doing was about librdkafka update and we did hold off the transition of refreshLinks job to kafka because of that, so you can be sure there's no possibility our stuff was related to this.

Thanks for clarifying, I wasn't sure what your overall status was :)

Change 409199 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/core@master] rdbms: make DOMAIN_ANY work on new MySQL connections

https://gerrit.wikimedia.org/r/409199

Change 409199 merged by jenkins-bot:
[mediawiki/core@master] rdbms: make DOMAIN_ANY ignore bogus MySQL DB names in config

https://gerrit.wikimedia.org/r/409199

Change 409327 had a related patch set uploaded (by Chad; owner: Aaron Schulz):
[mediawiki/core@wmf/1.31.0-wmf.20] rdbms: make DOMAIN_ANY ignore bogus MySQL DB names in config

https://gerrit.wikimedia.org/r/409327

Change 409327 merged by jenkins-bot:
[mediawiki/core@wmf/1.31.0-wmf.20] rdbms: make DOMAIN_ANY ignore bogus MySQL DB names in config

https://gerrit.wikimedia.org/r/409327

demon claimed this task.

Do you have a one-line summary of what was the issue? "bug on X patch/deployment of Y feature"- not looking to blame anyone, just genuinely curious.

@demon because we may have had misunderstanding in the past (T186764#3955920), maybe adding "DBA (not dba team/blocked external)" would be better understood as "please help us if you have the time, but you don't own it"? I am not demanding that, just suggesting if that could help avoiding missunderstandings if it doesn't take a lot of time? What do you think?