Page MenuHomePhabricator

Some jobs in refreshLinksPrioritized seems to repeat themselves for ever
Closed, ResolvedPublic

Description

Looking at the rate of some CirrusSearch jobs we found that this rate is way higher in metawiki than the other wikis (~200 updates/sec out of 400/sec for all our sites).
Digging further it seems that the refreshLinksPrioritized queue is populated with jobs that are repeated over and over using the same meta.request_id UUID.

kafkacat -b kafka-main1005.eqiad.wmnet:9092 -t eqiad.mediawiki.job.refreshLinksPrioritized -e -o beginning  | grep '"request_id":"5c452897402f7057c3b099a0"' | head -n 10000 | jq -r .params.title > 5c452897402f7057c3b099a0.lst
sort -u 5c452897402f7057c3b099a0.lst | wc -l
39

Example titles:

B18_1128_WMCH_mob_lrg-WMCH-2018-moblrg-priceofcoffee/en
B18_1128_WMCH_mob_lrg-WMCH-2018-moblrg-selectamount/en
Centralnotice-B18_1128_WMCH_mob_lrg_clone-WMCH-2018-moblrg-allreaders
Centralnotice-B18_1128_WMCH_mob_lrg_clone-WMCH-2018-moblrg-donatenow/it
Centralnotice-B18_1128_WMCH_mob_lrg-WMCH-2018-moblrg-donationsaveraging/de
Centralnotice-B18_1128_WMCH_mob_lrg-WMCH-2018-moblrg-gotoarticle/it
Centralnotice-B18_1128_WMCH_mob_lrg-WMCH-2018-moblrg-selectpaymentmethod/de
Centralnotice-Bridges_Across_Culture-line1/ar
Centralnotice-CommunityInsightsResults2018_1-line1/fr
Centralnotice-CommunityWishlistVote-CommunityWishlist-CNTranslationVote2-title/lb
Centralnotice-CommunityWishlistVote-CommunityWishlist-CNTranslationVote3-title/ja
Centralnotice-FR2015_translations-go-to-article/pl
Centralnotice-FR2015_translations-how-often/pt
Centralnotice-FR2015_translations-just-once/pl
Centralnotice-FR2015_translations-mob-lg-greeting/pt
Centralnotice-FR2015_translations-mob-lg-greeting/sk
Centralnotice-FR2015_translations-mob-lg-pp-four/uk
Centralnotice-FR2015_translations-mob-sm-pp-three/sk
Centralnotice-FR2015_translations-now/sk
Centralnotice-FR2015_translations-skip/pl
Centralnotice-template-Bridges_Across_Culture
Centralnotice-template-WikiMOOC_2018_B
Centralnotice-template-WMCH1819_dsk_lg
Centralnotice-WMCH1819_mob_conversion_nocss_robin-WMCH-2018-moblrg-allreaders/fr
Centralnotice-WMCH1819_mob_conversion_nocss_robin-WMCH-2018-moblrg-donationsaveraging/de
Centralnotice-WMCH1819_mob_conversion_nocss_robin-WMCH-2018-moblrg-noeasytime/it
Centralnotice-WMCH1819_mob_conversion_nocss_robin-WMCH-2018-moblrg-oneminute/fr
Centralnotice-WMCH1819_mob_conversion_nocss_robin-WMCH-2018-moblrg-selectpaymentmethod/de
CommunityInsightsResults2018_3-line2/en
TestBanner_AGreen-msg1/en
WMCH1819_dsk_lg_clone-WMCH-2018-dsklrg-msg/en
WMCH1819_dsk_sm-WMCH-2018-dsksm-text/en
WMCH1819_mob_conversion_nocss_robin-WMCH-2018-moblrg-donatenow/en
WMCH1819_mob_conversion_nocss_robin-WMCH-2018-moblrg-ifeveryone/en
WMCH1819_mob_conversion_nocss_robin-WMCH-2018-moblrg-openknowledge/en
WMCH1819_mob_conversion_nocss_robin-WMCH-2018-moblrg-thankyou/en
WMCH1819_mob_conversion_nocss_robin-WMCH-2018-moblrg-tothepoint/en
WMIT_5X1000_2018_desktop_A-cinquepermille3/en

My understanding is that changeprop is propagating the request_id when hitting MW and thus seeing the same request_id in multiple jobs could indicate either:

  • the jobs were submitted from the same PHP request
  • there's a collision of request_id
  • a request_id is being hard-coded in the code-base
  • a job is creating another job

I'm leaning toward the last option, such behaviors do seem normal but here something seems to recursively trigger the same set of jobs without any TTL mechanism to stop the loop.

First job found in kafka is:

{"$schema":"/mediawiki/job/1.0.0","meta":{"uri":"https://placeholder.invalid/wiki/Special:Badtitle","request_id":"5c452897402f7057c3b099a0","id":"a00c199e-d223-4e89-b0f9-318cd9deb88d","dt":"2023-02-09T13:21:13Z","domain":"meta.wikimedia.org","stream":"mediawiki.job.refreshLinksPrioritized"},"database":"metawiki","type":"refreshLinksPrioritized","sha1":"10125a36bba3625a8afa186ba0ef41a9ffe24ebd","params":{"isOpportunistic":true,"rootJobTimestamp":"20230209132113","namespace":8,"title":"Centralnotice-FR2015_translations-now/sk","requestId":"5c452897402f7057c3b099a0","causeAction":"RefreshLinksJob","causeAgent":"unknown"},"mediawiki_signature":"2ce3d32e26a58119172834c60e69b8030cab90f1"}

Last one is:

{"$schema":"/mediawiki/job/1.0.0","meta":{"uri":"https://placeholder.invalid/wiki/Special:Badtitle","request_id":"5c452897402f7057c3b099a0","id":"8e8ece45-55c8-48bf-9f42-8ee9a31dcb78","dt":"2023-02-16T14:21:06Z","domain":"meta.wikimedia.org","stream":"mediawiki.job.refreshLinksPrioritized"},"database":"metawiki","type":"refreshLinksPrioritized","sha1":"26f521e2cca0ba13b7deee415ae3a284feba0248","params":{"isOpportunistic":true,"rootJobTimestamp":"20230216142106","namespace":866,"title":"WMCH1819_mob_conversion_nocss_robin-WMCH-2018-moblrg-donatenow/en","requestId":"5c452897402f7057c3b099a0","causeAction":"RefreshLinksJob","causeAgent":"unknown"},"mediawiki_signature":"bec5f0612a931926a94cdcd216095c02f02b4235"}

The 5c452897402f7057c3b099a0 request_id can be found in logstash since 2022-11-18.

The few pages I checked do all seem to be protected with pr_cascade = 1 and tempted to blame something fishy happening with cascade-protected backlinks propagation.

Currently the rate of refreshLinksPrioritized jobs for metawiki can be up to 400 evt/sec:

 dcausse@stat1004:~$ kafkacat -b kafka-main1005.eqiad.wmnet:9092 -t eqiad.mediawiki.job.refreshLinksPrioritized  -o end | grep '"domain":"meta.wikimedia.org"' | pv -l >/dev/null
% Auto-selecting Consumer mode (use -P or -C to override)
% Reached end of topic eqiad.mediawiki.job.refreshLinksPrioritized [0] at offset 6587943883
2.68k 0:00:08 [ 407 /s] [                                        <=>                                                                                                                                                                                                                     ]

and fluctuating between 5 and 20 for everything else.

Event Timeline

pinging @daniel, in case you have ideas on who could help us to understand and possibly fix this behavior, thanks!

After Petr, no one really understands jobqueue anymore :( I have a feeling this is a bigger issue than it looks. cc. @Krinkle

dcausse renamed this task from Some jobs (refreshLinksPrioritized?) seems to repeat themselves for ever (rootJobTimestamp not propagated?) to Some jobs in refreshLinksPrioritized seems to repeat themselves for ever.Feb 17 2023, 5:12 PM
dcausse updated the task description. (Show Details)

pinging @daniel, in case you have ideas on who could help us to understand and possibly fix this behavior, thanks!

I know nothing about Kafka or changeprop/jobqueue. Maybe @Eevans can help?

pinging @daniel, in case you have ideas on who could help us to understand and possibly fix this behavior, thanks!

I know nothing about Kafka or changeprop/jobqueue. Maybe @Eevans can help?

I'm afraid not; I have no experience with changeprop. @hnowlan (tag, you're it)?

I have looked at https://meta.wikimedia.org/w/index.php?title=MediaWiki:Centralnotice-FR2015_translations-volunteers/sk which every day overnight has a refreshLinkPrioritized job with requestId=185f339a721cfbf471e219c7.. That has started every day of March over night, for the last few days it is constantly around 3:00 UTC give or take 0-20 minutes. Maybe there is a background job / timed maintenance job which is triggering the condition?

Eventually the spam for that specific for today started at 3:39:21 UTC and stopped at 9:18:21 UTC, it apparently had an execution every ~ 20 seconds.

Krinkle triaged this task as High priority.Apr 10 2023, 9:13 AM

Confirmation

I went looking in Kafka under eqiad.mediawiki.job.refreshLinksPrioritized and while I didn't find much there, this makes sense as Codfw is primary right now. Looking under codfw.mediawiki.job.refreshLinksPrioritized instead, returned a large volume of matching jobs for meta.wikimedia.org, and indeed many with the same request ID.

I could not find any today with the request ID that @dcausse observed in February, nor where there any today with the request ID that @hashar observed two weeks ago. However, I did indeed find an otherwise identical and seemingly endless repetition of MediaWiki jobs of the type refreshLinksPrioritized, for meta.wikimedia.org, with titles in the MediaWiki namespace such as:

{"$schema":"/mediawiki/job/1.0.0","meta":{"uri":"https://placeholder.invalid/wiki/Special:Badtitle","request_id":"933bea6c4bde5eeb73ee95f5","id":"a5472f52-7d69-41b2-90fa-01aa1359c1ba","dt":"2023-04-06T03:48:41Z","domain":"meta.wikimedia.org","stream":"mediawiki.job.refreshLinksPrioritized"},"database":"metawiki","type":"refreshLinksPrioritized","sha1":"e92d20b97ab00f5db596267f3fea15ddbd92fb23","params":{"isOpportunistic":true,"rootJobTimestamp":"20230406034841","namespace":8,"title":"Centralnotice-Wikimania2022regs_V2-festival/bg","requestId":"933bea6c4bde5eeb73ee95f5","causeAction":"RefreshLinksJob","causeAgent":"unknown"},"mediawiki_signature":"8fcbaedcd661cc2cb9f8d84748f9bda4c6603255"}

I collected it by first dumping about 33 GB of Meta-Wiki job descriptions from Kafka. Basically, I opening the stream with -o beginning and exited it by hand after letting it run for a few minutes.

$ kafkacat -C -b kafka-main1005.eqiad.wmnet:9092 -t codfw.mediawiki.job.refreshLinksPrioritized -e -o beginning | grep '"domain":"meta' > codfw.mediawiki.job.refreshLinksPrioritized-metawiki.txt

There were about 84,000 jobs with the same request ID for meta.wikimedia.org in the 24 hours from 2023-04-06 03:48 UTC to 2023-04-07 06:36.

ack 933bea6c4bde5eeb73ee95f5 codfw.mediawiki.job.refreshLinksPrioritized-metawiki.txt | wc -l
84837

They are for many different titles though, so this is not the iteration or repetition count by itself. There were about 7,000 jobs for the same exact (wiki, job type, job parameters, title, request ID) combination.

$ ack 933bea6c4bde5eeb73ee95f5 codfw.mediawiki.job.refreshLinksPrioritized-metawiki.txt | fgrep Centralnotice-Wikimania2022regs_V2-festival/bg | wc -l
7716

Is it BacklinkCache?

David mentioned that BacklinkCache might play a role in starting or amplifying these recursive jobs. There are only a handful of places in our codebase that create jobs of type refreshLinksPrioritized. Via Codesearch we find there are two direct creations of it: RefreshLinksJob::newPrioritized and RefreshSecondaryDataUpdate::getAsJobSpecification.

I'm ignoring the RefreshSecondaryDataUpdate case for now, because the request IDs in question did not match any fatal exceptions in Logstash. This is relevant because RefreshSecondaryDataUpdate is a DeferredUpdate and those run on web requests. The getAsJobSpecification method is used when the update fails, in which case it falls back to a job after logging the failure.

RefreshLinksJob::newPrioritized in turn also has two callers:

LinksUpdate is used after an edit that may have a cascading impact on other pages. It calls $backlinkCache->getCascadeProtectedLinkPages() and loops over the result and queues a high-priority RefreshLinksJob for each one of them, via RefreshLinksJob::newPrioritized. Thus this is indeed one of the ~3 or so places where such job can originate.

The matched WikiPage method is used during pageviews if the pageview in question happens to be a ParserCache miss. It calls RefreshLinksJob::newPrioritized for the current page in certain situations, specifically if the current page is protected from editing using cascading page protection.

Looking at BacklinkCache.php getCascadeProtectedLinkPages, I first tried to determine what kind of database query this makes. How does it decide what pages to queue a job for, might this be broken? Perhaps it is accidentally selecting all or otherwise too many/unrelated pages from the whole wiki. Something like that.

I copied its code to eval.php and made it standalone, and tried to run it a different ways as a dry run (calling getSql instead of fetchResultSet). I could not get it to return even a single row though.

BacklinkCache.php
# @ function getCascadeProtectedLinkPages() {

$title = Title::newFrom( NS_MAIN, 'Sandbox' );
$title = Title::newFrom( NS_MAIN, 'Main_Page' );
$title = Title::newFrom( NS_MEDIAWIKI, 'Centralnotice-Wikimania2022regs_V2-festival/bg' );

$linksMigration = \MediaWiki\MediaWikiServices::getInstance()->getLinksMigration();
$linkConds = $linksMigration->getLinksConditions( 'templatelinks', $title  );
$db = wfGetDB(DB_REPLICA);
print $db->newSelectQueryBuilder()->select( [ 'page_namespace', 'page_title', 'page_id' ] )->from( 'templatelinks' )->join( 'page_restrictions', null, 'tl_from = pr_page' )->join( 'page', null, 'page_id = tl_from' )->where( $linkConds )->andWhere( [ 'pr_cascade' => 1 ] )->distinct()->getSQL();

//> SELECT DISTINCT page_namespace,page_title,page_id  FROM `templatelinks` JOIN `page_restrictions` ON ((tl_from = pr_page)) JOIN `page` ON ((page_id = tl_from))   WHERE (1=0) AND pr_cascade = 1  

Note how it include the impossible 1=0 condition. This comes from the LinksMigration service, which is taking the path for "read new" (relating to T299417 and the new linktarget table), however while the table indeed exists and is non-empty on Meta-Wiki, there are no rows for the titles I tried:

mwmaint2002$ sql metawiki
> (metawiki)> SELECT * FROM linktarget WHERE lt_namespace=8 AND lt_title LIKE 'Central%' LIMIT 10;
| lt_id | lt_namespace | lt_title                                            |
+-------+--------------+-----------------------------------------------------+
| 78673 |            8 | CentralNotice/Resources/BannerReset.js              |

> SELECT * FROM linktarget WHERE (lt_namespace=0 AND lt_title='Main_Page') OR (lt_namespace=8 AND lt_title LIKE 'Centralnotice-Wikimania2022regs%') LIMIT 10;
Empty set (0.001 sec)

In case I wasn't doing it right, I also tried to call the method as-is from the higher level service:

mwscript eval.php metawiki
$bc = BacklinkCache::get(Title::newFromText('Main_Page'));
$it = $bc->getCascadeProtectedLinkPages();
return iterator_to_array($it);
//> array(0) {}

So.. for now I'll move on and assume BacklinkCache isn't returning unexpectedy many results, as it's not returning anything, let alone doing so for "too many" pages or in ways that might cause 7,000 recursive jobs every day.

The next avenue to explore could be WikiPage.php: triggerOpportunisticLinksUpdate. I might take a look at that one tomorrow if someone else hasn't. That one has a nice long comment explaining/justifying why it's doing something fairly expensive and usual, which seems like a good next suspect to chase.

Many of the pages are getting refreshlinks are translation pages that shouldn't allow direct editing (e.g. Centralnotice-Wikimania2022regs_V2-festival/bg) maybe something in translate extension makes it pretend to be a protected page? Just a thought. I don't know much of internals of this.

Note how it include the impossible 1=0 condition. This comes from the LinksMigration service, which is taking the path for "read new" (relating to T299417 and the new linktarget table), however while the table indeed exists and is non-empty on Meta-Wiki, there are no rows for the titles I tried:

I did that, so a quick explanation why it add 1=0. The query basically wants to get the linktarget id for a title, if the title doesn't exist in linktarget table (e.g. never used as a template), it adds 1=0 to say there shouldn't be any corresponding rows in templatelinks table

Investigated another example. I think that this edit: https://meta.wikimedia.org/w/index.php?title=MediaWiki:FundraisingBanners/SmallPrint-2017&oldid=24946127 made on 2023-04-27T15:31:24‎ with request_id: 393add43-367f-4953-b9ef-86fa050efa72 is the origin of 693984 refreshLinksPrioritized jobs so far (all for a set of 3908 pages).

WikiPage.php: triggerOpportunisticLinksUpdate does seem called when accessing the ParserOutput. I have the impression that the problem might be:

cirrusSearchLinksUpdate -> calls getParserOuput() to get the page content on a page with the cascade restriction -> WikiPage::triggerOpportunisticLinksUpdate() -> creates a RefreshLinksPrioritized job -> LinksUpdate -> back to square one with a new cirrusSearchLinksUpdate job on the same page.

My suggestion: Get rid of triggerOpportunisticLinksUpdate completely, it's speculative without proven benefit and has caused issues before.

WikiPage.php: triggerOpportunisticLinksUpdate does seem called when accessing the ParserOutput. I have the impression that the problem might be:

cirrusSearchLinksUpdate -> calls getParserOuput() to get the page content on a page with the cascade restriction -> WikiPage::triggerOpportunisticLinksUpdate() -> creates a RefreshLinksPrioritized job -> LinksUpdate -> back to square one with a new cirrusSearchLinksUpdate job on the same page.

You're absolutely right, but it also is deeply wrong. WikiPage::triggerOpportunisticLinksUpdate() is only called from one place: PoolWorkArticleViewCurrent. As the same suggests, this is a variant of the normal ParserOutput generation flow explicitly and and specifically for index.php page views, which use Article.php/ViewAction.php.

Going bottom-up, I see ParserOutputAccess is calling this unconditionally from ParserOutputAccess::getParserOutput, which makes no sense at all. That defeats the purpose of this as a special case and indeed grossly inflates use of PoolCounter to various uncontentious calls from the JobQueue, including likely CirrusSearch indeed. It seems like ParserOutputAccess is likely a good place to fix this. However, looking through the history, this is likely not where the regression began. In change 631264, Daniel introduced ParserOutputAccess by moving code from WikiPage::getParserOutput mostly as-is, and the unconditional use of PoolWorkArticleViewCurrent has been there since at least 2012. I suspect the issue conflation between WikiPage::getParserOutput (which comes from Article.php) and direct use of ParserCache (which in recent history has often gone via the relatively low-level ContentHandler::getParserOutput).

Going top-down, I see CirrusSearch/LinksUpdate/BuildDocument calling $contentHandler->getParserOutputForIndexing which seems like the correct entry point indeed. No issue there. This in turn calls ParserOutputAccess, which is a new pattern. Not "wrong" semantically, but given the previous paragraph, is likely where the problem began to be exposed for Cirrus. And indeed, in change 766183 Amir fixes a regression in ContentHandler::getParserOutputForIndexing that had caused it not use ParserCache anymore, to start using it again (via ParserOutputAccess). However, that started creating nasty side-effect (given the previous paragraph) due to treating it as if it's a real person waiting for the article on a potentially high-trafficked article with an empty ParserCache such as: involving PoolCounter locks, PoolCounter waits, allowing "stale cache" fallbacks, displaying error messages if the server is hot, etc. None of that makes sense in a JobQueue context.

My suggestion: Get rid of triggerOpportunisticLinksUpdate completely, it's speculative without proven benefit and has caused issues before.

Based on the above analysis, I believe this is a much wider issue, triggered by any call to ParserOutputAccess. The specific case of triggerOpportunisticLinksUpdate seems a red herring. However, the way that triggerOpportunisticLinksUpdate works does stand out as unusual, so a quick analysis to see if it is indeed. I'll analyze only the common 99% use case using the ideal scenarios to roughly categorize it as either essential or as likely specific to edge cases and potentially unneeded.

The triggerOpportunisticLinksUpdate has a long comment that talks about how the Main Page works on Wikipedia, and its interaction with cascading protection. I'll go with that example. Objective: We want is people not to be able to vandalise the Main Page by editing a page (template) that is included on the Main Page.

In practice, the Main Page is rarely edited directly. It's generally built up from smaller sub pages, including some that vary on the curent date (On this day, Today in the news, Today's Did You Know, Today's Featured Article, etc.). Let's consider the simplest case of the Main Page directly containing something like {{News / $today }}. Tomorrow's page title may or may not exist yet, and is not yet protected. It should be protected tomorrow when it is live on the Main Page. Tomorrow's page may be freely edited until it's locked in and published. Let's say there's no edits in the hour before midnight and no edits on the day itself, on the entire wiki. Midnight the parser cache of the Main Page expires and the new template goes live. Let's say we accept a few minutes of abuse exposure for sake of argument. What's going to take care of protecting this template? No edits, no cascading. The inclusion started with the parser cache miss, and that's where triggerOpportunisticLinksUpdate queues a singular LinksUpdate job. The job does not apply any backlinks or recursion from what I can see. I think this is essential for page protection to work correctly today. I don't see other components having similar responsibilities or otherwise taking care of this requirement in some or all cases.

However, what it does show is that feeding ParserCache without checking this triggerOpportunisticLinksUpdate could be a problem. My earlier thought that this should be limited to article views is thus wrong. Historically we've generally not checked ParserCache much in APIs or Jobs, and there was a period even relatively recently from 2019-2022 where we removed much ParserCache involvement to keep it's size and demand under tighter control and more reflective of web demand.

Idea: However, while ParserCache remains "unfed" from both RefreshLInks and Search jobs, what has changed is that ParserOutputAccess makes the "post cache-miss check" (triggerOpportunisticLinksUpdate) with the assumption it's about to save it, which in fact it doesn't. Search and RefreshLinks both pass ParserOutputAccess::OPT_NO_UPDATE_CACHE, so perhaps we should gate it based on that?

Historically we've generally not checked ParserCache much in APIs or Jobs, and there was a period even relatively recently from 2019-2022 where we removed much ParserCache involvement to keep it's size and demand under tighter control and more reflective of web demand.

Seems like there was a disconnect then. When I was working with Petr on ParserCache in 2021, our assumption was that we want to optimize for more cache hits, not reduce cache size. This led us to make more paths use the cache.

Idea: However, while ParserCache remains "unfed" from both RefreshLInks and Search jobs, what has changed is that ParserOutputAccess makes the "post cache-miss check" (triggerOpportunisticLinksUpdate) with the assumption it's about to save it, which in fact it doesn't. Search and RefreshLinks both pass ParserOutputAccess::OPT_NO_UPDATE_CACHE, so perhaps we should gate it based on that?

Using a flag to control this behavior sounds like the right approach. I haven't fully grocked the different flows, so I'm unsure whether triggerOpportunisticLinksUpdate() should be called per default, or only of some OPT_DO_LINKS_UPDATES flag is passed. Gating on ParserOutputAccess::OPT_NO_UPDATE_CACHE would probably be an easy quick fix, but I'd prefer a separate flag, OPT_DO_LINKS_UPDATES or OPT_NO_LINKS_UPDATES.

Change 919289 had a related patch set uploaded (by DCausse; author: DCausse):

[mediawiki/core@master] ParserCache: add an option to explicitly disable links update

https://gerrit.wikimedia.org/r/919289

@Krinkle thanks for the detailed analysis of the issue!
@daniel we believe that this is causing a fairly big noise in CirrusSearch update jobs and as we're working on improving how updates are performed I went ahead and tried to implement the fix you suggest opting for a conservative approach with a new OPT_NO_LINKS_UPDATE flag (please feel free to abandon the patch if you strongly disagree with it).

On a related topic I wonder if MW & ChangeProp should not propagate another field like the request_id field but to increment a kind of ttl counter, perhaps it could help to monitor and/or be used as a safeguard to kill such problematic jobs.

Change 919289 merged by jenkins-bot:

[mediawiki/core@master] ParserCache: add an option to explicitly trigger links update

https://gerrit.wikimedia.org/r/919289

Closing, the impact on refreshLinksPrioritized insertion rate is visible after wmf.10 has hit group1&group2 wikis:

refreshLinkPrioritizedJobRate.png (667×1 px, 111 KB)
.