Page MenuHomePhabricator

Large job queue increase, category updates etc. very slow
Closed, ResolvedPublic

Description

There is a huge lag on categories on commons.
See https://commons.wikimedia.org/wiki/Commons:Village_pump#Cat-a-lot_not_working_well for details.

Event Timeline

Steinsplitter raised the priority of this task from to Unbreak Now!.
Steinsplitter updated the task description. (Show Details)
Steinsplitter added a project: acl*sre-team.
Steinsplitter added a subscriber: Steinsplitter.
Restricted Application added subscribers: Matanya, Aklapper. · View Herald TranscriptOct 20 2015, 12:42 PM
Steinsplitter set Security to None.
Aklapper lowered the priority of this task from Unbreak Now! to High.Oct 20 2015, 1:27 PM
Aklapper removed a project: acl*sre-team.

Quoting Nemo from IRC:

maybe the jobqueue is more backlogged than usual; about 1 million now https://commons.wikimedia.org/w/api.php?action=query&meta=siteinfo&siprop=statistics

Worth to investigate, setting high priority.

Jarekt added a subscriber: Jarekt.Oct 20 2015, 1:49 PM

I just removed some images from a category (by editing file page), so when I am on the image page the category does not show up, but when I look at the category the images are still there. No amount of purging and reloading the category helps. Please see for yourself New_uploads_without_a_license should be empty but shows 10 images at the time of this message

https://commons.wikimedia.org/wiki/Category:New_uploads_without_a_license

Is this kind of problem something that could be made visible in http://status.wikimedia.org? Everything currently tracked there shows up as green (service operating normally).

https://wikiapiary.com/wiki/Wikimedia_Commons shows a pretty large job queue spike.

legoktm@terbium:~$ mwscript showJobs.php --wiki=commonswiki --group
refreshLinks: 50202 queued; 87963 claimed (12679 active, 75284 abandoned); 0 delayed
deleteLinks: 0 queued; 9334 claimed (0 active, 9334 abandoned); 0 delayed
htmlCacheUpdate: 4 queued; 0 claimed (0 active, 0 abandoned); 0 delayed
recentChangesUpdate: 1 queued; 0 claimed (0 active, 0 abandoned); 0 delayed
enqueue: 1268565 queued; 0 claimed (0 active, 0 abandoned); 0 delayed
cirrusSearchIncomingLinkCount: 0 queued; 0 claimed (0 active, 0 abandoned); 346 delayed
cirrusSearchLinksUpdatePrioritized: 0 queued; 1 claimed (1 active, 0 abandoned); 0 delayed
gwtoolsetUploadMediafileJob: 0 queued; 194 claimed (5 active, 189 abandoned); 0 delayed
gwtoolsetUploadMetadataJob: 0 queued; 0 claimed (0 active, 0 abandoned); 3 delayed
RestbaseUpdateJobOnDependencyChange: 3793 queued; 1 claimed (1 active, 0 abandoned); 0 delayed
webVideoTranscode: 3 queued; 115 claimed (58 active, 57 abandoned); 0 delayed

tailing runJobs.log on fluorine shows commonswiki jobs are being executed.

JobQueueRedis.log has a bunch of "Could not acknowledge refreshLinksPrioritized" type entries.
redis.log has some less frequent connection error messages.

jynus also said there has been an increase in mysql load since 12:23 PM UTC.

And according to wikiapiary, the enwp job queue is also increasing (didn't check any other ones).

Krenair renamed this task from Category lag on commons to Large job queue increase, category updates etc. very slow.Oct 20 2015, 6:29 PM
hoo added a subscriber: hoo.Oct 20 2015, 6:30 PM
ori raised the priority of this task from High to Unbreak Now!.Oct 20 2015, 6:38 PM
ori updated the task description. (Show Details)
ori assigned this task to aaron.EditedOct 21 2015, 7:47 AM
ori added a subscriber: ori.

Fixed by Aaron with:

There is still a large backlog of jobs, but it is shrinking.
Keeping this open until the queue is back to a manageable size.

ori lowered the priority of this task from Unbreak Now! to High.Oct 21 2015, 7:49 AM

What are the actionables here- why did this happen and how can we avoid/detect this in the future? Was this created after a very specific code deploy or due to external reasons? Should we focus on better queue monitoring? Let's create tasks for pending work.

Fae added a subscriber: Fae.Oct 21 2015, 12:42 PM

Per jcrespo, this failure is worth having a more in-depth analysis than the normal operational bug fix. This failure has had widespread affects putting off contributors from categorizing images, and made it impossible for some long running maintenance tasks to continue.

If more budget would help ensure this type of operational failure can be avoided in the future, then an understandable report will help push the case to prioritize money in this area.

Ankry added a subscriber: Ankry.Oct 21 2015, 1:13 PM
ori added a comment.Oct 21 2015, 3:10 PM

@Fae, @jcrespo -- agreed. I'll write up my thoughts on this later today.

Kersti added a subscriber: Kersti.Oct 21 2015, 3:10 PM
matmarex added a comment.EditedOct 21 2015, 4:49 PM

There is still a large backlog of jobs, but it is shrinking.
Keeping this open until the queue is back to a manageable size.

The number of jobs stopped growing, but isn't exactly shrinking either.

(I'm looking at a graph here: https://wikiapiary.com/wiki/Wikimedia_Commons choose "Jobs" from the dropdown under the graphs. Do we have our own graphs for this somewhere?)

Kipod added a subscriber: Kipod.Oct 21 2015, 5:29 PM

There is still a large backlog of jobs, but it is shrinking.
Keeping this open until the queue is back to a manageable size.

The number of jobs stopped growing, but isn't exactly shrinking either.
(I'm looking at a graph here: https://wikiapiary.com/wiki/Wikimedia_Commons choose "Jobs" from the dropdown under the graphs. Do we have our own graphs for this somewhere?)

iiuc, the length of job queue can be obtained by api
{ action: "query", meta: "siteinfo", siprop: "statistics" }
which, as of "now", shows this:

"query": {
    "statistics": {
        "pages": 40360963,
        "articles": 121114,
        "edits": 200884207,
        "images": 28731743,
        "users": 5257152,
        "activeusers": 31654,
        "admins": 249,
        "jobs": 2062146,
        "queued-massmessages": 0
    }
}

(and indeed, in repeated calls, "jobs" fluctuates around 2M, not showing significant shrinkage)

i would like to request and suggest in addition to job count, stats will report "fall through" time of the queue.

maybe this fall through time is difficult or "impossible" to obtain, and maybe is't actually easy - i do not know,

IMO, absolute precision is not important, and assuming each item on the queue has creation timestamp, i think it's enough to report the age of the last item removed from the queue at the time of removal.

maybe the queue is not 100% time-ordered, so this information may not represent the "fall through time" exactly, but i think it's as useful, or even more useful than the "queue length" datum.

peace.

greg added a subscriber: greg.
Base added a subscriber: Base.Oct 21 2015, 10:06 PM
Bawolff added a subscriber: Bawolff.EditedOct 22 2015, 4:47 AM

So just to make sure I got this right. This is what happened?

  • (Deployed to commons oct 14): 1846e2dc15e957c557efad7 - Makes default deferred job mode to be enqueue instead of immediately after end of request. This means that when you edit a page, the category changes (is supposed to) happen via the refreshLinksPrioritized queue.
  • (Oct 20 03:30) An edit is made to an extremely high use template https://commons.wikimedia.org/w/index.php?title=Template:Information&curid=576289&diff=176222206&oldid=153048707
  • Based on graphs, fairly immediately after that edit, job queue on commons increases rapidly.
  • Normally, this would have triggered a large amount of recursive jobs. But recursive refreshLinks jobs are broken. So what happened is:
    • htmlCacheUpdate updates page_touched
    • as people view these pages, triggerOpportunisticRefreshLinks() is hit, as all these pages have {{CURRENTYEAR}} on them as {{information}} uses it to determine if a file is > 200 years old.
    • RefreshLinksJob gets queued up as people view these pages.
  • (oct 20 09:00) Users start reporting that categories they add directly to pages don't work
  • 21:11 21:11 logmsgbot: ori@tin Synchronized php-1.27.0-wmf.2/includes/page/WikiPage.php: I5d0440588d (7a4348640e09bea): Make triggerOpportunisticLinksUpdate() directly use RefreshLinks (T116001)
    • If we happen to parse a page just randomly that we need to refresh links on, and it happens to be cascade restricted, this causes us to move it to the high priority queue (The priotize parameter wasn't recognized previously because it was using the JobSpecification class, not the RefreshLinksJob class), since we don't need to reparse and can do the job quickly. It also removes one layer of indirection. I'm not sure how much this would help, as the enqueued job is missing the rootJobTimestamp, so it won't detect that the opportunistic updates already did work . And most things aren't cascade protected, so this seems kind of irrelevant. But I quite likely could be missing something here. [Update: The removing one layer of indirection (Which applies to both the cascade protected and non-cascade protected opportunistic path) may have made a difference, as most of the jobs were from this code path]
  • Growth of job queue starts slowing down about here. At this point at 2.1 million jobs
  • 23:25 logmsgbot: aaron@tin Synchronized php-1.27.0-wmf.3/includes/deferred: 2a1e1d7dd88a62aba9
    • I find the commit message on this patch very confusing. As far as I can tell, the point is to set rootJobTimestamp, so that in many cases the job can know that it can safely use an old version from parser cache instead of having to re-phrase the page. This should make those refreshLinksJobs run much quicker and allow duplicate jobs to be removed.
    • Based on the commit message, I think the intention was to also change the job name to refreshLinksPrioritized so it would be in the right queue, but that is not part of the patch (?) [UPDATE: Looks like that part was accidentally lost between PS3 and PS4. See 3339dec7c227]
  • oct 21 02:00 Job queue hits peak of 2.2 million jobs. Starts decreasing very slowly, but mostly holding steady at about 2 million jobs
  • oct 21 20:16 AaronSchulz: Started running 8 threads of commonswiki refreshlinks jobs on terbium
  • Job queue length starts decreasing rapidly at this point (at rate of about 100,000 jobs an hour)
  • oct 22 04:00 - Job queue continues to decrease rapidly. Currently about 1 million entries at commons
  • [Update since original post] oct 22 05:10 logmsgbot: aaron@tin Synchronized php-1.27.0-wmf.3/includes/deferred/LinksUpdate.php: fe323f9b68bbb - Direct category additions now go in correct queue
  • Commons users report that direct category additions now happen immediately. People are happier. Queue length ~800,000
  • oct 22 21:05 logmsgbot: aaron@tin Synchronized php-1.27.0-wmf.3/includes: a6262272c9666d (duration: 00m 23s) - Fix lack of recursiveness in refreshLinksJobs

In summary:

  • direct category additions were changed to use jobqueue
  • Template with like 20 million backlinks was edited
  • Everything after that point stuck behind massive job. Possibly due to bug where the deferred update was added to refreshLinks not refreshLinksPrioritized queue
  • Missing rootJobTimestamp bug prevented some some de-duplication from occuring and using parser cache when possible, which exasperated the problem
RP88 added a subscriber: RP88.Oct 22 2015, 5:08 AM
aaron added a comment.Oct 22 2015, 5:11 AM

Also note:
When the prioritized jobs just use the parameter and not the full refreshLinksPrioritized job type, the ack() method failed because it looked in the Prioritized queue but the job was in the regular queue. This triggered another 2 retries, which wasted more time.

Thanks for spelling all that out. I had no idea the job queues on Commons could run up that high, that fast. Fascinating stuff and thanks for fixing the problem.

Jane023 removed a subscriber: Jane023.Oct 22 2015, 11:44 AM

Bawolff, thank you for clear explanation of the events. It seems like my poorly timed edit to Information template started it all. Looking at https://commons.wikimedia.org/wiki/Special:MostTranscludedPages it looks like Information template is actually #13 most transcluded page and we have 32 templates/modules with more than 5 million transclusions and 113 with over 1 million. As the number of the files on Commons seem to follow quadratic equation (see https://commons.wikimedia.org/wiki/Commons:Milestones) we expect a lot more files in the years to come and templates/modules with a lot more transclusions. If the trend continues, the number of files will double in 3 years. Many of the most transcluded Commons pages deal with translations (See #7, #12 or #26) and are expected to be changing each time someone wants to alter text in one of the 100's of languages. Are this type of slow-downs expected to be more common then?

Hmm. So at least locally, recursive jobs for refreshLinks are not being triggered (see https://gerrit.wikimedia.org/r/#/c/248119/ for fix). But if recursive jobs are not being triggered in production, then I'm rather confused as to what caused the job queue to explode on commons if not the recursive refreshLinks jobs from editing {{information}}.

aaron added a comment.Oct 22 2015, 8:06 PM

Hmm. So at least locally, recursive jobs for refreshLinks are not being triggered (see https://gerrit.wikimedia.org/r/#/c/248119/ for fix). But if recursive jobs are not being triggered in production, then I'm rather confused as to what caused the job queue to explode on commons if not the recursive refreshLinks jobs from editing {{information}}.

Many of the jobs I saw getting run where from triggerOpportunisticLinksUpdate().

Many of the jobs I saw getting run where from triggerOpportunisticLinksUpdate().

Ah. {{information}} contains {{Years since}} which calls {{CURRENTYEAR}}. So almost all pages on commons are considered to have dynamic content. So after the edit to {{information}}, htmlcacheupdate updates page_touched. Then opportunistLinksUpdate slowly starts updating all pages. Thus the LinksUpdate essentially ended up being recursive, despite the bug preventing it from actually being recursive.

An additional thing I noticed:

For a direct category add (Adding [[Category:Foo]] to a page). Prior to this change, we would use the ParserOutput just generated by the edit. However now, due to clock skewing from self::CLOCK_FUDGE, we are probably going to be reprasing the page (unless it takes more than 10 seconds to get to the job).

This reparsing is probably slowing these jobs down. If the job is from a direct edit action, a race condition seems less likely to be a concern (Not 100% sure on that. If it is a concern, maybe we could put a random number into the parser cache, and then match against that if the timestamp is within the CLOCK_FUDGE radius)

Jonesey95 added a subscriber: Jonesey95.EditedOct 29 2015, 12:02 AM

I am seeing similar behavior on en.WP. This change is not limited to Commons.

[added a couple hours later:] I have noticed this lag recently, just in the past couple of weeks, as I regularly clean out Category:Pages with citations using unsupported parameters and similar tracking categories. About one out of every twenty articles will remain in the category after I make a change that should remove it from the category. When I visit the article, the tracking category is not listed at the bottom, but until I do a null edit, the article stays on the category page. This is something new that has never happened to me in tens of thousands of similar edits over the past couple of years.

At this writing, you can see the articles "Fred McBrien" and "Padikkadavan (2009 film)" in the "unsupported parameters" category even though the unsupported parameters have been fixed.

I am seeing similar behavior on en.WP.

Please elaborate what you are seeing.

ori raised the priority of this task from High to Unbreak Now!.Oct 29 2015, 1:46 AM

I am seeing similar behavior on en.WP. This change is not limited to Commons.
[added a couple hours later:] I have noticed this lag recently, just in the past couple of weeks, as I regularly clean out Category:Pages with citations using unsupported parameters and similar tracking categories. About one out of every twenty articles will remain in the category after I make a change that should remove it from the category. When I visit the article, the tracking category is not listed at the bottom, but until I do a null edit, the article stays on the category page. This is something new that has never happened to me in tens of thousands of similar edits over the past couple of years.
At this writing, you can see the articles "Fred McBrien" and "Padikkadavan (2009 film)" in the "unsupported parameters" category even though the unsupported parameters have been fixed.

I think there might be a race condition between the refreshLinks job and fetching from external storage.

So for the Fred McBrien article. @Jonesey95 ([[User:Abebenjoe]] presumably) edit that got rid of the category was:

rev_commentrev_timestamprev_user_text
/* top */ Fixing style/layout errors20151029015235Abebenjoe

And the categorylinks / page entries look like:

*************************** 3. row ***************************
           cl_from: 13526177
             cl_to: Pages_with_citations_using_unsupported_parameters
        cl_sortkey: FRED MCBRIEN
      cl_timestamp: 2015-10-29 01:50:42
 cl_sortkey_prefix: 
      cl_collation: uppercase
           cl_type: page
           page_id: 13526177
    page_namespace: 0
        page_title: Fred_McBrien
 page_restrictions: 
      page_counter: 0
  page_is_redirect: 0
       page_is_new: 0
       page_random: 0.940205091976
      page_touched: 20151029015235
page_links_updated: 20151029015236
       page_latest: 688010736
          page_len: 5959
page_content_model: wikitext

So edit happened oct 29 01:52:35. As of right now, page_touched is also 01:52:35, and page_links_updated is 1 second later at 01:52:36. So the refreshLinks job succesfully was run. However the page is clearly still in the categorylinks table, despite the fact that the parser output from 01:52:35 does not have it in that category.

This is close enough to the self::CLOCK_FUDGE 10 second window, that the cached parser output was probably not used. Thus the page text was fetched from $revision->getContent( Revision::RAW );. However, external storage can also have slave lag. So maybe revision->getContent didn't actually get the right version of the page

There's code to wait for slaves( wfGetLB()->waitFor( $this->params['masterPos'] ); ). But that's only for main db. Maybe we should do something similar but for ExternalStorage cluster.

aaron added a comment.Oct 29 2015, 7:16 AM

ExternalStore uses immutable blobs, so either the right object is found or nothing is found at all.

aaron added a comment.Oct 29 2015, 7:22 AM

A large increase in run rate and a sharp overall queue size decrease started after https://gerrit.wikimedia.org/r/#/c/249687/ was deployed. I'm waiting to see how low the queue size gets.

aaron closed this task as Resolved.Oct 29 2015, 7:51 AM

From https://grafana.wikimedia.org/dashboard/db/job-queue-health?panelId=6&fullscreen we went from 3.25M to 72K jobs. Seems to have leveled out.

ExternalStore uses immutable blobs, so either the right object is found or nothing is found at all.

Oh, right.

Given that masterPos wouldn't be set from LinksUpdate::getAsJobSpecification(), I guess a race condition from normal slave lag could account for the behaviour that @Jonesey95 saw.

I had a look at the job queue at https://wikiapiary.com/wiki/Wikipedia_%28en%29
There is a recurring pattern, the job queue length is going up approx. every 7 days to 3-4M and then coming down slowly to some 100k just to go up again almost immediately. Strange. Unless there is a once a week power editor / background job putting the system above its limits, and the pattern continues to occur, this should be worth further examinations.

Denniss removed a subscriber: Denniss.Apr 24 2016, 10:59 AM