Page MenuHomePhabricator

jobqueue is full of refreshlinks duplicates after the switchover.
Closed, DuplicatePublic

Description

This is an exact duplicate of what we saw in T129517: refreshLink jobs that are recursive keep feeding the queue with more and more jobs, and the jobrunners are processing them at unprecedented rates.

While it isn't very refreshing (pun intended) to see that ONE YEAR LATER no one bothered to look into the original bug, I think we can assume re-doing what @ori did last year is safe, and I will just do it.

Event Timeline

Joe triaged this task as Unbreak Now! priority.Apr 20 2017, 8:38 AM

Mentioned in SAL (#wikimedia-operations) [2017-04-20T08:47:38Z] <_joe_> live-patching ./includes/jobqueue/jobs/RefreshLinksJob.php to drop all recursive jobs, T163418

Change 349177 had a related patch set uploaded (by Giuseppe Lavagetto):
[mediawiki/core@wmf/1.29.0-wmf.19] Temporarily skip recursive refreshLinks jobs

https://gerrit.wikimedia.org/r/349177

Mentioned in SAL (#wikimedia-operations) [2017-04-20T09:14:30Z] <_joe_> scap pull of live hack for T163418 on mw2154

I'm testing this live hack:

diff --git a/includes/jobqueue/jobs/RefreshLinksJob.php b/includes/jobqueue/jobs/RefreshLinksJob.php
index f9284a5..a478301 100644
--- a/includes/jobqueue/jobs/RefreshLinksJob.php
+++ b/includes/jobqueue/jobs/RefreshLinksJob.php
@@ -82,7 +82,7 @@ class RefreshLinksJob extends Job {
                global $wgUpdateRowsPerJob;
 
                // Job to update all (or a range of) backlink pages for a page
-               if ( !empty( $this->params['recursive'] ) ) {
+               if ( false && !empty( $this->params['recursive'] ) ) {
                        // When the base job branches, wait for the replica DBs to catch up to the master.
                        // From then on, we know that any template changes at the time the base job was
                        // enqueued will be reflected in backlink page parses when the leaf jobs run.

Mentioned in SAL (#wikimedia-operations) [2017-04-20T09:38:18Z] <_joe_> live-hack redeployed, running scap pull on codfw jobrunners T163418

FTR, the queue is dropping fast, as the number of processed jobs. I'll de-deploy my hack as soon as I'm confident I killed all the rogue refreshlinks links.

Mentioned in SAL (#wikimedia-operations) [2017-04-20T11:32:34Z] <_joe_> removing hack for jobqueue's refreshlinks T163418 from the jobrunners

The queue is down to 250K jobs, and I am confident all the old refreshlinks jobs have been removed. I'm leaving the ticket open at lower priority as I need to still take a look at this.

Joe lowered the priority of this task from Unbreak Now! to High.Apr 20 2017, 12:04 PM

Change 349177 abandoned by Giuseppe Lavagetto:
Temporarily skip recursive refreshLinks jobs

Reason:
this was just to show the live hack I employed.

https://gerrit.wikimedia.org/r/349177

Did you see repeat executions in this case, beyond the initial root to leaf job expansion?

@GWicke I have seen the same job being re-executed multiple times (after succeeding) when I ran runJobs.php from the command-line to remove some pressure; I'm sure more cases can be found in the logs.

There seems to be an increase of 1 million items in the last 12 hours, or what it is more important- double the enqueing size than the running size. Not sure if related to the one of the many outstanding issues, but defnitely worth mentioning: https://grafana.wikimedia.org/dashboard/db/job-queue-health?refresh=1m&orgId=1&from=1492797815997&to=1493402615997&panelId=12&fullscreen&var-jobType=all