Page MenuHomePhabricator

Some jobs are not being processed / are processed slowly
Closed, ResolvedPublic

Description

Global renames are being really, really, slow. See https://logstash.wikimedia.org/goto/576d8bc431184d815976416b3bfc2a3a for an example, long delays between renames - the rename itself to be rather quick.

Also see https://grafana.wikimedia.org/d/000000400/jobqueue-eventbus?orgId=1&fullscreen&panelId=15&from=now-7d&to=now - our processing backlog seems to be much higher than few days later.

Original description

https://meta.wikimedia.org/wiki/Special:GlobalRenameProgress/User1692 is stuck even after 3 hours.

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Are failing uploads on Commons the same issue?

Ymblanter: You may want to check T240698 instead.

There is also a problem with pageviews data. For example, there is a two days delay in hewiki. There was a 1.5 days delay in ruwiki, and a one day delay in enwiki.

Interactive history shows irrelevant timestamps.

@Pchelolo is unfortunately off today and Tuesday of this week. We'll dig into this now and try to find a resolution.

Looking at the graph "Rate of committed offset increment" from https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?orgId=1 it seems that only "low_traffic_jobs" are affected:

rate_of_committed_offset_inc.png (216×1 px, 50 KB)

dropping from ~40 to 3.
With one topic (fetchGoogleCloudVisionAnnotations) constantly failing out of many that should run properly (all the ones consumed by low_traffic_jobs), if ChangeProp does not properly handle such scenario I suppose it could lead to such behavior.

One way to verify this could be to either disable completely fetchGoogleCloudVisionAnnotations or assign it to a specific consumer.

I see sometimes an error like the following:

[2019-12-16T17:14:19.865Z] ERROR: cpjobqueue/8023 on scb1001: Producer error (levelPath=error/producer, err.origin=local)
    Error: Local: All broker connections are down
        at Error (native)

but it seems to be nothing new, but it's much more common on scb1001 since the restart.

Change 558141 had a related patch set uploaded (by Mholloway; owner: Michael Holloway):
[operations/mediawiki-config@master] MachineVision: Disable enqueuing new labeling request jobs

https://gerrit.wikimedia.org/r/558141

Change 558141 merged by Mholloway:
[operations/mediawiki-config@master] MachineVision: Disable enqueuing new labeling request jobs

https://gerrit.wikimedia.org/r/558141

Mentioned in SAL (#wikimedia-operations) [2019-12-16T17:56:03Z] <mholloway-shell@deploy1001> Synchronized php-1.35.0-wmf.10/extensions/MachineVision: Fix: Ignore duplicate entry errors on insertLabels (T240518) (duration: 00m 57s)

Mentioned in SAL (#wikimedia-operations) [2019-12-16T18:03:09Z] <mobrovac@deploy1001> Started restart [cpjobqueue/deploy@deafe56]: Rolling restart of CP4JQ -- T240518

Change 558149 had a related patch set uploaded (by Mholloway; owner: Michael Holloway):
[mediawiki/extensions/MachineVision@master] Catch duplicate entry errors and log a warning

https://gerrit.wikimedia.org/r/558149

Change 558149 merged by Mholloway:
[mediawiki/extensions/MachineVision@master] Catch duplicate entry errors and log a warning

https://gerrit.wikimedia.org/r/558149

Change 558153 had a related patch set uploaded (by Mholloway; owner: Michael Holloway):
[mediawiki/extensions/MachineVision@wmf/1.35.0-wmf.10] Catch duplicate entry errors and log a warning

https://gerrit.wikimedia.org/r/558153

Change 558153 merged by Mholloway:
[mediawiki/extensions/MachineVision@wmf/1.35.0-wmf.10] Catch duplicate entry errors and log a warning

https://gerrit.wikimedia.org/r/558153

Mentioned in SAL (#wikimedia-operations) [2019-12-16T18:20:10Z] <mholloway-shell@deploy1001> Synchronized php-1.35.0-wmf.10/extensions/MachineVision: (T240518) (duration: 00m 57s)

Change 558157 had a related patch set uploaded (by Mholloway; owner: Michael Holloway):
[mediawiki/extensions/MachineVision@master] Catch DB duplicate key errors, cont.

https://gerrit.wikimedia.org/r/558157

Change 558157 merged by Mholloway:
[mediawiki/extensions/MachineVision@master] Catch DB duplicate key errors, cont.

https://gerrit.wikimedia.org/r/558157

Change 558159 had a related patch set uploaded (by Mholloway; owner: Michael Holloway):
[mediawiki/extensions/MachineVision@wmf/1.35.0-wmf.10] Catch DB duplicate key errors, cont.

https://gerrit.wikimedia.org/r/558159

Change 558159 merged by Mholloway:
[mediawiki/extensions/MachineVision@wmf/1.35.0-wmf.10] Catch DB duplicate key errors, cont.

https://gerrit.wikimedia.org/r/558159

Mentioned in SAL (#wikimedia-operations) [2019-12-16T18:33:07Z] <mholloway-shell@deploy1001> Synchronized php-1.35.0-wmf.10/extensions/MachineVision: Catch DB duplicate key errors, cont. (T240518) (duration: 00m 55s)

Change 558171 had a related patch set uploaded (by Mobrovac; owner: Mobrovac):
[mediawiki/services/change-propagation/jobqueue-deploy@master] TEMP: Exclude fetchGoogleCloudVisionAnnotations topic from low_traffic

https://gerrit.wikimedia.org/r/558171

Change 558171 merged by Mobrovac:
[mediawiki/services/change-propagation/jobqueue-deploy@master] TEMP: Exclude fetchGoogleCloudVisionAnnotations topic from low_traffic

https://gerrit.wikimedia.org/r/558171

Mentioned in SAL (#wikimedia-operations) [2019-12-16T19:17:28Z] <mobrovac@deploy1001> Started deploy [cpjobqueue/deploy@0047875]: Do not consume the fetchGoogleCloudVisionAnnotations topic -- T240518

Mentioned in SAL (#wikimedia-operations) [2019-12-16T19:18:25Z] <mobrovac@deploy1001> Finished deploy [cpjobqueue/deploy@0047875]: Do not consume the fetchGoogleCloudVisionAnnotations topic -- T240518 (duration: 01m 00s)

Change 558176 had a related patch set uploaded (by Mobrovac; owner: Mobrovac):
[mediawiki/services/change-propagation/jobqueue-deploy@master] TEMP: Increase concurrency of low_traffic_jobs rule

https://gerrit.wikimedia.org/r/558176

Change 558176 merged by Mobrovac:
[mediawiki/services/change-propagation/jobqueue-deploy@master] TEMP: Increase concurrency of low_traffic_jobs rule

https://gerrit.wikimedia.org/r/558176

Mentioned in SAL (#wikimedia-operations) [2019-12-16T19:31:44Z] <mobrovac@deploy1001> Started deploy [cpjobqueue/deploy@1efbc29]: Increase concurrency for low traffic jobs -- T240518

Mentioned in SAL (#wikimedia-operations) [2019-12-16T19:32:30Z] <mobrovac@deploy1001> Finished deploy [cpjobqueue/deploy@1efbc29]: Increase concurrency for low traffic jobs -- T240518 (duration: 00m 46s)

Change 558186 had a related patch set uploaded (by Mobrovac; owner: Mobrovac):
[mediawiki/services/change-propagation/jobqueue-deploy@master] TEMP: Further increase the concurrency for low_traffic_jobs

https://gerrit.wikimedia.org/r/558186

Change 558186 merged by Mobrovac:
[mediawiki/services/change-propagation/jobqueue-deploy@master] TEMP: Further increase the concurrency for low_traffic_jobs

https://gerrit.wikimedia.org/r/558186

Mentioned in SAL (#wikimedia-operations) [2019-12-16T19:54:15Z] <mobrovac@deploy1001> Started deploy [cpjobqueue/deploy@9423e7e]: Increase concurrency for low traffic jobs even further -- T240518

Mentioned in SAL (#wikimedia-operations) [2019-12-16T19:55:04Z] <mobrovac@deploy1001> Finished deploy [cpjobqueue/deploy@9423e7e]: Increase concurrency for low traffic jobs even further -- T240518 (duration: 00m 49s)

mobrovac lowered the priority of this task from Unbreak Now! to High.Dec 16 2019, 8:10 PM
mobrovac subscribed.

Lowering the priority as we are out of the woods for the time being. Jobs are being processed now, but it will take a while to catch up with current events (at least 24h form now).

The problem seems to be writing the offset for the mediawiki.job.fetchGoogleCloudVisionAnnotations topic, but I don't know why. Perhaps it was wrongly configured in Kafka? Either way, its contents is currently not being consumed by CP (but that's ok because the jobs are not being emitted either). Before re-enabling CP for consuming that topic, its settings should be checked and the backlog cleared (i.e. its commit offset reset to latest). Also note that currently the concurrency for the low_traffic_jobs CP consumer (which consumes all of the jobs that have been affected by this outage) is set to a very high number in order to get through the backlog. Once that is done, it should be reset back to 50.

Looks good now: JobQueue seems emptied. I just tried a couple of renamings which worked perfectly.

Thanks everybody involved on fixing the issue.

@MBq apologies for the problems caused, and not catching it up earlier.

Joe lowered the priority of this task from High to Medium.Dec 17 2019, 5:34 AM

All jobs have recovered. Not closing the task as we need to reduce concurrency again.

(Rename task) I've tried both large and small cases with accounts to move. It works very well and it's being processed quickly.
Can we now handle the delayed rename request? It's estimated to be around 100+.

@Sotiale: If the question is if rename processing can continue as usual, the answer is yes.

/

Aside from @Joe's followup, an incident documentation would probably be useful to understand what went wrong here. Based on comment T240518#5745854 maybe @Mholloway or someone on his team understands better than me what was the origin of the issue, and can get it started, and I will also be able to help complete it with what I know.

Yes, when @Pchelolo is back I'll work with him to figure out what exactly went wrong here and start writing up an incident report.

Thank you, Mholloway, your help here was highly appreciated!

Yes, when @Pchelolo is back I'll work with him to figure out what exactly went wrong here and start writing up an incident report.

Assuming this should be tagged as Wikimedia-Incident then.

The explanation of what's happened is in T241072#5751597. Shorly, Kafka-based queue is not great with jobs delayed for 48 hours and we need an alternative solution for this job.

Now, why it affected other jobs and what to do. In Kafka-based queue the jobs are organized into groups so that we don't have a whole change-prop process working on a job that only has 1 occurrence every month. More high-traffic jobs are all separated, some jobs with special requirements are grouped together, but most of the long-tail jobs with low-enough traffic are all grouped under 'low_traffic_jobs'. Given that we have a very long tail of such jobs, we didn't want to have an explicit list of the jobs that get into the catch-all group, thus any new job added ends up in that group by default and if it misbehaves like in this instance, it can block the queue for multiple other types of jobs.

I guess we need to be more protective and list all low-traffic jobs explicitly (which would be a huge burden), or have some manual step of moving the jobs to 'stable' vs 'unstable' low-traffic groups (I don't really know how exactly would we do something like that without an explicit list) or have a requirement for an additional review or a requirement to put new jobs explicitly into an 'unstable' group.

Mentioned in SAL (#wikimedia-operations) [2019-12-18T21:00:18Z] <ppchelko@deploy1001> Started deploy [cpjobqueue/deploy@7e68510]: Return low_traffic_jobs concurrency to normal after T240518

Mentioned in SAL (#wikimedia-operations) [2019-12-18T21:01:22Z] <ppchelko@deploy1001> Finished deploy [cpjobqueue/deploy@7e68510]: Return low_traffic_jobs concurrency to normal after T240518 (duration: 01m 03s)

To try to close this as resolved, there it seems to be an incident report already https://wikitech.wikimedia.org/wiki/Incident_documentation/20191211-MachineVision%2Bcpjobqueue , even in draft. In my understanding, the only thing left here is to review it to make sure it is accurate and/or add further ticket actionables to ensure this doesn't happen again or it is mitigated. For example, T241072 seems to be a direct actionable and should be reflected on the document, too, not all actionables have to be long-term, ideal projects ;-).

@jcrespo @Pchelolo Could this be related to T241294 somewhat? One job is from January 2019 but the other one is fairly recent and showJobs.php keeps saying there's no jobs. I'm not sure how to proceed. Thanks.

Mholloway claimed this task.

Incident report is in review.