Some jobs are not being processed / are processed slowly
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	1997kB
	Dec 12 2019, 12:55 AM

Description

Global renames are being really, really, slow. See https://logstash.wikimedia.org/goto/576d8bc431184d815976416b3bfc2a3a for an example, long delays between renames - the rename itself to be rather quick.

Also see https://grafana.wikimedia.org/d/000000400/jobqueue-eventbus?orgId=1&fullscreen&panelId=15&from=now-7d&to=now - our processing backlog seems to be much higher than few days later.

Original description

https://meta.wikimedia.org/wiki/Special:GlobalRenameProgress/User1692 is stuck even after 3 hours.

Details

Subject	Repo	Branch	Lines +/-
TEMP: Further increase the concurrency for low_traffic_jobs	mediawiki/services/change-propagation/jobqueue-deploy	master	+1 -1
TEMP: Increase concurrency of low_traffic_jobs rule	mediawiki/services/change-propagation/jobqueue-deploy	master	+1 -1
TEMP: Exclude fetchGoogleCloudVisionAnnotations topic from low_traffic	mediawiki/services/change-propagation/jobqueue-deploy	master	+1 -0
Catch DB duplicate key errors, cont.	mediawiki/extensions/MachineVision	wmf/1.35.0-wmf.10	+17 -12
Catch DB duplicate key errors, cont.	mediawiki/extensions/MachineVision	master	+17 -12
Catch duplicate entry errors and log a warning	mediawiki/extensions/MachineVision	master	+56 -50
Catch duplicate entry errors and log a warning	mediawiki/extensions/MachineVision	wmf/1.35.0-wmf.10	+56 -50
MachineVision: Disable enqueuing new labeling request jobs	operations/mediawiki-config	master	+2 -0

Customize query in gerrit

Related Objects
Search...

Status	Subtype	Assigned	Task
Resolved	BUG REPORT	None	T240777 MassMessage not delivering
Resolved		None	T240800 MediaWiki mail system for watchlist on it.wikipedia is delivering very slowly
Resolved		• Mholloway	T240518 Some jobs are not being processed / are processed slowly
Resolved	BUG REPORT	• Mholloway	T240733 TypeError from line 92 of /srv/mediawiki/php-1.35.0-wmf.10/extensions/MachineVision/src/Client/GoogleCloudVisionClient.php: Argument 2 passed to MediaWiki\Extension\MachineVision\Client\GoogleCloudVisionClient::fetchAnnotations() must be an instance of LocalFile, boolean given

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Are failing uploads on Commons the same issue?

Ymblanter: You may want to check T240698 instead.

There is also a problem with pageviews data. For example, there is a two days delay in hewiki. There was a 1.5 days delay in ruwiki, and a one day delay in enwiki.

https://grafana.wikimedia.org/d/000000400/jobqueue-eventbus?orgId=1&from=1575805469987&to=1576500217867&fullscreen&panelId=5&var-site=eqiad&var-type=All

stwalkerster subscribed.Dec 16 2019, 1:33 PM

Pageview problems are T240803

GPSLeo subscribed.Dec 16 2019, 2:13 PM

Interactive history shows irrelevant timestamps.

@Pchelolo is unfortunately off today and Tuesday of this week. We'll dig into this now and try to find a resolution.

WDoranWMF edited projects, added Platform Team Workboards (Clinic Duty Team); removed Platform Engineering.Dec 16 2019, 3:02 PM

WDoranWMF moved this task from Inbox to Doing(WIP:5) on the Platform Team Workboards (Clinic Duty Team) board.

Looking at the graph "Rate of committed offset increment" from https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?orgId=1 it seems that only "low_traffic_jobs" are affected:

dropping from ~40 to 3.
With one topic (fetchGoogleCloudVisionAnnotations) constantly failing out of many that should run properly (all the ones consumed by low_traffic_jobs), if ChangeProp does not properly handle such scenario I suppose it could lead to such behavior.

One way to verify this could be to either disable completely fetchGoogleCloudVisionAnnotations or assign it to a specific consumer.

JJMC89 subscribed.Dec 16 2019, 3:37 PM

RhinosF1 subscribed.Dec 16 2019, 4:12 PM

I am about to deploy a backported fix for fetchGoogleCloudVisionAnnotations to wmf.10.

In T240518#5744203, @Ymblanter wrote:

Are failing uploads on Commons the same issue?

In T240518#5744216, @jcrespo wrote:

Ymblanter: You may want to check T240698 instead.

See https://github.com/wikimedia/mediawiki/blob/master/includes/api/ApiUpload.php#L811 - that actually might be queue dependant?

• Mholloway closed subtask T240733: TypeError from line 92 of /srv/mediawiki/php-1.35.0-wmf.10/extensions/MachineVision/src/Client/GoogleCloudVisionClient.php: Argument 2 passed to MediaWiki\Extension\MachineVision\Client\GoogleCloudVisionClient::fetchAnnotations() must be an instance of LocalFile, boolean given as Resolved.Dec 16 2019, 5:03 PM

zhuyifei1999 subscribed.Dec 16 2019, 5:12 PM

I see sometimes an error like the following:

[2019-12-16T17:14:19.865Z] ERROR: cpjobqueue/8023 on scb1001: Producer error (levelPath=error/producer, err.origin=local)
    Error: Local: All broker connections are down
        at Error (native)

but it seems to be nothing new, but it's much more common on scb1001 since the restart.

Change 558141 had a related patch set uploaded (by Mholloway; owner: Michael Holloway):
[operations/mediawiki-config@master] MachineVision: Disable enqueuing new labeling request jobs

https://gerrit.wikimedia.org/r/558141

Change 558141 merged by Mholloway:
[operations/mediawiki-config@master] MachineVision: Disable enqueuing new labeling request jobs

https://gerrit.wikimedia.org/r/558141

Mentioned in SAL (#wikimedia-operations) [2019-12-16T17:56:03Z] <mholloway-shell@deploy1001> Synchronized php-1.35.0-wmf.10/extensions/MachineVision: Fix: Ignore duplicate entry errors on insertLabels (T240518) (duration: 00m 57s)

Count_Count subscribed.Dec 16 2019, 6:01 PM

Mentioned in SAL (#wikimedia-operations) [2019-12-16T18:03:09Z] <mobrovac@deploy1001> Started restart [cpjobqueue/deploy@deafe56]: Rolling restart of CP4JQ -- T240518

Change 558149 had a related patch set uploaded (by Mholloway; owner: Michael Holloway):
[mediawiki/extensions/MachineVision@master] Catch duplicate entry errors and log a warning

https://gerrit.wikimedia.org/r/558149

Joeyconnick subscribed.Dec 16 2019, 6:11 PM

Ladsgroup subscribed.Dec 16 2019, 6:16 PM

Change 558149 merged by Mholloway:
[mediawiki/extensions/MachineVision@master] Catch duplicate entry errors and log a warning

https://gerrit.wikimedia.org/r/558149

Change 558153 had a related patch set uploaded (by Mholloway; owner: Michael Holloway):
[mediawiki/extensions/MachineVision@wmf/1.35.0-wmf.10] Catch duplicate entry errors and log a warning

https://gerrit.wikimedia.org/r/558153

• Mholloway mentioned this in rEMVI44ee2b448d67: Catch duplicate entry errors and log a warning.Dec 16 2019, 6:17 PM

Change 558153 merged by Mholloway:
[mediawiki/extensions/MachineVision@wmf/1.35.0-wmf.10] Catch duplicate entry errors and log a warning

https://gerrit.wikimedia.org/r/558153

Mentioned in SAL (#wikimedia-operations) [2019-12-16T18:20:10Z] <mholloway-shell@deploy1001> Synchronized php-1.35.0-wmf.10/extensions/MachineVision: (T240518) (duration: 00m 57s)

Change 558157 had a related patch set uploaded (by Mholloway; owner: Michael Holloway):
[mediawiki/extensions/MachineVision@master] Catch DB duplicate key errors, cont.

https://gerrit.wikimedia.org/r/558157

Change 558157 merged by Mholloway:
[mediawiki/extensions/MachineVision@master] Catch DB duplicate key errors, cont.

https://gerrit.wikimedia.org/r/558157

Change 558159 had a related patch set uploaded (by Mholloway; owner: Michael Holloway):
[mediawiki/extensions/MachineVision@wmf/1.35.0-wmf.10] Catch DB duplicate key errors, cont.

https://gerrit.wikimedia.org/r/558159

Change 558159 merged by Mholloway:
[mediawiki/extensions/MachineVision@wmf/1.35.0-wmf.10] Catch DB duplicate key errors, cont.

https://gerrit.wikimedia.org/r/558159

Mentioned in SAL (#wikimedia-operations) [2019-12-16T18:33:07Z] <mholloway-shell@deploy1001> Synchronized php-1.35.0-wmf.10/extensions/MachineVision: Catch DB duplicate key errors, cont. (T240518) (duration: 00m 55s)

ReleaseTaggerBot added a project: MW-1.35-notes (1.35.0-wmf.10; 2019-12-10).Dec 16 2019, 7:00 PM

Change 558171 had a related patch set uploaded (by Mobrovac; owner: Mobrovac):
[mediawiki/services/change-propagation/jobqueue-deploy@master] TEMP: Exclude fetchGoogleCloudVisionAnnotations topic from low_traffic

https://gerrit.wikimedia.org/r/558171

Change 558171 merged by Mobrovac:
[mediawiki/services/change-propagation/jobqueue-deploy@master] TEMP: Exclude fetchGoogleCloudVisionAnnotations topic from low_traffic

https://gerrit.wikimedia.org/r/558171

Mentioned in SAL (#wikimedia-operations) [2019-12-16T19:17:28Z] <mobrovac@deploy1001> Started deploy [cpjobqueue/deploy@0047875]: Do not consume the fetchGoogleCloudVisionAnnotations topic -- T240518

Mentioned in SAL (#wikimedia-operations) [2019-12-16T19:18:25Z] <mobrovac@deploy1001> Finished deploy [cpjobqueue/deploy@0047875]: Do not consume the fetchGoogleCloudVisionAnnotations topic -- T240518 (duration: 01m 00s)

Change 558176 had a related patch set uploaded (by Mobrovac; owner: Mobrovac):
[mediawiki/services/change-propagation/jobqueue-deploy@master] TEMP: Increase concurrency of low_traffic_jobs rule

https://gerrit.wikimedia.org/r/558176

Change 558176 merged by Mobrovac:
[mediawiki/services/change-propagation/jobqueue-deploy@master] TEMP: Increase concurrency of low_traffic_jobs rule

https://gerrit.wikimedia.org/r/558176

Mentioned in SAL (#wikimedia-operations) [2019-12-16T19:31:44Z] <mobrovac@deploy1001> Started deploy [cpjobqueue/deploy@1efbc29]: Increase concurrency for low traffic jobs -- T240518

Mentioned in SAL (#wikimedia-operations) [2019-12-16T19:32:30Z] <mobrovac@deploy1001> Finished deploy [cpjobqueue/deploy@1efbc29]: Increase concurrency for low traffic jobs -- T240518 (duration: 00m 46s)

Change 558186 had a related patch set uploaded (by Mobrovac; owner: Mobrovac):
[mediawiki/services/change-propagation/jobqueue-deploy@master] TEMP: Further increase the concurrency for low_traffic_jobs

https://gerrit.wikimedia.org/r/558186

Change 558186 merged by Mobrovac:
[mediawiki/services/change-propagation/jobqueue-deploy@master] TEMP: Further increase the concurrency for low_traffic_jobs

https://gerrit.wikimedia.org/r/558186

Mentioned in SAL (#wikimedia-operations) [2019-12-16T19:54:15Z] <mobrovac@deploy1001> Started deploy [cpjobqueue/deploy@9423e7e]: Increase concurrency for low traffic jobs even further -- T240518

Mentioned in SAL (#wikimedia-operations) [2019-12-16T19:55:04Z] <mobrovac@deploy1001> Finished deploy [cpjobqueue/deploy@9423e7e]: Increase concurrency for low traffic jobs even further -- T240518 (duration: 00m 49s)

Lowering the priority as we are out of the woods for the time being. Jobs are being processed now, but it will take a while to catch up with current events (at least 24h form now).

The problem seems to be writing the offset for the mediawiki.job.fetchGoogleCloudVisionAnnotations topic, but I don't know why. Perhaps it was wrongly configured in Kafka? Either way, its contents is currently not being consumed by CP (but that's ok because the jobs are not being emitted either). Before re-enabling CP for consuming that topic, its settings should be checked and the backlog cleared (i.e. its commit offset reset to latest). Also note that currently the concurrency for the low_traffic_jobs CP consumer (which consumes all of the jobs that have been affected by this outage) is set to a very high number in order to get through the backlog. Once that is done, it should be reset back to 50.

For reference, the relevant Job class is https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/extensions/MachineVision/+/master/src/Job/FetchGoogleCloudVisionAnnotationsJob.php, created by https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/extensions/MachineVision/+/master/src/Job/FetchGoogleCloudVisionAnnotationsJobFactory.php on UploadComplete when permitted to do so by extension configuration.

• Mholloway added projects: Product-Infrastructure-Team-Backlog-Deprecated, MachineVision, SDC-Statements (Machine-vision-depicts).Dec 16 2019, 8:34 PM

Restricted Application added a project: Structured-Data-Backlog. · View Herald TranscriptDec 16 2019, 8:34 PM

Thank you @mobrovac for the help!

Looks good now: JobQueue seems emptied. I just tried a couple of renamings which worked perfectly.

Thanks everybody involved on fixing the issue.

@MBq apologies for the problems caused, and not catching it up earlier.

All jobs have recovered. Not closing the task as we need to reduce concurrency again.

(Rename task) I've tried both large and small cases with accounts to move. It works very well and it's being processed quickly.
Can we now handle the delayed rename request? It's estimated to be around 100+.

@Sotiale: If the question is if rename processing can continue as usual, the answer is yes.

Aside from @Joe's followup, an incident documentation would probably be useful to understand what went wrong here. Based on comment T240518#5745854 maybe @Mholloway or someone on his team understands better than me what was the origin of the issue, and can get it started, and I will also be able to help complete it with what I know.

Lea_Lacroix_WMDE subscribed.Dec 17 2019, 8:17 AM

Samat awarded a token.Dec 17 2019, 9:30 AM

Yes, when @Pchelolo is back I'll work with him to figure out what exactly went wrong here and start writing up an incident report.

Thank you, Mholloway, your help here was highly appreciated!

Thanks everyone involved!

In T240518#5747813, @Mholloway wrote:

Yes, when @Pchelolo is back I'll work with him to figure out what exactly went wrong here and start writing up an incident report.

Assuming this should be tagged as Wikimedia-Incident then.

Krinkle moved this task from Active investigation to Follow-up prevention on the Wikimedia-Incident board.Dec 17 2019, 10:13 PM

Nikerabbit mentioned this in T221119: "This namespace is reserved for content page translations" when trying to translate a recently created translation unit.Dec 18 2019, 12:03 PM

• Mholloway mentioned this in T241072: Update the "waiting period" implemenation so as not to block the job queue.Dec 18 2019, 3:59 PM

The explanation of what's happened is in T241072#5751597. Shorly, Kafka-based queue is not great with jobs delayed for 48 hours and we need an alternative solution for this job.

Now, why it affected other jobs and what to do. In Kafka-based queue the jobs are organized into groups so that we don't have a whole change-prop process working on a job that only has 1 occurrence every month. More high-traffic jobs are all separated, some jobs with special requirements are grouped together, but most of the long-tail jobs with low-enough traffic are all grouped under 'low_traffic_jobs'. Given that we have a very long tail of such jobs, we didn't want to have an explicit list of the jobs that get into the catch-all group, thus any new job added ends up in that group by default and if it misbehaves like in this instance, it can block the queue for multiple other types of jobs.

I guess we need to be more protective and list all low-traffic jobs explicitly (which would be a huge burden), or have some manual step of moving the jobs to 'stable' vs 'unstable' low-traffic groups (I don't really know how exactly would we do something like that without an explicit list) or have a requirement for an additional review or a requirement to put new jobs explicitly into an 'unstable' group.

LGoto moved this task from Needs triage to Tracking on the Product-Infrastructure-Team-Backlog-Deprecated board.Dec 18 2019, 4:50 PM

Mentioned in SAL (#wikimedia-operations) [2019-12-18T21:00:18Z] <ppchelko@deploy1001> Started deploy [cpjobqueue/deploy@7e68510]: Return low_traffic_jobs concurrency to normal after T240518

Mentioned in SAL (#wikimedia-operations) [2019-12-18T21:01:22Z] <ppchelko@deploy1001> Finished deploy [cpjobqueue/deploy@7e68510]: Return low_traffic_jobs concurrency to normal after T240518 (duration: 01m 03s)

• Mholloway moved this task from Backlog to Tracking on the MachineVision board.Dec 19 2019, 8:33 PM

Aklapper removed a project: Patch-For-Review.Dec 21 2019, 3:39 PM

To try to close this as resolved, there it seems to be an incident report already https://wikitech.wikimedia.org/wiki/Incident_documentation/20191211-MachineVision%2Bcpjobqueue , even in draft. In my understanding, the only thing left here is to review it to make sure it is accurate and/or add further ticket actionables to ensure this doesn't happen again or it is mitigated. For example, T241072 seems to be a direct actionable and should be reflected on the document, too, not all actionables have to be long-term, ideal projects ;-).

@jcrespo @Pchelolo Could this be related to T241294 somewhat? One job is from January 2019 but the other one is fairly recent and showJobs.php keeps saying there's no jobs. I'm not sure how to proceed. Thanks.

• Mholloway mentioned this in T242726: Improve the Kafka job queue's handling of unknown new jobs.Jan 14 2020, 10:16 AM

Incident report is in review.

In T240518#5801289, @Mholloway wrote:

Incident report is in review.

I assume that is https://wikitech.wikimedia.org/wiki/Incident_documentation/20191211-MachineVision%2Bcpjobqueue

In T240518#5862403, @Aklapper wrote:

In T240518#5801289, @Mholloway wrote:

Incident report is in review.

I assume that is https://wikitech.wikimedia.org/wiki/Incident_documentation/20191211-MachineVision%2Bcpjobqueue

Yes, that's right.

Krinkle edited projects, added Sustainability (Incident Followup); removed Wikimedia-Incident.Apr 28 2020, 9:50 PM

	F31477840: rate_of_committed_offset_inc.png
	Dec 16 2019, 3:18 PM

	F31477255: wikibaseInj.png
	Dec 16 2019, 4:44 AM

	F31477257: recentChanges.png
	Dec 16 2019, 4:44 AM

	F31477258: googlevision.png
	Dec 16 2019, 4:44 AM

	F31477256: echoNotif.png
	Dec 16 2019, 4:44 AM

Some jobs are not being processed / are processed slowlyClosed, ResolvedPublicActions