Page MenuHomePhabricator

Spike in JobQueue job backlog time (500ms -> 4-8 minutes)
Closed, ResolvedPublic

Description

In discussions with @Dreamy_Jazz today, he pointed out that he observed a spike in job backlog time starting on October 24 at ~19:00.

The spike is hard to spot with all jobs plotted:

image.png (2,492×1,470 px, 1 MB)

But it's easy to see if you filter for a specific job:

image.png (2,492×1,470 px, 513 KB)

e.g. mediaModerationScanFileJob goes from 500ms to 4-8 minutes.

The slower backlog processing time seem to have perhaps started around October 21, but then became persistent after October 24. Here's a screenshot of October 11 to October 25 for mediaModerationScanFileJob:

image.png (3,574×2,204 px, 405 KB)

Note that not all jobs are affected, but it looks like most are.

It also seems that enwiki may be less affected than commonswiki.

@Dreamy_Jazz and I did not see anything obvious in https://sal.toolforge.org/production?p=0&q=&d=2024-10-24 to point to why this slowdown in job backlog is occurring.

Event Timeline

kostajh renamed this task from Spike in JobQueue job backlog time to Spike in JobQueue job backlog time (500ms -> 4-8 minutes).Oct 28 2024, 5:12 PM

Just to note, this increase in job queue waiting time is affecting our scanning rate. It was causing the script to assume that the jobs had failed, clear the internal tracking array (used because at the time there was no way to get a list of queued jobs), and then retry the same images. This brought the actual scanning rate down to a small rate.

I adjusted some of the maintenance script parameters to try to account for this a few days ago. It did improve things but it's not got us back to a normal scanning rate. I did this adjustment a few days ago, and nothing appears to have changed with the job queue times.

Potentially related to T378076, as concurrency was lowered to mitigate ongoing issues.

It looks like there was a large influx of flaggedrevs_CacheUpdate [0] jobs that started around that time, and ended around 2024-10-25 20:30.

That's resulted in a large backlog of jobs that we're continuing to chew through [1] (~ 2/3 of the way there judging by the associated kafka topic).

I believe this particular job is of the "low traffic" type that will share execution throughput in changeprop-jobqueue with other jobs [2], thus leading to a slowdown until the backlog clears.

[0] https://grafana.wikimedia.org/goto/ToEnl5WNR?orgId=1

[1] https://grafana.wikimedia.org/goto/oYcO_cZNg?orgId=1

[2] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/refs/heads/master/helmfile.d/services/changeprop-jobqueue/values.yaml#216

Given the rate at which the backlog is draining, this should self-resolve in ~ 24h.

We can try to relax the concurrency limit for the low_traffic_jobs rule, but that might not make a significant difference in either clearing the backlog or improving queue times for other jobs handled by that rule.

I'm happy to give it a try, though.

Given the rate at which the backlog is draining, this should self-resolve in ~ 24h.

We can try to relax the concurrency limit for the low_traffic_jobs rule, but that might not make a significant difference in either clearing the backlog or improving queue times for other jobs handled by that rule.

I'm happy to give it a try, though.

I think it self-resolving in ~24 hrs would be okay. The concern was that this was a more long-term change to the wait times.

Thanks for the follow-up, @Dreamy_Jazz. If self-resolving in 24h should be fine, then I might not make any changes to the concurrency limit (particularly since it's unclear how much it will help).

FWIW, if the situation becomes more urgent, a more effective option may be to temporarily lift flaggedrevs_CacheUpdate out into the high_traffic_jobs_config, which would give it a dedicated consumer and processing resources. While that might not clear the backlog much sooner, it would at least isolate it from the other jobs mapped to low_traffic_jobs.

Given the rate at which the backlog is draining, this should self-resolve in ~ 24h.

We can try to relax the concurrency limit for the low_traffic_jobs rule, but that might not make a significant difference in either clearing the backlog or improving queue times for other jobs handled by that rule.

I'm happy to give it a try, though.

I think it self-resolving in ~24 hrs would be okay. The concern was that this was a more long-term change to the wait times.

Agreed, that sounds fine. I was concerned it was related to a software deployment.

It looks like there was a large influx of flaggedrevs_CacheUpdate [0] jobs that started around that time, and ended around 2024-10-25 20:30.

Is there a task tracking how / why this occurred, and should we be worried about it happening again? Is there alerting when job queue backlog time goes above some level?

Looking at the breakdown by wiki for flaggedrevs_CacheUpdate jobs among the last 1M entries in the executor log on mwlog1002:

$ tail -1000000 JobExecutor.log | grep flaggedrevs_CacheUpdate | cut -f 5 -d ' ' | sort | uniq -c | sort -nr
 261208 dewiki
   3026 arwiki
    111 ukwiki
 ...

So yeah, seems to effectively all be targeting dewiki. Example:

2024-10-28 18:36:32.166605 [625b52ea-a275-4d0e-a34c-0758a10b99c0] mw-jobrunner.codfw.main-589f7dcf76-hz7vd dewiki 1.43.0-wmf.28 JobExecutor INFO: Finished job execution {"job":"flaggedrevs_CacheUpdate Kategorie:Wikipedia:Seite,_die_JsonConfig_verwendet type=purge start=7730388 end=7730388 namespace=14 title=Wikipedia:Seite,_die_JsonConfig_verwendet requestId=625b52ea-a275-4d0e-a34c-0758a10b99c0","job_type":"flaggedrevs_CacheUpdate","job_status":true,"job_duration":0.10941386222839355}

In fact, everything in the sample I'm looking at are for the same title, just with different start / end fields. I don't know enough about the flaggedrevs extension (really anything about it, to be fair) to connect this back to a specific trigger workload. However, if anyone does, that would be greatly appreciated.

Looking at the breakdown by wiki for flaggedrevs_CacheUpdate jobs among the last 1M entries in the executor log on mwlog1002:

$ tail -1000000 JobExecutor.log | grep flaggedrevs_CacheUpdate | cut -f 5 -d ' ' | sort | uniq -c | sort -nr
 261208 dewiki
   3026 arwiki
    111 ukwiki
 ...

So yeah, seems to effectively all be targeting dewiki. Example:

I think that's just because FlaggedRevs is used most heavily on dewiki.

2024-10-28 18:36:32.166605 [625b52ea-a275-4d0e-a34c-0758a10b99c0] mw-jobrunner.codfw.main-589f7dcf76-hz7vd dewiki 1.43.0-wmf.28 JobExecutor INFO: Finished job execution {"job":"flaggedrevs_CacheUpdate Kategorie:Wikipedia:Seite,_die_JsonConfig_verwendet type=purge start=7730388 end=7730388 namespace=14 title=Wikipedia:Seite,_die_JsonConfig_verwendet requestId=625b52ea-a275-4d0e-a34c-0758a10b99c0","job_type":"flaggedrevs_CacheUpdate","job_status":true,"job_duration":0.10941386222839355}

In fact, everything in the sample I'm looking at are for the same title, just with different start / end fields. I don't know enough about the flaggedrevs extension (really anything about it, to be fair) to connect this back to a specific trigger workload. However, if anyone does, that would be greatly appreciated.

The code that added this job was introduced in 2008 via rEFLRc856ea8fd8f1: * Track stable-only link use in it's own table (requires schema change) (bug…, in case that helps.

Thanks, @kostajh.

FWIW, it looks like Kategorie:Wikipedia:Seite,_die_JsonConfig_verwendet is a brand new category created on the 24th, with a lot of associated pages.

While it seems(?) to have been created after the influx of jobs started, it's plausible that it's not the only category of this "shape" to have been created on that day. In any case, this seems to result in a large amount of flaggedrevs_CacheUpdate jobs being created.

I think a potentially interesting monitoring follow up here might be to detect when a "low traffic" job no longer is, as clearly there's an isolation failure risk for the other jobs grouped into that consumer configuration.

Thanks, @kostajh.

FWIW, it looks like Kategorie:Wikipedia:Seite,_die_JsonConfig_verwendet is a brand new category created on the 24th, with a lot of associated pages.

Created by code https://gerrit.wikimedia.org/r/c/mediawiki/extensions/JsonConfig/+/1019081.

I don't think they are related. That same tracking category appeared on every wiki, nothing explains why it would have cause a load spike for dewiki only. Unless perhaps flagged revs does something "funny" with categories that would cause much higher load?

MediaModeration jobs have gone back to normal processing.

We can probably close this unless we want to follow-up on comments made above about monitoring for "low traffic" jobs that are no longer low traffic?

Thanks @Ammarpad for the pointer to where this category came from, and thanks @cscott for providing some context. Indeed, this seems likely to be an interaction between the new category and the way flaggedrevs is used on dewiki.

@Dreamy_Jazz - thanks for confirming. I see the same when spot-checking other jobs mapped to the low-traffic config as of late yesterday.

I've opened T378609 to specifically cover the monitoring aspect, so unless anyone has other follow-ups they'd like to pursue here, I'd also vote that we close this out.

kostajh claimed this task.

Thanks everyone!

Just as an aside, I believe PublishStashedFile AssembleUploadChunks are considered low traffic job. Unlike normal jobs these are very latency sensitive, as they don't happen in the background, but the UI actually makes users wait well these jobs complete (See also T378276). It would be really great if somehow these jobs can be prioritized in a job queue backlog situation.

It shouldn't be too hard to give it a dedicated lane with small concurrency

Thanks, @Bawolff - Yes, indeed, those both fan into the low-traffic consumer. While we don't really have a prioritization mechanism in this context that I'm aware of, it would probably be fairly straightforward to at least move them out of low-traffic to a dedicated consumer, as @Ladsgroup points out. I've opened T379035 to look into that.

@Bawolff The problem described in T378276 is rised again. I have recorded a HAR file, please, give me an e-mail where should I send it. I will not clear any cookies because I don't know how to do it, I'll just seng you a raw file.

Now files are waiting ~10 minutes before publishing and doesn't published due to errors "Unknown server error" and "Incorrect CSRF token". Uploading (first UploadWizard step) was very slow too with the same behavior than in previous case: 3 files in queue and all other files waiting, after several minutes this 3 files uploaded and next 3 files in queue.

{A2518AE1-A179-4C24-B15D-0D827809598E}.png (1,163×1,071 px, 43 KB)

@MBH lets open a separate new task to investigate, as the cause could be something different than the job queue thing this task is about. If you want you could email the HAR file to me ( bawolff@gmail.com ).

That said, it does seem like the p99 for AssembleChunkUpload jobs has spiked to ~15 min for the last 2 hours (was fine before that point), so maybe that is just it. Maybe driven by a spike in ChangeDeletionNotification jobs. Sounds like a dedicated queue as Scott suggests would really help.

Subscribing myself. I'm seeing this for a while, and yes, today some mw tags are taking forever to update.

Thanks for flagging, all. Yes, this looks like another isolation failure on the low-traffic consumer, and appears to have largely self-resolved as of ~ 14:50 UTC on the 10th. I'll follow up on T379462 for this particular instance, and aim to prioritize T379035 when I'm back this week.

jijiki triaged this task as High priority.Nov 11 2024, 1:07 PM

Since the three job types critical to uploads have now been moved to dedicated consumers (T379035), the primary follow-up here is monitoring (T378609) to detect when these kinds of isolation failures occur so that we can reactively isolate the "antagonist" job.

I'll hold onto this until we have that in place.

Monitoring for sustained latency impact on low-traffic jobs is now live.