cpjobqueue not achieving configured concurrency
Open, Needs TriagePublic
Actions

Assigned To

Authored By

	EBernhardson
	Feb 3 2022, 6:49 PM

Description

CirrusSearchElasticaWrite jobs have been backlogging recently. In T266762 the same issue was addressed and concurrency for the job was increased to 100 per partition, 300 overall. Job concurrency graphs, added to JobQueue Job dashboard while investigating this issue, show the job achieving a typical concurrency of ~25 per partition with a backlogged queue. We have disabled some important maintenance operations on the cirrus elasticsearch clusters to reduce the backlog, but these cannot stay turned off forever.

While looking into this I also noticed that several other jobs are backlogged currently, unclear but plausible they are experiencing a similar issue:

job	normal job backlog time (mean avg, 15min)
wikibase-addUsagesforPage	4.93 days
refreshLinks	3.27 days
cirrusSearchElaticaWrite	12.9 hours

Details

Subject	Repo	Branch	Lines +/-
jobqueue: increase num_workers to 4	operations/deployment-charts	master	+2 -1
changeprop: add sampling configuration, set num_workers	operations/deployment-charts	master	+24 -4
jobqueue: use guaranteed QoS strategy	operations/deployment-charts	master	+2 -1
jobqueue: set CPU request	operations/deployment-charts	master	+3 -1
changeprop-jobqueue: more replicas, less CPU	operations/deployment-charts	master	+2 -2
changeprop-jobqueue: decrease CPU use, increase number of nodes	operations/deployment-charts	master	+2 -2
changeprop-jobqueue: increase CPU and memory allocation	operations/deployment-charts	master	+2 -1
changeprop-jobqueue: increase concurrency for backlogged jobs	operations/deployment-charts	master	+5 -5

Customize query in gerrit

Related Objects

Mentioned In: T314426: Job queue for writes to cloudelastic falling behind
T302733: Restore CirrusSearch saneitizer to production usage
T295705: Cleanup missing Commons index on Elasticsearch eqiad
Mentioned Here: T314426: Job queue for writes to cloudelastic falling behind
T255975: Investigate the iowait issues plaguing kubernetes nodes since 2020-05-29
T266762: The saneitizer is a lot slower than when running in codfw (oct 27 2020 codfw -> eqiad switchover)

Event Timeline

EBernhardson created this task.Feb 3 2022, 6:49 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 3 2022, 6:49 PM

EBernhardson mentioned this in T295705: Cleanup missing Commons index on Elasticsearch eqiad.Feb 3 2022, 7:01 PM

EBernhardson updated the task description. (Show Details)Feb 3 2022, 7:04 PM

dcausse subscribed.Feb 3 2022, 8:50 PM

EBernhardson assigned this task to lbowmaker.Feb 4 2022, 5:31 PM

EBernhardson moved this task from needs triage to Current work on the Discovery-Search board.Feb 4 2022, 5:38 PM

EBernhardson edited projects, added Discovery-Search (Current work); removed Discovery-Search.

EBernhardson moved this task from Incoming to Blocked/Waiting on the Discovery-Search (Current work) board.

WDoranWMF added a project: Platform Team Workboards (Platform Engineering Reliability).Feb 4 2022, 6:03 PM

hnowlan subscribed.Feb 7 2022, 11:05 AM

We're looking into this atm. It seems that there might have been some stalled processes in k8s due to some OOM kills but that doesn't explain the larger concurrency issues. Hopefully we can add some instrumentation to get better insight

Change 761664 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/deployment-charts@master] changeprop-jobqueue: increase concurrency for backlogged jobs

https://gerrit.wikimedia.org/r/761664

gerritbot added a project: Patch-For-Review.Feb 10 2022, 4:55 PM

Was there a change made to the Cirrus cluster yesterday? I see a significant dropoff in the backlog and I'm trying to find out what might have caused this. I haven't seen an equivalent dropoff in other jobs so I'm concerned it's not a good sign.

Change 761664 merged by jenkins-bot:

[operations/deployment-charts@master] changeprop-jobqueue: increase concurrency for backlogged jobs

https://gerrit.wikimedia.org/r/761664

In T300914#7701426, @hnowlan wrote:

Was there a change made to the Cirrus cluster yesterday? I see a significant dropoff in the backlog and I'm trying to find out what might have caused this. I haven't seen an equivalent dropoff in other jobs so I'm concerned it's not a good sign.

For the cirrus backlog we turned off a process that generates half our jobs on Jan 31 when we noticed the backlogging initially. This ticket came later as we investigated that and it looked like cpjobqueue was having issues. The part that was turned off typically inserts jobs every 2 hours from a systemd timer, i'll kick off one of those to see if it will drain before the next round would happen.

The part that was turned off typically inserts jobs every 2 hours from a systemd timer, i'll kick off one of those to see if it will drain before the next round would happen.

I started one but it did something rather unexpected (but unrelated to this ticket, a read-only exception). Possibly something else wrong is going on with cirrus and will look into it.

Change 761677 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/deployment-charts@master] changeprop-jobqueue: increase CPU allocation

https://gerrit.wikimedia.org/r/761677

At this point I am fairly sure that this is a CPU throttling issue - CPU allocation is already a bit low for jobqueue compared to traditional changeprop, and we're seeing throttling in a way that is reminiscent of historical failures for changeprop. An increase in concurrency has helped some jobs but also has increased the amount of throttling: https://grafana-rw.wikimedia.org/d/LSeAShkGz/jobqueue?viewPanel=47&orgId=1&from=1644512117444&to=1644514737979&forceLogin=true

Change 761677 merged by jenkins-bot:

[operations/deployment-charts@master] changeprop-jobqueue: increase CPU and memory allocation

https://gerrit.wikimedia.org/r/761677

Maintenance_bot removed a project: Patch-For-Review.Feb 11 2022, 11:10 AM

Change 761938 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/deployment-charts@master] changeprop-jobqueue: decrease CPU use, increase number of nodes

https://gerrit.wikimedia.org/r/761938

gerritbot added a project: Patch-For-Review.Feb 11 2022, 3:54 PM

Change 761938 merged by jenkins-bot:

[operations/deployment-charts@master] changeprop-jobqueue: decrease CPU use, increase number of nodes

https://gerrit.wikimedia.org/r/761938

Maintenance_bot removed a project: Patch-For-Review.Feb 11 2022, 5:10 PM

Some tweaking and investigation has taught us a few things:

Increasing concurrency for some jobs lead to a threefold increase in concurrency. This was unexpected and hard to explain initially - ultimately it turned out that this was probably more or less unrelated to the changes, as we were still not achieving the desired overall concurrency.
Some jobqueue instances were hit by an issue where high memory usage was causing CPU throttling - we have previously seen this in changeprop (T255975) which uses the same codebase. We've increased memory and CPU allocation, and since decreased the CPU bump a little as it wasn't the source of issues. This lead to even more changes of concurrency, frequently in the wrong direction. In this screenshot we can see a substantial drop in throttling at 1045 after the resource increase.

After this, we increased the number of replicas running in each cluster from 12 to 20 - odds are we will have to tune this higher in future but hopefully this will still have an impact - we can see increased processing happening even if it's not causing the concurrency we desire. In this screenshot we can see an initial jitter at 1045 when resources were increased that kinda levels out to be effectively the same, but with a substantial increase at 1730 when replicas were increased.

Screenshot 2022-02-11 at 17.41.39.png (532×1 px, 274 KB)

However, even with the improved performance, we are still not seeing what we want to see in terms of throughput. Given the reaction to increased instances and increased resources, it is most likely a side effect of the overall model of changeprop wherein an instance of changeprop has consumers for every topic and they will be assigned a partition independent of their control. Some instances of changeprop are being saturated whereas other instances are being totally undersubscribed (for example at present the least used instance is using 397.98 MiB of memory and the most used is using 1.28 GiB). This would explain why we saw disproportionate jumps in concurrency during the initial configuration changes as a result of simple redistribution of jobs when all pods were recreated.

Increasing the number of nodes further is a stopgap that will buy us some breathing room, but it seems like the only option for us making changeprop more sensibly suited to k8s is a little rearchitecting or at least some instrumentation so we can better analyse how each instance is operating. One option that I've raised before that would help us deal with this (that would unfortunately also make configuration and management a *lot* more complicated) is sharding jobs across groups of instances rather than subscribing all instances to all jobs.

Change 762418 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/deployment-charts@master] changeprop-jobqueue: more replicas, less CPU

https://gerrit.wikimedia.org/r/762418

gerritbot added a project: Patch-For-Review.Feb 14 2022, 11:42 AM

Thanks for looking into this, can see concurrency increasing after the patches but unfortunately as stated it isn't meeting what we are trying for. I've made a modification to cirrus that we will roll out next week to reduce isolation between clusters while also reducing the job rate by a few hundred/s. It seems to get cirrus jobs going again we will have to work from both directions.

Change 762418 merged by jenkins-bot:

[operations/deployment-charts@master] changeprop-jobqueue: more replicas, less CPU

https://gerrit.wikimedia.org/r/762418

EBernhardson mentioned this in T302733: Restore CirrusSearch saneitizer to production usage.Feb 28 2022, 5:55 PM

Change 767080 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/deployment-charts@master] changeprop: add sampling configuration

https://gerrit.wikimedia.org/r/767080

Ladsgroup subscribed.Mar 3 2022, 12:07 PM

Change 768760 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/deployment-charts@master] jobqueue: set CPU request

https://gerrit.wikimedia.org/r/768760

Change 768760 merged by jenkins-bot:

[operations/deployment-charts@master] jobqueue: set CPU request

https://gerrit.wikimedia.org/r/768760

Change 769038 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/deployment-charts@master] jobqueue: use guaranteed QoS strategy

https://gerrit.wikimedia.org/r/769038

JMeybohm subscribed.Mar 8 2022, 3:19 PM

Change 769038 merged by jenkins-bot:

[operations/deployment-charts@master] jobqueue: use guaranteed QoS strategy

https://gerrit.wikimedia.org/r/769038

Since adjusting our QoS strategy I think we're seeing better and more consistent performance. Configured concurrency and actual concurrency are potentially still not in alignment but all of the backlogs on previously problematic jobs have been eliminated. @EBernhardson - are there still changes in place on the Cirrus cluster that limit it? I'd be curious to see how the jobqueue handles real volumes now. I see occasional increases in backlog for cirrusSearchElasticaWrite (possibly as a result of large imports/changes?) but it also appears that those backlogs are dealt with relatively quickly

In T300914#7773391, @hnowlan wrote:

Since adjusting our QoS strategy I think we're seeing better and more consistent performance. Configured concurrency and actual concurrency are potentially still not in alignment but all of the backlogs on previously problematic jobs have been eliminated. @EBernhardson - are there still changes in place on the Cirrus cluster that limit it? I'd be curious to see how the jobqueue handles real volumes now. I see occasional increases in backlog for cirrusSearchElasticaWrite (possibly as a result of large imports/changes?) but it also appears that those backlogs are dealt with relatively quickly

Indeed things are looking more consistent now. Unfortunately the cirrus backlog increases whenever i manually run the process that we turned off, it usually runs via systemd timers every two hours but we have yet to see it consistently finish the backlog in less than two hours so I haven't turned the timers back on. The patch i added last week to reduce re-indexing rate to 3 times per year hasn't made it to prod yet though, I'll ship that in a few hours and hopefully that will be enough to call this all complete.

This task could perhaps be closed either way, it seems like you've done whats possible from the SRE side of thing, eeking more of out this system might require re-visiting fundamental assumptions that we won't be fixing here.

We've been running with everything turned on for a weekend and it looks generally healthy, backlogs don't seem to be building for cirrus jobs. The other jobs mentioned in the ticket have also cleared their backlogs since about a month ago, overall we can probably call this complete enough.

Gehel closed this task as Resolved.Mar 21 2022, 4:20 PM

I am reopening this. Due to https://wikitech.wikimedia.org/wiki/Incidents/2022-03-27_api we had to lower the concurrency by half in https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/774462

The backlogs have increased as well, we should probably revisit this.

1 actionable of the outage (mobileapps performance issues) has been addressed, which should decrease the amount of retries by cpjobqueue due to failures from that side. Another approach serviceops need to discuss is the sizing of the MediaWiki API cluster. We are on budgetting season, I 'll see what I can do about that.

@hnowlan Do you think we could increase the concurrency limits in a more conservative fashion ? My thinking is that we might be able to shave off quite a bit of the backlog without substantially increasing the risk we have another of these incidents.

Matthewrbowker subscribed.May 1 2022, 8:11 PM

Gehel moved this task from Needs Reporting to Incoming on the Discovery-Search (Current work) board.May 9 2022, 9:24 AM

I'm removing the Search Platform from this ticket as it seems to be handled now purely at the JobQueue level. Ping us or reassign us as needed.

Change 767080 merged by jenkins-bot:

[operations/deployment-charts@master] changeprop: add sampling configuration, set num_workers

https://gerrit.wikimedia.org/r/767080

Search team is seeing an increased JobQueue backlog again , does anyone have any suggestions on how best to deal with this?

EBernhardson mentioned this in T314426: Job queue for writes to cloudelastic falling behind.Aug 2 2022, 6:33 PM

Krinkle reassigned this task from lbowmaker to hnowlan.Aug 2 2022, 6:46 PM

Krinkle added a subscriber: lbowmaker.

Change 820117 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/deployment-charts@master] jobqueue: increase num_workers to 4

https://gerrit.wikimedia.org/r/820117

In T300914#8124252, @bking wrote:

Search team is seeing an increased JobQueue backlog again , does anyone have any suggestions on how best to deal with this?

By August 10th, the back log decreased abruptly

Grafana link

However, SAL has nothing related around that time to explain it.

https://sal.toolforge.org/production?p=3&q=&d=2022-08-10

Any ideas?

We changed the jobs in T314426. We send what used to be one partition worth of jobs into three partitions, with the overall goal of having cpjobqueue engage multiple execution contexts to run the same queue. This has some downsides, we can no longer ensure strict ordering and that means a quick succession of edit and then delete (quite common) are not guaranteed to go in the same order, but since there are other background processes that clean up mistakes we decided that was a reasonable course of action vs. losing writes due to backlogging.

Change 820117 merged by jenkins-bot:

[operations/deployment-charts@master] jobqueue: increase num_workers to 4

https://gerrit.wikimedia.org/r/820117

hnowlan moved this task from Backlog to Doing 🔨 on the Platform Team Workboards (Platform Engineering Reliability) board.Oct 17 2022, 2:23 PM

hnowlan moved this task from Doing 🔨 to Backlog on the Platform Team Workboards (Platform Engineering Reliability) board.Nov 28 2022, 2:39 PM

Pppery removed a project: Patch-For-Review.Apr 4 2024, 4:43 AM

	F34948746: Screenshot 2022-02-11 at 17.39.38.png
	Feb 11 2022, 5:57 PM

	F34948749: Screenshot 2022-02-11 at 17.41.39.png
	Feb 11 2022, 5:57 PM

	F35483908: image.png
	Aug 22 2022, 1:20 PM

cpjobqueue not achieving configured concurrencyOpen, Needs TriagePublicActions

Description

Details

Related Objects

Event Timeline

cpjobqueue not achieving configured concurrency
Open, Needs TriagePublic
Actions