Page MenuHomePhabricator

cpjobqueue not achieving configured concurrency
Open, Needs TriagePublic

Description

CirrusSearchElasticaWrite jobs have been backlogging recently. In T266762 the same issue was addressed and concurrency for the job was increased to 100 per partition, 300 overall. Job concurrency graphs, added to JobQueue Job dashboard while investigating this issue, show the job achieving a typical concurrency of ~25 per partition with a backlogged queue. We have disabled some important maintenance operations on the cirrus elasticsearch clusters to reduce the backlog, but these cannot stay turned off forever.

While looking into this I also noticed that several other jobs are backlogged currently, unclear but plausible they are experiencing a similar issue:

jobnormal job backlog time (mean avg, 15min)
wikibase-addUsagesforPage4.93 days
refreshLinks3.27 days
cirrusSearchElaticaWrite12.9 hours

Event Timeline

We're looking into this atm. It seems that there might have been some stalled processes in k8s due to some OOM kills but that doesn't explain the larger concurrency issues. Hopefully we can add some instrumentation to get better insight

Change 761664 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/deployment-charts@master] changeprop-jobqueue: increase concurrency for backlogged jobs

https://gerrit.wikimedia.org/r/761664

Was there a change made to the Cirrus cluster yesterday? I see a significant dropoff in the backlog and I'm trying to find out what might have caused this. I haven't seen an equivalent dropoff in other jobs so I'm concerned it's not a good sign.

Change 761664 merged by jenkins-bot:

[operations/deployment-charts@master] changeprop-jobqueue: increase concurrency for backlogged jobs

https://gerrit.wikimedia.org/r/761664

Was there a change made to the Cirrus cluster yesterday? I see a significant dropoff in the backlog and I'm trying to find out what might have caused this. I haven't seen an equivalent dropoff in other jobs so I'm concerned it's not a good sign.

For the cirrus backlog we turned off a process that generates half our jobs on Jan 31 when we noticed the backlogging initially. This ticket came later as we investigated that and it looked like cpjobqueue was having issues. The part that was turned off typically inserts jobs every 2 hours from a systemd timer, i'll kick off one of those to see if it will drain before the next round would happen.

The part that was turned off typically inserts jobs every 2 hours from a systemd timer, i'll kick off one of those to see if it will drain before the next round would happen.

I started one but it did something rather unexpected (but unrelated to this ticket, a read-only exception). Possibly something else wrong is going on with cirrus and will look into it.

Change 761677 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/deployment-charts@master] changeprop-jobqueue: increase CPU allocation

https://gerrit.wikimedia.org/r/761677

At this point I am fairly sure that this is a CPU throttling issue - CPU allocation is already a bit low for jobqueue compared to traditional changeprop, and we're seeing throttling in a way that is reminiscent of historical failures for changeprop. An increase in concurrency has helped some jobs but also has increased the amount of throttling: https://grafana-rw.wikimedia.org/d/LSeAShkGz/jobqueue?viewPanel=47&orgId=1&from=1644512117444&to=1644514737979&forceLogin=true

Change 761677 merged by jenkins-bot:

[operations/deployment-charts@master] changeprop-jobqueue: increase CPU and memory allocation

https://gerrit.wikimedia.org/r/761677

Change 761938 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/deployment-charts@master] changeprop-jobqueue: decrease CPU use, increase number of nodes

https://gerrit.wikimedia.org/r/761938

Change 761938 merged by jenkins-bot:

[operations/deployment-charts@master] changeprop-jobqueue: decrease CPU use, increase number of nodes

https://gerrit.wikimedia.org/r/761938

Some tweaking and investigation has taught us a few things:

  • Increasing concurrency for some jobs lead to a threefold increase in concurrency. This was unexpected and hard to explain initially - ultimately it turned out that this was probably more or less unrelated to the changes, as we were still not achieving the desired overall concurrency.
  • Some jobqueue instances were hit by an issue where high memory usage was causing CPU throttling - we have previously seen this in changeprop (T255975) which uses the same codebase. We've increased memory and CPU allocation, and since decreased the CPU bump a little as it wasn't the source of issues. This lead to even more changes of concurrency, frequently in the wrong direction. In this screenshot we can see a substantial drop in throttling at 1045 after the resource increase.

Screenshot 2022-02-11 at 17.39.38.png (536×2 px, 229 KB)

  • After this, we increased the number of replicas running in each cluster from 12 to 20 - odds are we will have to tune this higher in future but hopefully this will still have an impact - we can see increased processing happening even if it's not causing the concurrency we desire. In this screenshot we can see an initial jitter at 1045 when resources were increased that kinda levels out to be effectively the same, but with a substantial increase at 1730 when replicas were increased.

Screenshot 2022-02-11 at 17.41.39.png (532×1 px, 274 KB)

However, even with the improved performance, we are still not seeing what we want to see in terms of throughput. Given the reaction to increased instances and increased resources, it is most likely a side effect of the overall model of changeprop wherein an instance of changeprop has consumers for every topic and they will be assigned a partition independent of their control. Some instances of changeprop are being saturated whereas other instances are being totally undersubscribed (for example at present the least used instance is using 397.98 MiB of memory and the most used is using 1.28 GiB). This would explain why we saw disproportionate jumps in concurrency during the initial configuration changes as a result of simple redistribution of jobs when all pods were recreated.

Increasing the number of nodes further is a stopgap that will buy us some breathing room, but it seems like the only option for us making changeprop more sensibly suited to k8s is a little rearchitecting or at least some instrumentation so we can better analyse how each instance is operating. One option that I've raised before that would help us deal with this (that would unfortunately also make configuration and management a *lot* more complicated) is sharding jobs across groups of instances rather than subscribing all instances to all jobs.

Change 762418 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/deployment-charts@master] changeprop-jobqueue: more replicas, less CPU

https://gerrit.wikimedia.org/r/762418

Thanks for looking into this, can see concurrency increasing after the patches but unfortunately as stated it isn't meeting what we are trying for. I've made a modification to cirrus that we will roll out next week to reduce isolation between clusters while also reducing the job rate by a few hundred/s. It seems to get cirrus jobs going again we will have to work from both directions.

Change 762418 merged by jenkins-bot:

[operations/deployment-charts@master] changeprop-jobqueue: more replicas, less CPU

https://gerrit.wikimedia.org/r/762418

Change 767080 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/deployment-charts@master] changeprop: add sampling configuration

https://gerrit.wikimedia.org/r/767080

Change 768760 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/deployment-charts@master] jobqueue: set CPU request

https://gerrit.wikimedia.org/r/768760

Change 768760 merged by jenkins-bot:

[operations/deployment-charts@master] jobqueue: set CPU request

https://gerrit.wikimedia.org/r/768760

Change 769038 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/deployment-charts@master] jobqueue: use guaranteed QoS strategy

https://gerrit.wikimedia.org/r/769038

Change 769038 merged by jenkins-bot:

[operations/deployment-charts@master] jobqueue: use guaranteed QoS strategy

https://gerrit.wikimedia.org/r/769038

Since adjusting our QoS strategy I think we're seeing better and more consistent performance. Configured concurrency and actual concurrency are potentially still not in alignment but all of the backlogs on previously problematic jobs have been eliminated. @EBernhardson - are there still changes in place on the Cirrus cluster that limit it? I'd be curious to see how the jobqueue handles real volumes now. I see occasional increases in backlog for cirrusSearchElasticaWrite (possibly as a result of large imports/changes?) but it also appears that those backlogs are dealt with relatively quickly

Since adjusting our QoS strategy I think we're seeing better and more consistent performance. Configured concurrency and actual concurrency are potentially still not in alignment but all of the backlogs on previously problematic jobs have been eliminated. @EBernhardson - are there still changes in place on the Cirrus cluster that limit it? I'd be curious to see how the jobqueue handles real volumes now. I see occasional increases in backlog for cirrusSearchElasticaWrite (possibly as a result of large imports/changes?) but it also appears that those backlogs are dealt with relatively quickly

Indeed things are looking more consistent now. Unfortunately the cirrus backlog increases whenever i manually run the process that we turned off, it usually runs via systemd timers every two hours but we have yet to see it consistently finish the backlog in less than two hours so I haven't turned the timers back on. The patch i added last week to reduce re-indexing rate to 3 times per year hasn't made it to prod yet though, I'll ship that in a few hours and hopefully that will be enough to call this all complete.

This task could perhaps be closed either way, it seems like you've done whats possible from the SRE side of thing, eeking more of out this system might require re-visiting fundamental assumptions that we won't be fixing here.

We've been running with everything turned on for a weekend and it looks generally healthy, backlogs don't seem to be building for cirrus jobs. The other jobs mentioned in the ticket have also cleared their backlogs since about a month ago, overall we can probably call this complete enough.

akosiaris subscribed.

I am reopening this. Due to https://wikitech.wikimedia.org/wiki/Incidents/2022-03-27_api we had to lower the concurrency by half in https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/774462

The backlogs have increased as well, we should probably revisit this.

1 actionable of the outage (mobileapps performance issues) has been addressed, which should decrease the amount of retries by cpjobqueue due to failures from that side. Another approach serviceops need to discuss is the sizing of the MediaWiki API cluster. We are on budgetting season, I 'll see what I can do about that.

@hnowlan Do you think we could increase the concurrency limits in a more conservative fashion ? My thinking is that we might be able to shave off quite a bit of the backlog without substantially increasing the risk we have another of these incidents.

Gehel subscribed.

I'm removing the Search Platform from this ticket as it seems to be handled now purely at the JobQueue level. Ping us or reassign us as needed.

Change 767080 merged by jenkins-bot:

[operations/deployment-charts@master] changeprop: add sampling configuration, set num_workers

https://gerrit.wikimedia.org/r/767080

Search team is seeing an increased JobQueue backlog again , does anyone have any suggestions on how best to deal with this?

Change 820117 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/deployment-charts@master] jobqueue: increase num_workers to 4

https://gerrit.wikimedia.org/r/820117

Search team is seeing an increased JobQueue backlog again , does anyone have any suggestions on how best to deal with this?

By August 10th, the back log decreased abruptly

image.png (1×2 px, 209 KB)
Grafana link

However, SAL has nothing related around that time to explain it.

https://sal.toolforge.org/production?p=3&q=&d=2022-08-10

Any ideas?

We changed the jobs in T314426. We send what used to be one partition worth of jobs into three partitions, with the overall goal of having cpjobqueue engage multiple execution contexts to run the same queue. This has some downsides, we can no longer ensure strict ordering and that means a quick succession of edit and then delete (quite common) are not guaranteed to go in the same order, but since there are other background processes that clean up mistakes we decided that was a reasonable course of action vs. losing writes due to backlogging.

Change 820117 merged by jenkins-bot:

[operations/deployment-charts@master] jobqueue: increase num_workers to 4

https://gerrit.wikimedia.org/r/820117