Page MenuHomePhabricator

The saneitizer is a lot slower than when running in codfw (oct 27 2020 codfw -> eqiad switchover)
Closed, ResolvedPublic8 Estimated Story Points

Description

As CirrusSearch maintainer I want the cirrusSearchCheckerJob jobs to be handled in time so that they don't pile-up indefinitely.

Since the switch to eqiad the cirrusSearchCheckerJob seems to be slower, the rate of ids being checked has dropped drastically just after the switch:

saneitizer.png (232×1 px, 78 KB)

The impact is that the cirrusSearchCheckerJob jobs do pile up.

AC:

  • find the cause of the slowdown
  • the backlog of cirrusSearchCheckerJob jobs should remain sane
  • re-enabled the saneitizer after resolving performance issue(s)

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Aklapper renamed this task from The saneitizer is a lot slower than when running in codfw (oct 27 2020 codfw -> eqiad switchover) to The sanitizer is a lot slower than when running in codfw (oct 27 2020 codfw -> eqiad switchover).Oct 29 2020, 11:55 AM
CBogen set the point value for this task to 8.Nov 2 2020, 6:33 PM
Gehel renamed this task from The sanitizer is a lot slower than when running in codfw (oct 27 2020 codfw -> eqiad switchover) to The saneitizer is a lot slower than when running in codfw (oct 27 2020 codfw -> eqiad switchover).Nov 2 2020, 6:34 PM

I assume the Grafana panel in the screenshot is Elasticsearch Indexing / Saneitizer, but that graph now shows 0 ops since November 7th? (Except for a brief burst of activity on the 9th/10th, around midnight UTC.)

It had to be turned off again due to the job queue growing, The brief burst in activity is me manually triggering the process that injects jobs into the queue every other hour, unfortunately it needed 3 hours to process 2 hours worth of jobs. It's not particularly clear what's going on here that suddenly made things slower. Likely the next step is to increase concurrency on the job from 20 to something larger (30?).

Change 641251 had a related patch set uploaded (by Ebernhardson; owner: Ebernhardson):
[operations/deployment-charts@master] cpjobqueue: Increase cirrusSearchCheckerJob concurrency

https://gerrit.wikimedia.org/r/641251

Per the graphs above we used to see consistent 400-600 ops, now we see fairly dramatic fluctuations from 50 to 400 ops.

We can get some estimation about what the jobs are spending their time on by looking at the flame graphs. For reference these are from excimer/arclamp and are constructed by collecting stack traces from running processes. We can estimate in aggregate where code is spending it's time by seeing which functions show up most often in those stack traces.

2020-10-26

nestingStack framesamplesabsolute %relative %
CheckerJob::doJob18920.52%100%
*CheckerJob::loadPagesFromDB3580.1019.2%
*CheckerJob::checkExistingPage11570.32%61.5%
**RevisionRecord::getContent7810.21%40.3%
**WikiPage::loadLastEdit3290.09%17.3%

2020-10-28

nestingStack framesamplesabsolute %relative %
CheckerJob::doJob9660.29%100%
*CheckerJob::checkExistingPage5830.17%58.6%
**RevisionRecord::getContent3750.09%31.0%

Notably, we only have half as many samples on 10-28 vs 10-26. This doesn't match our expectations. If the job queue was running these jobs at full concurrency (20 processes) we would expect to have more samples in 10-28, since on 10-26 the queue would empty and we would have periods of time that were not running the job. The jobs running slower would not result in less samples, as these samples reflect stack traces of whatever is running.

The overall shape of the flamegraph stayed approximately the same, although due to the lower frequency we lose some of the smaller portions. This suggests to me nothing fundamentally changed in the latency of services the job communicates with, although that isn't 100% certain.

@Pchelolo I've looked through the cpjobqueue stat collection but I can't seem to pin down anything that reports concurrency of the job buckets. We can get stats on jobs added, jobs run, etc. But they are all counters and nothing that says how many are concurrently running. Might that data be somewhere I'm just not looking?

@Pchelolo I've looked through the cpjobqueue stat collection but I can't seem to pin down anything that reports concurrency of the job buckets. We can get stats on jobs added, jobs run, etc. But they are all counters and nothing that says how many are concurrently running. Might that data be somewhere I'm just not looking?

Yeah, we have lost this graph when transitioning to k8s and prometheus. It used to be constructed by smartly dividing the rate by average exec time... We have T255684 which prevents us from reconstructing that metric in prometheus.

As for running non-full-concurrency - I'll deploy your change https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/641251 tomorrow - let's see if we notice 33% increase in rate - that would indicate we are running full concurrency. In the meantime we should definitely reconstruct the concurrency graph.

Change 641251 merged by jenkins-bot:
[operations/deployment-charts@master] cpjobqueue: Increase cirrusSearchCheckerJob concurrency

https://gerrit.wikimedia.org/r/641251

Change 641854 had a related patch set uploaded (by Ryan Kemper; owner: Ryan Kemper):
[operations/puppet@production] cirrus: re-enable saneitizer

https://gerrit.wikimedia.org/r/641854

The increase in concurrency does seem to have increased throughput, we are keeping up with the current rate, but this rate naturally varies over two week intervals so will need some time to see if this can keep up. Historically (180 days) this process seems very consistent, will want to see how it performs going forward.

I'm a bit indecisive on if further investigation should occur. Something changed, but we've done some investigation and don't know what. It would likely require significant effort to track down whatever it was. With this being tied to a DC switchover any number of things done to eqiad during the switchover are potential culprits. We have a concurrency knob we can turn, it's a bit ham fisted but as long as turning up concurrency isn't causing problems it seems we could go with it. One potential worry is if the problem is some sort of downstream resource contention, in that case increasing concurrency might be giving this job a larger % of that resource, but is simply an arms race.

I'm leaning towards cranking concurrency as necessary, but open to thoughts.

Change 641854 merged by Ryan Kemper:
[operations/puppet@production] cirrus: re-enable saneitizer

https://gerrit.wikimedia.org/r/641854