The saneitizer is a lot slower than when running in codfw (oct 27 2020 codfw -> eqiad switchover)
Closed, ResolvedPublic8 Estimated Story Points
Actions

Assigned To

Authored By

	• dcausse
	Oct 29 2020, 9:33 AM

Description

As CirrusSearch maintainer I want the cirrusSearchCheckerJob jobs to be handled in time so that they don't pile-up indefinitely.

Since the switch to eqiad the cirrusSearchCheckerJob seems to be slower, the rate of ids being checked has dropped drastically just after the switch:

The impact is that the cirrusSearchCheckerJob jobs do pile up.

AC:

find the cause of the slowdown
the backlog of cirrusSearchCheckerJob jobs should remain sane
re-enabled the saneitizer after resolving performance issue(s)

Details

	Subject	Repo	Branch	Lines +/-
	cirrus: re-enable saneitizer	operations/puppet	production	+0 -1
	cpjobqueue: Increase cirrusSearchCheckerJob concurrency	operations/deployment-charts	master	+1 -1

Customize query in gerrit

Related Objects

Mentioned In: T300914: cpjobqueue not achieving configured concurrency
T295705: Cleanup missing Commons index on Elasticsearch eqiad
T255684: Fix aggregation of change-prop exec time metric
T266911: Disable cirrus sanitizer job
T239931: Reduce the impact of the sanitizer on wikidata
Mentioned Here: T255684: Fix aggregation of change-prop exec time metric

Event Timeline

• dcausse created this task.Oct 29 2020, 9:33 AM

Restricted Application added a project: Discovery-Search. · View Herald TranscriptOct 29 2020, 9:33 AM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Aklapper renamed this task from The saneitizer is a lot slower than when running in codfw (oct 27 2020 codfw -> eqiad switchover) to The sanitizer is a lot slower than when running in codfw (oct 27 2020 codfw -> eqiad switchover).Oct 29 2020, 11:55 AM

CBogen moved this task from needs triage to Current work on the Discovery-Search board.Nov 2 2020, 3:17 PM

CBogen edited projects, added Discovery-Search (Current work); removed Discovery-Search.

• dcausse mentioned this in T239931: Reduce the impact of the sanitizer on wikidata.Nov 2 2020, 6:10 PM

Addshore subscribed.Nov 2 2020, 6:19 PM

CBogen set the point value for this task to 8.Nov 2 2020, 6:33 PM

CBogen moved this task from Incoming to Ready for Dev -- SWE on the Discovery-Search (Current work) board.

Gehel renamed this task from The sanitizer is a lot slower than when running in codfw (oct 27 2020 codfw -> eqiad switchover) to The saneitizer is a lot slower than when running in codfw (oct 27 2020 codfw -> eqiad switchover).Nov 2 2020, 6:34 PM

I assume the Grafana panel in the screenshot is Elasticsearch Indexing / Saneitizer, but that graph now shows 0 ops since November 7th? (Except for a brief burst of activity on the 9th/10th, around midnight UTC.)

It had to be turned off again due to the job queue growing, The brief burst in activity is me manually triggering the process that injects jobs into the queue every other hour, unfortunately it needed 3 hours to process 2 hours worth of jobs. It's not particularly clear what's going on here that suddenly made things slower. Likely the next step is to increase concurrency on the job from 20 to something larger (30?).

RKemper updated the task description. (Show Details)Nov 16 2020, 5:06 PM

RKemper mentioned this in T266911: Disable cirrus sanitizer job.

Change 641251 had a related patch set uploaded (by Ebernhardson; owner: Ebernhardson):
[operations/deployment-charts@master] cpjobqueue: Increase cirrusSearchCheckerJob concurrency

https://gerrit.wikimedia.org/r/641251

gerritbot added a project: Patch-For-Review.Nov 16 2020, 8:54 PM

EBernhardson claimed this task.Nov 16 2020, 8:55 PM

EBernhardson moved this task from Ready for Dev -- SWE to Needs review on the Discovery-Search (Current work) board.

Per the graphs above we used to see consistent 400-600 ops, now we see fairly dramatic fluctuations from 50 to 400 ops.

We can get some estimation about what the jobs are spending their time on by looking at the flame graphs. For reference these are from excimer/arclamp and are constructed by collecting stack traces from running processes. We can estimate in aggregate where code is spending it's time by seeing which functions show up most often in those stack traces.

2020-10-26

nesting	Stack frame	samples	absolute %	relative %
	CheckerJob::doJob	1892	0.52%	100%
*	CheckerJob::loadPagesFromDB	358	0.10	19.2%
*	CheckerJob::checkExistingPage	1157	0.32%	61.5%
**	RevisionRecord::getContent	781	0.21%	40.3%
**	WikiPage::loadLastEdit	329	0.09%	17.3%

2020-10-28

nesting	Stack frame	samples	absolute %	relative %
	CheckerJob::doJob	966	0.29%	100%
*	CheckerJob::checkExistingPage	583	0.17%	58.6%
**	RevisionRecord::getContent	375	0.09%	31.0%

Notably, we only have half as many samples on 10-28 vs 10-26. This doesn't match our expectations. If the job queue was running these jobs at full concurrency (20 processes) we would expect to have more samples in 10-28, since on 10-26 the queue would empty and we would have periods of time that were not running the job. The jobs running slower would not result in less samples, as these samples reflect stack traces of whatever is running.

The overall shape of the flamegraph stayed approximately the same, although due to the lower frequency we lose some of the smaller portions. This suggests to me nothing fundamentally changed in the latency of services the job communicates with, although that isn't 100% certain.

@Pchelolo I've looked through the cpjobqueue stat collection but I can't seem to pin down anything that reports concurrency of the job buckets. We can get stats on jobs added, jobs run, etc. But they are all counters and nothing that says how many are concurrently running. Might that data be somewhere I'm just not looking?

@Pchelolo I've looked through the cpjobqueue stat collection but I can't seem to pin down anything that reports concurrency of the job buckets. We can get stats on jobs added, jobs run, etc. But they are all counters and nothing that says how many are concurrently running. Might that data be somewhere I'm just not looking?

Yeah, we have lost this graph when transitioning to k8s and prometheus. It used to be constructed by smartly dividing the rate by average exec time... We have T255684 which prevents us from reconstructing that metric in prometheus.

As for running non-full-concurrency - I'll deploy your change https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/641251 tomorrow - let's see if we notice 33% increase in rate - that would indicate we are running full concurrency. In the meantime we should definitely reconstruct the concurrency graph.

Change 641251 merged by jenkins-bot:
[operations/deployment-charts@master] cpjobqueue: Increase cirrusSearchCheckerJob concurrency

https://gerrit.wikimedia.org/r/641251

Maintenance_bot removed a project: Patch-For-Review.Nov 17 2020, 5:10 PM

Change 641854 had a related patch set uploaded (by Ryan Kemper; owner: Ryan Kemper):
[operations/puppet@production] cirrus: re-enable saneitizer

https://gerrit.wikimedia.org/r/641854

gerritbot added a project: Patch-For-Review.Nov 19 2020, 12:23 AM

The increase in concurrency does seem to have increased throughput, we are keeping up with the current rate, but this rate naturally varies over two week intervals so will need some time to see if this can keep up. Historically (180 days) this process seems very consistent, will want to see how it performs going forward.

I'm a bit indecisive on if further investigation should occur. Something changed, but we've done some investigation and don't know what. It would likely require significant effort to track down whatever it was. With this being tied to a DC switchover any number of things done to eqiad during the switchover are potential culprits. We have a concurrency knob we can turn, it's a bit ham fisted but as long as turning up concurrency isn't causing problems it seems we could go with it. One potential worry is if the problem is some sort of downstream resource contention, in that case increasing concurrency might be giving this job a larger % of that resource, but is simply an arms race.

I'm leaning towards cranking concurrency as necessary, but open to thoughts.

Change 641854 merged by Ryan Kemper:
[operations/puppet@production] cirrus: re-enable saneitizer

https://gerrit.wikimedia.org/r/641854

Maintenance_bot removed a project: Patch-For-Review.Nov 19 2020, 11:10 PM

EBernhardson moved this task from Needs review to Waiting on the Discovery-Search (Current work) board.Nov 24 2020, 6:30 PM

CBogen moved this task from Waiting to Needs Reporting on the Discovery-Search (Current work) board.Dec 14 2020, 4:29 PM

Gehel closed this task as Resolved.Jan 6 2021, 9:04 AM

EBernhardson mentioned this in T255684: Fix aggregation of change-prop exec time metric.Feb 2 2022, 8:39 PM

EBernhardson mentioned this in T295705: Cleanup missing Commons index on Elasticsearch eqiad.Feb 2 2022, 11:10 PM

EBernhardson mentioned this in T300914: cpjobqueue not achieving configured concurrency.Feb 3 2022, 6:49 PM