Page MenuHomePhabricator

Restore CirrusSearch saneitizer to production usage
Closed, ResolvedPublic

Description

As part of T295705 the saneitizer was turned off to reduce load on the cirrus ingestion pipelines while dealing with a separate issue. When attempting to turn this functionality back on we found the system unable to keep-up with the requested write load.

AC:

  • Saneitizer runs on a regular schedule

Event Timeline

Whats been done so far:

  • T300914: cpjobqueue configuration has been adjusted to increase throughput. This increased throughput was enough to stop building the backlog, but not enough to clear it.
  • https://gerrit.wikimedia.org/r/765577 : CirrusSearch was adjusted to turn three jobs into one, performing the same amount of work but with less jobs inserted to the job queue. This was enough to clear the backlog of jobs, but not enough to turn the system back on.
  • T302620: When turning three jobs into one it exposed duplicate work between them. Ticket reduces that duplicate work, reducing the overall time spent in cirrus jobs by job runners. While it did reduce time in the job runners it doesn't seem to have influenced throughput much.

Current thoughts:

  • My current suspicion is we are primarily throughput limited by job distribution from kafka to the mw job runners, and that the next step to increase throughput would be to split the topic into multiple partitions so that multiple runners can distribute their work.
  • @Ottomata If we wanted to partition these jobs, particularly <dc>.mediawiki.job.cirrusSearchLinksUpdate into random buckets, does anything special need to be done here? We've previously used exact partitioning (decide partition # from value, such as a database, in the event) for <dc>.mediawiki.job.cirrusSearchElasticaWrite and that had to be done through cpjobqueue, but in this case we are looking for a less specialized partitioning.

Change 768769 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[mediawiki/extensions/CirrusSearch@master] Cut saneitizer re-indexing rate in half

https://gerrit.wikimedia.org/r/768769

Once patch to cut re-indexing rate is merged and deployed will need to test some manual invocations again. Based on performance of previous manual invocations i suspect this is enough to get us to a workable rate but it's not certain.

Change 768769 merged by jenkins-bot:

[mediawiki/extensions/CirrusSearch@master] Cut saneitizer re-indexing rate in half

https://gerrit.wikimedia.org/r/768769

Change 770056 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[mediawiki/extensions/CirrusSearch@wmf/1.38.0-wmf.25] Cut saneitizer re-indexing rate in half

https://gerrit.wikimedia.org/r/770056

Change 770056 merged by jenkins-bot:

[mediawiki/extensions/CirrusSearch@wmf/1.38.0-wmf.25] Cut saneitizer re-indexing rate in half

https://gerrit.wikimedia.org/r/770056

Mentioned in SAL (#wikimedia-operations) [2022-03-14T20:45:44Z] <ebernhardson@deploy1002> Synchronized php-1.38.0-wmf.25/extensions/CirrusSearch/profiles/SaneitizeProfiles.config.php: Backport: [[gerrit:770056|Cut saneitizer re-indexing rate in half (T302733)]] (duration: 00m 49s)

In a test run the backlog stays under a minute, finishing well under the two hours we expect. While it wasn't the ideal solution, cutting the indexing rate in half seems to have gotten us to a workable state.

Change 771076 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[operations/puppet@production] cirrus: Reenable saneitizer

https://gerrit.wikimedia.org/r/771076

Change 771076 merged by Bking:

[operations/puppet@production] cirrus: Reenable saneitizer

https://gerrit.wikimedia.org/r/771076

This has been turned on for a few hours and is looking acceptable, after the weekend if everything looks proper we should be ready to declare success.

The over-weekend graphs look good, backlogs are not being built and it is progressing through the regular checks.