Reduce the load of CirrusSearch update jobs on MW jobrunners
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	dcausse
	May 15 2023, 7:08 PM

Description

Looking at the flame graphs of a jobrunner it appears that CirrusSearch jobs are taking most of the jobrunner resources.

Few ideas to improve the situation:

verify that ContentHandler::getParserOutputForIndexing() is not asking to render the HTML output on wikidata
- generate-html is set to false when rendering the output.
disable the saneitizer for one week and assess the impact
- if the impact is big consider lowering the number of parses by making a dedicated profile for wikis like commons and increase reindex_after_loops from 8 to e.g. 16.
verify that running the jobs for both eqiad & codfw re-use the parser output (no double parse)
- ~~eqiad and codfw writes are done in the same job and they re-use the same documents~~
  - the above statement is wrong, the parser output is actually accessed twice but I believe that \MediaWiki\Page\ParserOutputAccess::$localCache is being used to avoid a double parse
Consider using memcache (~6hours ttl) to hold the indexed content to be re-used by subsequent ElasticaWrite jobs running for cloudelastic
- done in https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CirrusSearch/+/920785
- big impact see T336698#8890551 and T336698#8891221

AC:

reduce by X% the impact of CirrusSearch jobs on jobrunners

Details

Subject	Repo	Branch	Lines +/-
Help measure the impact of saneitizer jobs	mediawiki/extensions/CirrusSearch	wmf/1.41.0-wmf.11	+18 -1
Help measure the impact of saneitizer jobs	mediawiki/extensions/CirrusSearch	master	+18 -1
Add WANCache to ParserOutputPageProperties::finalize	mediawiki/extensions/CirrusSearch	wmf/1.41.0-wmf.10	+50 -17
Add WANCache to ParserOutputPageProperties::finalize	mediawiki/extensions/CirrusSearch	wmf/1.41.0-wmf.11	+50 -17
Add WANCache to ParserOutputPageProperties::finalize	mediawiki/extensions/CirrusSearch	master	+50 -17

Customize query in gerrit

Event Timeline

dcausse created this task.May 15 2023, 7:08 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 15 2023, 7:08 PM

dcausse updated the task description. (Show Details)May 15 2023, 7:15 PM

Ladsgroup awarded a token.May 16 2023, 9:19 AM

Change 920785 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/extensions/CirrusSearch@master] Add WANCache to ParserOutputPageProperties::finalize

https://gerrit.wikimedia.org/r/920785

gerritbot added a project: Patch-For-Review.May 17 2023, 8:25 PM

I don't know internals of CirrusSearch very well so the patch might be super super wrong. Sorry if I missed something super obvious.

MPhamWMF moved this task from needs triage to Current work on the Discovery-Search board.May 22 2023, 3:27 PM

MPhamWMF edited projects, added Discovery-Search (Current work); removed Discovery-Search.

MPhamWMF moved this task from Incoming to Blocked/Waiting on the Discovery-Search (Current work) board.May 22 2023, 3:30 PM

Change 920785 merged by jenkins-bot:

[mediawiki/extensions/CirrusSearch@master] Add WANCache to ParserOutputPageProperties::finalize

https://gerrit.wikimedia.org/r/920785

Maintenance_bot removed a project: Patch-For-Review.May 30 2023, 8:30 AM

ReleaseTaggerBot added a project: MW-1.41-notes (1.41.0-wmf.12; 2023-06-06).May 30 2023, 1:01 PM

Change 924568 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/extensions/CirrusSearch@wmf/1.41.0-wmf.11] Add WANCache to ParserOutputPageProperties::finalize

https://gerrit.wikimedia.org/r/924568

Change 924569 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/extensions/CirrusSearch@wmf/1.41.0-wmf.10] Add WANCache to ParserOutputPageProperties::finalize

https://gerrit.wikimedia.org/r/924569

Change 924568 merged by jenkins-bot:

[mediawiki/extensions/CirrusSearch@wmf/1.41.0-wmf.11] Add WANCache to ParserOutputPageProperties::finalize

https://gerrit.wikimedia.org/r/924568

Mentioned in SAL (#wikimedia-operations) [2023-05-30T20:30:32Z] <ladsgroup@deploy1002> Started scap: Backport for [[gerrit:924568|Add WANCache to ParserOutputPageProperties::finalize (T336698)]]

Mentioned in SAL (#wikimedia-operations) [2023-05-30T20:32:00Z] <ladsgroup@deploy1002> ladsgroup: Backport for [[gerrit:924568|Add WANCache to ParserOutputPageProperties::finalize (T336698)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet

Mentioned in SAL (#wikimedia-operations) [2023-05-30T20:39:59Z] <ladsgroup@deploy1002> Finished scap: Backport for [[gerrit:924568|Add WANCache to ParserOutputPageProperties::finalize (T336698)]] (duration: 09m 27s)

Change 924569 merged by jenkins-bot:

[mediawiki/extensions/CirrusSearch@wmf/1.41.0-wmf.10] Add WANCache to ParserOutputPageProperties::finalize

https://gerrit.wikimedia.org/r/924569

Mentioned in SAL (#wikimedia-operations) [2023-05-30T20:57:02Z] <ladsgroup@deploy1002> Started scap: Backport for [[gerrit:924569|Add WANCache to ParserOutputPageProperties::finalize (T336698)]]

Mentioned in SAL (#wikimedia-operations) [2023-05-30T20:58:36Z] <ladsgroup@deploy1002> ladsgroup: Backport for [[gerrit:924569|Add WANCache to ParserOutputPageProperties::finalize (T336698)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet

ReleaseTaggerBot edited projects, added MW-1.41-notes (1.41.0-wmf.11; 2023-05-30); removed MW-1.41-notes (1.41.0-wmf.12; 2023-06-06).May 30 2023, 9:00 PM

Maintenance_bot removed a project: Patch-For-Review.May 30 2023, 9:10 PM

Awesome: https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=eqiad%20prometheus%2Fk8s&var-job=cirrusSearchElasticaWrite&from=1685478996767&to=1685482340256

Concurrency of the job went to one third of its baseline.

Aand CPU usage of jobrunners:
https://grafana.wikimedia.org/d/wqj6s-unk/jobrunners?orgId=1&viewPanel=54&from=1685473013364&to=1685483023742

The flamegraphs:
https://performance.wikimedia.org/arclamp/svgs/hourly/2023-05-30_20.excimer.RunSingleJob.svgz
after:
https://performance.wikimedia.org/arclamp/svgs/hourly/2023-05-30_21.excimer.RunSingleJob.svgz

memcached reqs related to parsing went also down:
https://grafana.wikimedia.org/d/lqE4lcGWz/wanobjectcache-key-group?orgId=1&from=1685465094984&to=1685486603469&var-kClass=SqlBlobStore_blob
and
https://grafana.wikimedia.org/d/lqE4lcGWz/wanobjectcache-key-group?orgId=1&from=1685465196854&to=1685486722991&var-kClass=revision_row_1_29

dcausse updated the task description. (Show Details)May 31 2023, 8:23 AM

@Ladsgroup the impact is impressive, thanks!
I'm tempted to skip the disable the saneitizer for one week and assess the impact idea and consider the improvements you've made via the use of memcache good enough to achieve the desired outcome on the load of the jobrunners.
Tentatively closing but please feel free to re-open if you still want us to investigate the impact of the saneitizer on commons.

Thanks. I honestly would like to have some visibility to the load of sanitizer. If we could find a way to make it to the flamegraphs (e.g. by using a dedicated class), that'd be more than enough for me.

In T336698#8891646, @Ladsgroup wrote:

Thanks. I honestly would like to have some visibility to the load of sanitizer. If we could find a way to make it to the flamegraphs (e.g. by using a dedicated class), that'd be more than enough for me.

Mostly to be able to make informed decisions, e.g. if job runners are about to fall apart, we could save x% by turning off sanitizer jobs. Or realizing that it's only small portion of the load and we can safely ignore them, etc.

Change 924904 had a related patch set uploaded (by DCausse; author: DCausse):

[mediawiki/extensions/CirrusSearch@master] Help measure the impact of saneitizer jobs

https://gerrit.wikimedia.org/r/924904

gerritbot added a project: Patch-For-Review.May 31 2023, 10:00 AM

In T336698#8891657, @Ladsgroup wrote:

In T336698#8891646, @Ladsgroup wrote:

Thanks. I honestly would like to have some visibility to the load of sanitizer. If we could find a way to make it to the flamegraphs (e.g. by using a dedicated class), that'd be more than enough for me.

Mostly to be able to make informed decisions, e.g. if job runners are about to fall apart, we could save x% by turning off sanitizer jobs. Or realizing that it's only small portion of the load and we can safely ignore them, etc.

Sure, added a small patch that should create another branch in the flamegraphs when these jobs are processed.

Awesome. Thanks!

dcausse updated the task description. (Show Details)May 31 2023, 10:32 AM

Change 924904 merged by jenkins-bot:

[mediawiki/extensions/CirrusSearch@master] Help measure the impact of saneitizer jobs

https://gerrit.wikimedia.org/r/924904

Maintenance_bot removed a project: Patch-For-Review.May 31 2023, 7:10 PM

ReleaseTaggerBot edited projects, added MW-1.41-notes (1.41.0-wmf.12; 2023-06-06); removed MW-1.41-notes (1.41.0-wmf.11; 2023-05-30).May 31 2023, 8:00 PM

Gehel closed this task as Resolved.Jun 2 2023, 9:51 AM

Change 926860 had a related patch set uploaded (by Ladsgroup; author: DCausse):

[mediawiki/extensions/CirrusSearch@wmf/1.41.0-wmf.11] Help measure the impact of saneitizer jobs

https://gerrit.wikimedia.org/r/926860

gerritbot added a project: Patch-For-Review.Jun 5 2023, 10:42 AM

Change 926860 merged by jenkins-bot:

[mediawiki/extensions/CirrusSearch@wmf/1.41.0-wmf.11] Help measure the impact of saneitizer jobs

https://gerrit.wikimedia.org/r/926860

Mentioned in SAL (#wikimedia-operations) [2023-06-05T22:03:50Z] <ladsgroup@deploy1002> Started scap: Backport for [[gerrit:926860|Help measure the impact of saneitizer jobs (T336698)]]

Mentioned in SAL (#wikimedia-operations) [2023-06-05T22:05:30Z] <ladsgroup@deploy1002> ladsgroup: Backport for [[gerrit:926860|Help measure the impact of saneitizer jobs (T336698)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet

Maintenance_bot removed a project: Patch-For-Review.Jun 5 2023, 10:10 PM

Mentioned in SAL (#wikimedia-operations) [2023-06-05T22:13:39Z] <ladsgroup@deploy1002> Finished scap: Backport for [[gerrit:926860|Help measure the impact of saneitizer jobs (T336698)]] (duration: 09m 48s)

ReleaseTaggerBot edited projects, added MW-1.41-notes (1.41.0-wmf.11; 2023-05-30); removed MW-1.41-notes (1.41.0-wmf.12; 2023-06-06).Jun 5 2023, 11:00 PM

	F37084904: grafik.png
	May 30 2023, 9:45 PM

	F37084894: grafik.png
	May 30 2023, 9:34 PM

Reduce the load of CirrusSearch update jobs on MW jobrunnersClosed, ResolvedPublicActions

Description

Details

Event Timeline

Reduce the load of CirrusSearch update jobs on MW jobrunners
Closed, ResolvedPublic
Actions