Reduce the impact of the sanitizer on wikidata
Closed, ResolvedPublic2 Estimated Story Points
Actions

Assigned To

Authored By

	dcausse
	Dec 5 2019, 5:01 PM

Description

The sanitizer seems to be a bit aggressive with wikidata causing significant load on the database (https://phabricator.wikimedia.org/T229407#5635732) because generating the data relies on the ParserOutput.

One immediate solution could be to tune the sanitizer to slow it down for wikidata, this can be achieved by creating a new profile in CirrusSearch/profiles/SaneitizeProfiles.config.php.
Another approach is to refactor and reduce the dependency on the ParserOutput when cirrus generates the document for elasticsearch.
Discussion:

16:15 <Amir1> So regarding writing some code, if there's some documentation, I might dig into it and do it
16:17 <addshore> Which bits of parser output does it need?
16:25 <+dcausse> addshore: it needs it for wikipage properties (categories/external links/...) these are maybe useless for wikidata
16:26 <+dcausse> the problematic interface is \ContentHandler::getDataForSearchIndex that takes the ParserOutput as an argument
16:30 <+dcausse> this would have to be changed to load the ParserOutput just when needed and from EntityHandler stop calling parent::getDataForSearchIndex but feed the base properties needed by cirrus from something else
16:30 <+dcausse> properties needed: https://gerrit.wikimedia.org/g/mediawiki/core/+/2b04ef66576439b9ace37f1f25de7967abcb1356/includes/content/ContentHandler.php#1321
16:31 <addshore> okay!
16:33 <addshore> Indeed, so the bit in ContentHandler::getDataForSearchIndex uses a ParserOutputSearchDataExtractor and thus the parseroutput
16:34 <addshore> i just had a quick look through the wikibase specific index things and nothing there uses parser output
16:34  * addshore looks at what calls getDataForSearchIndex
16:34 <+dcausse> the thing that flattens the entity data into the text field is very important tho
16:34 <+dcausse> but probably a code available directly from wikibase
16:35 <addshore> Yup, thats fine, that doenst need parser output
16:35 <addshore> So, CirrusSearch/includes/Updater.php calles getDataForSearchIndex
16:35 <+dcausse> yes this one will have to change as well
16:35 <addshore> going back further, it does $output = $contentHandler->getParserOutputForIndexing( $page, $parserCache );
16:36 <+ebernhardson> addshore: you have an old version btw, thats now in CirrusSearch/includes/BuildDocument/something
16:36  * addshore pulls :P
16:38 <addshore> So....
16:38 <addshore> ConrentHandler::getParserOutputForIndexing
16:38 <addshore> Calls, $renderer->getRenderedRevision->getRevisionParserOutput
16:38 <addshore> And that has
16:38 <addshore> @param array $hints Hints given as an associative array. Known keys:
16:38 <addshore> 	 *      - 'generate-html' => bool: Whether the caller is interested in output HTML (as opposed
16:38 <addshore> 	 *        to just meta-data). Default is to generate HTML.
16:39 <addshore> that, could, maybe, be something to think about
16:40 <Amir1> Maybe Update.php buildDocument can set "skipParse" for wikibase to true?
16:41 <addshore> Well, i think it still needs a "parse" and the meta data form it, for links and things?
16:41 <addshore> but it probably doesnt care about the actual html output, but i need to verify that
16:42 <addshore> it looks at categories, external links, outgoing links, templates, text, source_text, text_bytes, content_model
16:45 <addshore> This path as far as I can see if the only thing that uses getParserOutputForIndexing too
16:46 <addshore> So, as long as the things listed above dont need any part of the html, we can add that hint and stop generating it probably :)

Details

Due Date: Oct 26 2020, 11:00 PM

Subject	Repo	Branch	Lines +/-
Revert "Disable sanity check cirrus jobs for Wikidata"	operations/mediawiki-config	master	+1 -2
Disable sanity check cirrus jobs for Wikidata	operations/mediawiki-config	master	+2 -1
Override EntityHandler::getParserOutputForIndexing	mediawiki/extensions/Wikibase	master	+29 -0
Clarify HTML generation for indexing in ContentHandler	mediawiki/core	master	+28 -2
Override getParserOutputForIndexing to ensure HTML	mediawiki/extensions/Jade	master	+27 -0

Customize query in gerrit

Related Objects

Mentioned In: T285987: Do not generate full html parser output at the end of Wikibase edit requests
T264566: Redirected entity still present in search results after 6 months
T264283: Assess cleaning up Wikibase code around CirrusSearch sanitizer job
T208425: [EPIC] Kill the wb_terms table
T229407: Spikes in DB traffic and rows/s reads when reading from new terms store
Mentioned Here: T266762: The saneitizer is a lot slower than when running in codfw (oct 27 2020 codfw -> eqiad switchover)
rOMWC406cbec2a764: Disable sanity check cirrus jobs for Wikidata
rEWBA01de21628b23: Override EntityHandler::getParserOutputForIndexing
T263180: 1.36.0-wmf.14 deployment blockers
T239950: Consider rank in haswbstatement search
T229407: Spikes in DB traffic and rows/s reads when reading from new terms store

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Restricted Application added a project: Discovery-Search. · View Herald TranscriptDec 5 2019, 5:01 PM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

dcausse updated the task description. (Show Details)Dec 5 2019, 5:02 PM

dcausse added subscribers: Addshore, Ladsgroup.

Ladsgroup mentioned this in T229407: Spikes in DB traffic and rows/s reads when reading from new terms store.Dec 5 2019, 7:18 PM

Addshore moved this task from incoming to monitoring on the Wikidata board.Dec 9 2019, 8:47 AM

Addshore awarded a token.

Change 555941 had a related patch set uploaded (by Ladsgroup; owner: Ladsgroup):
[operations/mediawiki-config@master] Disable sanity check cirrus jobs for Wikidata

https://gerrit.wikimedia.org/r/555941

gerritbot added a project: Patch-For-Review.Dec 9 2019, 2:06 PM

Change 555941 merged by jenkins-bot:
[operations/mediawiki-config@master] Disable sanity check cirrus jobs for Wikidata

https://gerrit.wikimedia.org/r/555941

Mentioned in SAL (#wikimedia-operations) [2019-12-09T14:26:23Z] <ladsgroup@deploy1001> Synchronized wmf-config/InitialiseSettings.php: [[gerrit:555941|Disable sanity check cirrus jobs for Wikidata (T239931 T229407)]] (duration: 00m 57s)

Maintenance_bot removed a project: Patch-For-Review.Dec 9 2019, 3:10 PM

To be clear, we do not have the ability to reindex wikidata from scratch due to size, so turning off the sanity checker means no new fields can be added to wikidata. Additionally any change to the way fields are defined (such as changing the list of whitelisted properties in search) will effectively never reach 100% deployment.

Is it really too much to ask wikidata to render all known items once every eight weeks? This seems like a much deeper problem than the saneitizer if not.

EBernhardson moved this task from needs triage to making others happy on the Discovery-Search board.Dec 9 2019, 11:16 PM

Hey, Thanks for the comment. I'm planning to turn it back on ASAP, right now, we are at middle of the migration and it puts too much pressure on s8, once the migration of of wb_terms is over, we can turn this back on again. That's going to happen in two months (hopefully).

I have two questions:

If someone makes an edit or makes a new item still the index gets updated but adding new feature (let's assume for example "number of claims" gets added to the search index) is not propagating into the system. Is that correct?
If the above statement is correct, how often do you change the index structure, meaning you need the run reindexing? When was the last time this change was needed?

If you change it quite often. let us know.

The best solution IMO is not to make this less aggressive, it's to stop rendering html of the items which is very heavy job for wikidata (unlike Wikipedia pages). Doing it is not super hard but I'm not super sure where to start. I might pick this up to see what I can do.

In T239931#5726538, @Ladsgroup wrote:

The best solution IMO is not to make this less aggressive, it's to stop rendering html of the items which is very heavy job for wikidata (unlike Wikipedia pages). Doing it is not super hard but I'm not super sure where to start. I might pick this up to see what I can do.

+1, this would probably make the whole process go a lot faster too!

Ladsgroup mentioned this in T208425: [EPIC] Kill the wb_terms table.Dec 10 2019, 2:48 PM

In T239931#5726538, @Ladsgroup wrote:

Hey, Thanks for the comment. I'm planning to turn it back on ASAP, right now, we are at middle of the migration and it puts too much pressure on s8, once the migration of of wb_terms is over, we can turn this back on again. That's going to happen in two months (hopefully).

I have two questions:

If someone makes an edit or makes a new item still the index gets updated but adding new feature (let's assume for example "number of claims" gets added to the search index) is not propagating into the system. Is that correct?

Edit's all propagate with perhaps a few minutes of delay. So if the way some field is generated is changed, or a new property is added, that gets updated on a standard edit. What happens though is many pages do not get edited. When we first rolled out the automated saneitizer there were millions of pages that still did not have properties added several years prior.

If the above statement is correct, how often do you change the index structure, meaning you need the run reindexing? When was the last time this change was needed?

It's not necessarily the indexing structure, but any change to the way searchable properties are rendered. There was a ticket (T239950) filed about 5 days ago to request changing the way some wikidata properties are rendered. The only way to roll out a change like that is to regenerate the pages and ship them to elasticsearch. You could imagine how much worse the load would be if we had to re-render all wikidata items from a maintenance script instead of an automated process that slowly does it over 8 weeks. I'm not sure when the last change that affected wikidata was, but in general there are probably a few updates a quarter that effect the search document rendering.

If you change it quite often. let us know.

The best solution IMO is not to make this less aggressive, it's to stop rendering html of the items which is very heavy job for wikidata (unlike Wikipedia pages). Doing it is not super hard but I'm not super sure where to start. I might pick this up to see what I can do.

Sounds like some kind of parse flag? I'm not too familiar with those interfaces. Separately, CirrusSearch tends to assume that the ParserCache has an anonymously rendered version of all current pages somewhere. Is Cirrus somehow getting a different cache key than anonymous page views?

In T239931#5728662, @EBernhardson wrote:

Sounds like some kind of parse flag? I'm not too familiar with those interfaces. Separately, CirrusSearch tends to assume that the ParserCache has an anonymously rendered version of all current pages somewhere. Is Cirrus somehow getting a different cache key than anonymous page views?

I don't think we should assume that PC has all pages of wikidata rendered. Given the unique size and concept of wikidata compared to other wikis, direct page views are low but secondary usages (WDQS, client entity usage, API) are high combine it with the fact that rendering item HTMLs are resource intensive (needs to load lots of terms from other items) and we have this problem at hand now.

The search indexes on wikidata doesn't need to render the page so I think fixing it to not use the HTML part of ParserOutput should not be that hard. I will check.

I don't think we should assume that PC has all pages of wikidata rendered

Yes that would be a bad assumption.

is Cirrus somehow getting a different cache key than anonymous page views?

Anon cache keys are for example:

wikidatawiki:pcache:idhash:402-0!termboxVersion=1!wb=3 and timestamp 20191118131034 and revision id 997124603

T229407#5714974 reports the cache key of:

wikidatawiki:pcache:idoptions:55899217

which points to

wikidatawiki:pcache:idhash:10017433-0!termboxVersion=1!wb=3

which looks like the same :)

Not adding this to the Wikidata-Campsite during our intake of terms related tasks as we (wmde) probably wont be doing this.

@Addshore do you know if the load has been reduced enough that this is no longer an issue?

In T239931#6330953, @CBogen wrote:

@Addshore do you know if the load has been reduced enough that this is no longer an issue?

The load has been reduced (we dropped the old term store and the new one is heavily and properly cached) but I recommend turning this on with a slower pace than it used to be or (more preferably and highly recommended) make the sanitizer job not depend on the HTML output of the page (e.g. by having a dedicated sanitizer job for wikibase), It only loads HTML of the page because the job is the same with Wikipedia while it doesn't need it (unlike Wikipedia) and such redundant process is pretty expensive in matter of CPU and DB queries, almost 30%-50% of ParserCache storage everywhere was due to this job.

CBogen triaged this task as Medium priority.Aug 27 2020, 9:13 PM

Lucas_Werkmeister_WMDE mentioned this in T264283: Assess cleaning up Wikibase code around CirrusSearch sanitizer job.Oct 1 2020, 8:39 AM

CBogen merged a task: T264566: Redirected entity still present in search results after 6 months.Oct 5 2020, 5:22 PM

CBogen mentioned this in T264566: Redirected entity still present in search results after 6 months.

CBogen added subscribers: Pintoch, tfmorris.

Gehel edited projects, added Discovery-Search (Current work); removed Discovery-Search.Oct 5 2020, 5:31 PM

Change 632751 had a related patch set uploaded (by Lucas Werkmeister (WMDE); owner: Lucas Werkmeister (WMDE)):
[mediawiki/extensions/Jade@master] Override getParserOutputForIndexing to ensure HTML

https://gerrit.wikimedia.org/r/632751

Change 632752 had a related patch set uploaded (by Lucas Werkmeister (WMDE); owner: Lucas Werkmeister (WMDE)):
[mediawiki/core@master] Don’t generate HTML in getParserOutputForIndexing

https://gerrit.wikimedia.org/r/632752

The above two changes attempt to implement the optimization outlined in the IRC log in the task description: to stop generating HTML for search indexing by setting the 'generate-html' hint. (Wikibase seems to respect this hint, as far as I could tell, so this should indeed speed up indexing for Wikibase pages.) This is a breaking change to two ContentHandler methods, but the only affected extension is Jade, which I fixed (hopefully).

Change 633170 had a related patch set uploaded (by Lucas Werkmeister (WMDE); owner: Lucas Werkmeister (WMDE)):
[mediawiki/extensions/Wikibase@master] Override EntityHandler::getParserOutputForIndexing

https://gerrit.wikimedia.org/r/633170

Change 632751 abandoned by Lucas Werkmeister (WMDE):
[mediawiki/extensions/Jade@master] Override getParserOutputForIndexing to ensure HTML

Reason:
No longer needed with PS2 of I50f3a530f2, where generating HTML is no longer skipped by default.

https://gerrit.wikimedia.org/r/632751

In T239931#6331040, @Ladsgroup wrote:

In T239931#6330953, @CBogen wrote:

@Addshore do you know if the load has been reduced enough that this is no longer an issue?

The load has been reduced (we dropped the old term store and the new one is heavily and properly cached) but I recommend turning this on with a slower pace than it used to be or (more preferably and highly recommended) make the sanitizer job not depend on the HTML output of the page (e.g. by having a dedicated sanitizer job for wikibase), It only loads HTML of the page because the job is the same with Wikipedia while it doesn't need it (unlike Wikipedia) and such redundant process is pretty expensive in matter of CPU and DB queries, almost 30%-50% of ParserCache storage everywhere was due to this job.

If we no longer use the HTML in the sanitizer (see the above changes), do we still need a gradual ramp-up when reenabling the sanitizer or is it okay to turn it back on immediately? (I’m not sure if a gradual ramp-up would be possible without code changes to CirrusSearch – currently CirrusSearchSanityCheck is just a boolean flag.)

In T239931#6536170, @Lucas_Werkmeister_WMDE wrote:

If we no longer use the HTML in the sanitizer (see the above changes), do we still need a gradual ramp-up when reenabling the sanitizer or is it okay to turn it back on immediately? (I’m not sure if a gradual ramp-up would be possible without code changes to CirrusSearch – currently CirrusSearchSanityCheck is just a boolean flag.)

I do not think we need a gradual ramp up, but best to keep an eye on things like term storage db load, s8 load, entity parser output renderings, memcached etc.

Agreed.

Gehel moved this task from Incoming to In Progress on the Discovery-Search (Current work) board.Oct 12 2020, 3:16 PM

Change 632752 merged by jenkins-bot:
[mediawiki/core@master] Clarify HTML generation for indexing in ContentHandler

https://gerrit.wikimedia.org/r/632752

All seems reasonable to me. Note that the saneitizer is globally turned off as of a week ago due to a separate incident. Expecting to turn that back on this week.

Change 633170 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] Override EntityHandler::getParserOutputForIndexing

https://gerrit.wikimedia.org/r/633170

Alright, then we can try turning on the sanitizer on Wikidata either next week (after wmf.14 is rolled out) or the week after that (probably safer).

Lucas_Werkmeister_WMDE added a project: Wikidata-Campsite (Wikidata-Campsite-Iteration-∞ (On Hold)).Oct 15 2020, 2:35 PM

Lucas_Werkmeister_WMDE moved this task from To Do (prioritised from top to bottom) to Stalled/Waiting on the Wikidata-Campsite (Wikidata-Campsite-Iteration-∞ (On Hold)) board.

Lucas_Werkmeister_WMDE set Due Date to Oct 26 2020, 11:00 PM.

WMDE-leszek added a project: Wikibase wb_terms leftovers 2020.Oct 16 2020, 9:20 AM

Pintoch unsubscribed.Oct 16 2020, 9:37 AM

Saneitizer was turned back on last week, everything there is working well and wikidata can be reenabled any time.

Alright, unstalling – wmf.14 is also rolled out and a rollback seems unlikely at this point (T263180: 1.36.0-wmf.14 deployment blockers is closed).

To summarize: the sanitizer was turned off on Wikidata because it caused excessive database load; during the wb_terms cleanup hike, we optimized it to not do as much unneeded work (mainly rEWBA01de21628b23: Override EntityHandler::getParserOutputForIndexing); now we want to turn the sanitizer back on (revert rOMWC406cbec2a764: Disable sanity check cirrus jobs for Wikidata) and hope the load stays acceptable.

Gehel moved this task from In Progress to Blocked/Waiting on the Discovery-Search (Current work) board.Nov 2 2020, 6:09 PM

Moving to blocked until we know what's causing T266762

Saneitizer is running slower than usual at the moment, this needs to be investigated first. See T266762 and related.

Lucas_Werkmeister_WMDE added a project: Wikidata-Campsite (Wikidata-Campsite-Iteration-∞ (On Hold)).Nov 2 2020, 6:12 PM

Lucas_Werkmeister_WMDE moved this task from To Do (prioritised from top to bottom) to Stalled/Waiting on the Wikidata-Campsite (Wikidata-Campsite-Iteration-∞ (On Hold)) board.

CBogen moved this task from Blocked/Waiting to Waiting on the Discovery-Search (Current work) board.Nov 23 2020, 6:06 PM

CBogen moved this task from Waiting to Incoming on the Discovery-Search (Current work) board.Dec 14 2020, 4:29 PM

CBogen set the point value for this task to 2.Dec 14 2020, 4:34 PM

CBogen moved this task from Incoming to Ready for Dev -- SWE on the Discovery-Search (Current work) board.

The sanitizer is working OK with increased concurrency (T266762), we might try to enable it again on wikidata and sees how it performs.

With the holidays over and everyone back, i think we can turn this on?

In T239931#6719994, @EBernhardson wrote:

With the holidays over and everyone back, i think we can turn this on?

Sounds good to me!

Agreed. Should we (Wikidata team) do the config change or leave it to you? :)

In T239931#6736120, @Lucas_Werkmeister_WMDE wrote:

Agreed. Should we (Wikidata team) do the config change or leave it to you? :)

We can ship the config change no worries :)

Change 655389 had a related patch set uploaded (by DCausse; owner: DCausse):
[operations/mediawiki-config@master] Revert "Disable sanity check cirrus jobs for Wikidata"

https://gerrit.wikimedia.org/r/655389

dcausse moved this task from Ready for Dev -- SWE to To Be Deployed on the Discovery-Search (Current work) board.Jan 11 2021, 2:52 PM

Change 655389 merged by jenkins-bot:
[operations/mediawiki-config@master] Revert "Disable sanity check cirrus jobs for Wikidata"

https://gerrit.wikimedia.org/r/655389

Mentioned in SAL (#wikimedia-operations) [2021-01-13T12:09:46Z] <dcausse@deploy1001> Synchronized wmf-config/InitialiseSettings.php: T239931: Revert "Disable sanity check cirrus jobs for Wikidata" (duration: 01m 16s)

dcausse assigned this task to Lucas_Werkmeister_WMDE.Jan 13 2021, 1:23 PM

dcausse moved this task from To Be Deployed to Needs Reporting on the Discovery-Search (Current work) board.

Lucas_Werkmeister_WMDE moved this task from Stalled/Waiting to Test (Verification) on the Wikidata-Campsite (Wikidata-Campsite-Iteration-∞ (On Hold)) board.Jan 13 2021, 1:25 PM

It shouldn't hit ParserCache anymore but it seems we got it again? https://grafana.wikimedia.org/d/000000106/parser-cache?viewPanel=8&from=now-12h&orgId=1&to=now&var-contentModel=wikibase_item&var-contentModel=wikibase_property

No, we still try to look the content up in the parser cache, I think – the change was that if the parser cache isn’t hit, then we generate parser output without HTML and don’t store it in the parser cache anymore.

I see. The impact doesn't show up in the term store and db, where it matters the most. So it's good to consider done IMO.

Alright, great! (Leaving open in case @dcausse isn’t done with this Needs Reporting column yet.)

Gehel closed this task as Resolved.Jan 20 2021, 8:27 AM

Addshore mentioned this in T285987: Do not generate full html parser output at the end of Wikibase edit requests.Jul 1 2021, 9:55 PM

Reduce the impact of the sanitizer on wikidataClosed, ResolvedPublic2 Estimated Story PointsActions

Description

Details

Related Objects

Event Timeline

Reduce the impact of the sanitizer on wikidata
Closed, ResolvedPublic2 Estimated Story Points
Actions