Page MenuHomePhabricator

Reduce the impact of the sanitizer on wikidata
Closed, ResolvedPublic2 Estimated Story Points

Description

The sanitizer seems to be a bit aggressive with wikidata causing significant load on the database (https://phabricator.wikimedia.org/T229407#5635732) because generating the data relies on the ParserOutput.

One immediate solution could be to tune the sanitizer to slow it down for wikidata, this can be achieved by creating a new profile in CirrusSearch/profiles/SaneitizeProfiles.config.php.
Another approach is to refactor and reduce the dependency on the ParserOutput when cirrus generates the document for elasticsearch.
Discussion:

16:15 <Amir1> So regarding writing some code, if there's some documentation, I might dig into it and do it
16:17 <addshore> Which bits of parser output does it need?
16:25 <+dcausse> addshore: it needs it for wikipage properties (categories/external links/...) these are maybe useless for wikidata
16:26 <+dcausse> the problematic interface is \ContentHandler::getDataForSearchIndex that takes the ParserOutput as an argument
16:30 <+dcausse> this would have to be changed to load the ParserOutput just when needed and from EntityHandler stop calling parent::getDataForSearchIndex but feed the base properties needed by cirrus from something else
16:30 <+dcausse> properties needed: https://gerrit.wikimedia.org/g/mediawiki/core/+/2b04ef66576439b9ace37f1f25de7967abcb1356/includes/content/ContentHandler.php#1321
16:31 <addshore> okay!
16:33 <addshore> Indeed, so the bit in ContentHandler::getDataForSearchIndex uses a ParserOutputSearchDataExtractor and thus the parseroutput
16:34 <addshore> i just had a quick look through the wikibase specific index things and nothing there uses parser output
16:34  * addshore looks at what calls getDataForSearchIndex
16:34 <+dcausse> the thing that flattens the entity data into the text field is very important tho
16:34 <+dcausse> but probably a code available directly from wikibase
16:35 <addshore> Yup, thats fine, that doenst need parser output
16:35 <addshore> So, CirrusSearch/includes/Updater.php calles getDataForSearchIndex
16:35 <+dcausse> yes this one will have to change as well
16:35 <addshore> going back further, it does $output = $contentHandler->getParserOutputForIndexing( $page, $parserCache );
16:36 <+ebernhardson> addshore: you have an old version btw, thats now in CirrusSearch/includes/BuildDocument/something
16:36  * addshore pulls :P
16:38 <addshore> So....
16:38 <addshore> ConrentHandler::getParserOutputForIndexing
16:38 <addshore> Calls, $renderer->getRenderedRevision->getRevisionParserOutput
16:38 <addshore> And that has
16:38 <addshore> @param array $hints Hints given as an associative array. Known keys:
16:38 <addshore> 	 *      - 'generate-html' => bool: Whether the caller is interested in output HTML (as opposed
16:38 <addshore> 	 *        to just meta-data). Default is to generate HTML.
16:39 <addshore> that, could, maybe, be something to think about
16:40 <Amir1> Maybe Update.php buildDocument can set "skipParse" for wikibase to true?
16:41 <addshore> Well, i think it still needs a "parse" and the meta data form it, for links and things?
16:41 <addshore> but it probably doesnt care about the actual html output, but i need to verify that
16:42 <addshore> it looks at categories, external links, outgoing links, templates, text, source_text, text_bytes, content_model
16:45 <addshore> This path as far as I can see if the only thing that uses getParserOutputForIndexing too
16:46 <addshore> So, as long as the things listed above dont need any part of the html, we can add that hint and stop generating it probably :)

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Change 555941 had a related patch set uploaded (by Ladsgroup; owner: Ladsgroup):
[operations/mediawiki-config@master] Disable sanity check cirrus jobs for Wikidata

https://gerrit.wikimedia.org/r/555941

Change 555941 merged by jenkins-bot:
[operations/mediawiki-config@master] Disable sanity check cirrus jobs for Wikidata

https://gerrit.wikimedia.org/r/555941

Mentioned in SAL (#wikimedia-operations) [2019-12-09T14:26:23Z] <ladsgroup@deploy1001> Synchronized wmf-config/InitialiseSettings.php: [[gerrit:555941|Disable sanity check cirrus jobs for Wikidata (T239931 T229407)]] (duration: 00m 57s)

To be clear, we do not have the ability to reindex wikidata from scratch due to size, so turning off the sanity checker means no new fields can be added to wikidata. Additionally any change to the way fields are defined (such as changing the list of whitelisted properties in search) will effectively never reach 100% deployment.

Is it really too much to ask wikidata to render all known items once every eight weeks? This seems like a much deeper problem than the saneitizer if not.

Hey, Thanks for the comment. I'm planning to turn it back on ASAP, right now, we are at middle of the migration and it puts too much pressure on s8, once the migration of of wb_terms is over, we can turn this back on again. That's going to happen in two months (hopefully).

I have two questions:

  • If someone makes an edit or makes a new item still the index gets updated but adding new feature (let's assume for example "number of claims" gets added to the search index) is not propagating into the system. Is that correct?
  • If the above statement is correct, how often do you change the index structure, meaning you need the run reindexing? When was the last time this change was needed?

If you change it quite often. let us know.

The best solution IMO is not to make this less aggressive, it's to stop rendering html of the items which is very heavy job for wikidata (unlike Wikipedia pages). Doing it is not super hard but I'm not super sure where to start. I might pick this up to see what I can do.

The best solution IMO is not to make this less aggressive, it's to stop rendering html of the items which is very heavy job for wikidata (unlike Wikipedia pages). Doing it is not super hard but I'm not super sure where to start. I might pick this up to see what I can do.

+1, this would probably make the whole process go a lot faster too!

Hey, Thanks for the comment. I'm planning to turn it back on ASAP, right now, we are at middle of the migration and it puts too much pressure on s8, once the migration of of wb_terms is over, we can turn this back on again. That's going to happen in two months (hopefully).

I have two questions:

  • If someone makes an edit or makes a new item still the index gets updated but adding new feature (let's assume for example "number of claims" gets added to the search index) is not propagating into the system. Is that correct?

Edit's all propagate with perhaps a few minutes of delay. So if the way some field is generated is changed, or a new property is added, that gets updated on a standard edit. What happens though is many pages do not get edited. When we first rolled out the automated saneitizer there were millions of pages that still did not have properties added several years prior.

  • If the above statement is correct, how often do you change the index structure, meaning you need the run reindexing? When was the last time this change was needed?

It's not necessarily the indexing structure, but any change to the way searchable properties are rendered. There was a ticket (T239950) filed about 5 days ago to request changing the way some wikidata properties are rendered. The only way to roll out a change like that is to regenerate the pages and ship them to elasticsearch. You could imagine how much worse the load would be if we had to re-render all wikidata items from a maintenance script instead of an automated process that slowly does it over 8 weeks. I'm not sure when the last change that affected wikidata was, but in general there are probably a few updates a quarter that effect the search document rendering.

If you change it quite often. let us know.

The best solution IMO is not to make this less aggressive, it's to stop rendering html of the items which is very heavy job for wikidata (unlike Wikipedia pages). Doing it is not super hard but I'm not super sure where to start. I might pick this up to see what I can do.

Sounds like some kind of parse flag? I'm not too familiar with those interfaces. Separately, CirrusSearch tends to assume that the ParserCache has an anonymously rendered version of all current pages somewhere. Is Cirrus somehow getting a different cache key than anonymous page views?

Sounds like some kind of parse flag? I'm not too familiar with those interfaces. Separately, CirrusSearch tends to assume that the ParserCache has an anonymously rendered version of all current pages somewhere. Is Cirrus somehow getting a different cache key than anonymous page views?

I don't think we should assume that PC has all pages of wikidata rendered. Given the unique size and concept of wikidata compared to other wikis, direct page views are low but secondary usages (WDQS, client entity usage, API) are high combine it with the fact that rendering item HTMLs are resource intensive (needs to load lots of terms from other items) and we have this problem at hand now.

The search indexes on wikidata doesn't need to render the page so I think fixing it to not use the HTML part of ParserOutput should not be that hard. I will check.

I don't think we should assume that PC has all pages of wikidata rendered

Yes that would be a bad assumption.

is Cirrus somehow getting a different cache key than anonymous page views?

Anon cache keys are for example:

wikidatawiki:pcache:idhash:402-0!termboxVersion=1!wb=3 and timestamp 20191118131034 and revision id 997124603

T229407#5714974 reports the cache key of:

wikidatawiki:pcache:idoptions:55899217

which points to

wikidatawiki:pcache:idhash:10017433-0!termboxVersion=1!wb=3

which looks like the same :)

Not adding this to the Wikidata-Campsite during our intake of terms related tasks as we (wmde) probably wont be doing this.

@Addshore do you know if the load has been reduced enough that this is no longer an issue?

@Addshore do you know if the load has been reduced enough that this is no longer an issue?

The load has been reduced (we dropped the old term store and the new one is heavily and properly cached) but I recommend turning this on with a slower pace than it used to be or (more preferably and highly recommended) make the sanitizer job not depend on the HTML output of the page (e.g. by having a dedicated sanitizer job for wikibase), It only loads HTML of the page because the job is the same with Wikipedia while it doesn't need it (unlike Wikipedia) and such redundant process is pretty expensive in matter of CPU and DB queries, almost 30%-50% of ParserCache storage everywhere was due to this job.

CBogen triaged this task as Medium priority.Aug 27 2020, 9:13 PM

Change 632751 had a related patch set uploaded (by Lucas Werkmeister (WMDE); owner: Lucas Werkmeister (WMDE)):
[mediawiki/extensions/Jade@master] Override getParserOutputForIndexing to ensure HTML

https://gerrit.wikimedia.org/r/632751

Change 632752 had a related patch set uploaded (by Lucas Werkmeister (WMDE); owner: Lucas Werkmeister (WMDE)):
[mediawiki/core@master] Don’t generate HTML in getParserOutputForIndexing

https://gerrit.wikimedia.org/r/632752

The above two changes attempt to implement the optimization outlined in the IRC log in the task description: to stop generating HTML for search indexing by setting the 'generate-html' hint. (Wikibase seems to respect this hint, as far as I could tell, so this should indeed speed up indexing for Wikibase pages.) This is a breaking change to two ContentHandler methods, but the only affected extension is Jade, which I fixed (hopefully).

Change 633170 had a related patch set uploaded (by Lucas Werkmeister (WMDE); owner: Lucas Werkmeister (WMDE)):
[mediawiki/extensions/Wikibase@master] Override EntityHandler::getParserOutputForIndexing

https://gerrit.wikimedia.org/r/633170

Change 632751 abandoned by Lucas Werkmeister (WMDE):
[mediawiki/extensions/Jade@master] Override getParserOutputForIndexing to ensure HTML

Reason:
No longer needed with PS2 of I50f3a530f2, where generating HTML is no longer skipped by default.

https://gerrit.wikimedia.org/r/632751

@Addshore do you know if the load has been reduced enough that this is no longer an issue?

The load has been reduced (we dropped the old term store and the new one is heavily and properly cached) but I recommend turning this on with a slower pace than it used to be or (more preferably and highly recommended) make the sanitizer job not depend on the HTML output of the page (e.g. by having a dedicated sanitizer job for wikibase), It only loads HTML of the page because the job is the same with Wikipedia while it doesn't need it (unlike Wikipedia) and such redundant process is pretty expensive in matter of CPU and DB queries, almost 30%-50% of ParserCache storage everywhere was due to this job.

If we no longer use the HTML in the sanitizer (see the above changes), do we still need a gradual ramp-up when reenabling the sanitizer or is it okay to turn it back on immediately? (I’m not sure if a gradual ramp-up would be possible without code changes to CirrusSearch – currently CirrusSearchSanityCheck is just a boolean flag.)

If we no longer use the HTML in the sanitizer (see the above changes), do we still need a gradual ramp-up when reenabling the sanitizer or is it okay to turn it back on immediately? (I’m not sure if a gradual ramp-up would be possible without code changes to CirrusSearch – currently CirrusSearchSanityCheck is just a boolean flag.)

I do not think we need a gradual ramp up, but best to keep an eye on things like term storage db load, s8 load, entity parser output renderings, memcached etc.

Change 632752 merged by jenkins-bot:
[mediawiki/core@master] Clarify HTML generation for indexing in ContentHandler

https://gerrit.wikimedia.org/r/632752

All seems reasonable to me. Note that the saneitizer is globally turned off as of a week ago due to a separate incident. Expecting to turn that back on this week.

Change 633170 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] Override EntityHandler::getParserOutputForIndexing

https://gerrit.wikimedia.org/r/633170

Alright, then we can try turning on the sanitizer on Wikidata either next week (after wmf.14 is rolled out) or the week after that (probably safer).

Saneitizer was turned back on last week, everything there is working well and wikidata can be reenabled any time.

Lucas_Werkmeister_WMDE changed the task status from Stalled to Open.Oct 26 2020, 6:47 PM

Alright, unstalling – wmf.14 is also rolled out and a rollback seems unlikely at this point (T263180: 1.36.0-wmf.14 deployment blockers is closed).

To summarize: the sanitizer was turned off on Wikidata because it caused excessive database load; during the wb_terms cleanup hike, we optimized it to not do as much unneeded work (mainly rEWBA01de21628b23: Override EntityHandler::getParserOutputForIndexing); now we want to turn the sanitizer back on (revert rOMWC406cbec2a764: Disable sanity check cirrus jobs for Wikidata) and hope the load stays acceptable.

Moving to blocked until we know what's causing T266762

Gehel subscribed.

Saneitizer is running slower than usual at the moment, this needs to be investigated first. See T266762 and related.

CBogen set the point value for this task to 2.Dec 14 2020, 4:34 PM

The sanitizer is working OK with increased concurrency (T266762), we might try to enable it again on wikidata and sees how it performs.

With the holidays over and everyone back, i think we can turn this on?

With the holidays over and everyone back, i think we can turn this on?

Sounds good to me!

Agreed. Should we (Wikidata team) do the config change or leave it to you? :)

Agreed. Should we (Wikidata team) do the config change or leave it to you? :)

We can ship the config change no worries :)

Change 655389 had a related patch set uploaded (by DCausse; owner: DCausse):
[operations/mediawiki-config@master] Revert "Disable sanity check cirrus jobs for Wikidata"

https://gerrit.wikimedia.org/r/655389

Change 655389 merged by jenkins-bot:
[operations/mediawiki-config@master] Revert "Disable sanity check cirrus jobs for Wikidata"

https://gerrit.wikimedia.org/r/655389

Mentioned in SAL (#wikimedia-operations) [2021-01-13T12:09:46Z] <dcausse@deploy1001> Synchronized wmf-config/InitialiseSettings.php: T239931: Revert "Disable sanity check cirrus jobs for Wikidata" (duration: 01m 16s)

No, we still try to look the content up in the parser cache, I think – the change was that if the parser cache isn’t hit, then we generate parser output without HTML and don’t store it in the parser cache anymore.

I see. The impact doesn't show up in the term store and db, where it matters the most. So it's good to consider done IMO.

Alright, great! (Leaving open in case @dcausse isn’t done with this Needs Reporting column yet.)