Page MenuHomePhabricator

Reduce the impact of the sanitizer on wikidata
Open, Needs TriagePublic

Description

The sanitizer seems to be a bit aggressive with wikidata causing significant load on the database (https://phabricator.wikimedia.org/T229407#5635732) because generating the data relies on the ParserOutput.

One immediate solution could be to tune the sanitizer to slow it down for wikidata, this can be achieved by creating a new profile in CirrusSearch/profiles/SaneitizeProfiles.config.php.
Another approach is to refactor and reduce the dependency on the ParserOutput when cirrus generates the document for elasticsearch.
Discussion:

16:15 <Amir1> So regarding writing some code, if there's some documentation, I might dig into it and do it
16:17 <addshore> Which bits of parser output does it need?
16:25 <+dcausse> addshore: it needs it for wikipage properties (categories/external links/...) these are maybe useless for wikidata
16:26 <+dcausse> the problematic interface is \ContentHandler::getDataForSearchIndex that takes the ParserOutput as an argument
16:30 <+dcausse> this would have to be changed to load the ParserOutput just when needed and from EntityHandler stop calling parent::getDataForSearchIndex but feed the base properties needed by cirrus from something else
16:30 <+dcausse> properties needed: https://gerrit.wikimedia.org/g/mediawiki/core/+/2b04ef66576439b9ace37f1f25de7967abcb1356/includes/content/ContentHandler.php#1321
16:31 <addshore> okay!
16:33 <addshore> Indeed, so the bit in ContentHandler::getDataForSearchIndex uses a ParserOutputSearchDataExtractor and thus the parseroutput
16:34 <addshore> i just had a quick look through the wikibase specific index things and nothing there uses parser output
16:34  * addshore looks at what calls getDataForSearchIndex
16:34 <+dcausse> the thing that flattens the entity data into the text field is very important tho
16:34 <+dcausse> but probably a code available directly from wikibase
16:35 <addshore> Yup, thats fine, that doenst need parser output
16:35 <addshore> So, CirrusSearch/includes/Updater.php calles getDataForSearchIndex
16:35 <+dcausse> yes this one will have to change as well
16:35 <addshore> going back further, it does $output = $contentHandler->getParserOutputForIndexing( $page, $parserCache );
16:36 <+ebernhardson> addshore: you have an old version btw, thats now in CirrusSearch/includes/BuildDocument/something
16:36  * addshore pulls :P
16:38 <addshore> So....
16:38 <addshore> ConrentHandler::getParserOutputForIndexing
16:38 <addshore> Calls, $renderer->getRenderedRevision->getRevisionParserOutput
16:38 <addshore> And that has
16:38 <addshore> @param array $hints Hints given as an associative array. Known keys:
16:38 <addshore> 	 *      - 'generate-html' => bool: Whether the caller is interested in output HTML (as opposed
16:38 <addshore> 	 *        to just meta-data). Default is to generate HTML.
16:39 <addshore> that, could, maybe, be something to think about
16:40 <Amir1> Maybe Update.php buildDocument can set "skipParse" for wikibase to true?
16:41 <addshore> Well, i think it still needs a "parse" and the meta data form it, for links and things?
16:41 <addshore> but it probably doesnt care about the actual html output, but i need to verify that
16:42 <addshore> it looks at categories, external links, outgoing links, templates, text, source_text, text_bytes, content_model
16:45 <addshore> This path as far as I can see if the only thing that uses getParserOutputForIndexing too
16:46 <addshore> So, as long as the things listed above dont need any part of the html, we can add that hint and stop generating it probably :)

Details

Related Gerrit Patches:
operations/mediawiki-config : masterDisable sanity check cirrus jobs for Wikidata

Event Timeline

dcausse created this task.Dec 5 2019, 5:01 PM
Restricted Application added a project: Discovery-Search. · View Herald TranscriptDec 5 2019, 5:01 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
dcausse updated the task description. (Show Details)Dec 5 2019, 5:02 PM
dcausse added subscribers: Addshore, Ladsgroup.
Addshore moved this task from incoming to monitoring on the Wikidata board.Dec 9 2019, 8:47 AM
Addshore awarded a token.

Change 555941 had a related patch set uploaded (by Ladsgroup; owner: Ladsgroup):
[operations/mediawiki-config@master] Disable sanity check cirrus jobs for Wikidata

https://gerrit.wikimedia.org/r/555941

Change 555941 merged by jenkins-bot:
[operations/mediawiki-config@master] Disable sanity check cirrus jobs for Wikidata

https://gerrit.wikimedia.org/r/555941

Mentioned in SAL (#wikimedia-operations) [2019-12-09T14:26:23Z] <ladsgroup@deploy1001> Synchronized wmf-config/InitialiseSettings.php: [[gerrit:555941|Disable sanity check cirrus jobs for Wikidata (T239931 T229407)]] (duration: 00m 57s)

EBernhardson added a subscriber: EBernhardson.EditedDec 9 2019, 11:12 PM

To be clear, we do not have the ability to reindex wikidata from scratch due to size, so turning off the sanity checker means no new fields can be added to wikidata. Additionally any change to the way fields are defined (such as changing the list of whitelisted properties in search) will effectively never reach 100% deployment.

Is it really too much to ask wikidata to render all known items once every eight weeks? This seems like a much deeper problem than the saneitizer if not.

Hey, Thanks for the comment. I'm planning to turn it back on ASAP, right now, we are at middle of the migration and it puts too much pressure on s8, once the migration of of wb_terms is over, we can turn this back on again. That's going to happen in two months (hopefully).

I have two questions:

  • If someone makes an edit or makes a new item still the index gets updated but adding new feature (let's assume for example "number of claims" gets added to the search index) is not propagating into the system. Is that correct?
  • If the above statement is correct, how often do you change the index structure, meaning you need the run reindexing? When was the last time this change was needed?

If you change it quite often. let us know.

The best solution IMO is not to make this less aggressive, it's to stop rendering html of the items which is very heavy job for wikidata (unlike Wikipedia pages). Doing it is not super hard but I'm not super sure where to start. I might pick this up to see what I can do.

The best solution IMO is not to make this less aggressive, it's to stop rendering html of the items which is very heavy job for wikidata (unlike Wikipedia pages). Doing it is not super hard but I'm not super sure where to start. I might pick this up to see what I can do.

+1, this would probably make the whole process go a lot faster too!

EBernhardson added a comment.EditedDec 10 2019, 4:35 PM

Hey, Thanks for the comment. I'm planning to turn it back on ASAP, right now, we are at middle of the migration and it puts too much pressure on s8, once the migration of of wb_terms is over, we can turn this back on again. That's going to happen in two months (hopefully).
I have two questions:

  • If someone makes an edit or makes a new item still the index gets updated but adding new feature (let's assume for example "number of claims" gets added to the search index) is not propagating into the system. Is that correct?

Edit's all propagate with perhaps a few minutes of delay. So if the way some field is generated is changed, or a new property is added, that gets updated on a standard edit. What happens though is many pages do not get edited. When we first rolled out the automated saneitizer there were millions of pages that still did not have properties added several years prior.

  • If the above statement is correct, how often do you change the index structure, meaning you need the run reindexing? When was the last time this change was needed?

It's not necessarily the indexing structure, but any change to the way searchable properties are rendered. There was a ticket (T239950) filed about 5 days ago to request changing the way some wikidata properties are rendered. The only way to roll out a change like that is to regenerate the pages and ship them to elasticsearch. You could imagine how much worse the load would be if we had to re-render all wikidata items from a maintenance script instead of an automated process that slowly does it over 8 weeks. I'm not sure when the last change that affected wikidata was, but in general there are probably a few updates a quarter that effect the search document rendering.

If you change it quite often. let us know.
The best solution IMO is not to make this less aggressive, it's to stop rendering html of the items which is very heavy job for wikidata (unlike Wikipedia pages). Doing it is not super hard but I'm not super sure where to start. I might pick this up to see what I can do.

Sounds like some kind of parse flag? I'm not too familiar with those interfaces. Separately, CirrusSearch tends to assume that the ParserCache has an anonymously rendered version of all current pages somewhere. Is Cirrus somehow getting a different cache key than anonymous page views?

Sounds like some kind of parse flag? I'm not too familiar with those interfaces. Separately, CirrusSearch tends to assume that the ParserCache has an anonymously rendered version of all current pages somewhere. Is Cirrus somehow getting a different cache key than anonymous page views?

I don't think we should assume that PC has all pages of wikidata rendered. Given the unique size and concept of wikidata compared to other wikis, direct page views are low but secondary usages (WDQS, client entity usage, API) are high combine it with the fact that rendering item HTMLs are resource intensive (needs to load lots of terms from other items) and we have this problem at hand now.

The search indexes on wikidata doesn't need to render the page so I think fixing it to not use the HTML part of ParserOutput should not be that hard. I will check.

I don't think we should assume that PC has all pages of wikidata rendered

Yes that would be a bad assumption.

is Cirrus somehow getting a different cache key than anonymous page views?

Anon cache keys are for example:

wikidatawiki:pcache:idhash:402-0!termboxVersion=1!wb=3 and timestamp 20191118131034 and revision id 997124603

T229407#5714974 reports the cache key of:

wikidatawiki:pcache:idoptions:55899217

which points to

wikidatawiki:pcache:idhash:10017433-0!termboxVersion=1!wb=3

which looks like the same :)

Not adding this to the Wikidata-Campsite during our intake of terms related tasks as we (wmde) probably wont be doing this.