The sanitizer seems to be a bit aggressive with wikidata causing significant load on the database (https://phabricator.wikimedia.org/T229407#5635732) because generating the data relies on the ParserOutput.
One immediate solution could be to tune the sanitizer to slow it down for wikidata, this can be achieved by creating a new profile in CirrusSearch/profiles/SaneitizeProfiles.config.php.
Another approach is to refactor and reduce the dependency on the ParserOutput when cirrus generates the document for elasticsearch.
Discussion:
16:15 <Amir1> So regarding writing some code, if there's some documentation, I might dig into it and do it 16:17 <addshore> Which bits of parser output does it need? 16:25 <+dcausse> addshore: it needs it for wikipage properties (categories/external links/...) these are maybe useless for wikidata 16:26 <+dcausse> the problematic interface is \ContentHandler::getDataForSearchIndex that takes the ParserOutput as an argument 16:30 <+dcausse> this would have to be changed to load the ParserOutput just when needed and from EntityHandler stop calling parent::getDataForSearchIndex but feed the base properties needed by cirrus from something else 16:30 <+dcausse> properties needed: https://gerrit.wikimedia.org/g/mediawiki/core/+/2b04ef66576439b9ace37f1f25de7967abcb1356/includes/content/ContentHandler.php#1321 16:31 <addshore> okay! 16:33 <addshore> Indeed, so the bit in ContentHandler::getDataForSearchIndex uses a ParserOutputSearchDataExtractor and thus the parseroutput 16:34 <addshore> i just had a quick look through the wikibase specific index things and nothing there uses parser output 16:34 * addshore looks at what calls getDataForSearchIndex 16:34 <+dcausse> the thing that flattens the entity data into the text field is very important tho 16:34 <+dcausse> but probably a code available directly from wikibase 16:35 <addshore> Yup, thats fine, that doenst need parser output 16:35 <addshore> So, CirrusSearch/includes/Updater.php calles getDataForSearchIndex 16:35 <+dcausse> yes this one will have to change as well 16:35 <addshore> going back further, it does $output = $contentHandler->getParserOutputForIndexing( $page, $parserCache ); 16:36 <+ebernhardson> addshore: you have an old version btw, thats now in CirrusSearch/includes/BuildDocument/something 16:36 * addshore pulls :P 16:38 <addshore> So.... 16:38 <addshore> ConrentHandler::getParserOutputForIndexing 16:38 <addshore> Calls, $renderer->getRenderedRevision->getRevisionParserOutput 16:38 <addshore> And that has 16:38 <addshore> @param array $hints Hints given as an associative array. Known keys: 16:38 <addshore> * - 'generate-html' => bool: Whether the caller is interested in output HTML (as opposed 16:38 <addshore> * to just meta-data). Default is to generate HTML. 16:39 <addshore> that, could, maybe, be something to think about 16:40 <Amir1> Maybe Update.php buildDocument can set "skipParse" for wikibase to true? 16:41 <addshore> Well, i think it still needs a "parse" and the meta data form it, for links and things? 16:41 <addshore> but it probably doesnt care about the actual html output, but i need to verify that 16:42 <addshore> it looks at categories, external links, outgoing links, templates, text, source_text, text_bytes, content_model 16:45 <addshore> This path as far as I can see if the only thing that uses getParserOutputForIndexing too 16:46 <addshore> So, as long as the things listed above dont need any part of the html, we can add that hint and stop generating it probably :)