The sanitizer seems to be a bit aggressive with wikidata causing significant load on the database (https://phabricator.wikimedia.org/T229407#5635732) because generating the data relies on the ParserOutput.
One immediate solution could be to tune the sanitizer to slow it down for wikidata, this can be achieved by creating a new profile in `CirrusSearch/profiles/SaneitizeProfiles.config.php`.
Another approach is to refactor and reduce the dependency on the ParserOutput
16:15 <Amir1> So regarding writing some code, if there's some documentation, I might dig into it and do it
16:17 <addshore> Which bits of parser output does it need?
16:25 <+dcausse> addshore: it needs it for wikipage properties (categories/external links/...) these are maybe useless for wikidata
16:26 <+dcausse> the problematic interface is \ContentHandler::getDataForSearchIndex that takes the ParserOutput as an argument
16:30 <+dcausse> this would have to be changed to load the ParserOutput just when needed and from EntityHandler stop calling parent::getDataForSearchIndex but feed the base properties needed by cirrus from something else
16:30 <+dcausse> properties needed: https://gerrit.wikimedia.org/g/mediawiki/core/+/2b04ef66576439b9ace37f1f25de7967abcb1356/includes/content/ContentHandler.php#1321
16:31 <addshore> okay!
16:33 <addshore> Indeed, so the bit in ContentHandler::getDataForSearchIndex uses a ParserOutputSearchDataExtractor and thus the parseroutput
16:34 <addshore> i just had a quick look through the wikibase specific index things and nothing there uses parser output
16:34 * addshore looks at what calls getDataForSearchIndex
16:34 <+dcausse> the thing that flattens the entity data into the text field is very important tho
16:34 <+dcausse> but probably a code available directly from wikibase
16:35 <addshore> Yup, thats fine, that doenst need parser output
16:35 <addshore> So, CirrusSearch/includes/Updater.php calles getDataForSearchIndex
16:35 <+dcausse> yes this one will have to change as well
16:35 <addshore> going back further, it does $output = $contentHandler->getParserOutputForIndexing( $page, $parserCache );
16:36 <+ebernhardson> addshore: you have an old version btw, thats now in CirrusSearch/includes/BuildDocument/something
16:36 * addshore pulls :P
16:38 <addshore> So....
16:38 <addshore> ConrentHandler::getParserOutputForIndexing
16:38 <addshore> Calls, $renderer->getRenderedRevision->getRevisionParserOutput
16:38 <addshore> And that has
16:38 <addshore> @param array $hints Hints given as an associative array. Known keys:
16:38 <addshore> * - 'generate-html' => bool: Whether the caller is interested in output HTML (as opposed
16:38 <addshore> * to just meta-data). Default is to generate HTML.
16:39 <addshore> that, could, maybe, be something to think about
16:40 <Amir1> Maybe Update.php buildDocument can set "skipParse" for wikibase to true?
16:41 <addshore> Well, i think it still needs a "parse" and the meta data form it, for links and things?
16:41 <addshore> but it probably doesnt care about the actual html output, but i need to verify that
16:42 <addshore> it looks at categories, external links, outgoing links, templates, text, source_text, text_bytes, content_model
16:45 <addshore> This path as far as I can see if the only thing that uses getParserOutputForIndexing too
16:46 <addshore> So, as long as the things listed above dont need any part of the html, we can add that hint and stop generating it probably :)