Introduction
During Wikibase local development we noticed something odd and slow
- Create some new wikibase items
- create a final wikibase item that refers to all of the other items you just wrote
- derivedDataUpdater->prepareUpdate and derivedDataUpdater->doUpdates are called, triggering parser output generation at the end of the same request edit request (which in wikibase ends up doing some amount of work)
Problem
Generation of HTML ends up being the "expensive" part of ParserOutput, as it needs to load data from a whole collection of other entities.
There is secondary storage and caching etc in place, but ideally, when not needed, we would not do this extra work. And if it is needed for some reason, we would not do it pre send in the API edit call.
In many cases this HTML for the ParserOutput is not used immediately after the API request, as most edits on Wikidata.org are made by bots.
Even for edits made by users, post edit they will not normally reload the page, as editing generally happens in JS with on page elements changing instead.
For Wikibase 3rd party users that want to do large bulk imports of data this is an even more pressing issue, as they will often have less resource, speed, caching etc, and may not even have parser caching enabled, but post API request to edit a wikidata entity, they application will still go and generate this possible not needed html parser output.
Other details
For search indexing we already generate parser output with the generate-html => false hint as part of T239931.
We already have a split parser cache, but the canonical parser output is used for en page views, for example:
- canonical wikidatawiki:pcache:idhash:3369-0!termboxVersion=1!wb=3
- other wikidatawiki:pcache:idhash:3369-0!termboxVersion=1!userlang=pt!wb=3
Possible solution
If MediaWiki could ask the content type if html should be generated for it or not post edit, then we would be able to say that a generate-html => false hint should be used for Wikibase content in most cases.
When such a hint is used to generate content, we could add a similar hack to EntityContent::getParserOutput for when no html is generated https://gerrit.wikimedia.org/g/mediawiki/extensions/Wikibase/+/c566765494f366c06b56a4ae3a2257d378b93222/repo/includes/Content/EntityContent.php#168
We already have a "hack" or 2 to change when parser output is or isn't cached, such as https://gerrit.wikimedia.org/g/mediawiki/extensions/Wikibase/+/c566765494f366c06b56a4ae3a2257d378b93222/repo/includes/Content/EntityContent.php#180
if($generateHtml = true){ $out->updateCacheExpiry( 0 ); }
We could then either:
- Just do not generate this ParserOutput and wait for a user to trigger the render on page view?
- Send a job to the job queue to generate it if we do want it?
We could also do other things such as only pre render very expensive pages (100+ statements etc)
Predicted impact
All Wikibase content edits would do less work and be faster.
Memcached (for terms caching) would likely have a slightly lower load, as would s8 terms related tables, which are used during parser output generation.
Currently per https://grafana.wikimedia.org/d/FxKUKqUik/wikibase-parseroutputgenerator?orgId=1 item parser output it generated roughly 2.5-3k times per minuite. Roughly 700-1k of those per minute likely come from edits.
This could also have some impact on Commons as media info entity edits would also not have their parser output generated and cached.
Acceptance Criteria ποΈπ
- HTML parser output is not generated or cached during an edit
- While the change is being deployed monitor 1) Wikibase Parser Cache Generation 2) Wikibase & Commons save times 3) Wikibase edge entity load times
- Consider backporting the change to the next Wikibase release (REL1_36 branch)