Page MenuHomePhabricator

Do not generate full html parser output at the end of Wikibase edit requests
Open, MediumPublic13 Estimated Story Points

Description

Introduction

During Wikibase local development we noticed something odd and slow

  1. Create some new wikibase items
  2. create a final wikibase item that refers to all of the other items you just wrote
  3. derivedDataUpdater->prepareUpdate and derivedDataUpdater->doUpdates are called, triggering parser output generation at the end of the same request edit request (which in wikibase ends up doing some amount of work)

Problem

Generation of HTML ends up being the "expensive" part of ParserOutput, as it needs to load data from a whole collection of other entities.
There is secondary storage and caching etc in place, but ideally, when not needed, we would not do this extra work. And if it is needed for some reason, we would not do it pre send in the API edit call.

In many cases this HTML for the ParserOutput is not used immediately after the API request, as most edits on Wikidata.org are made by bots.
Even for edits made by users, post edit they will not normally reload the page, as editing generally happens in JS with on page elements changing instead.
For Wikibase 3rd party users that want to do large bulk imports of data this is an even more pressing issue, as they will often have less resource, speed, caching etc, and may not even have parser caching enabled, but post API request to edit a wikidata entity, they application will still go and generate this possible not needed html parser output.

Other details

For search indexing we already generate parser output with the generate-html => false hint as part of T239931.

We already have a split parser cache, but the canonical parser output is used for en page views, for example:

  • canonical wikidatawiki:pcache:idhash:3369-0!termboxVersion=1!wb=3
  • other wikidatawiki:pcache:idhash:3369-0!termboxVersion=1!userlang=pt!wb=3

Possible solution

If MediaWiki could ask the content type if html should be generated for it or not post edit, then we would be able to say that a generate-html => false hint should be used for Wikibase content in most cases.
When such a hint is used to generate content, we could add a similar hack to EntityContent::getParserOutput for when no html is generated https://gerrit.wikimedia.org/g/mediawiki/extensions/Wikibase/+/c566765494f366c06b56a4ae3a2257d378b93222/repo/includes/Content/EntityContent.php#168
We already have a "hack" or 2 to change when parser output is or isn't cached, such as https://gerrit.wikimedia.org/g/mediawiki/extensions/Wikibase/+/c566765494f366c06b56a4ae3a2257d378b93222/repo/includes/Content/EntityContent.php#180

if($generateHtml = true){
    $out->updateCacheExpiry( 0 );
}

We could then either:

  1. Just do not generate this ParserOutput and wait for a user to trigger the render on page view?
  2. Send a job to the job queue to generate it if we do want it?

We could also do other things such as only pre render very expensive pages (100+ statements etc)

Predicted impact

All Wikibase content edits would do less work and be faster.
Memcached (for terms caching) would likely have a slightly lower load, as would s8 terms related tables, which are used during parser output generation.

Currently per https://grafana.wikimedia.org/d/FxKUKqUik/wikibase-parseroutputgenerator?orgId=1 item parser output it generated roughly 2.5-3k times per minuite. Roughly 700-1k of those per minute likely come from edits.

This could also have some impact on Commons as media info entity edits would also not have their parser output generated and cached.

Acceptance Criteria 🏕️🌟

  • HTML parser output is not generated or cached during an edit
  • While the change is being deployed monitor 1) Wikibase Parser Cache Generation 2) Wikibase & Commons save times 3) Wikibase edge entity load times
  • Consider backporting the change to the next Wikibase release (REL1_36 branch)

Event Timeline

Addshore added a project: Platform Engineering.

Adding Platform Engineering to get some input and thoughts on this topic from mediawiki folks that know more about something that might be missing, or thoughts on implementation.

We need to watch out that link tables etc. are still populated even when we don’t generate HTML. (So if I remember the interfaces correctly, we still need to generate ParserOutput with all its metadata, just without HTML inside it.)

Adding Platform Engineering to get some input and thoughts on this topic from mediawiki folks that know more about something that might be missing, or thoughts on implementation.

It's more @daniel's area, but for what it's worth, I'd rather review this once there is a patch in Gerrit, since that will clarify things, and it sounds like it will be a pretty small change.

Addshore triaged this task as Medium priority.Wed, Jul 14, 10:05 AM
Addshore updated the task description. (Show Details)

Change 704856 had a related patch set uploaded (by Ladsgroup; author: Ladsgroup):

[mediawiki/extensions/Wikibase@master] [POC] Set $generateHtml to false in EntityContent

https://gerrit.wikimedia.org/r/704856

I just want to say that I think the effect on readers will be negligible, HTML of entity content comparing to wikitext is much faster to produce.

Also there is a rather major benefit from this that is not mentioned in the ticket, the ParserCache database is currently under pressure and as result they reduced its expiry time from 30 days to 22 days and this work will clearly help with that. (cc. @Marostegui @LSobanski). My guess would be that this will remove around a couple tens of millions of PC entries from there (we can do a check how many wikidata entries there, let me know if want to check that).

My guess would be that this will remove around a couple tens of millions of PC entries from there (we can do a check how many wikidata entries there, let me know if want to check that).

I'd definitely be interested in that number!

My guess would be that this will remove around a couple tens of millions of PC entries from there (we can do a check how many wikidata entries there, let me know if want to check that).

I'd definitely be interested in that number!

I just looked at our ParserCache keys (dumped three random dbs from pc1007):

  • It says pc1007 has 260M entries in total (130M parsercache values)
  • Out of those 24.4M entries belong to wikidata (12.2M parsercache entry)

We wouldn't fully remove all wikidata parsercache entries, but it's safe to assume it'll be most of them meaning 10% of PC will be cleaned.

Change 704856 abandoned by Ladsgroup:

[mediawiki/extensions/Wikibase@master] [POC] Set $generateHtml to false in EntityContent

Reason:

POC is done :D

https://gerrit.wikimedia.org/r/704856