Page MenuHomePhabricator

Do not generate full html parser output at the end of Wikibase edit requests
Closed, ResolvedPublic13 Estimated Story Points

Description

Introduction

During Wikibase local development we noticed something odd and slow

  1. Create some new wikibase items
  2. create a final wikibase item that refers to all of the other items you just wrote
  3. derivedDataUpdater->prepareUpdate and derivedDataUpdater->doUpdates are called, triggering parser output generation at the end of the same request edit request (which in wikibase ends up doing some amount of work)

Problem

Generation of HTML ends up being the "expensive" part of ParserOutput, as it needs to load data from a whole collection of other entities.
There is secondary storage and caching etc in place, but ideally, when not needed, we would not do this extra work. And if it is needed for some reason, we would not do it pre send in the API edit call.

In many cases this HTML for the ParserOutput is not used immediately after the API request, as most edits on Wikidata.org are made by bots.
Even for edits made by users, post edit they will not normally reload the page, as editing generally happens in JS with on page elements changing instead.
For Wikibase 3rd party users that want to do large bulk imports of data this is an even more pressing issue, as they will often have less resource, speed, caching etc, and may not even have parser caching enabled, but post API request to edit a wikidata entity, they application will still go and generate this possible not needed html parser output.

Other details

For search indexing we already generate parser output with the generate-html => false hint as part of T239931.

We already have a split parser cache, but the canonical parser output is used for en page views, for example:

  • canonical wikidatawiki:pcache:idhash:3369-0!termboxVersion=1!wb=3
  • other wikidatawiki:pcache:idhash:3369-0!termboxVersion=1!userlang=pt!wb=3

Possible solution

If MediaWiki could ask the content type if html should be generated for it or not post edit, then we would be able to say that a generate-html => false hint should be used for Wikibase content in most cases.
When such a hint is used to generate content, we could add a similar hack to EntityContent::getParserOutput for when no html is generated https://gerrit.wikimedia.org/g/mediawiki/extensions/Wikibase/+/c566765494f366c06b56a4ae3a2257d378b93222/repo/includes/Content/EntityContent.php#168
We already have a "hack" or 2 to change when parser output is or isn't cached, such as https://gerrit.wikimedia.org/g/mediawiki/extensions/Wikibase/+/c566765494f366c06b56a4ae3a2257d378b93222/repo/includes/Content/EntityContent.php#180

if($generateHtml = true){
    $out->updateCacheExpiry( 0 );
}

We could then either:

  1. Just do not generate this ParserOutput and wait for a user to trigger the render on page view?
  2. Send a job to the job queue to generate it if we do want it?

We could also do other things such as only pre render very expensive pages (100+ statements etc)

Predicted impact

All Wikibase content edits would do less work and be faster.
Memcached (for terms caching) would likely have a slightly lower load, as would s8 terms related tables, which are used during parser output generation.

Currently per https://grafana.wikimedia.org/d/FxKUKqUik/wikibase-parseroutputgenerator?orgId=1 item parser output it generated roughly 2.5-3k times per minuite. Roughly 700-1k of those per minute likely come from edits.

This could also have some impact on Commons as media info entity edits would also not have their parser output generated and cached.

Acceptance Criteria πŸ•οΈπŸŒŸ

  • HTML parser output is not generated or cached during an edit
  • While the change is being deployed monitor 1) Wikibase Parser Cache Generation 2) Wikibase & Commons save times 3) Wikibase edge entity load times
  • Consider backporting the change to the next Wikibase release (REL1_36 branch)

Event Timeline

Addshore added a project: Platform Engineering.

Adding Platform Engineering to get some input and thoughts on this topic from mediawiki folks that know more about something that might be missing, or thoughts on implementation.

We need to watch out that link tables etc. are still populated even when we don’t generate HTML. (So if I remember the interfaces correctly, we still need to generate ParserOutput with all its metadata, just without HTML inside it.)

Adding Platform Engineering to get some input and thoughts on this topic from mediawiki folks that know more about something that might be missing, or thoughts on implementation.

It's more @daniel's area, but for what it's worth, I'd rather review this once there is a patch in Gerrit, since that will clarify things, and it sounds like it will be a pretty small change.

Addshore triaged this task as Medium priority.Jul 14 2021, 10:05 AM
Addshore updated the task description. (Show Details)

Change 704856 had a related patch set uploaded (by Ladsgroup; author: Ladsgroup):

[mediawiki/extensions/Wikibase@master] [POC] Set $generateHtml to false in EntityContent

https://gerrit.wikimedia.org/r/704856

I just want to say that I think the effect on readers will be negligible, HTML of entity content comparing to wikitext is much faster to produce.

Also there is a rather major benefit from this that is not mentioned in the ticket, the ParserCache database is currently under pressure and as result they reduced its expiry time from 30 days to 22 days and this work will clearly help with that. (cc. @Marostegui @LSobanski). My guess would be that this will remove around a couple tens of millions of PC entries from there (we can do a check how many wikidata entries there, let me know if want to check that).

My guess would be that this will remove around a couple tens of millions of PC entries from there (we can do a check how many wikidata entries there, let me know if want to check that).

I'd definitely be interested in that number!

My guess would be that this will remove around a couple tens of millions of PC entries from there (we can do a check how many wikidata entries there, let me know if want to check that).

I'd definitely be interested in that number!

I just looked at our ParserCache keys (dumped three random dbs from pc1007):

  • It says pc1007 has 260M entries in total (130M parsercache values)
  • Out of those 24.4M entries belong to wikidata (12.2M parsercache entry)

We wouldn't fully remove all wikidata parsercache entries, but it's safe to assume it'll be most of them meaning 10% of PC will be cleaned.

Change 704856 abandoned by Ladsgroup:

[mediawiki/extensions/Wikibase@master] [POC] Set $generateHtml to false in EntityContent

Reason:

POC is done :D

https://gerrit.wikimedia.org/r/704856

This is waaaay more complicated than it looks, I try to summarize my findings here:

  • First, if we fix this, and taking editing the sandbox as an example, the wall time will be reduced by 33% and CPU time by 40% (data)
  • The code for saving edits in mw grew organically and it's quite convoluted. The part that helps figuring out what's going on is MediaWiki\Storage\DerivedPageDataUpdater::doParserCacheUpdate
  • I did an example edit and checked it in XHGui so I can follow the path of function calls. The calls for rendering the revision comes from four places. Two are from DerivedPageDataUpdater which needs a refactor to make this work and two from MediaWiki\Edit\PreparedEdit::getOutput which is not deprecated but callers to it are deprecated. All usages seem to boil down to SpamBlacklist extension. I don't know how hard it would be to fix that.
  • There is a twist in refactoring DerivedPageDataUpdater. We basically can't tell to skip adding the updates if it's an EntityContent because it's based on a page and WikiPage can have multiple slots with different contents (and in the case of Commons we actually have this case). I can think of checking for all contents and if only one wasn't set to skip generate html then queue the updates.
  • This might cause issues on edits being done through special pages (e.g. ChronologyProtector). If we can avoid that and e.g. generate html in these cases, it'd be amazing. Will see.

I try to do more work and create sub tickets. Any ideas are welcome.

Change 713268 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@master] [WIP] Introduce concept of generateHTMLOnEdit() for Content

https://gerrit.wikimedia.org/r/713268

Change 713649 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/extensions/Wikibase@master] Set EntityHandler::generateHTMLOnEdit to false

https://gerrit.wikimedia.org/r/713649

Change 714671 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@wmf/1.37.0-wmf.20] Introduce concept of generateHTMLOnEdit() for ContentHandler

https://gerrit.wikimedia.org/r/714671

Change 714672 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@wmf/1.37.0-wmf.19] Introduce concept of generateHTMLOnEdit() for ContentHandler

https://gerrit.wikimedia.org/r/714672

Change 714671 merged by jenkins-bot:

[mediawiki/core@wmf/1.37.0-wmf.20] Introduce concept of generateHTMLOnEdit() for ContentHandler

https://gerrit.wikimedia.org/r/714671

Mentioned in SAL (#wikimedia-operations) [2021-08-25T10:03:19Z] <ladsgroup@deploy1002> Synchronized php-1.37.0-wmf.20/includes: Backport: [[gerrit:714671|Introduce concept of generateHTMLOnEdit() for ContentHandler (T285987)]] (duration: 02m 17s)

Change 713268 merged by jenkins-bot:

[mediawiki/core@master] Introduce concept of generateHTMLOnEdit() for ContentHandler

https://gerrit.wikimedia.org/r/713268

Change 714672 merged by jenkins-bot:

[mediawiki/core@wmf/1.37.0-wmf.19] Introduce concept of generateHTMLOnEdit() for ContentHandler

https://gerrit.wikimedia.org/r/714672

Mentioned in SAL (#wikimedia-operations) [2021-08-25T10:47:48Z] <ladsgroup@deploy1002> Synchronized php-1.37.0-wmf.19/includes/content/ContentHandler.php: Backport: [[gerrit:714672|Introduce concept of generateHTMLOnEdit() for ContentHandler (T285987)]], Part I (duration: 01m 08s)

Mentioned in SAL (#wikimedia-operations) [2021-08-25T10:49:33Z] <ladsgroup@deploy1002> Synchronized php-1.37.0-wmf.19/includes/Storage/DerivedPageDataUpdater.php: Backport: [[gerrit:714672|Introduce concept of generateHTMLOnEdit() for ContentHandler (T285987)]], Part II (duration: 01m 04s)

Change 714674 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/extensions/Wikibase@wmf/1.37.0-wmf.19] Set EntityHandler::generateHTMLOnEdit to false

https://gerrit.wikimedia.org/r/714674

Change 714675 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/extensions/Wikibase@wmf/1.37.0-wmf.20] Set EntityHandler::generateHTMLOnEdit to false

https://gerrit.wikimedia.org/r/714675

Change 713649 merged by jenkins-bot:

[mediawiki/extensions/Wikibase@master] Set EntityHandler::generateHTMLOnEdit to false

https://gerrit.wikimedia.org/r/713649

Change 714675 merged by jenkins-bot:

[mediawiki/extensions/Wikibase@wmf/1.37.0-wmf.20] Set EntityHandler::generateHTMLOnEdit to false

https://gerrit.wikimedia.org/r/714675

Mentioned in SAL (#wikimedia-operations) [2021-08-25T17:14:25Z] <ladsgroup@deploy1002> Synchronized php-1.37.0-wmf.20/extensions/Wikibase/repo/includes/Content/EntityHandler.php: Backport: [[gerrit:714675|Set EntityHandler::generateHTMLOnEdit to false (T285987)]] (duration: 01m 18s)

Change 714674 merged by jenkins-bot:

[mediawiki/extensions/Wikibase@wmf/1.37.0-wmf.19] Set EntityHandler::generateHTMLOnEdit to false

https://gerrit.wikimedia.org/r/714674

Mentioned in SAL (#wikimedia-operations) [2021-08-25T17:46:34Z] <lucaswerkmeister-wmde@deploy1002> Synchronized php-1.37.0-wmf.19/extensions/Wikibase/repo/includes/Content/EntityHandler.php: Backport: [[gerrit:714674|Set EntityHandler::generateHTMLOnEdit to false (T285987)]] (duration: 01m 06s)

It helped in the size of binlog on ParserCache servers (https://grafana.wikimedia.org/d/000000106/parser-cache?viewPanel=6&orgId=1&from=1629900681358&to=1629987081359)

image.png (968Γ—1 px, 113 KB)

The actual impact on the db is a bit hard to measure since it won't show up unless you optimize this monster of a table. We might still do it before the dc switchover (since it doesn't require depooling and repooling) but we need to wait 20 days for the old data to be properly cleaned up.

The effects on rendering and edit time saving is highly dependent on 701074 and the rest but also since these are post-edit actions, it wouldn't have much impact and the edit saving time but it would help on import and batch edits.

but also since these are post-edit actions, it wouldn't have much impact and the edit saving time but it would help on import and batch edits.

At least when the testing on local Wikibase was done this should have an impact on API response times to users, as the parser cache generation was done post edit, but presend of the API response.
I wonder if we can see that anywhere.

I was half expecting to see some sort of impact on https://grafana.wikimedia.org/d/FxKUKqUik/wikibase-parseroutputgenerator, but we should not, as parser output objects are still generated, just without html?

On the whole I am very happy that this task lead us to T288639: SpamBlacklistHooks::onEditFilterMergedContent causes every edit to be rendered twice too which save very significant improvements

image.png (1Γ—3 px, 831 KB)

Also looking at average API response times for August we can see the impact there for wikibase editing API calls.

SELECT day, hour, avg(backend_time_ms) as avg_response, count(backend_time_ms) as num_requests
FROM event.mediawiki_api_request
WHERE `database` = "wikidatawiki"
AND params["action"] RLIKE '^wbl?(create|edit|set|add|remove|link|merge)'
AND api_error_codes IS NULL
AND YEAR = 2021 AND MONTH = 8
GROUP BY DAY, HOUR
LIMIT 1000

image.png (742Γ—2 px, 116 KB)

This currently only has data up to the 25th August, but we may be able to see more for the changes that happened on the 25th in the coming days.
(So I'm going to leave open in verification for a bit more)