Page MenuHomePhabricator

Client pages of Wikidata should be able to track their usage of WD data using Parsoid
Open, Needs TriagePublic

Description

Notes from a meeting between @MSantos and Wikidata for Wikimedia Projects team (aka Integrations). We're looking forward to collaborating and knowledge sharing more : )

We at WMDE are concerned that Parsoid does not currently have the capability to manage usage tracking when Wikipedia (client) pages use Wikidata (repo) data. Although the legacy parser can handle this work until Parsoid becomes the sole default parser, we can see from changes by the Abstract Wikipedia team that Parsoid will need an intervention to handle the following tasks:

  1. Purging cache to remove stale data in client pages after it is updated in WD
  2. Notifying users of changes to WD items (vandalism moderation)
  3. Maintaining backlinks to client pages in WD and sitelinks between the projects

This is currently managed in ParserOutput by storing an array of wikibase-entity-usage in the extension data (See extensions/Wikibase/client/includes/Usage/ParserOutputUsageAccumulator.php for more`). Some questions we found useful in this research are:

  • Is/ can a "top level context" be managed? It seems collectMetaData is the new equivalent of ParserOutput, which is what the AbWp team use to create their usage accumulator
    • But as they also said, collectMetaData is likely different for each fragment → each usage accumulator would also be separate and potentially duplicate usages
  • How does the merge strategy in appendExtensionData for ParserOutput work? Using MW_MERGE_STRATEGY_UNION, could fragments which append the same extension data key have their values overwritten? Could they duplicate the exact same usages? (e.g. if fragment 1 adds ["wbc_entity_usage" = "L.en"] and fragment 2 appends ["wbc_entity_usage" = "L.fr"], can we be sure that the Parser output would know to track both L.en and L.fr?)
    • If fragments could overwrite each other, we could lose usage tracking for the majority of pages' usages and risk displaying stale / low quality data
  • How can we enforce usage limits? For performance reasons, we have limits such as 'entityAccessLimit' => 500, 'referencedEntityIdAccessLimit' => 3, and $wgExpensiveParserFunctionLimit

Related Objects

StatusSubtypeAssignedTask
OpenReleaseNone
OpenNone
OpenNone
OpenNone
OpenNone
OpenNone
OpenFeatureNone
OpenNone
OpenNone
OpenNone
OpenNone
OpenBUG REPORTNone
OpenNone
OpenNone
OpenNone
OpenNone
OpenNone
OpenNone
OpenNone
OpenNone
Resolvedcscott
ResolvedABreault-WMF
Resolvedcscott
Opencscott
Resolvedssastry
OpenJgiannelos
OpenJgiannelos

Event Timeline

Some updates from a meeting this week with @JoelyRooke-WMDE . Briefly: WMDE expects to be able to work on this in Q4 WMF (Q2 WMDE), which ought to be ok since Parsoid will not be tackling T393716: [EPIC] RefreshLinksJob should use Parsoid-generated metadata until next (WMF) fiscal year.

It is indeed expected that metadata collected by each fragment is more-or-less independent, with the ::appendExtensionData() methods allowing this to be combined with metadata from other fragments. We're adding the "SUM" merge strategy to handle numeric usage data (T403621).

Once this data is in the ParserOutput extensiondata, there is a "new"(ish!) mechanism for "secondary data updates" used to transfer it to a database so that Wikibase can use it to drive purging, etc. I don't know the details, but anecdotally I believe wikibase is currently using a mechanism driven directly by the legacy parser instead. I've opened T403902: Update Wikibase to use Secondary Data Update mechanism for this task of updating Wikibase to use the Secondary Data Update mechanism introduced by MCR, and will try to facilitate some meetings to fill out that task description better.

There are also some issues with enforcing usage limits in a selective-update scenario, which are described in T354877: Track usage limits in ParserOutput (or elsewhere). It is probably worth introducing new fine-grained usage limits, for example if the total usage limit for the entire page is 1,000,000 you might also want to introduce an additional new limit that each individual transclusion/fragment can't exceed a usage of (say) 10,000. This would allow selective update to potentially limit usage earlier, instead of allowing a given fragment to (conservatively) use the entire 1,000,000 limit and only determining that the page limit was exceeded when results are combined.

Finally (and this is related to parsoid read views on wikidata, not metadata update) there's also T403904: HtmlPageLinkRendererEnd hook is not (will not be) supported by Parsoid although (again, anecdotally) I was told that Wikibase "hardly" uses this anymore, and uses templates to customize link titles instead, which is also the path WikiLambda took.

I've been asked to comment on this. I see two different discussions. One is on the storage of the usage in ExtensionData in ParserOutput in which I don't have any opinions, if you want to store it in ParserCache, it could technically work and we do have the capacity. But also there is a long-term storage of these data which is currently happening in wbc_entity_usage which has grown quite large over the years and I would like to move that to either x1 or Cassandra (granted we build a better integration of cassandra and mw) but that discussion seems orthogonal to this.

@Ladsgroup Could you elaborate a bit on the wbc_entity_usage size issue? It would be nice to more tightly integrate wikibase's link tables with cores, rather than have a separate cassandra database for this. Is it feasible to expect that x1 or cassandra storage could be used by the LinksUpdate jobs in the time frame of this task (before the end of next fiscal year at the latest)? Is wbc_entity_usage size going to be an issue within that time frame?

Framed another way, can we separate these tasks, or should we (eg) rule out using the 'normal' LinksUpdate mechanism for wikibase because of the size of the tables involved?

@Ladsgroup Could you elaborate a bit on the wbc_entity_usage size issue?

It's one of the biggest tables of every core section. And it's a scalability problem in general. We have sections that are too large and growing without bound and we can't vertically scale our databases anymore. Core databases are for mediawiki core and extension clusters are for extension tables. In some cases when the data is small, it's fine to keep them in core dbs but the default should be the extension clusters not the other way around (linter tables are also similar, quite big and also quite heavy in terms of writes, both should go to x1).

I have written a long essay about this: https://docs.google.com/document/u/1/d/15JdZMWjOAzd00wR7aeBmLHrSG2sfn9S_I11S0qFAYjI/edit

It would be nice to more tightly integrate wikibase's link tables with cores, rather than have a separate cassandra database for this.

What are the usecases? Maybe they are already supported.

Is it feasible to expect that x1 or cassandra storage could be used by the LinksUpdate jobs in the time frame of this task (before the end of next fiscal year at the latest)?

x1 is already mostly ready, virtual domains have nicely abstracted away the distinction between x1 and core dbs and many extensions already use that. Only problem is the lack of replication to wmcs which we are fixing and have already bought the hardware.

Cassandra support requires building a better integration of those with mw which is going to take a while and I have no control over that.

Is wbc_entity_usage size going to be an issue within that time frame?

We have knobs to make sure the table won't be an issue (by changing the threshold of collapsing too many aspects into one general aspect) but one big reason to migrate this table away from core is to actually allow us to bump the threshold and track wikidata usage more granularly.

Framed another way, can we separate these tasks, or should we (eg) rule out using the 'normal' LinksUpdate mechanism for wikibase because of the size of the tables involved?

I don't follow this question. LinkUpdate is an updating mechanism, we have storage issues and tables being too big. I don't see them being connected in any way. Sorry if I'm missing something obvious.