Notes from a meeting between @MSantos and Wikidata for Wikimedia Projects team (aka Integrations). We're looking forward to collaborating and knowledge sharing more : )
We at WMDE are concerned that Parsoid does not currently have the capability to manage usage tracking when Wikipedia (client) pages use Wikidata (repo) data. Although the legacy parser can handle this work until Parsoid becomes the sole default parser, we can see from changes by the Abstract Wikipedia team that Parsoid will need an intervention to handle the following tasks:
- Purging cache to remove stale data in client pages after it is updated in WD
- Notifying users of changes to WD items (vandalism moderation)
- Maintaining backlinks to client pages in WD and sitelinks between the projects
This is currently managed in ParserOutput by storing an array of wikibase-entity-usage in the extension data (See extensions/Wikibase/client/includes/Usage/ParserOutputUsageAccumulator.php for more`). Some questions we found useful in this research are:
- Is/ can a "top level context" be managed? It seems collectMetaData is the new equivalent of ParserOutput, which is what the AbWp team use to create their usage accumulator
- But as they also said, collectMetaData is likely different for each fragment → each usage accumulator would also be separate and potentially duplicate usages
- How does the merge strategy in appendExtensionData for ParserOutput work? Using MW_MERGE_STRATEGY_UNION, could fragments which append the same extension data key have their values overwritten? Could they duplicate the exact same usages? (e.g. if fragment 1 adds ["wbc_entity_usage" = "L.en"] and fragment 2 appends ["wbc_entity_usage" = "L.fr"], can we be sure that the Parser output would know to track both L.en and L.fr?)
- If fragments could overwrite each other, we could lose usage tracking for the majority of pages' usages and risk displaying stale / low quality data
- How can we enforce usage limits? For performance reasons, we have limits such as 'entityAccessLimit' => 500, 'referencedEntityIdAccessLimit' => 3, and $wgExpensiveParserFunctionLimit
- This issue is detailed by https://phabricator.wikimedia.org/T354877
- If we can't enforce these limits, some pages have slow load time or time out after 30secs, but probably is not often reached.