Page MenuHomePhabricator

What proportion of a Wikipedia article's edit history might reasonably be changes via Wikidata transclusion?
Open, Needs TriagePublic

Description

Goal: empirically understand gaps in patrolling, specifically the magnitude of the inability for editors to see when Wikipedia articles change because Wikidata items that are transcluded are edited.

Background:

General approach:

  • Use wbc_entity_usage MariaDB table to determine what items/properties are transcluded in which Wikipedia articles
  • For each Wikipedia article, compute the ratio of edits to the article : edits to the Wikidata properties transcluded.
    • This is slightly tricky because wbc_entity_usage does not track when the property started to be transcluded. Potentially one could grab the current snapshot of the table and then wait a month and gather the data for that month only.
    • Also we would need to understand the potential for new properties to be added to a Wikidata item that would automatically be transcluded -- e.g., via infobox templates.

Event Timeline

The RC injection process for wikidata edits into client edits is fairly simple.

RC entries are enabled for all wikis, except commons.
The setting to look out for in WIkibase is "injectRecentChanges" which is set by the below in production.

'wmgWikibaseClientInjectRecentChanges' => [
	'default' => true,
	'commonswiki' => false, // T171027
	'testcommonswiki' => false, // T171027
],

Whenever a change occurs on wikidata that impacts a client site, change propagation happens.
See the docs at https://doc.wikimedia.org/Wikibase/master/php/md_docs_topics_change-propagation.html

The process of recent changes injection is done in a job and goes roughly as follow:

  • Loop through a batch of affected titles
  • Look in recent changes to see if there is a recent change already appears for the title we are working with that was caused by the same edit on wikidata (skip if this is the case)
  • Commit the RC entry to DB

The approach described in the description is likely to get you some fairly unreliable data.

Thanks for the additional details @Addshore !

Some context: this task isn't being worked right now. I just created it as a potential future analysis because I had just become aware that Wikidata item properties were tracked specifically in wbc_entity_usage and think some good numbers on this would be valuable to track.

Based on what you said, I think I should probably rename this task to focus on the edit history as that's closer to what I'm actually interested in (I now see the confusion that the current title causes) -- i.e. edits that originate on Wikidata and actually change the content of an associated Wikipedia article. The Wikidata tracking in Recent Changes feed seems to be still far too noisy for this purpose. Looking at English Wikipedia for example, the challenge with the Recent Changes feed of Wikidata edits is that almost none of those changes (I couldn't actually find any in my quick checking) actually affected the content of the page. Sitelinks obviously matter but I'm not considering them at the moment. Almost all the property changes in that feed are surfaced because the associated Wikipedia article has a generic C property on the wbc_entity_usage table (as opposed to C.P#### indicating that specific properties are being transcluded). I don't actually understand why they have that C property listed on wbc_entity_usage.

The approach described in the description is likely to get you some fairly unreliable data.

Could you provide some more details here?

Isaac renamed this task from What percentage of edits via Wikidata transclusion are missing on Recent Changes? to What proportion of a Wikipedia article's edit history might reasonably be changes via Wikidata transclusion?.Mar 2 2020, 10:04 PM

The Wikidata tracking in Recent Changes feed seems to be still far too noisy for this purpose

Yup. See T90435: [Epic] Wikidata watchlist improvements (client) and all subtasks.

I don't actually understand why they have that C property listed on wbc_entity_usage.

If we have a concrete example to look at I can try to figure that out :)
But indeed, the usage tracking for C.P### only means that a change to that statement might trigger a change to the page rendering.
Being more certain is hard due to the flexibility of LUA and parser functions.

A 'better' way to dispatch recent changes notifications might be to parser both the old and new versions of the page, compare, and only add an RC row if the content seems to have changes.
Still not fool proof as some other magic on the page could cause the content to change, but would likely result in less RC entries.
And of course this would mean double page parsing (for cases where there is no parser cache entry) which would be undesirable.

If we have a concrete example to look at I can try to figure that out :)

Actually, I think I found the reason for most of the pages: https://en.wikipedia.org/wiki/Template:Authority_control
It's generic because it pulls any external identifiers so can't be defined in advance which ones will be transcluded. And while "noise' from this example could be reduced by somehow indicating in wbc_entity_usage whether the properties used are identifiers or statements, I recognize that that doesn't solve the larger challenge.

@Isaac I suggest we have a call together with Adam to talk this through before we dive deeper into it? It seems like a worthy research area so let's make sure we look at the right things :)

@Lydia_Pintscher that makes sense and thanks for reaching out. I'm not going to schedule the meeting right now because I don't want to use up your time if we don't end up prioritizing this work, but when we do, I'll reach out!

The results reported in T249654#6352573 have some potential insight into how we think about supporting patrolling of Wikidata transclusion within Wikipedia articles so I wanted to record some of my initial thoughts here. We would want to talk with patrollers before actually thinking about implementing any of these and unfortunately I'm not actually working on this aspect of the project at the moment. However: the recent changes feed for a given article likely has many more Wikidata-related changes than are actually pertinent to an article from a patrolling standpoint. Some thoughts on reducing this noise:

  • Many entries to wbc_entity_usage are from transclusion that only generates tracking categories (e.g., Category:Coordinates on Wikidata) so arguably there should be a way to mark events on Recent Changes caused by these as tracking-only so patrollers could easily ignore them.
  • Many entries to wbc_entity_usage are from metadata templates like Authority Control and Taxonbar that are very valuable from a linked-data perspective but less from a reader's perspective and have a very low potential for harmful vandalism. Because the way both of these templates are written, they also trigger a general "statements" aspect usage, so any changes to statements on the Wikidata item would trigger an event on recent changes. This adds a bunch of noise to the Recent Changes feed from Wikidata where these templates are used. Additionally, in reality, changes to Wikidata identifiers that impact Authority Control and Taxonbar have a very low likelihood of being problematic from a reader's perspective because the external links that are generated via these templates go to well-curated repositories of information so the reader should quickly realize the link is incorrect and probably won't end up viewing offensive material. Ideally these templates would be rewritten to only trigger the specific properties they transclude, but in practice I could see that being difficult, inefficient, or causing the wbc_entity_usage table to become far too large to be practical (as each usage of Authority Control would trigger close to 100 rows, 1 for each property that can be transcluded). Instead, maybe wbc_entity_usage could be expanded to distinguish between general statements (C.S?) and identifiers (C.I?)? This would make filtering out changes to identifiers far easier and metadata templates then could still be recorded simply without causing every change to date of birth, occupation, etc. to also trigger a change. Unfortunately, I suspect this would require making non-trivial changes to the Lua modules and then convincing template coders to adapt the code.
  • Some entries to wbc_entity_usage go to generating external links that could more clearly generate harm if vandalized and probably do warrant focus from patrollers. For instance, Wikidata templates that generate links to Commons categories or external links to IMDb etc. could more clearly be abused to link to offensive material. Thankfully, given the specific nature of these templates, they generally are recorded with their specific property and so don't generate noise for patrollers. That said, a not insignificant amount of their usage (on enwiki) is only for tracking categories, so any changes that would distinguish between actual transclusion and tracking categories would serve to reduce noise for this.
  • Finally, infobox transclusion has probably the greatest potential for harm (e.g., falsifying someone's age or where they were born). This seems to be tracked pretty well for most infoboxes (the specific properties each get their own row and labels for each item that was actually transcluded) so I think it's more about reducing the noise from the above so that patrollers can more easily see these changes.