Page MenuHomePhabricator

Categorize different types of Wikidata re-use within Wikimedia projects
Closed, ResolvedPublic

Description

Overview

Wikidata is transcluded in other Wikimedia projects in a large variety of ways. For Wikipedia, this can be anything from determining the values in an infobox of an individual's day of birth, day of death, or nationality (quite visible to readers and important to get correct; see BLP) to providing descriptions that show up only on mobile to auto-generating metadata tables (e.g., a Library of Congress identifier) that is displayed at the end of the article on desktop to simply indicating that Wikidata contains coordinates for a given article even if that article is not using them.
The goal of this task is to categorize the range of different ways in which Wikidata is transcluded within Wikipedia and begin to identify how to measure the prevalence of each type of transclusion to add more nuance to the counts that can be derived from the wbc_entity_usage Mediawiki table. If this initial work goes well, I intend to expand it to Commons and the other projects as well.

See Also

T247099 (SQL definition for wikidata metrics for tuning session)
T246709 (What proportion of a Wikipedia article's edit history might reasonably be changes via Wikidata transclusion?)
Wikidata metrics from FY20 Q2 Tuning Session

Event Timeline

Isaac created this task.Apr 7 2020, 7:33 PM
Restricted Application added a subscriber: Liuxinyu970226. · View Herald TranscriptApr 7 2020, 7:33 PM
Isaac added a comment.Apr 14 2020, 3:24 PM

Interesting example: article in Portuguese Wikipedia that has reached Featured List status and is generated by ListeriaBot via Wikidata: https://pt.wikipedia.org/wiki/Lista_de_mortos_e_desaparecidos_pol%C3%ADticos_na_ditadura_militar_brasileira

Isaac added a comment.Apr 20 2020, 3:46 PM

Weekly update: wrote up a quick summary and started collecting my thoughts on this meta page: https://meta.wikimedia.org/wiki/Research:External_Reuse_of_Wikimedia_Content/Wikidata_Transclusion. I'll continue to report weekly updates here but will attempt to document on the actual process there. My intent is to do some continued searching / exploring myself for the next few weeks and then begin to reach out to others for feedback / guidance when I have some (hopefully) clear ideas / plans to share and more detailed questions to ask.

Isaac added a comment.May 4 2020, 2:43 PM

Weekly updates: no progress.

Isaac added a comment.May 8 2020, 7:09 PM

Weekly update: no progress

Isaac added a comment.May 15 2020, 6:34 PM

Weekly update: no progress.

Isaac added a comment.May 22 2020, 4:25 PM

Weekly update: no progress.

Isaac added a comment.May 29 2020, 7:58 PM

Weekly update: no progress

Isaac added a comment.Jun 5 2020, 6:08 PM

Weekly update: no progress

Isaac added a comment.Jun 12 2020, 6:32 PM

Weekly update: no progress

Isaac added a comment.Jun 19 2020, 4:07 PM

Weekly update: no progress. End date pushed to August 31st (Betterworks updated).

Weekly update: began process of systematically identifying main ways in which Wikidata is transcluded in enwiki and determining how they affect the wbc_entity_usage table. Had been inspecting the table for various examples to identify patterns but I just realized that I could probably use a sandbox page to actually verify without being disruptive. Also coding each instance with these criteria.

@Nuria: following up on T247099#6346344 here as this seems a more relevant task. I provide high-level details below regarding the nature of Wikidata transclusion on English Wikipedia. Here is a more thorough description of how I came to my conclusion regarding the importance of different types of Wikidata transclusion that occurs. @Addshore @GoranSMilovanovic @Lydia_Pintscher FYI in case you're interested as I know you're well aware of the limits of wbc_entity_usage for measuring Wikidata transclusion in articles. I'm very open to feedback so let me know if you see any mistaken assumptions etc.

Methods

I looked at 100 random English Wikipedia articles (eventually I'd like to expand this but English was easiest for me to figure out in this initial analysis) and evaluated them as described below. I also attempted to automate the evaluations that I was making (not always possible / easy) and apply them to all of enwiki to make sure my tiny 100-article sample was representative of the whole wiki.

For each example of Wikidata transclusion I evaluated, I scored it based upon three criteria:

  • how many platforms it showed up -- i.e. desktop, mobile, apps
  • how salient it was on the page -- i.e. infobox vs. article body vs. end boxes
  • what impact is presumably has (or could have) on the reader -- i.e. fact vs. link vs. maintenance category.

I then placed each type of transclusion into high-, medium-, and low-importance categories depending on these criteria.

Results

The estimate currently used based on wbc_entity_usage for Wikidata transclusion on enwiki is that 61.95% of articles transclude Wikidata (and 99.9% have associated Wikidata items). I found that that 61.95% statistic is far too optimistic, specifically:

  • High-importance transclusion: only 2% of enwiki articles have content inserted from Wikidata into infoboxes.
    • This could not be verified easily in the automated script because each infobox varies in which parameters have to be left unnamed for them to be filled in via Wikidata but the verification of the other statistics leaves me confident that this number is pretty accurate.
  • Medium-importance transclusion: an additional 3% of enwiki articles have other Wikidata-based content -- specifically links to external sites and Commons. Overlapping with the other categories, 54% of articles additionally use Wikidata descriptions in Search on the mobile apps (the 61.95% statistic does not track this type of usage).
    • I don't have an exact estimate for transclusion from external link templates for all of enwiki because of how I chose to count it (gonna go back and fix this) but it looks like it would be close to 3%. Of note, about half of the time that an external link template is used, it pulls the Wikidata data. The other half of the time, the appropriate link was already provided by the editor in the template.
    • A count of articles in the shortdesc category indicates that 65.5% of articles don't overwrite the Wikidata description, which is slightly higher so I'll have to look into that discrepancy.
  • Low-importance transclusion: an additional 21% of enwiki articles just have metadata templates and 33% just have Wikidata-based tracking categories.
    • I call these "low-importance" not because I don't find them valuable but because they have minimal impact on the average reader.
    • The full analysis of enwiki suggests that 29% of articles have metadata templates, which supports this as I actually found that 24% of articles had metadata templates but 3% had higher importance templates and so aren't counted here.
    • The full analysis of enwiki suggests that the transclusion on 27% of articles is just a tracking category, which is pretty close to 33%. For instance, out of 1,841,505 usages of the coordinates template, only 2730 times did it pull the coordinates from Wikidata (the rest was just tracking) even though the wbc_entity_usage table records all of these as P625 transclusion.
  • No transclusion: in line with current estimates, 41% of articles had no templates that support transclusion whatsoever.
    • This in line with the full analysis of wbc_entity_usage (39.05%) and my analysis of enwiki (43%).

Generalization

Enwiki is famously cautious to use Wikidata transclusion so presumably wikis like ruwiki or cawiki have substantially more high- and medium-importance tranclusion but this analysis at least demonstrates that if we want to more carefully track and categorize Wikidata transclusion for the purpose of metrics, we need something much more nuanced than the wbc_entity_usage table (which is what powers the 61.95% statistic)

A few notes regarding wbc_entity_usage

This table tells Mediawiki when to update a given Wikipedia article because a Wikidata item has been edited. We've been using it for statistics as it's the best indicator available at this point of when Wikidata content is transcluded in a Wikipedia article. It is based on parser functions and Lua calls made by modules/templates that are in an article and thus has no connection to what is actually done with the Wikidata content. Code that pulls much more data than it uses (e.g., retrieving the full object is not uncommon) bloats the data from a statistics perspective. As I find, much of the Wikidata transclusion is either used to generate tracking categories or metadata templates. Only a very small proportion actually pulls Wikidata values for statements (not identifiers) and displays them in the page. I don't have any suggestions for how to better analyze the wbc_entity_usage table as connecting the aspects recorded there with what actually happens on the page requires going through the template/module code and following what happens with the data and it would differ for every language because of wiki-specific nature of templates. That said, using the wbc_entity_usage table coupled with this sort of more focused analysis to better contextualize the data in the table is probably sufficient for understanding the general overview of Wikidata transclusion in Wikipedia.

Nuria added a subscriber: calbon.Jul 31 2020, 7:11 PM

cc @calbon so he is aware of this reserach

Akuckartz added a subscriber: Akuckartz.
Nuria added a comment.Jul 31 2020, 8:03 PM

No transclusion: in line with current estimates, 41% of articles had no templates that support transclusion whatsoever

This is "overall articles for all projects", correct?

Overlapping with the other categories, 54% of articles additionally use Wikidata descriptions in Search on the mobile apps

How is this calculated?

Do you have your selects to group by "importance" using wbc_entity_usage?

Nuria added a comment.Jul 31 2020, 8:04 PM

And, forgot to say, THIS IS SUPER USEFUL, thanks!

Isaac added a comment.Jul 31 2020, 8:26 PM

This is "overall articles for all projects", correct?

It's actually just for English Wikipedia. The number from the WMDE dashboard for all Wikipedia projects is 31.99% (i.e. the inverse of the 68.01% number provided under "% of Articles that use Wikidata" in the tinier table that aggregates each project family). It varies a lot by wiki too -- vecwiki seems to have almost every article with some form of Wikidata transclusion whereas 62% of articles on Japanese Wikipedia don't have a single Wikidata-based template. This data was only recently added there (see T257962).

How is this calculated?
Do you have your selects to group by "importance" using wbc_entity_usage?

Wikidata description usage isn't tracked on wbc_entity_usage as far as I can tell so can't be queried in any straightforward way. The way I reached the 54% number is that I checked each article in my sample to see whether the description was from Wikidata using the gadget mentioned here. On enwiki at least, Wikidata is the default unless there is a short description provided on the page, which supposedly is tracked by this category (which is how I verified this number -- you can see that the category has 2.1M pages in it, so about 1/3 of articles overwrite the Wikidata description). That said, maybe 10-20% of articles that used Wikidata didn't actually show a description because it hadn't been added in Wikidata yet.

And, forgot to say, THIS IS SUPER USEFUL, thanks!

Thanks!!

@Isaac Thank you for this analysis - really useful!

Isaac added a comment.Aug 6 2020, 8:01 PM

Thank you for this analysis - really useful!

Thanks! Glad to hear :)

Additionally, I made some notes here about how these findings my inform patrolling of Wikidata transclusion (T246709#6367012) and am working on hopefully writing this up for the Wikidata Workshop.

@Isaac Let me know if you need any help. I will take a look at your notes now. Very, very useful work.

Isaac added a comment.Aug 13 2020, 2:03 PM

@GoranSMilovanovic thanks! I'm pretty open on next steps. This work was done in part to help guide interpretation of potential WMF metrics around measuring transclusion but I would love to see some improvements made to the way we monitor transclusion if possible too. You'll have to let me know what you see as feasible / reasonable changes though and what I can do to help make them happen. In T246709#6367012 I noted that there are two potential improvements I could see made based on my very limited knowledge of how lua / these tables work:

  • Distinguishing between standard statements and identifiers in Lua calls. If this then was reflected in wbc_entity_usage, it would be much easier to distinguish between transclusion that is part of linked open data and transclusion that is facts like birthday etc. It would also substantially reduce noise in Recent Changes because, at least in English Wikipedia, the very common metadata templates like Authority Control and Taxonbar trigger a general C aspect and so changes to any part of the Wikidata item show up in Recent Changes even when they have no impact on the article. In theory, a filter could be added to Recent Changes then to change how changes to identifiers show up in the feed.
  • I'm not sure if it's possible to distinguish between transclusion and tracking in the wbc_entity_usage table -- e.g., a parameter that can be passed with lua calls that indicates that the property is only being used for tracking. This might just be a hacky change that long-term isn't useful, but tracking categories generate a lot of the entries in the wbc_entity_usage table and are quite different in impact than transclusion.
Isaac added a comment.Aug 21 2020, 6:07 PM

Weekly update:

  • cleaned up the meta page a little: https://meta.wikimedia.org/wiki/Research:External_Reuse_of_Wikimedia_Content/Wikidata_Transclusion
  • this task is essentially done but I'm going to leave the task open at least another week to allow for continued discussion
  • further research steps in this space would be:
    • repeating the analysis in a language like Japanese which shows very little transclusion and a language like Catalan that presumably has much more.
    • automating the infobox portion of this analysis for enwiki
    • moving to the question raised in T246709 which is essentially applying the same taxonomy (high-, medium-, low-, no-importance) to Wikidata changes that show up in RecentChanges feeds to see how much "noise" appears in them and the best ways to provide better filters there.
Isaac updated the task description. (Show Details)Aug 21 2020, 6:08 PM
Isaac closed this task as Resolved.Fri, Aug 28, 3:00 PM