Page MenuHomePhabricator

List of most used Wikidata entities
Closed, ResolvedPublic

Description

As a Wikidata editor I want to regularly check which Wikidata entities are the most used on the Wikimedia projects (as the number of Wikimedia pages containing data from them, something similar to Special:MostTranscludedPages) in order to be aware of Wikidata's weaknesses, know how much damage a vandal can cause when modifying these entities and make decisions about them (watching, protecting, etc.).

Problem: Vandalism on a single Item can cause thousands of pages in several projects to appear vandalized.

Example:

Screenshots/mockups:

BDD
GIVEN
AND
WHEN
AND
THEN
AND

Acceptance criteria:

  • I would expect to get a list of entities ordered by the number of Wikimedia pages on which these entities are used (first the most used one) with at least the following information about each of them:
    • identifier (QX, LY, etc.),
    • label in my language,
    • current protection status, and
    • number of uses (Wikimedia pages).
  • The list should be as long as technically reasonable.

Open questions:

  • Should only administrators be able to access the list?

See also

Event Timeline

@GoranSMilovanovic that'd be lovely. Can you pass-word protect it for now? We should make sure this isn't used as an easy way to find out how to create a lot of harm.

For the record: I've got some data (thanks!) and, based on those, I'm going to run a script to semi-protect the 500 (quite arbitrary number) most used Items in random order (although several of these Items are already semi-protected), which represents the 0.0009% of our total number of Items. Despite what may look like on the chart, the 500th most used Item has still more than 45,000 uses. This measure isn't the panacea and several thousands of Items will remain unprotected with several thousands of uses each one, but I do think this can be a reasonable middle ground for now.

xy.png (874×878 px, 13 KB)

Some Items can enter or leave the top-500 ranking at some point as a result of small changes in Wikipedia templates, so I keep this task open while we look for an official, probably lower priority, monitoring solution. Thanks again!

@abian @Lydia_Pintscher It would be very difficult to define a rational criterion of how many top frequently used WD items to protect.

But maybe there is a way. Namely, the distribution of item usage, as you can observe, almost certainly follows the power-law (Zipf). There are maximum-likelihood estimation methods that can determine both the scaling parameter ('alpha' - it essentially determines whether the distribution has infinite moments) and the "cut-off" ('x-min') parameter. Only beyond a certain value of the "cut-off" the given density (or mass) probability function really begins to exhibit a power-law behavior; see https://arxiv.org/abs/0706.1062.

Now, this is my idea: (1) fit the power-law to our empirical data on WD usage; (2) protect only those observations (i.e. items) that are not found in the tail of the distribution (i.e. those whose observed usage frequency falls beyond the estimated x-min value). As a consequence, we would need to protect (a) only the most frequently used WD items, and (b) those items would at the same time be a part of the "stable" region of the distribution of WD items usage.

Empirical drawback: if the value of x-min turns out to be large, and that is not impossible, we would need to protect a large number of WD items.
Methodological drawback: well, we have already cut off the tail of the distribution: you are observing only the top 100,000 most frequently used WD items. If we estimate x-min from this sample only, it would be different from the x-min estimated from all WD items. If we choose to go for all items, the estimation procedure would take quite some time to complete its run.

Resources: the statistical estimation procedures that I was referring to are already available in R and MATLAB (and maybe Python), so it would take me only a minimal amount of time to develop a script for this. The estimation procedure in itself is computationally demanding but it's not like our number crunching servers could not deal with it.

Looks good to me; if actually this isn't going to take you much effort, let's see what the results are and we'll be able to make a more informed decision.

@Lydia_Pintscher Hey, what is your take on this ticket and especially T210664#4860427? Thanks.

Update:

  • we have the dataset (WD items, and for each item the number of pages which make use of it),
  • and we're running the power-law estimation procedures now; as predicted
  • that is going to take a while.

N.B. I am running this on my private server because the R estimation packages that I need will not work with the current version of R on our statboxes; see T214598.

@abian @Lydia_Pintscher We have the results.

Method

  • The power-law was estimated from 27,394,027 WD items that are currently used across the Wikimedia websites;
  • that makes approximately 50% of items that are now present in WD (54,195,898 is the today's number);
  • the statistic from which the power-law was estimated is the number of pages that make use of a particular item;
  • estimation procedures from the {poweRlaw} R package were used.

Results

  • Power-law behavior cannot be excluded,
  • with the value of the scaling parameter (alpha) of 2.050451 (infinite distribution variance), and
  • the value of the xmin parameter of 9 (in effect, this means: the distribution for all items with usage frequency >=9 does exhibit a power-law behavior).
  • The following is the log(Rank) vs log(Pages) plot for all WD items with usage frequency >= 9 across the pages in our projects:

logRank-logPages.png (520×600 px, 14 KB)

Recommendation

  • Protect all items that are used on 9 or more pages across the Wikimedia websites.
  • There are 1,656,137 such items, which makes only 3.06% of the total number of items in WD, and only 6.05% of WD items that are currently in use.

Discussion

  • If you can automate this, protecting 1,656,137 should not be a problem, I guess.
  • Currently, the list of items that are recommended for protection encompasses only item IDs and the number of pages that make use of them;
  • the list will be shared with @Lydia_Pintscher;
  • it would take some time/engineering to get the English labels in, and
  • also some time for me to figure out where from to fetch their current protection status.
  • The procedure to generate this list updated on regular daily basis would take approx. 3 - 4 hours for each run, but
  • it cannot be established on our infrastructure before we have R upgraded, see my request in T214598.

So, until we have R upgraded on our systems, I recommend you ask for an updated list whenever you need one.

Thank you for such an accurate analysis! And for offering your own computing resources.

About the results

I would read the results in a negative way, I would say the conclusion is Items that are used in less than 9 pages aren't so used as to protect them for this reason, while the rest of Items might be protected or might not. More variables than the number of uses are significant to decide whether or not to protect an Item and some of them aren't easily quantifiable, for example:

  • the opportunity cost of each potential good edit prevented because of a semi-protection,
  • the value we give to preventing a bad edit,
  • the ratio bad edits/total edits by non-confirmed users,
  • the ratio edits by non-confirmed users/total edits,
  • the visibility that vandalism on an Item has per Item use,
  • the completeness and timelessness of an Item (or the opposite, the potential of an Item to be improved),
  • etc.

Taking these other variables (subjectively) into account I wouldn't feel comfortable protecting all those Items. In any case, that would be a decision I would have to agree with many users. The important thing is that now we objectively know more than before and we can make better decisions. Thanks again, @GoranSMilovanovic!

@abian You're welcome! Your analysis is awesome and could be used to exemplify how a Client/Manager/Editor/Owner should introduce the problem to a Data Scientist/Analyst!

Thank you, Goran! That's useful information for discussing this further.
And I agree with abian that protecting more than 3% of all items is too costly when taking into account all the other factors mentioned.
I think we can close this task now?

I think we can close this task now?

Okay for me. Since it doesn't look like the "stable area" is going to change often and having a dynamic list is currently impossible we'll ask you @GoranSMilovanovic for an updated list at some point in the future, what do you think?

@abian The code is in place. Just ping me somewhere whenever you need an update.

Mahir256 subscribed.

Concern has been raised about semi-protection of these items here.