Page MenuHomePhabricator

Expand the Percentage of articles making use of data from Wikidata Dashboard to include (S)itelinks
Closed, ResolvedPublic

Description

Expand the Percentage of articles making use of data from Wikidata dashboard to include (S)itelinks.

This dashboard is updated daily and reports the % of articles in each WMF project that re-use any Wikidata items, subject to the following constraints:

  • namespace = 0,
  • no redirects,
  • (S)itelinks are excluded.

Motivation: the current Percentage of articles making use of data from Wikidata dashboard reports on Wikidata re-use statistics from the wbc_entity_usage schema but excludes the (S)itelinks aspect (see eu_aspect field in the respective schema).

Why is the (S)itelinks eu_aspect currently excluded? Simply because it represents a somewhat trivial case of Wikidata re-use in the WMF projects (any page is really expected to have a corresponding Wikidata item, so... having a (S)itelink is really a matter of time only). However, it was reported that some people were confused by the current Wikidata re-use statistics, expecting to observe higher figures than those reported. The cause of confusion was identified to be exactly the question of whether (S)itelinks count as valid Wikidata re-use statistics or not.

In a discussion with the Research team @MGerlach and @diego, @Lydia_Pintscher and @GoranSMilovanovic came to a conclusion that it makes sense to conceptually separate:

  • Wikidata usage (i.e. everything excluding (S)itelinks) from
  • Wikidata coverage (only (S)itelinks).

The existing dashboard will be re-designed to encompass two tabs: Wikidata usage, showing the usage statistics that are currently represented on the dashboard, and Wikidata coverage, introducing (S)itelinks usage only.

Event Timeline

  • Bug discovered (and fixed): HiveQL queries failing due to incorrect Kerberos Auth.
  • Essential re-factoring completed;
  • Testing: tonight, following the regular update from stat1004.
  • Until review is done:
    • expanding the scope of the dataset so to include all projects that make any use of (S)itelinks at all;
    • XML parameterization for production purposes;
    • switching ETL from local filesystem (slow, cluser throughput high) to hdfs.

Done: expanding the scope of the dataset so to include all projects that make any use of (S)itelinks at all.

Done:

  • XML parameterization for production purposes
  • switching ETL from local filesystem (slow, cluser throughput high) to hdfs: switched to PySpark indeed.

Testing.

  • Test successful, done.
  • Next: XML parameterization for the dashboard itself + html components.

@Lydia_Pintscher Thank you!

An additional touch here and there on the dashboard's back-end and I am closing the ticket if @diego and @MGerlach also think that the dashboard now presents the previously missing indicators in a satisfactory way.

Hi @GoranSMilovanovic ,
First of all, thanks for the work. Looks really good. My only doubt is about this sentence:

"Wikidata Sitelinks are found in 81.5853% of articles across the WMF projects considered." Is is this the avg per project or the sum of all sitelinks divided by the total number of articles?

@diego

"Wikidata Sitelinks are found in 81.5853% of articles across the WMF projects considered." Is is this the avg per project or the sum of all sitelinks divided by the total number of articles?

It is the sum of all sitelinks divided by the total number of articles. Do you think the formulation should be changed? Thanks.

It is the sum of all sitelinks divided by the total number of articles. Do you think the formulation should be changed? Thanks.

No, I think is good like this.
What would be great is to - additionally - have the statistic only for Wipedias (or by project-type, eg: wikipedia, wicktionary, etc..)

@diego Great then.

What would be great is to - additionally - have the statistic only for Wipedias (or by project-type, eg: wikipedia, wicktionary, etc..)

Good, we'll have it.

@GoranSMilovanovic looks great.

In my opinion this task is complete.

Thank you very much.

@diego Thank you.
@Lydia_Pintscher Since you have already reviewed the dashboard, and the latest change includes only one aggregate data table, I will close the ticket.