Page MenuHomePhabricator

track percentage of articles making use of data from Wikidata
Closed, ResolvedPublic

Description

We'd like to know what percentage of articles make use of data from Wikidata. It should be a new graph on https://grafana.wikimedia.org/dashboard/db/wikidata-entity-usage?refresh=5m&orgId=1

Notes:

  • only main namespace
  • at least one usage aspect except sitelink usage counts
  • group by project (Wikipedia, Commons, Wikivoyage, etc.)

Event Timeline

@Addshore I need a second opinion on the following, please. One of your generating scripts for this Grafana dashboard iterates across the project databases and counts the pages that make use of any aspects except 'S', in the following manner:

SELECT COUNT(DISTINCT eu_page_id) AS pages FROM dewiki.wbc_entity_usage WHERE eu_aspect != 'S';

Only the dewiki example shown above takes 3 minutes and 20 seconds from analytics-store.eqiad.wmnet to complete (run from stat1005), and there are >800 projects to assess in this manner.

Suggestion. I have all these SQL data sqooped in a Hive goransm.wdcm_clients_wb_entity_usage table. This table has a weekly update (i.e. there is a weekly Apache Sqoop run across all SQL wbc_entity_usage tables that collects them to Hadoop), but let's say I can make that daily. If you can help me around the Graphite metrics, only in terms of helping me understand the proper conventions for metric names, I think I could start sending these data to Graphite on daily basis from R in production. That would (a) save some resources on our SQL servers, and (b) align us to the general policy of bypassing SQL wherever we deal with real Big Data sets. Let me know what you think.

@Lydia_Pintscher I think it is natural to approach this task within the WDCM framework, but that would imply presenting the outputs on an RStudio Shiny Server (like all WDCM dashboards), not Grafana. Let me know if this has to be Grafana for some reason. Otherwise I can easily build you a dashboad for this and any other similar, related statistics in WDCM.

@Lydia_Pintscher I think it is natural to approach this task within the WDCM framework, but that would imply presenting the outputs on an RStudio Shiny Server (like all WDCM dashboards), not Grafana. Let me know if this has to be Grafana for some reason. Otherwise I can easily build you a dashboad for this and any other similar, related statistics in WDCM.

The existing grafana board is where people would look for it. If it is significantly easier to do it in WDCM then let's do it there and at least link it from the grafana board.

Vvjjkkii renamed this task from track percentage of articles making use of data from Wikidata to ajdaaaaaaa.Jul 1 2018, 1:11 AM
Vvjjkkii removed GoranSMilovanovic as the assignee of this task.
Vvjjkkii triaged this task as High priority.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii edited subscribers, added: GoranSMilovanovic; removed: Aklapper.
CommunityTechBot renamed this task from ajdaaaaaaa to track percentage of articles making use of data from Wikidata.Jul 2 2018, 4:25 PM
CommunityTechBot assigned this task to GoranSMilovanovic.
CommunityTechBot raised the priority of this task from High to Needs Triage.
CommunityTechBot updated the task description. (Show Details)
CommunityTechBot edited subscribers, added: Aklapper; removed: GoranSMilovanovic.

@Lydia_Pintscher Here's a table with the desired statistics.

Columns: numPages = the number of pages from namespace = 0, no redirects; wdUsePages = the number of pages among them that make any use of Wikidata except (S)itelins; percentWDuse = the respective percent.

This will soon be deployed as a simple, one-page Shiny Dashboard and linked from the https://grafana.wikimedia.org/dashboard/db/wikidata-entity-usage?refresh=5m&orgId=1 Grafana dashboard.

The data can be updated daily.

@Lydia_Pintscher Added some charts on the dashboard as previously agreed. Please let me know if any other data aggregates would be useful.

@Lydia_Pintscher Please see: T206214#4690482 from @Daniel_Mietchen (my bad the suggestion didn't get here; I've shared a wrong Phab ticket with Daniel).

Please expand in the report what's meant by "Wikidata usage", it's ambiguous and could be interpreted as items linked vs statement-level data reused via templates.

Also: do we have the complementary chart? How much of Wikidata, by class, is reused?

@DarTar

Please expand in the report what's meant by "Wikidata usage", it's ambiguous and could be interpreted as items linked vs statement-level data reused via templates.

The exact definition of what "Wikidata usage" refers to on this dashboard is provided above the table, almost at the top of the page, right bellow the Wikidata logo:

Wikidata (WD) usage upon which the reported data are based excludes Sitelinks
[see: S usage aspect, wbc_entity_usage table in the Wikibase schema].

The definition of Wikidata usage is dependent upon the exact meaning of the item usage aspects as provided in the documentation of the wbc_entity_usage Wikibase table (also cf. this Diffusion page). The wbc_entity_usage table lacks a functional documentation in a sense that it does not explicitly state what user actions map onto what usage aspects, preventing the analytics from providing more meaningful, interpretable insights (e.g. going beyond expressions like "... a (S)itelink usage aspect would be registered following an application of this or that Lua module on the page..." - which are not very informative neither to our readers or editors, neither to us in analytics who take a behavioral, user-centered perspective on the data). A separate ticket will be opened soon to ask for a more thorough documentation of this table. In the meantime: this dashboards exludes only the (S)itelink usage aspect, so yes it does take into account the usage of statements (or, the so-called (C) usage aspect, quote from the documentation: "statements (C): certain statements (identified by their property id) from the entity are used").

Also: do we have the complementary chart?

Complimentary in what sense?

How much of Wikidata, by class, is reused?

This might take a while, but I certainly agree that it is a question worth of addressing.

@Lydia_Pintscher Please let me know if any of these two question should be addressed immediately to incorporate the respective features into our Percentage of articles making use of data from Wikidata page.

Great. Thank you!
I'm closing this ticket since what I wanted is covered. Let's do other tings in new tickets.