Page MenuHomePhabricator

Track orphan Wikidata items or entities on Grafana
Open, Needs TriagePublic

Description

It would be great to be able to check the ratio orphan/total and its evolution on a Grafana dashboard, where orphan is the number of Wikidata items or entities with no links from other items or entities, and total is the total number of Wikidata items or entities.

Context/motivation: https://en.wikipedia.org/wiki/Wikipedia:Orphan.

Event Timeline

It might make sense for the wikidata entities to be loaded into hadoop before working on this task.

@JAllemandou already went ahead and working some magic on the data that is already in hadoop.
So in January 2018 there were:

  • Entities: 42336942
  • Entities with links Inward (to the entity): 10312826
  • Entities with links outward (from the entity): 40096647
  • Entities with some kind of link (in or out): 40115806
  • Therefore totally orphan entities can be seen at roughly: 2,221,136

This currently only uses the links within the main snaks of statements.
So qualifier and reference links are missed out.

It would probably also be worth counting sitelinks as links for the tracking of orphan entities (at least in my opinion)
(This requires the graph construction that we did for the first numbers to change so I won't both posting the number here for 2018-01)

This analysis was done with the following gist: https://gist.githubusercontent.com/jobar/ec44542614c0fe261a23cc3b4acf8e00/raw/6018e5d62401a2ca86f46580a547cb025932b8ca/degrees-analysis
for reference, most of it comes from https://wikitech.wikimedia.org/wiki/User:Joal/Wikidata_Graph#Playing_with_GraphFrames_-_v2

This is the sort of thing that we will want to work towards having run on a regular basis once wikidata is regularly in hadoop

Cool! Is there any Phabricator task for the pending deployment on Hadoop?

According to the definition for Wikipedia articles, orphan pages are those with no incoming links. It's true that, from an RDF-based perspective, it's relevant to consider both incoming and outgoing links. However, from the perspective of web search engines, it's relevant to consider only incoming links; my original intention was to track these unreachable entities, which somehow belong to the deep Web. So we could say that orphan entities are those with no sitelinks nor incoming links from other entities. Anyway, all the metrics you mention seem relevant.

Thank you!