Page MenuHomePhabricator

Collect statistics about unlinked pages from Wikimedia Wikis
Closed, ResolvedPublic

Description

For almost every Wikimedia wiki, we want the number of pages without sitelink to Wikidata in the main namespace both as an absolute number and as a fraction of the total number of pages in the main namespace.

Example SQL code to get the number pagse without a sitelink in the main namespace on a given wiki:

SELECT COUNT(*) 
FROM `page_props`
WHERE pp_propname = "unexpectedUnconnectedPage" 
AND pp_value = 0;

Acceptance criteria:

  • An initial collection of the data has been made
    • in the future, we may want to have the data to be continuously collected and displayed, for example on Grafana

Open Questions:

  • Which wikis do we want to exclude?
  • Are there wikis where namespaces other than the main namespace are interesting?
  • How much more work would it be to collect the data for all namespaces for all wikis? (and drop the ones with 0 sitelinks)
  • Wiktionaries don't usually use sitelinks in the classic sense and rather use Cognate, do we still want to include them here?

Event Timeline

Some statistics is provided in https://wikidata-todo.toolforge.org/duplicity/.

Ah, there's some interesting data there! Thank you 🙏

Michael claimed this task.
Michael added a subscriber: Manuel.

The data in https://analytics.wikimedia.org/published/datasets/wmde-analytics-engineering/Wikidata/WD_percentUsage/ seems to be more-or-less exactly what we were looking for in this task. Thank you for pointing us to it, @Manuel!

I'm resolving this for now. We can open up a new task when we have more specific needs.