Page MenuHomePhabricator

Collect WDumper subset dumps data
Open, Needs TriagePublic

Description

In order to better understand what kinds of entity data dump subsets our users are interested in, we want to take a closer look at how WDumper is being used. Under "recent dumps" is a list of previously generated subsets which includes a JSON representation of the filters that were used to generate the dump.

We want to scrape these dumps and turn the filter data into a human-readable form. The outcome should be a CSV file with one row per dump. Columns:

  • dump name
  • URL
  • filter (in human-readable form including labels for any items and properties used)
  • statements included in the dump (in human-readable form)
  • labels (yes/no)
  • descriptions (yes/no)
  • aliases (yes/no)
  • sitelinks (yes/no)
  • languages

Event Timeline

Closing the parent task as declined because we won't be picking up on this work since the WMF is still debating the future strategy on wikidata dumps

Does this mean Mona and I shouldn't work on this anymore? I thought we agreed with @WMDE-leszek that we will still continue this.

Reopening, as the work should continue. This is not really task about wikidata dumps. It involves a community-built tool, and we won't even touch it. It is a task for WMDE intern to gain familiarity with Wikidata web APIs and some general object-oriented design and programming topics.