Page MenuHomePhabricator

Update and fix wiki segmentation dataset
Open, NormalPublic


Since we released the wiki segmentation dataset in July 2018, it has been used by a wide variety of people for exploration and project targeting.

We should make sure the dataset is in good shape for continued use by doing the following:

  • Update with current data (at the moment, the data is from June 2018)
  • Make some high-impact, low-effort improvements (add pageviews, fix the count of active administrators, fix some errors in wiki names)
  • Extract the code that generates the dataset of project names, families, and languages into a reusable form, so it can be used in other data project (T184576)
  • Change the spreadsheet to pull data from a CSV file to eliminate copy-paste errors and ease updating