Since we released the wiki segmentation dataset in July 2018, it has been used by a wide variety of people for exploration and project targeting.
We should make sure the dataset is in good shape for continued use by doing the following:
- Update with current data (at the moment, the data is from June 2018)
- Make some high-impact, low-effort improvements (add pageviews, fix the count of active administrators, fix some errors in wiki names)
- Extract the code that generates the dataset of project names, families, and languages into a reusable form, so it can be used in other data project (T184576)
- Change the spreadsheet to pull data from a CSV file to eliminate copy-paste errors and ease updating