@CMyrick-WMF has done a bunch of excellent work cleaning and analyzing data on incubating wikis, which also involved a bunch of work with language data.
We should incorporate this work into canonical data. Proposed work sequence:
- T346855: Provide ISO 639 language codes in canonical wiki dataset
- T392951: Create a first version of the canonical language dataset
- T393075: Create a canonical dataset for incubating wikis
- T393076: Add Glottolog fields to canonical language dataset
@CMyrick-WMF has some initial drafts for the table schemas in this doc.
Potential additional work
- Add the MediaWiki autonym to the language dataset if it would be useful
- Think about how to incorporate things from Language Data and Utilities