In preparation for adding a Wikidata-based topic classification model to ORES, we will need utilities to preprocess Wikidata items so that they are in the correct format for the ML model. This is roughly analogous to the utilities for processsing wikitext dumps for model training in the mwtext library: https://github.com/mediawiki-utilities/python-mwtext/blob/master/mwtext/utilities/preprocess_text.py
The envisioned process would:
- Process the Wikidata JSON entity dump: https://www.wikidata.org/wiki/Wikidata:Database_download#JSON_dumps_(recommended)
- Do appropriate item-level filtering -- e.g., only retaining Wikidata items with appropriate sitelinks
- Convert an item into an ML-ready feature-list -- e.g., list of properties + values that meet the above criteria
Out of scope:
- Do appropriate statement-level filtering -- e.g., checking references/ranking associated with claims
Additional notes:
- Github repo where this code will likely live: python-mwtext
- A name change -- e.g., mwembeddings -- is also under consideration given the expansion from wikitext to other types of features (in this case, Wikidata claims).