Page MenuHomePhabricator

Write Python util for converting Wikidata claims to features for ML models
Closed, ResolvedPublic

Description

In preparation for adding a Wikidata-based topic classification model to ORES, we will need utilities to preprocess Wikidata items so that they are in the correct format for the ML model. This is roughly analogous to the utilities for processsing wikitext dumps for model training in the mwtext library: https://github.com/mediawiki-utilities/python-mwtext/blob/master/mwtext/utilities/preprocess_text.py

The envisioned process would:

Out of scope:

  • Do appropriate statement-level filtering -- e.g., checking references/ranking associated with claims

Additional notes:

  • Github repo where this code will likely live: python-mwtext
  • A name change -- e.g., mwembeddings -- is also under consideration given the expansion from wikitext to other types of features (in this case, Wikidata claims).

Event Timeline

Halfak triaged this task as High priority.May 18 2020, 5:02 PM
Halfak moved this task from Unsorted to New development on the Machine-Learning-Team board.

I wanted to preserve this info somewhere. We have discussed whether or not the Wikidata statements should be ordered by mwtext (see Examples section here). Here's my current thinking:

When does order matter:

  • Order does not matter when making predictions because the model just averages everything together.
  • Order does matter when training the model because the model (and many machine learning models like this) don't actually train on the full item. They'll repeatedly take a window of like 20 tokens (i.e. properties/values) and use that to train the model. So the model might begin to learn something about the order of properties/values if they aren't randomized.

There are a couple of things to know about ordering Wikidata statements:

So that leaves us with a few options:

  • For prediction time, randomization is very quick but it has no effect and is extraneous so we probably shouldn't bake it in but it's okay if we do.
  • For training time, we can either enforce the order that Wikidata sets or randomize but I wouldn't use what the API/dumps provide as it's not clear to me that it's a stable ordering and it could potentially affect model training.

At the moment, it looks like we are leaning towards always randomize for simplicity but @Dibyaaaaax if you have additional time while waiting for us to review code etc., you might want to look into how to extract the order from this page and apply it to the claims that are extracted from the dumps. This is the API call that you would get you the page text: https://www.wikidata.org/w/api.php?action=parse&page=MediaWiki:Wikibase-SortedProperties&prop=wikitext&format=json

There are some Wikidata items that represent Wikipedia pages that aren't articles. Eg. Q8207058 sitelinks to Portal:Earth Sciences in English Wikipedia (same case for other languages).

For this task, we would like to filter out these items and keep only those items that have sitelink(s) to Wikipedia articles.

Method-1: To do this, check if the item is an instance-of (P31) Wikimedia Internal Item (Q17442446) or its subclasses. List of all the subclasses of Wikimedia Internal Item can be found here.

Method-2: We also considered using the sitelinks names to filter out these items. For eg. sitelinks for Q8207058 all begin with Portal: in different languages. Collection of all the namspace names in all the Wikipedia languages would be enough to use this filter.

Both the methods perform more or less same for most items. One place where they differ was: items that are instance-of Wikimedia disambiguation page e.g. Q5506257, are filtered out by Method-1 but not by Method-2.

For now, we stick with Method-1 just because it looks more structured and depends on Wikidata's data model.

Sort order of Wikidata statements
We decided to order the Wikidata statements instead of randomizing them, for the reasons @Isaac mentioned above. For that, the order of properties is collected from here (SortedProperties) using the API. We then sort the statements based on their properties using the list of SortedProperties as reference.
There were some statements with properties that do not appear in the SortedProperties list. Those statements are simply sent to the end of the list, making them appear after all other statements that are on their correct position.


Filtering non-articles
We've filtered out the Wikidata items that represent Wikipedia pages that aren't articles. We used SPARQL to query for the list of subclasses of Wikimedia Internal Item and discarded all the Wikidata items that either belong to the list or are instance of (P31) one of the items on the list of subclasses.