Write Python util for converting Wikidata claims to features for ML models
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Isaac
	May 14 2020, 2:27 PM

Description

In preparation for adding a Wikidata-based topic classification model to ORES, we will need utilities to preprocess Wikidata items so that they are in the correct format for the ML model. This is roughly analogous to the utilities for processsing wikitext dumps for model training in the mwtext library: https://github.com/mediawiki-utilities/python-mwtext/blob/master/mwtext/utilities/preprocess_text.py

The envisioned process would:

Process the Wikidata JSON entity dump: https://www.wikidata.org/wiki/Wikidata:Database_download#JSON_dumps_(recommended)
Do appropriate item-level filtering -- e.g., only retaining Wikidata items with appropriate sitelinks
Convert an item into an ML-ready feature-list -- e.g., list of properties + values that meet the above criteria

Out of scope:

Do appropriate statement-level filtering -- e.g., checking references/ranking associated with claims

Additional notes:

Github repo where this code will likely live: python-mwtext
A name change -- e.g., mwembeddings -- is also under consideration given the expansion from wikitext to other types of features (in this case, Wikidata claims).

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		Isaac	T245848 Productionize Wikidata-based Topic Model on ORES
		Resolved		Dibyaaaaax	T252775 Write Python util for converting Wikidata claims to features for ML models

Event Timeline

Isaac created this task.May 14 2020, 2:27 PM

Isaac updated the task description. (Show Details)May 15 2020, 5:52 PM

Halfak triaged this task as High priority.May 18 2020, 5:02 PM

Halfak moved this task from Unsorted to New development on the Machine-Learning-Team board.

Dibyaaaaax subscribed.May 20 2020, 6:33 AM

Isaac assigned this task to Dibyaaaaax.May 21 2020, 7:51 PM

Isaac added a parent task: T245848: Productionize Wikidata-based Topic Model on ORES.Jun 2 2020, 4:15 PM

I wanted to preserve this info somewhere. We have discussed whether or not the Wikidata statements should be ordered by mwtext (see Examples section here). Here's my current thinking:

When does order matter:

Order does not matter when making predictions because the model just averages everything together.
Order does matter when training the model because the model (and many machine learning models like this) don't actually train on the full item. They'll repeatedly take a window of like 20 tokens (i.e. properties/values) and use that to train the model. So the model might begin to learn something about the order of properties/values if they aren't randomized.

There are a couple of things to know about ordering Wikidata statements:

They actually have an order defined by Wikidata (and any properties not listed there are ranked by date added): https://www.wikidata.org/wiki/MediaWiki:Wikibase-SortedProperties
You can see that order when you open up a Wikidata page (e.g. https://www.wikidata.org/wiki/Q937) but neither the API (e.g., https://www.wikidata.org/w/api.php?action=wbgetentities&ids=Q937&props=claims) or dump files seem to obey that order. The API and JSON dump are consistent in how they order properties but the XML dump is different. I'm not clear on how any of them are ordered.

So that leaves us with a few options:

For prediction time, randomization is very quick but it has no effect and is extraneous so we probably shouldn't bake it in but it's okay if we do.
For training time, we can either enforce the order that Wikidata sets or randomize but I wouldn't use what the API/dumps provide as it's not clear to me that it's a stable ordering and it could potentially affect model training.

At the moment, it looks like we are leaning towards always randomize for simplicity but @Dibyaaaaax if you have additional time while waiting for us to review code etc., you might want to look into how to extract the order from this page and apply it to the claims that are extracted from the dumps. This is the API call that you would get you the page text: https://www.wikidata.org/w/api.php?action=parse&page=MediaWiki:Wikibase-SortedProperties&prop=wikitext&format=json

There are some Wikidata items that represent Wikipedia pages that aren't articles. Eg. Q8207058 sitelinks to Portal:Earth Sciences in English Wikipedia (same case for other languages).

For this task, we would like to filter out these items and keep only those items that have sitelink(s) to Wikipedia articles.

Method-1: To do this, check if the item is an instance-of (P31) Wikimedia Internal Item (Q17442446) or its subclasses. List of all the subclasses of Wikimedia Internal Item can be found here.

Method-2: We also considered using the sitelinks names to filter out these items. For eg. sitelinks for Q8207058 all begin with Portal: in different languages. Collection of all the namspace names in all the Wikipedia languages would be enough to use this filter.

Both the methods perform more or less same for most items. One place where they differ was: items that are instance-of Wikimedia disambiguation page e.g. Q5506257, are filtered out by Method-1 but not by Method-2.

For now, we stick with Method-1 just because it looks more structured and depends on Wikidata's data model.

Sort order of Wikidata statements
We decided to order the Wikidata statements instead of randomizing them, for the reasons @Isaac mentioned above. For that, the order of properties is collected from here (SortedProperties) using the API. We then sort the statements based on their properties using the list of SortedProperties as reference.
There were some statements with properties that do not appear in the SortedProperties list. Those statements are simply sent to the end of the list, making them appear after all other statements that are on their correct position.

Filtering non-articles
We've filtered out the Wikidata items that represent Wikipedia pages that aren't articles. We used SPARQL to query for the list of subclasses of Wikimedia Internal Item and discarded all the Wikidata items that either belong to the list or are instance of (P31) one of the items on the list of subclasses.

Dibyaaaaax closed this task as Resolved.Jul 28 2020, 4:20 PM

Write Python util for converting Wikidata claims to features for ML modelsClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Write Python util for converting Wikidata claims to features for ML models
Closed, ResolvedPublic
Actions

Related Objects
Search...