Page MenuHomePhabricator

Add wikidata features to topic models
Closed, ResolvedPublic

Description

See T236713#5745643 for a discussion:

Some follow-up to a conversation with @Halfak and @dr0ptp4kt :

This is how I have been adjusting the model outputs based on Wikidata properties in my Wikidata-based topic model, but the same adjustments would apply to ORES:

  • I create an additional topic "Compilation.List_Disambig" that includes List / Disambiguation articles. This is based on either the instance-of property (P31:Q4167410 for disambiguation pages; P31:Q13406463 for lists) or presence of the "is a list of" property (P360, which largely duplicates P31:Q13406463 but is kept for completeness).
  • If P625 (coordinate location) does not exist for a Wikidata item, I lowered the output confidence of any geography prediction by 0.5 (effectively removing them if the threshold is 0.5). This is still a big open question about how to handle articles tagged with Geography WikiProjects but that most people don't think of as Geography (e.g., famous people being tagged by the WikiProject for the state they were born)
  • I took any item with instance-of human (P31:Q5) and moved it into a Person topic and downgraded Culture.Language and Literature. This will be no longer necessary given the in-progress changes to the taxonomy!

The actual code that I use is here: https://github.com/geohci/wikidata-topic-model/blob/master/app/app.py#L30

On a related topic, if we're considering outputting Wikidata properties as part of an ORES prediction, I'd argue for at least the following:

  • Instance-of (P31) to cover the List/Disambiguation use-case and provide further data to help evaluate whether an article is about a person or not
  • Occupation (P106) to help further break down the Biography topic
  • Gender (P21) to help with filtering predictions around women scientists etc. where false positives can be particularly problematic
  • Coordinate location (P625) to help with evaluating geography predictions.
  • We could potentially consider an aggregator for country too but its design is not obvious -- e.g., outputting any countries listed under country (P17) for non-people, country-of-citizenship (P27) for people, or something more complex with place of birth (P19) or place of death (P20) for people as well.

Event Timeline

So, we won't be able to output the properties directly. If we're going to output anything, it'll need to be in "feature" format. A feature can be a boolean, a float, or an int. Nulls are not allowed.

So, I could output property values as the number portion of Wikidata Qids as features. Coordinates can be pairs of features. I'm not quite sure what to do with the coordinate values if they don't exist. We could output zero if a property doesn't exist or has a non-Qid value.

It sounds like we want:

  • P31 (Instance-of): Q5=5 for humans, Q4167410=4167410 for disambiguation pages; Q13406463=13406463 for list pages; Q16521=16521 for organisms, Q3624078=3624078 for sovern states, etc. 0 if not set.
  • P106 (Occupation): Q82594=82594 for computer scientist, Q18844224=18844224 for science fiction writer, etc. In this case, we often have multiple values. We might rather have Q36180=36180 for writers even if the top value is something like Q18844224 for science fiction writer. Are there a set of preferred values we can return even if there are multiple values for a property? 0 if not set.
  • P21 (Sex or gender): This one is pretty straightforward, I think. Q6581072=6581072 for female, and Q6581097=6581097 for male would be pretty common with a few others. 0 if not set.
  • P625 (Coordinate location): This will need to be split into two values -- at least. Maybe three. I'm thinking something like this:
    • P625-exists: Boolean. True if it exists, False otherwise
    • P625-latitude: Float. The latitude value if it exists and zero otherwise
    • P625-longitude: Float. The longitude if it exists and zero otherwise
    • P625-altitude: Float. The altitude if it exists and zero otherwise
  • P17-P27: Q30=30 for the United States. We'd pick the value P17. If it doesn't exist, pick the value of P27. If nothing is specified, zero.
  • P19-P20: Same as above.

Anything else?

From my conversation with @dr0ptp4kt:

  • Infobox matches "settlement" (Look for all redirects to 'Infobox settlement')
  • We could also get geo coords from Wikipedia directly to supplement what we have in Wikidata

So, I could output property values as the number portion of Wikidata Qids as features.

Hmmm...I'm torn. On one hand, I like this approach because most properties have too many values to make a one-hot-coding realistic. On the other hand, I'm also concerned about how this requires choosing a single value for each property though and many items (especially high-traffic / higher-quality items) are going to have multiple instance-of or occupation values. In my experience, there is no obvious way to do this (order on Wikidata is at best a weak proxy and it's very difficult to automatically determine the level of detail that's most useful from the Wikidata taxonomy). It might be that just labeling the White House as a mansion or Douglas Adams as a playwright is acceptable, but I'm hesistant to say that there's a good process for reducing these properties down to a single value. And if a single value isn't good enough, then the end user might just have been better off querying Wikidata themselves. So if there's no way to return an array, we might consider identifying a few static properties that we care about like List/Disambiguation and just return those as booleans rather than returning incomplete data.

P625-latitude: Float. The latitude value if it exists and zero otherwise

So latitude will vary from -90 to +90 and longitude will vary from -180 to +180. I guess the idea is that someone should first check P625-exists before interpreting the latitude/longitude results. I think I'm okay with that. Probably will cause some headaches for someone who forgets to but there's no obvious better solution.
I don't see much data at all for altitude -- I think most items must be using height about sea level (https://www.wikidata.org/wiki/Property:P2044) instead. Given that I've never used this for anything, don't think there is anything it would tell you about topic, and it has a very wide range of acceptable values (making choosing an obvious "no data" value difficult), I'd argue for not including altitude.

P17-P27: Q30=30 for the United States. We'd pick the value P17. If it doesn't exist, pick the value of P27. If nothing is specified, zero.

Seems reasonable to me though FYI country-of-citizenship is going to be multivalued for a not insignificant number of cases (e.g., Albert Einstein) and raises the same question as above with instance-of / occupation.

P19-P20: Same as above.

The more I think about this, the more I think I might advocate for not including these. Most often place of birth and place of death seem to be cities as opposed to countries, so the end-user would need to probably query Wikidata to figure out what region the city is associated with and, at that point, there isn't much use to providing the feature data with the prediction.

Halfak triaged this task as Medium priority.Jan 15 2020, 9:44 PM

@dr0ptp4kt, trying to figure out how to prioritize this. We're wondering if there's a straightforward way that you might test out the new models (specifically articletopic) to see if it address most/all of the issues you were hoping to solve for with these Wikidata properties.

@Halfak Is it possible to enable this model in production, but dark launch it or mark that part of the API as unstable?

From there I could compare the results against notebook1003:~dr0ptp4kt/venv/topic_predictions_201912_mediawiki_page_dump_enriched_20191201_through_20191219.tsv.gz , the generation of which is described at https://github.com/dr0ptp4kt/dr0ptp4kt.github.io/blob/master/topic-20191211.ipynb .

Preferably I'd be able to run the POSIX ores score_revisions command to produce the articletopic scores with a file of revision IDs just like before.

I may have missed the point here, though - were you actually asking about the aspect of returning the feature inputs back in the response such that I can mimic the strategy employed in the notebook, just against the encoded Wikidata properties?

@Halfak, got code for the new articletopic model generation? I was hoping to take a look and make sure I understand it.

For the sake of knowledge sharing for those following along...and our future selves...the following is an email I sent that might be helpful later on, too.

Kaldari reminded me of PageAssessments, which saves a few steps for getting assigned wikiprojects. Then I looked around a bit in Special:ApiSandbox to find the geocoordinate stuff.

Something like this isn't bad.

https://en.wikipedia.org/wiki/Special:ApiSandbox#action=query&format=json&prop=pageassessments%7Ccoordinates%7Ccategories%7Ctemplates%7Crevisions&continue=&titles=Stockholm&palimit=500&coprop=globe%7Ccountry&coprimary=primary&clshow=&cllimit=500&clcategories=&tllimit=500&tltemplates=&rvprop=content&rvslots=main

The Action API does go through the trouble of following redirects / transclusions of templates, so in this case it finds Infobox settlement. However the actual wikitext on the page uses the template Infobox U.S. state. All the country extrapolation and cleanup I'm doing in the notebook and scripts might be nice (and it's easy to get a list of redirects and transclusions of a template of course, just not always so easy to canonicalize inferred country names), but then again it seems like much of the time if Infobox settlement is present so too are the geocoordinates and thus the country can be figured out directly via the geocoordinates.

Isaac claimed this task.

Closing this -- the general idea is still pertinent (some topics we might want to have different approaches to modeling than a single, multi-label classification model) but this will be worked on under new tasks if need be as the taxonomy, for example, has shifted since this task was initially opened.