See T236713#5745643 for a discussion:
In T236713#5745643, @Isaac wrote:Some follow-up to a conversation with @Halfak and @dr0ptp4kt :
This is how I have been adjusting the model outputs based on Wikidata properties in my Wikidata-based topic model, but the same adjustments would apply to ORES:
- I create an additional topic "Compilation.List_Disambig" that includes List / Disambiguation articles. This is based on either the instance-of property (P31:Q4167410 for disambiguation pages; P31:Q13406463 for lists) or presence of the "is a list of" property (P360, which largely duplicates P31:Q13406463 but is kept for completeness).
- If P625 (coordinate location) does not exist for a Wikidata item, I lowered the output confidence of any geography prediction by 0.5 (effectively removing them if the threshold is 0.5). This is still a big open question about how to handle articles tagged with Geography WikiProjects but that most people don't think of as Geography (e.g., famous people being tagged by the WikiProject for the state they were born)
- I took any item with instance-of human (P31:Q5) and moved it into a Person topic and downgraded Culture.Language and Literature. This will be no longer necessary given the in-progress changes to the taxonomy!
The actual code that I use is here: https://github.com/geohci/wikidata-topic-model/blob/master/app/app.py#L30
On a related topic, if we're considering outputting Wikidata properties as part of an ORES prediction, I'd argue for at least the following:
- Instance-of (P31) to cover the List/Disambiguation use-case and provide further data to help evaluate whether an article is about a person or not
- Occupation (P106) to help further break down the Biography topic
- Gender (P21) to help with filtering predictions around women scientists etc. where false positives can be particularly problematic
- Coordinate location (P625) to help with evaluating geography predictions.
- We could potentially consider an aggregator for country too but its design is not obvious -- e.g., outputting any countries listed under country (P17) for non-people, country-of-citizenship (P27) for people, or something more complex with place of birth (P19) or place of death (P20) for people as well.