Article features are both an important component of the debiasing/analysis of the reader surveys and also inform the larger question of what is the relationship between readership gaps (i.e. populations that are less well-represented among Wikipedia readers) and content gaps (i.e. biases, lower quality, missing content w/r/t articles on Wikipedia). This subtask focuses on gathering what we believe to be potentially important article features and testing whether they are important w/r/t understanding reader behavior and demographic gaps.
Features under consideration:
- Topic:
- Wikidata properties:
- P21 (sex or gender)
- P625 (coordinate location)
- P585 (point in time)
- P569 (date of birth)
- P31 (instance-of) is a disambiguation page or list article
- WikiProject mid-level categories as operationalized via ORES drafttopic labels
- Currently there is only a model for enwiki, so I will likely build a model that maps Wikidata claims to these labels
- Wikidata properties:
- Quality (language-independent features that drive the ORES wp10 quality labels):
- article length
- # of templates, infonoise (ratio of parsed text to wikitext)
- # of ref tags, # of wikilinks, # external links
- # of level-two sections, # of level-three-plus sections
- Demand:
- article page views in language
- # of sitelinks (language editions with articles for Wikidata item) and whether article exists in a respondent's native language
Features currently leaning against:
- Outside taxonomies for topics (e.g., YAGO, DBPedia). In many cases, these are excellent taxonomies, but they bring additional engineering challenges given the scale of the data analysis (whereas Wikidata properties etc. often are much easier to join in to our data) and we hope that this project can bring insight into how Wikidata properties might help us further understand/categorize knowledge gaps.
- Pure Wikidata taxonomy based on instance of (P31) / subclass of (P279) properties (explained in comments below)
- Article or Wikidata item embeddings -- these would be less interpretable but might have more predictive power for debiasing and be useful for further projects as well. They are likely on hold though while we consider a more general solution to building embeddings of Wikipedia articles.
Additional resources:
- T221891#5309010 and follow-up response by Halfak regarding important elements of articles
- T219660 and T221891 regarding topic-analysis with ORES drafttopic
- drafttopic models: https://github.com/wikimedia/drafttopic
- quality models: https://github.com/wikimedia/articlequality