Page MenuHomePhabricator

Determine important article features with respect to readership
Closed, ResolvedPublic


Article features are both an important component of the debiasing/analysis of the reader surveys and also inform the larger question of what is the relationship between readership gaps (i.e. populations that are less well-represented among Wikipedia readers) and content gaps (i.e. biases, lower quality, missing content w/r/t articles on Wikipedia). This subtask focuses on gathering what we believe to be potentially important article features and testing whether they are important w/r/t understanding reader behavior and demographic gaps.

Features under consideration:

  • Topic:
    • Wikidata properties:
    • WikiProject mid-level categories as operationalized via ORES drafttopic labels
      • Currently there is only a model for enwiki, so I will likely build a model that maps Wikidata claims to these labels
  • Quality (language-independent features that drive the ORES wp10 quality labels):
    • article length
    • # of templates, infonoise (ratio of parsed text to wikitext)
    • # of ref tags, # of wikilinks, # external links
    • # of level-two sections, # of level-three-plus sections
  • Demand:
    • article page views in language
    • # of sitelinks (language editions with articles for Wikidata item) and whether article exists in a respondent's native language

Features currently leaning against:

  • Outside taxonomies for topics (e.g., YAGO, DBPedia). In many cases, these are excellent taxonomies, but they bring additional engineering challenges given the scale of the data analysis (whereas Wikidata properties etc. often are much easier to join in to our data) and we hope that this project can bring insight into how Wikidata properties might help us further understand/categorize knowledge gaps.
  • Pure Wikidata taxonomy based on instance of (P31) / subclass of (P279) properties (explained in comments below)
  • Article or Wikidata item embeddings -- these would be less interpretable but might have more predictive power for debiasing and be useful for further projects as well. They are likely on hold though while we consider a more general solution to building embeddings of Wikipedia articles.

Additional resources:

Event Timeline

Isaac triaged this task as High priority.Jul 17 2019, 8:15 PM

My first attempt at building a language-independent means of representing article topic -- i.e. grouping article page views into categories regardless of which Wikipedia language edition that article was read in -- was to map each page view to its Wikidata item and then represent the item based on its instance-of / subclass-of properties. An example of how this might work is below [1]. The goal is to build a set of higher-level categories to which any Wikidata item with an instance-of property can be mapped to. This is similar to the 14 categories in the Wikidata Concepts Monitor but ideally with no overlap and full coverage -- i.e. all items map deterministically to a single category.

In practice, a few challenges led to abandoning this pure Wikidata instance-of approach: many items have many instance-of properties and superclasses that make sense from an ontological perspective but not from a "what is this article about" perspective. This makes it challenging to weight the relative importance of a given instance-of property and the resulting "topics" do not always make sense. For example, Tesla Model 3 is an instance-of an automobile model, electric car, sedan, compact car, sports car, and battery electric vehicle. "Automobile model" then is a subclass of "vehicle model" which is a subclass of "model", which is shared across any items such as Rubik's cubes. Using page views as a means of determining how far to aggregate up this subclass network generally leads to these overly general categories that do not tell us much about the likely more pertinent topics (i.e. car-related).

[1] Example Wikidata taxonomy:

Wikidata Taxonomy.png (540×960 px, 59 KB)

I'm currently exploring how to expand the ORES drafftopic model to languages besides English. The approach I have taken to this is to build a model that predicts the ~40 categories used by the ORES drafttopic model.


See an early exploration of this:

Overview of approach:

  • Iterate through recent dump of English Wikipedia
  • Gather all templates from article talk pages that include "wikiproject" or "wp" in their name (case-insensitive)
  • For articles whose talk pages had potential WikiProject templates, look through Wikidata JSON dump and gather their QIDs and claims (based on exact matching of title)
  • Map templates to mid-level-categories via existing drafttopic code. Well...this isn't entirely true, I've actually made a large suite of small adjustments:
    • I made some adjustments to the code to clean up the Geographical part of the hierarchy (e.g., WikiProjects about the Americas were mapping to "Cities" instead of "Americas")
    • I remove some duplicates (all WikiProjects related to Music were also mapped to Performing Arts, so I set them to only map to Music as the more specific category)
    • I add in a few WikiProjects that were previously skipped due to inconsistencies in the WikiProjects Directory.
    • I ignore the Assistance category (which is largely around Wikipedia maintenance)
    • I also hand-mapped template names to WikiProjects for those that did not automatically match and had at least 5000 occurrences. For instance, while WikiProject Medicine has a template with the same name, the template WPMED is also widely used and I hardcode in as equivalent to WikiProject Medicine.
  • Use fastText to train a model that predicts mid-level categories (multilabel) based on a bag of words approach with the properties and values in a Wikidata's item

I wrote up a more complete report on topic modeling (for assigning topics to page views) and some of my recommendations / takeaways:

I will likely continue to add to this as things occur to me, but it's a good summary for now. I'm going to resolve the task as any future iteration on topic modeling should be under a new task.

Notably, there is already work going on here: T233646