Determine important article features with respect to readership
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Isaac
	Jul 17 2019, 5:58 PM

Description

Article features are both an important component of the debiasing/analysis of the reader surveys and also inform the larger question of what is the relationship between readership gaps (i.e. populations that are less well-represented among Wikipedia readers) and content gaps (i.e. biases, lower quality, missing content w/r/t articles on Wikipedia). This subtask focuses on gathering what we believe to be potentially important article features and testing whether they are important w/r/t understanding reader behavior and demographic gaps.

Features under consideration:

Topic:
- Wikidata properties:
  - P21 (sex or gender)
  - P625 (coordinate location)
  - P585 (point in time)
  - P569 (date of birth)
  - P31 (instance-of) is a disambiguation page or list article
- WikiProject mid-level categories as operationalized via ORES drafttopic labels
  - Currently there is only a model for enwiki, so I will likely build a model that maps Wikidata claims to these labels
Quality (language-independent features that drive the ORES wp10 quality labels):
- article length
- # of templates, infonoise (ratio of parsed text to wikitext)
- # of ref tags, # of wikilinks, # external links
- # of level-two sections, # of level-three-plus sections
Demand:
- article page views in language
- # of sitelinks (language editions with articles for Wikidata item) and whether article exists in a respondent's native language

Features currently leaning against:

Outside taxonomies for topics (e.g., YAGO, DBPedia). In many cases, these are excellent taxonomies, but they bring additional engineering challenges given the scale of the data analysis (whereas Wikidata properties etc. often are much easier to join in to our data) and we hope that this project can bring insight into how Wikidata properties might help us further understand/categorize knowledge gaps.
Pure Wikidata taxonomy based on instance of (P31) / subclass of (P279) properties (explained in comments below)
Article or Wikidata item embeddings -- these would be less interpretable but might have more predictive power for debiasing and be useful for further projects as well. They are likely on hold though while we consider a more general solution to building embeddings of Wikipedia articles.

Additional resources:

T221891#5309010 and follow-up response by Halfak regarding important elements of articles
T219660 and T221891 regarding topic-analysis with ORES drafttopic
drafttopic models: https://github.com/wikimedia/drafttopic
quality models: https://github.com/wikimedia/articlequality

Related Objects
Search...

Status	Assigned	Task
Resolved	Isaac	T203042 Output 2.2: Characterizing readership by demographics
Resolved	Isaac	T228279 Process Demographics Surveys
Resolved	Isaac	T228285 Analyze Demographics Surveys
Resolved	Isaac	T228319 Determine important article features with respect to readership

Event Timeline

Isaac created this task.Jul 17 2019, 5:58 PM

Isaac mentioned this in T228285: Analyze Demographics Surveys.

Isaac triaged this task as High priority.Jul 17 2019, 8:15 PM

Isaac added a subscriber: dr0ptp4kt.Jul 19 2019, 8:23 PM

Isaac updated the task description. (Show Details)Aug 26 2019, 3:11 PM

My first attempt at building a language-independent means of representing article topic -- i.e. grouping article page views into categories regardless of which Wikipedia language edition that article was read in -- was to map each page view to its Wikidata item and then represent the item based on its instance-of / subclass-of properties. An example of how this might work is below [1]. The goal is to build a set of higher-level categories to which any Wikidata item with an instance-of property can be mapped to. This is similar to the 14 categories in the Wikidata Concepts Monitor but ideally with no overlap and full coverage -- i.e. all items map deterministically to a single category.

In practice, a few challenges led to abandoning this pure Wikidata instance-of approach: many items have many instance-of properties and superclasses that make sense from an ontological perspective but not from a "what is this article about" perspective. This makes it challenging to weight the relative importance of a given instance-of property and the resulting "topics" do not always make sense. For example, Tesla Model 3 is an instance-of an automobile model, electric car, sedan, compact car, sports car, and battery electric vehicle. "Automobile model" then is a subclass of "vehicle model" which is a subclass of "model", which is shared across any items such as Rubik's cubes. Using page views as a means of determining how far to aggregate up this subclass network generally leads to these overly general categories that do not tell us much about the likely more pertinent topics (i.e. car-related).

[1] Example Wikidata taxonomy:

Wikidata Taxonomy.png (540×960 px, 59 KB)

Isaac updated the task description. (Show Details)Sep 10 2019, 8:09 PM

I'm currently exploring how to expand the ORES drafftopic model to languages besides English. The approach I have taken to this is to build a model that predicts the ~40 categories used by the ORES drafttopic model.

Write-up: https://meta.wikimedia.org/wiki/Research_talk:Characterizing_Wikipedia_Reader_Behaviour/Demographics_and_Wikipedia_use_cases/Work_log/2019-09-11

See an early exploration of this: http://paws-public.wmflabs.org/paws-public/User:Isaac_(WMF)/Wikidata%20Topic%20Model/fasttext_drafttopic_wikidata.ipynb

Overview of approach:

Iterate through recent dump of English Wikipedia
Gather all templates from article talk pages that include "wikiproject" or "wp" in their name (case-insensitive)
For articles whose talk pages had potential WikiProject templates, look through Wikidata JSON dump and gather their QIDs and claims (based on exact matching of title)
Map templates to mid-level-categories via existing drafttopic code. Well...this isn't entirely true, I've actually made a large suite of small adjustments:
- I made some adjustments to the code to clean up the Geographical part of the hierarchy (e.g., WikiProjects about the Americas were mapping to "Cities" instead of "Americas")
- I remove some duplicates (all WikiProjects related to Music were also mapped to Performing Arts, so I set them to only map to Music as the more specific category)
- I add in a few WikiProjects that were previously skipped due to inconsistencies in the WikiProjects Directory.
- I ignore the Assistance category (which is largely around Wikipedia maintenance)
- I also hand-mapped template names to WikiProjects for those that did not automatically match and had at least 5000 occurrences. For instance, while WikiProject Medicine has a template with the same name, the template WPMED is also widely used and I hardcode in as equivalent to WikiProject Medicine.
Use fastText to train a model that predicts mid-level categories (multilabel) based on a bag of words approach with the properties and values in a Wikidata's item

Isaac moved this task from Backlog to In Progress on the Research board.Sep 30 2019, 4:24 PM

• santhosh subscribed.Oct 7 2019, 4:43 PM

I wrote up a more complete report on topic modeling (for assigning topics to page views) and some of my recommendations / takeaways: https://meta.wikimedia.org/wiki/Research:Characterizing_Wikipedia_Reader_Behaviour/Demographics_and_Wikipedia_use_cases/Topic_Analysis

I will likely continue to add to this as things occur to me, but it's a good summary for now. I'm going to resolve the task as any future iteration on topic modeling should be under a new task.

Notably, there is already work going on here: T233646

He7d3r subscribed.Jan 24 2021, 10:33 PM

Determine important article features with respect to readershipClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Determine important article features with respect to readership
Closed, ResolvedPublic
Actions

Related Objects
Search...