Page MenuHomePhabricator

Review Adam's Topic Dataset
Closed, ResolvedPublic

Description

Adam Baso has generated a list of about 10K records of articles with "best" topic predictions for pages accessed in July.

https://dr0ptp4kt.github.io/topics-6.html?go

Request: Review list and Adam know if we are interested in a dataset in Hive against which to join for the fuller list of namespace 0 articles on enwiki?

Notes from Adam about his approach:
I have a few glitchy things to clean up in my scripts and will be doing that anyway, but wanted to make sure the fuller dataset would be useful for your exploration. During Q2 I'm interested in drafting a plan to move this stuff into a pipeline if we think it would be useful.

This builds on the approach Chelsy used (https://analytics.wikimedia.org/datasets/one-off/English%20Wikipedia%20Page%20Views%20by%20Topics.html#Top-50-articles-read-in-March-2019-on-English-Wikipedia), adding in a bunch of heuristics to improve on the ORES drafttopic machine learning model's predictions so that "Geography.*" and "Culture.Language and Literature" aren't so often assigned as the probable topic for a given article*, plus taking a few small liberties for clarity. Here's what the columns mean. Extra columns can be useful for data cubing purposes, the point being we can derive all kinds of interesting insights if we simply create new columns of derived knowledge.

page title x (Talk): page title and talk page. Note you can hover on the page title and clicks on either link will open in a new tab.
predicted: predicted topic after the heuristics are applied
is_human: strong indicator the subject is human
has_geo: the article bears geocoordinates
has_list: strong indication this is a list page
country_association: heuristically derived country associated with this article. Notice this can be further tuned, but I've tried to keep it with obvious matches on country names from the available data.
topic: from the talk page of the article, the "best" wikiproject
topic_first_encountered: from the talk page of the article, the first encountered wikiproject
best1: drafttopics highest scoring topic assignment
best1_score: drafttopic's estimated probability about its highest scoring topic assignment

  • That something is in Geography.* is an interesting note, but Toby suggested trying to extrapolate the country if it's not too hard. You'll see that this was doable in a number of cases and actually wouldn't be hard to make even stronger, but I've tried to stick to the low hanging fruit for country mapping. The Culture.Language and Literature assignment frequently showing up has to do with biographies being subtopics of Culture, but that's not terribly helpful - we usually care more about why a human subject was interesting than strictly that the subject was a human and thus the article was biographical in nature.

Event Timeline

kzimmerman updated the task description. (Show Details)

Note from Toby: "Product analytics folks - I’d like to revisit the project Chelsey did about finding the most read topics on enwiki. No hurry but this would be very useful. "

kzimmerman triaged this task as Medium priority.Oct 7 2019, 5:22 PM
kzimmerman moved this task from Triage to Backlog on the Product-Analytics board.
kzimmerman added subscribers: dr0ptp4kt, Tnegrin.

@dr0ptp4kt To make sure I understand the entirety of the request: This is a request for review of the dataset and consultation on whether it should implemented on Hive, correct? Or are you also looking for support in refining the logic? And is this a one-time thing, or is there a plan to productionize this?

This kind of topic modeling could be useful for a Content data cube (T234701).

Thanks @kzimmerman - yeah, some cursory review of the "predicted" values and whether they're approximately sensible and whether addition of stuff of this nature to Hive for a starting point might be of use would be most appreciated!

My tentative goal in Q2 is, pending interest, create a plan for getting this into some sort of pipeline, so that next quarter Q3 it would be possible to stand up the pipeline.

As @Isaac notes and @Halfak & @diego discuss there are risks to assigning a primary salient topic to articles. However, we also need to be able to understand, even if just retrospectively, how new content, and its emphasized knowledge, influence the flywheel, how knowledge equity is shaping up in practice in simplified terms, and so on.

As a thought experiment, consider that we may have net new articles and we may have articles that change, and deltas in the most salient classification assignment times pageviews can still be telling for explanatory purposes.

One thing I've been considering is if perhaps it would make sense to retain both this more targeted "prediction" and perhaps the top three drafttopic categories exceeding some threshold (along with their numeric probabilities, or probabilities taking into account thresholds) such that we can take a more nuanced view of content and consumption fluctuation. One can imagine how we might sum calculated pageviews times each predicted category's numeric score to arrive at the theoretical impression impact of certain types of content. So if we have an article that only hits in one category, then pageviews in that category are primarily how the content consumption is weighted. And if we have an article with strong linkage to three categories, then its consumption can be reflected in all three categories (we might divvy up the consumption if we want parity for pageviews + previews). It's the case that articles can and will have changes in their topic(s) assignment(s), and that's okay - in fact it is useful to be able to do retrospective analysis and to have the rough revision-at-month-x based adjustments (better yet would be all revisions, but that's a different matter for a different day). There are both simple and complicated reporting approaches here that could all help fill certain needs or give us a bit more perspective in interpretation.

I do see opportunities to improve the logic, to be sure. There's also the matter of either upstreaming selected logic to the ORES drafttopic ML model building so less post-processing is necessary (particularly if we want to add extra columns for cubing or make analysis judgment calls like having a country assignment as the topic so as to neutralize explicitly risky topic assignment) (there are some other small tweaks like mentioned for T229401 that are trivial, too), and then projecting across all languages in a language neutral way in the fashioning of the work of Isaac, Aaron, and Diego - but first things first, I'm interested in the sensibility here on the enwiki viewed pages.

@dr0ptp4kt any chance you could rerun this for pages accessed in English Wikipedia in September? We saw a spike in traffic in North America (specifically the US) and were hoping to look at content to see if there are any hints. I think this would be a great way for us to test the model, since it has a tangible application.

also looping @cchen in

@kzimmerman possibly, depending on time pressure. I just set up a meeting for tomorrow to discuss with you and Connie.

Pasting in an email I sent:

Hi there. You can find a data set at stat1007.eqiad.wmnet:~dr0ptp4kt/topic_predictions.tsv.gz. It's about half a gig. Connie and Kate, I'll leave it to you now to work your magic! I'm available for Hangout Meet if you need me during any mutually available time the next few days.

TL;DR This is all non-redirect titles from enwiki from the wmf_raw.mediawiki_page Hive view for the 2019-09 cut, the 2019-09 cut being the latest available in Hive as of Friday night. You should be able to join on the page_id field or the page_title_x field with other tables, so long as you understand those things are somewhat fungible over time and you account for edge cases.

More context:

There's a bunch of enrichment on that data with newer data that ultimately feeds into the predicted field (see example of first 10K non-randomized rows at https://dr0ptp4kt.github.io/topics-7.html for an HTML view). Please note that in some cases pages have, since the 2019-09 cut that's available in Hive, taken on a deleted or redirected status and in a very small minority of cases the drafttopic API scores are unavailable (because of deletion or redirect or API error code). You'll notice this by virtue of what are effectively the righthand side of a left join for the columns after the 'predicted' field. I suspect this modest drift, which is symptomatic of how the wiki works, will have little bearing on any aggregation of pageview joins, but one can never be certain until actually digging in.

...as discussed, the 'predicted' field is the post-enrichment best guess for a so-called "best" (that is to say, the probable most salient) topic, and I've also placed the top 5 (best1, best2...) drafttopic values in the righthand side of the TSV. The enrichment largely tries to compensate so that articles about people don't all end up with a 'prediction' value of [Culture.]Language and literature and also so that if the article looks like a settlement within a country (or a country) or has similar telltale signs the apparent country name is used instead of the less utilitarian [Geography.]* mid-level topics. I mention this because if you're going to divvy up pageviews by the top 5 drafttopic values you'll need to dampen pages with these characteristics in order to avoid improperly skewed data (for a derived ML model you may need to tune some as well).

This is minimum viable grade stuff, so please do note scripts are scattered about notebook1003 in my venv and topicmodeling directories, as well as my GitHub Pages repo. I'm happy to walk anyone through how it all fits together. Working out a pipeline approach with AE in Q2 so that a pipeline can be constructed in Q3 is one of my tentative goals.

Isaac and Diego have a Wikidata claims based mapping file for all Wikidata entities as of a recent dump that could be used to try to derive similar things in a more language neutral way (Isaac, Aaron, and I talk somewhat regularly ... and we all talk with Diego to varying degrees). At present their approach is intentionally general, and Isaac has taken some of my concepts to dampen variables accordingly [n.b., which I built based on inspiration from him!]. We've discussed the notion that we should upstream my enhancements to drafttopic, so that their modeling strategy will become a benefactor of my work, as their strategy trains atop drafttopic, too. This line of thinking is the route I'd like to go, but as I say there's more planning to be done. As a short run thing if we feel we need to generate some sort of similar, but less data rich dataset on Spanish or some handful of wikis in the next two days I can crank on that, but understand the results will be unreliable and I might not honestly finish it with enough lead time for you [for any quarterly reporting on the MVP dataset].

Thanks Adam, assigning to Connie to provide comments on the dataset as she uses with our September pageview data

I did another run, and pointed to the details of forming the data set at https://github.com/dr0ptp4kt/dr0ptp4kt.github.io/blob/master/topic-20191211.ipynb. Some of the scripts run out of band, which are referenced in the notebook commentary, have been copied into the same directory as this notebook in the repo.

@dr0ptp4kt Connie has her hands full with digging into MTP/board-related data and pushing out the editors dataset. I'm moving this back to our backlog to be picked up later in Q3.

@dr0ptp4kt - Sorry for the late feedback. The only topic predictions that don't make sense to me are the "Regional society" and "Regional geography" ones. No one is ever going to choose an interest in "Regional society" or "Regional geography". These either need to be refined to topics like "European society" and "African geography", for example, or just generalized to "Society" and "Geography", although FWIW, "Society" doesn't seem like an especially useful topic either.

In our current taxonomy, here are the projects related to "society":

  • WikiProject Awards
  • WikiProject Gender Studies
  • WikiProject LGBT studies
  • WikiProject Modern Western Europe
  • WikiProject Pakistani history
  • WikiProject Russian history
  • WikiProject Sexology and sexuality
  • WikiProject Ageing and culture
  • WikiProject Agriculture
  • WikiProject Alternative views
  • WikiProject Animal rights
  • WikiProject Arab world
  • WikiProject Corruption
  • WikiProject Cultural Evolution
  • WikiProject Disability
  • WikiProject Environment
  • WikiProject Fisheries and Fishing
  • WikiProject Forestry
  • WikiProject Globalization
  • WikiProject Home Living
  • WikiProject Human rights
  • WikiProject Human Rights in Sri Lanka
  • WikiProject Nonviolence
  • WikiProject Ethnic groups
  • WikiProject African diaspora
  • WikiProject Asian Americans
  • WikiProject Anthropology
  • WikiProject Assyria
  • WikiProject Azerbaijan
  • WikiProject Basque
  • WikiProject Berbers
  • WikiProject Clans of Scotland
  • WikiProject Igbo
  • WikiProject Indian caste system
  • WikiProject Franco-Americans
  • WikiProject Pashtun
  • WikiProject Taiwan
  • WikiProject Tamil civilization
  • WikiProject Israel Palestine Collaboration
  • WikiProject Sociology
  • WikiProject Feminism

If you wanted "African Society" I would look for the intersection between Geography.Regions.Africa.Africa* and History & Society.Society.

Thanks @kaldari. Yeah, "Regional society" and "Regional geography" and "Regional interest" were intentionally general, sort of as last ditch categories when there weren't higher confidence topic assignments. Part of this was to avoid false positive topic assignment. I like the set intersection notion @Halfak suggests. For analytic needs, agreed in general on how it would be ideal to have a little more precision on the probable most salient topic cluster if a small enough set of certain clusters emerge regularly (and I'm pretty sure they do).

Looking at https://en.wikipedia.org/wiki/Yuri_Gagarin the signals indicated it was a person and and there was a geographic relationship, but the five highest scoring mid-level categories weren't quite as telling as to the non-biographical, non-geographic aspect of topic assignment. In the fuller data set one could do some rollup queries on stuff like country_association and that the entity appeared to be a person, though!

page_title_x    page_id rev_id  predicted       is_human        has_geo has_list        country_association     topic   topic_first_encountered best1   best1_score    best2   best2_score     best3   best3_score     best4   best4_score     best5   best5_score
Yuri_Gagarin    34226   918350670       Regional society        1.0     1.0             Russia  WikiProject Russia      WikiProject Biography   Culture.Language and literature        0.7221317911199624      Geography.Europe        0.40902686380799524     Geography.Countries     0.27393608048351675     Assistance.Maintenance 0.12838172886084948     History_And_Society.History and society 0.11529839248636366

In the case of "Regional geography" that typically was the case where it was evident there was an apparent geographic component but it was difficult to extrapolate country - country being a neutral "most salient" topic mapping for settlements when it could be determined. This is where fusing together of Wikidata, geolookups when geodata are available, and Parsoid output content extraction (instead of mwparserfromhell with edge case handling) could be useful.

I'm going to be looking at the newer articletopic model output relatively soon, as I understand it produces even more relevant topic assignments. Looking at https://ores.wikimedia.org/ui/ for revision 918350670 shown above, the newer articletopic output does a nice job of pinpointing at higher probabilities some very relevant topic assignments in addition to the bio/geo ones the older drafttopic suggested. (For those following along, there's a newer version of drafttopic and you can see how it also produces seemingly more relevant ouput nowadays at that same UI.)

Research team will provide guidelines for how to use the topic models and post processing. Updates will be in the parent task.