Review Adam's Topic Dataset
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	kzimmerman
	Oct 7 2019, 5:17 PM

Description

Adam Baso has generated a list of about 10K records of articles with "best" topic predictions for pages accessed in July.

https://dr0ptp4kt.github.io/topics-6.html?go

Request: Review list and Adam know if we are interested in a dataset in Hive against which to join for the fuller list of namespace 0 articles on enwiki?

Notes from Adam about his approach:
I have a few glitchy things to clean up in my scripts and will be doing that anyway, but wanted to make sure the fuller dataset would be useful for your exploration. During Q2 I'm interested in drafting a plan to move this stuff into a pipeline if we think it would be useful.

This builds on the approach Chelsy used (https://analytics.wikimedia.org/datasets/one-off/English%20Wikipedia%20Page%20Views%20by%20Topics.html#Top-50-articles-read-in-March-2019-on-English-Wikipedia), adding in a bunch of heuristics to improve on the ORES drafttopic machine learning model's predictions so that "Geography.*" and "Culture.Language and Literature" aren't so often assigned as the probable topic for a given article*, plus taking a few small liberties for clarity. Here's what the columns mean. Extra columns can be useful for data cubing purposes, the point being we can derive all kinds of interesting insights if we simply create new columns of derived knowledge.

page title x (Talk): page title and talk page. Note you can hover on the page title and clicks on either link will open in a new tab.
predicted: predicted topic after the heuristics are applied
is_human: strong indicator the subject is human
has_geo: the article bears geocoordinates
has_list: strong indication this is a list page
country_association: heuristically derived country associated with this article. Notice this can be further tuned, but I've tried to keep it with obvious matches on country names from the available data.
topic: from the talk page of the article, the "best" wikiproject
topic_first_encountered: from the talk page of the article, the first encountered wikiproject
best1: drafttopics highest scoring topic assignment
best1_score: drafttopic's estimated probability about its highest scoring topic assignment

That something is in Geography.* is an interesting note, but Toby suggested trying to extrapolate the country if it's not too hard. You'll see that this was doable in a number of cases and actually wouldn't be hard to make even stronger, but I've tried to stick to the low hanging fruit for country mapping. The Culture.Language and Literature assignment frequently showing up has to do with biographies being subtopics of Culture, but that's not terribly helpful - we usually care more about why a human subject was interesting than strictly that the subject was a human and thus the article was biographical in nature.

Related Objects
Search...

Status	Assigned	Task
Declined	None	T298924 Superset - Product Analytics Canonical Dashboards, Reports, and Datasets
Open	kzimmerman	T234701 "Content" equivalent of pageviews daily or edits_hourly available to use in Turnilo and Superset
Duplicate	Mayakp.wiki	T255496 Identify stakeholders, gather requirements, and determine maintenance and ownership responsibilities for the content dataset
Open	None	T257636 Technical Requirements for Content dataset
Open	None	T257638 Topic Dataset : Model, Threshold, Post-processing
Resolved	cchen	T234839 Review Adam's Topic Dataset

Event Timeline

Note from Toby: "Product analytics folks - I’d like to revisit the project Chelsey did about finding the most read topics on enwiki. No hurry but this would be very useful. "

@dr0ptp4kt To make sure I understand the entirety of the request: This is a request for review of the dataset and consultation on whether it should implemented on Hive, correct? Or are you also looking for support in refining the logic? And is this a one-time thing, or is there a plan to productionize this?

This kind of topic modeling could be useful for a Content data cube (T234701).

Thanks @kzimmerman - yeah, some cursory review of the "predicted" values and whether they're approximately sensible and whether addition of stuff of this nature to Hive for a starting point might be of use would be most appreciated!

My tentative goal in Q2 is, pending interest, create a plan for getting this into some sort of pipeline, so that next quarter Q3 it would be possible to stand up the pipeline.

As @Isaac notes and @Halfak & @diego discuss there are risks to assigning a primary salient topic to articles. However, we also need to be able to understand, even if just retrospectively, how new content, and its emphasized knowledge, influence the flywheel, how knowledge equity is shaping up in practice in simplified terms, and so on.

As a thought experiment, consider that we may have net new articles and we may have articles that change, and deltas in the most salient classification assignment times pageviews can still be telling for explanatory purposes.

One thing I've been considering is if perhaps it would make sense to retain both this more targeted "prediction" and perhaps the top three drafttopic categories exceeding some threshold (along with their numeric probabilities, or probabilities taking into account thresholds) such that we can take a more nuanced view of content and consumption fluctuation. One can imagine how we might sum calculated pageviews times each predicted category's numeric score to arrive at the theoretical impression impact of certain types of content. So if we have an article that only hits in one category, then pageviews in that category are primarily how the content consumption is weighted. And if we have an article with strong linkage to three categories, then its consumption can be reflected in all three categories (we might divvy up the consumption if we want parity for pageviews + previews). It's the case that articles can and will have changes in their topic(s) assignment(s), and that's okay - in fact it is useful to be able to do retrospective analysis and to have the rough revision-at-month-x based adjustments (better yet would be all revisions, but that's a different matter for a different day). There are both simple and complicated reporting approaches here that could all help fill certain needs or give us a bit more perspective in interpretation.

I do see opportunities to improve the logic, to be sure. There's also the matter of either upstreaming selected logic to the ORES drafttopic ML model building so less post-processing is necessary (particularly if we want to add extra columns for cubing or make analysis judgment calls like having a country assignment as the topic so as to neutralize explicitly risky topic assignment) (there are some other small tweaks like mentioned for T229401 that are trivial, too), and then projecting across all languages in a language neutral way in the fashioning of the work of Isaac, Aaron, and Diego - but first things first, I'm interested in the sensibility here on the enwiki viewed pages.

kzimmerman added a subscriber: Iflorez.Oct 8 2019, 10:22 PM

kzimmerman added a subscriber: kaldari.Oct 11 2019, 12:31 AM

• Tnegrin added a subscriber: Amire80.Oct 15 2019, 6:03 PM

• mmodell edited projects, added Product-Analytics (Kanban); removed Product-Analytics.Oct 16 2019, 5:46 PM

• mmodell edited projects, added Product-Analytics; removed Product-Analytics (Kanban).Oct 16 2019, 5:51 PM

@dr0ptp4kt any chance you could rerun this for pages accessed in English Wikipedia in September? We saw a spike in traffic in North America (specifically the US) and were hoping to look at content to see if there are any hints. I think this would be a great way for us to test the model, since it has a tangible application.

also looping @cchen in

@kzimmerman possibly, depending on time pressure. I just set up a meeting for tomorrow to discuss with you and Connie.

Pasting in an email I sent:

Hi there. You can find a data set at stat1007.eqiad.wmnet:~dr0ptp4kt/topic_predictions.tsv.gz. It's about half a gig. Connie and Kate, I'll leave it to you now to work your magic! I'm available for Hangout Meet if you need me during any mutually available time the next few days.

TL;DR This is all non-redirect titles from enwiki from the wmf_raw.mediawiki_page Hive view for the 2019-09 cut, the 2019-09 cut being the latest available in Hive as of Friday night. You should be able to join on the page_id field or the page_title_x field with other tables, so long as you understand those things are somewhat fungible over time and you account for edge cases.

More context:

There's a bunch of enrichment on that data with newer data that ultimately feeds into the predicted field (see example of first 10K non-randomized rows at https://dr0ptp4kt.github.io/topics-7.html for an HTML view). Please note that in some cases pages have, since the 2019-09 cut that's available in Hive, taken on a deleted or redirected status and in a very small minority of cases the drafttopic API scores are unavailable (because of deletion or redirect or API error code). You'll notice this by virtue of what are effectively the righthand side of a left join for the columns after the 'predicted' field. I suspect this modest drift, which is symptomatic of how the wiki works, will have little bearing on any aggregation of pageview joins, but one can never be certain until actually digging in.

...as discussed, the 'predicted' field is the post-enrichment best guess for a so-called "best" (that is to say, the probable most salient) topic, and I've also placed the top 5 (best1, best2...) drafttopic values in the righthand side of the TSV. The enrichment largely tries to compensate so that articles about people don't all end up with a 'prediction' value of [Culture.]Language and literature and also so that if the article looks like a settlement within a country (or a country) or has similar telltale signs the apparent country name is used instead of the less utilitarian [Geography.]* mid-level topics. I mention this because if you're going to divvy up pageviews by the top 5 drafttopic values you'll need to dampen pages with these characteristics in order to avoid improperly skewed data (for a derived ML model you may need to tune some as well).

This is minimum viable grade stuff, so please do note scripts are scattered about notebook1003 in my venv and topicmodeling directories, as well as my GitHub Pages repo. I'm happy to walk anyone through how it all fits together. Working out a pipeline approach with AE in Q2 so that a pipeline can be constructed in Q3 is one of my tentative goals.

Isaac and Diego have a Wikidata claims based mapping file for all Wikidata entities as of a recent dump that could be used to try to derive similar things in a more language neutral way (Isaac, Aaron, and I talk somewhat regularly ... and we all talk with Diego to varying degrees). At present their approach is intentionally general, and Isaac has taken some of my concepts to dampen variables accordingly [n.b., which I built based on inspiration from him!]. We've discussed the notion that we should upstream my enhancements to drafttopic, so that their modeling strategy will become a benefactor of my work, as their strategy trains atop drafttopic, too. This line of thinking is the route I'd like to go, but as I say there's more planning to be done. As a short run thing if we feel we need to generate some sort of similar, but less data rich dataset on Spanish or some handful of wikis in the next two days I can crank on that, but understand the results will be unreliable and I might not honestly finish it with enough lead time for you [for any quarterly reporting on the MVP dataset].

Thanks Adam, assigning to Connie to provide comments on the dataset as she uses with our September pageview data

I did another run, and pointed to the details of forming the data set at https://github.com/dr0ptp4kt/dr0ptp4kt.github.io/blob/master/topic-20191211.ipynb. Some of the scripts run out of band, which are referenced in the notebook commentary, have been copied into the same directory as this notebook in the repo.

@dr0ptp4kt Connie has her hands full with digging into MTP/board-related data and pushing out the editors dataset. I'm moving this back to our backlog to be picked up later in Q3.

kzimmerman moved this task from Backlog to Current Quarter on the Product-Analytics board.Feb 12 2020, 6:09 PM

@dr0ptp4kt - Sorry for the late feedback. The only topic predictions that don't make sense to me are the "Regional society" and "Regional geography" ones. No one is ever going to choose an interest in "Regional society" or "Regional geography". These either need to be refined to topics like "European society" and "African geography", for example, or just generalized to "Society" and "Geography", although FWIW, "Society" doesn't seem like an especially useful topic either.

In our current taxonomy, here are the projects related to "society":

WikiProject Awards
WikiProject Gender Studies
WikiProject LGBT studies
WikiProject Modern Western Europe
WikiProject Pakistani history
WikiProject Russian history
WikiProject Sexology and sexuality
WikiProject Ageing and culture
WikiProject Agriculture
WikiProject Alternative views
WikiProject Animal rights
WikiProject Arab world
WikiProject Corruption
WikiProject Cultural Evolution
WikiProject Disability
WikiProject Environment
WikiProject Fisheries and Fishing
WikiProject Forestry
WikiProject Globalization
WikiProject Home Living
WikiProject Human rights
WikiProject Human Rights in Sri Lanka
WikiProject Nonviolence
WikiProject Ethnic groups
WikiProject African diaspora
WikiProject Asian Americans
WikiProject Anthropology
WikiProject Assyria
WikiProject Azerbaijan
WikiProject Basque
WikiProject Berbers
WikiProject Clans of Scotland
WikiProject Igbo
WikiProject Indian caste system
WikiProject Franco-Americans
WikiProject Pashtun
WikiProject Taiwan
WikiProject Tamil civilization
WikiProject Israel Palestine Collaboration
WikiProject Sociology
WikiProject Feminism

If you wanted "African Society" I would look for the intersection between Geography.Regions.Africa.Africa* and History & Society.Society.

Thanks @kaldari. Yeah, "Regional society" and "Regional geography" and "Regional interest" were intentionally general, sort of as last ditch categories when there weren't higher confidence topic assignments. Part of this was to avoid false positive topic assignment. I like the set intersection notion @Halfak suggests. For analytic needs, agreed in general on how it would be ideal to have a little more precision on the probable most salient topic cluster if a small enough set of certain clusters emerge regularly (and I'm pretty sure they do).

Looking at https://en.wikipedia.org/wiki/Yuri_Gagarin the signals indicated it was a person and and there was a geographic relationship, but the five highest scoring mid-level categories weren't quite as telling as to the non-biographical, non-geographic aspect of topic assignment. In the fuller data set one could do some rollup queries on stuff like country_association and that the entity appeared to be a person, though!

page_title_x    page_id rev_id  predicted       is_human        has_geo has_list        country_association     topic   topic_first_encountered best1   best1_score    best2   best2_score     best3   best3_score     best4   best4_score     best5   best5_score
Yuri_Gagarin    34226   918350670       Regional society        1.0     1.0             Russia  WikiProject Russia      WikiProject Biography   Culture.Language and literature        0.7221317911199624      Geography.Europe        0.40902686380799524     Geography.Countries     0.27393608048351675     Assistance.Maintenance 0.12838172886084948     History_And_Society.History and society 0.11529839248636366

In the case of "Regional geography" that typically was the case where it was evident there was an apparent geographic component but it was difficult to extrapolate country - country being a neutral "most salient" topic mapping for settlements when it could be determined. This is where fusing together of Wikidata, geolookups when geodata are available, and Parsoid output content extraction (instead of mwparserfromhell with edge case handling) could be useful.

I'm going to be looking at the newer articletopic model output relatively soon, as I understand it produces even more relevant topic assignments. Looking at https://ores.wikimedia.org/ui/ for revision 918350670 shown above, the newer articletopic output does a nice job of pinpointing at higher probabilities some very relevant topic assignments in addition to the bio/geo ones the older drafttopic suggested. (For those following along, there's a newer version of drafttopic and you can see how it also produces seemingly more relevant ouput nowadays at that same UI.)

LGoto moved this task from Current Quarter to Upcoming Quarter on the Product-Analytics board.Mar 18 2020, 6:41 PM

cchen moved this task from Upcoming Quarter to Current Quarter on the Product-Analytics board.Apr 17 2020, 6:01 PM

cchen moved this task from Current Quarter to Upcoming Quarter on the Product-Analytics board.Jun 29 2020, 3:43 AM

cchen moved this task from Upcoming Quarter to Kanban on the Product-Analytics board.Jul 20 2020, 4:15 PM

cchen edited projects, added Product-Analytics (Kanban); removed Product-Analytics.

Research team will provide guidelines for how to use the topic models and post processing. Updates will be in the parent task.

Review Adam's Topic DatasetClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Review Adam's Topic Dataset
Closed, ResolvedPublic
Actions

Related Objects
Search...