Page MenuHomePhabricator

Review Adam's Topic Dataset
Open, MediumPublic


Adam Baso has generated a list of about 10K records of articles with "best" topic predictions for pages accessed in July.

Request: Review list and Adam know if we are interested in a dataset in Hive against which to join for the fuller list of namespace 0 articles on enwiki?

Notes from Adam about his approach:
I have a few glitchy things to clean up in my scripts and will be doing that anyway, but wanted to make sure the fuller dataset would be useful for your exploration. During Q2 I'm interested in drafting a plan to move this stuff into a pipeline if we think it would be useful.

This builds on the approach Chelsy used (, adding in a bunch of heuristics to improve on the ORES drafttopic machine learning model's predictions so that "Geography.*" and "Culture.Language and Literature" aren't so often assigned as the probable topic for a given article*, plus taking a few small liberties for clarity. Here's what the columns mean. Extra columns can be useful for data cubing purposes, the point being we can derive all kinds of interesting insights if we simply create new columns of derived knowledge.

page title x (Talk): page title and talk page. Note you can hover on the page title and clicks on either link will open in a new tab.
predicted: predicted topic after the heuristics are applied
is_human: strong indicator the subject is human
has_geo: the article bears geocoordinates
has_list: strong indication this is a list page
country_association: heuristically derived country associated with this article. Notice this can be further tuned, but I've tried to keep it with obvious matches on country names from the available data.
topic: from the talk page of the article, the "best" wikiproject
topic_first_encountered: from the talk page of the article, the first encountered wikiproject
best1: drafttopics highest scoring topic assignment
best1_score: drafttopic's estimated probability about its highest scoring topic assignment

  • That something is in Geography.* is an interesting note, but Toby suggested trying to extrapolate the country if it's not too hard. You'll see that this was doable in a number of cases and actually wouldn't be hard to make even stronger, but I've tried to stick to the low hanging fruit for country mapping. The Culture.Language and Literature assignment frequently showing up has to do with biographies being subtopics of Culture, but that's not terribly helpful - we usually care more about why a human subject was interesting than strictly that the subject was a human and thus the article was biographical in nature.

Event Timeline

kzimmerman updated the task description. (Show Details)

Note from Toby: "Product analytics folks - I’d like to revisit the project Chelsey did about finding the most read topics on enwiki. No hurry but this would be very useful. "

kzimmerman triaged this task as Medium priority.Oct 7 2019, 5:22 PM
kzimmerman moved this task from Triage to Backlog on the Product-Analytics board.
kzimmerman added subscribers: dr0ptp4kt, Tnegrin.

@dr0ptp4kt To make sure I understand the entirety of the request: This is a request for review of the dataset and consultation on whether it should implemented on Hive, correct? Or are you also looking for support in refining the logic? And is this a one-time thing, or is there a plan to productionize this?

This kind of topic modeling could be useful for a Content data cube (T234701).

dr0ptp4kt added subscribers: Isaac, diego, Halfak.EditedOct 8 2019, 11:26 AM

Thanks @kzimmerman - yeah, some cursory review of the "predicted" values and whether they're approximately sensible and whether addition of stuff of this nature to Hive for a starting point might be of use would be most appreciated!

My tentative goal in Q2 is, pending interest, create a plan for getting this into some sort of pipeline, so that next quarter Q3 it would be possible to stand up the pipeline.

As @Isaac notes and @Halfak & @diego discuss there are risks to assigning a primary salient topic to articles. However, we also need to be able to understand, even if just retrospectively, how new content, and its emphasized knowledge, influence the flywheel, how knowledge equity is shaping up in practice in simplified terms, and so on.

As a thought experiment, consider that we may have net new articles and we may have articles that change, and deltas in the most salient classification assignment times pageviews can still be telling for explanatory purposes.

One thing I've been considering is if perhaps it would make sense to retain both this more targeted "prediction" and perhaps the top three drafttopic categories exceeding some threshold (along with their numeric probabilities, or probabilities taking into account thresholds) such that we can take a more nuanced view of content and consumption fluctuation. One can imagine how we might sum calculated pageviews times each predicted category's numeric score to arrive at the theoretical impression impact of certain types of content. So if we have an article that only hits in one category, then pageviews in that category are primarily how the content consumption is weighted. And if we have an article with strong linkage to three categories, then its consumption can be reflected in all three categories (we might divvy up the consumption if we want parity for pageviews + previews). It's the case that articles can and will have changes in their topic(s) assignment(s), and that's okay - in fact it is useful to be able to do retrospective analysis and to have the rough revision-at-month-x based adjustments (better yet would be all revisions, but that's a different matter for a different day). There are both simple and complicated reporting approaches here that could all help fill certain needs or give us a bit more perspective in interpretation.

I do see opportunities to improve the logic, to be sure. There's also the matter of either upstreaming selected logic to the ORES drafttopic ML model building so less post-processing is necessary (particularly if we want to add extra columns for cubing or make analysis judgment calls like having a country assignment as the topic so as to neutralize explicitly risky topic assignment) (there are some other small tweaks like mentioned for T229401 that are trivial, too), and then projecting across all languages in a language neutral way in the fashioning of the work of Isaac, Aaron, and Diego - but first things first, I'm interested in the sensibility here on the enwiki viewed pages.

@dr0ptp4kt any chance you could rerun this for pages accessed in English Wikipedia in September? We saw a spike in traffic in North America (specifically the US) and were hoping to look at content to see if there are any hints. I think this would be a great way for us to test the model, since it has a tangible application.

also looping @cchen in

@kzimmerman possibly, depending on time pressure. I just set up a meeting for tomorrow to discuss with you and Connie.

Pasting in an email I sent:

Hi there. You can find a data set at stat1007.eqiad.wmnet:~dr0ptp4kt/topic_predictions.tsv.gz. It's about half a gig. Connie and Kate, I'll leave it to you now to work your magic! I'm available for Hangout Meet if you need me during any mutually available time the next few days.
TL;DR This is all non-redirect titles from enwiki from the wmf_raw.mediawiki_page Hive view for the 2019-09 cut, the 2019-09 cut being the latest available in Hive as of Friday night. You should be able to join on the page_id field or the page_title_x field with other tables, so long as you understand those things are somewhat fungible over time and you account for edge cases.
More context:
There's a bunch of enrichment on that data with newer data that ultimately feeds into the predicted field (see example of first 10K non-randomized rows at for an HTML view). Please note that in some cases pages have, since the 2019-09 cut that's available in Hive, taken on a deleted or redirected status and in a very small minority of cases the drafttopic API scores are unavailable (because of deletion or redirect or API error code). You'll notice this by virtue of what are effectively the righthand side of a left join for the columns after the 'predicted' field. I suspect this modest drift, which is symptomatic of how the wiki works, will have little bearing on any aggregation of pageview joins, but one can never be certain until actually digging in. discussed, the 'predicted' field is the post-enrichment best guess for a so-called "best" (that is to say, the probable most salient) topic, and I've also placed the top 5 (best1, best2...) drafttopic values in the righthand side of the TSV. The enrichment largely tries to compensate so that articles about people don't all end up with a 'prediction' value of [Culture.]Language and literature and also so that if the article looks like a settlement within a country (or a country) or has similar telltale signs the apparent country name is used instead of the less utilitarian [Geography.]* mid-level topics. I mention this because if you're going to divvy up pageviews by the top 5 drafttopic values you'll need to dampen pages with these characteristics in order to avoid improperly skewed data (for a derived ML model you may need to tune some as well).
This is minimum viable grade stuff, so please do note scripts are scattered about notebook1003 in my venv and topicmodeling directories, as well as my GitHub Pages repo. I'm happy to walk anyone through how it all fits together. Working out a pipeline approach with AE in Q2 so that a pipeline can be constructed in Q3 is one of my tentative goals.
Isaac and Diego have a Wikidata claims based mapping file for all Wikidata entities as of a recent dump that could be used to try to derive similar things in a more language neutral way (Isaac, Aaron, and I talk somewhat regularly ... and we all talk with Diego to varying degrees). At present their approach is intentionally general, and Isaac has taken some of my concepts to dampen variables accordingly [n.b., which I built based on inspiration from him!]. We've discussed the notion that we should upstream my enhancements to drafttopic, so that their modeling strategy will become a benefactor of my work, as their strategy trains atop drafttopic, too. This line of thinking is the route I'd like to go, but as I say there's more planning to be done. As a short run thing if we feel we need to generate some sort of similar, but less data rich dataset on Spanish or some handful of wikis in the next two days I can crank on that, but understand the results will be unreliable and I might not honestly finish it with enough lead time for you [for any quarterly reporting on the MVP dataset].

kzimmerman assigned this task to cchen.Oct 30 2019, 10:18 PM

Thanks Adam, assigning to Connie to provide comments on the dataset as she uses with our September pageview data