Adam Baso has generated a list of about 10K records of articles with "best" topic predictions for pages accessed in July.
Request: Review list and Adam know if we are interested in a dataset in Hive against which to join for the fuller list of namespace 0 articles on enwiki?
Notes from Adam about his approach:
I have a few glitchy things to clean up in my scripts and will be doing that anyway, but wanted to make sure the fuller dataset would be useful for your exploration. During Q2 I'm interested in drafting a plan to move this stuff into a pipeline if we think it would be useful.
This builds on the approach Chelsy used (https://analytics.wikimedia.org/datasets/one-off/English%20Wikipedia%20Page%20Views%20by%20Topics.html#Top-50-articles-read-in-March-2019-on-English-Wikipedia), adding in a bunch of heuristics to improve on the ORES drafttopic machine learning model's predictions so that "Geography.*" and "Culture.Language and Literature" aren't so often assigned as the probable topic for a given article*, plus taking a few small liberties for clarity. Here's what the columns mean. Extra columns can be useful for data cubing purposes, the point being we can derive all kinds of interesting insights if we simply create new columns of derived knowledge.
page title x (Talk): page title and talk page. Note you can hover on the page title and clicks on either link will open in a new tab.
predicted: predicted topic after the heuristics are applied
is_human: strong indicator the subject is human
has_geo: the article bears geocoordinates
has_list: strong indication this is a list page
country_association: heuristically derived country associated with this article. Notice this can be further tuned, but I've tried to keep it with obvious matches on country names from the available data.
topic: from the talk page of the article, the "best" wikiproject
topic_first_encountered: from the talk page of the article, the first encountered wikiproject
best1: drafttopics highest scoring topic assignment
best1_score: drafttopic's estimated probability about its highest scoring topic assignment
- That something is in Geography.* is an interesting note, but Toby suggested trying to extrapolate the country if it's not too hard. You'll see that this was doable in a number of cases and actually wouldn't be hard to make even stronger, but I've tried to stick to the low hanging fruit for country mapping. The Culture.Language and Literature assignment frequently showing up has to do with biographies being subtopics of Culture, but that's not terribly helpful - we usually care more about why a human subject was interesting than strictly that the subject was a human and thus the article was biographical in nature.