Description
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | Mayakp.wiki | T297631 Update Topic Dataset with data upto Nov 2021 | |||
Resolved | cchen | T297749 Update Pageview Topic dataset and Dashboard | |||
Resolved | jwang | T297751 Update Edit Topic dataset and Dashboard |
Event Timeline
Ok, all complete! Let me know if the data looks reasonable and then will close this out. Documentation below:
Final output is in this Hive table: isaacj.article_topics_outlinks_2021_11
Schema:
+-----------------------+---------+--------------------------------------------------------------+ |col_name |data_type|comment | +-----------------------+---------+--------------------------------------------------------------+ |pageid |bigint |Page ID of article -- e.g., 8091 | |qid |string |Wikidata item ID associated with article -- e.g., Q42 | |topic |string |See: https://www.mediawiki.org/wiki/ORES/Articletopic#Taxonomy| |score |float |Model confidence [0-1] that topic applies to article. | |wiki_db |string |Wikipedia version -- e.g., enwiki for English Wikipedia | |# Partition Information| | | |# col_name |data_type|comment | |wiki_db |string |Wikipedia version -- e.g., enwiki for English Wikipedia | +-----------------------+---------+--------------------------------------------------------------+
EDIT: suggestions for usage:
- The data that is included are topics with a confidence score above 0.15. In practice, I recommend limiting to only topics with a confidence of 0.5 (WHERE score >= 0.5).
- The data covers every Wikipedia article but articles can have from 0 (won't show up in the table) to many topics (many rows) associated with them.
- Confidence isn't a proxy for importance so while e.g., it's simple to just assign the highest-confidence topic, this will bias towards easy-to-predict topics like Biography/Geography and away from hard-to-predict topics like Culture/Society.
- Generally, I recommend tracking an article in every topic under which it appears. This will potentially inflate your numbers but give the best sense of the distribution of attention etc.
Steps:
- Hive SQL for generating outlinks data that is used for making topic predictions (slight tweak of this -- see below)
- Dump link data to TSVs and coalesce on stat1007 (code)
- Apply trained model (in this case: trained from 2020-12 as described here) to all the data to get topic predictions for all articles (code)
- Note: though this model is now almost a year old, in practice there is very little drift in the data and even though lots of new articles are created, they still tend to link to lots of old articles and have plenty of data to make predictions with. I'll train a new model though for the next snapshot update (it's not a lot of work for me but does take a several hours for preprocessing the data / training the model).
- Format data for Hive (see below)
- Load data into Hive (see below)
Outlinks query:
WITH wikipedia_projects AS ( SELECT DISTINCT dbname FROM wmf_raw.mediawiki_project_namespace_map WHERE snapshot = '2021-11' AND hostname LIKE '%wikipedia%' ), title_to_id AS ( SELECT page_id, page_title, wiki_db FROM wmf_raw.mediawiki_page mp INNER JOIN wikipedia_projects wp ON (mp.wiki_db = wp.dbname) WHERE page_namespace = 0 AND snapshot = '2021-11' ), redirects AS ( SELECT mr.rd_from AS rd_from, tti.page_id AS rd_to, mr.wiki_db AS wiki_db FROM wmf_raw.mediawiki_redirect mr INNER JOIN title_to_id tti ON (mr.rd_title = tti.page_title AND mr.wiki_db = tti.wiki_db) WHERE mr.snapshot = '2021-11' AND mr.rd_namespace = 0 ), pagelinks_reformatted AS ( SELECT pl.pl_from AS pl_from, tti.page_id AS pl_to, pl.wiki_db AS wiki_db FROM wmf_raw.mediawiki_pagelinks pl INNER JOIN title_to_id tti ON (pl.pl_title = tti.page_title AND pl.wiki_db = tti.wiki_db) LEFT ANTI JOIN redirects r ON (pl.pl_from = r.rd_from AND pl.wiki_db = r.wiki_db) WHERE snapshot = '2021-11' AND pl_from_namespace = 0 AND pl_namespace = 0 ), pagelinks_redirects_resolved AS ( SELECT DISTINCT pl.pl_from AS pl_from, COALESCE(r.rd_to, pl.pl_to) AS pl_to, pl.wiki_db AS wiki_db FROM pagelinks_reformatted pl LEFT JOIN redirects r ON (pl.pl_to = r.rd_from AND pl.wiki_db = r.wiki_db) ), wikidata_ids AS ( SELECT wiki_db, page_id, item_id FROM wmf.wikidata_item_page_link wd INNER JOIN wikipedia_projects p ON (wd.wiki_db = p.dbname) WHERE wd.snapshot = '2021-11-29' AND wd.page_namespace = 0 ) INSERT OVERWRITE TABLE isaacj.outlinks_allwikis PARTITION(snapshot='2021-11') SELECT wf.item_id AS qid_from, p.pl_from AS pid_from, wt.item_id AS qid_to, p.pl_to AS pid_to, p.wiki_db as wiki_db FROM pagelinks_redirects_resolved p LEFT JOIN wikidata_ids wf ON (p.pl_from = wf.page_id AND p.wiki_db = wf.wiki_db) LEFT JOIN wikidata_ids wt ON (p.pl_to = wt.page_id AND p.wiki_db = wt.wiki_db)
Predictions -> Hive format:
for_hive_tsv = './fasttext_data/202111/model_alloutlinks_202012_predictions_202111_hive.tsv' expected_header = ['wiki_db', 'qid', 'pid', 'num_outlinks', 'Culture.Biography.Biography*', 'Culture.Biography.Women', 'Culture.Food_and_drink', 'Culture.Internet_culture', 'Culture.Linguistics', 'Culture.Literature', 'Culture.Media.Books', 'Culture.Media.Entertainment', 'Culture.Media.Films', 'Culture.Media.Media*', 'Culture.Media.Music', 'Culture.Media.Radio', 'Culture.Media.Software', 'Culture.Media.Television', 'Culture.Media.Video_games', 'Culture.Performing_arts', 'Culture.Philosophy_and_religion', 'Culture.Sports', 'Culture.Visual_arts.Architecture', 'Culture.Visual_arts.Comics_and_Anime', 'Culture.Visual_arts.Fashion', 'Culture.Visual_arts.Visual_arts*', 'Geography.Geographical', 'Geography.Regions.Africa.Africa*', 'Geography.Regions.Africa.Central_Africa', 'Geography.Regions.Africa.Eastern_Africa', 'Geography.Regions.Africa.Northern_Africa', 'Geography.Regions.Africa.Southern_Africa', 'Geography.Regions.Africa.Western_Africa', 'Geography.Regions.Americas.Central_America', 'Geography.Regions.Americas.North_America', 'Geography.Regions.Americas.South_America', 'Geography.Regions.Asia.Asia*', 'Geography.Regions.Asia.Central_Asia', 'Geography.Regions.Asia.East_Asia', 'Geography.Regions.Asia.North_Asia', 'Geography.Regions.Asia.South_Asia', 'Geography.Regions.Asia.Southeast_Asia', 'Geography.Regions.Asia.West_Asia', 'Geography.Regions.Europe.Eastern_Europe', 'Geography.Regions.Europe.Europe*', 'Geography.Regions.Europe.Northern_Europe', 'Geography.Regions.Europe.Southern_Europe', 'Geography.Regions.Europe.Western_Europe', 'Geography.Regions.Oceania', 'History_and_Society.Business_and_economics', 'History_and_Society.Education', 'History_and_Society.History', 'History_and_Society.Military_and_warfare', 'History_and_Society.Politics_and_government', 'History_and_Society.Society', 'History_and_Society.Transportation', 'STEM.Biology', 'STEM.Chemistry', 'STEM.Computing', 'STEM.Earth_and_environment', 'STEM.Engineering', 'STEM.Libraries_&_Information', 'STEM.Mathematics', 'STEM.Medicine_&_Health', 'STEM.Physics', 'STEM.STEM*', 'STEM.Space', 'STEM.Technology'] topic_start_idx = expected_header.index('Culture.Biography.Biography*') topic_end_idx = expected_header.index('STEM.Technology') wikidb_idx = expected_header.index('wiki_db') pid_idx = expected_header.index('pid') qid_idx = expected_header.index('qid') threshold = 0.15 with open(for_hive_tsv, 'w') as fout: # fout.write('pageid\tqid\ttopic\tscore\twikidb\n') with bz2.open(PREDICTIONS_TSV, 'rt') as fin: tsvreader = csv.reader(fin, delimiter='\t') assert next(tsvreader) == expected_header for i, line in enumerate(tsvreader, start=1): wiki = line[wikidb_idx] pid = int(line[pid_idx]) qid = line[qid_idx] for idx in range(topic_start_idx, topic_end_idx + 1): if float(line[idx]) >= threshold: topic = expected_header[idx] fout.write('{0}\t{1}\t{2}\t{3}\t{4}\n'.format(pid, qid, topic, line[idx], wiki))
Loading data into Hive is actually vaguely tricky because it's involves loading it into a staging table and then inserting it into the final table but with partitions (there is maybe an easier way to do this, but I don't know it). Code can be found on stat1007 at /home/isaacj/lang_agnostic_topic_modeling/topics/load_predictions_into_hive.ipynb. The final table is created as such though:
CREATE TABLE isaacj.article_topics_outlinks_2021_11 ( pageid BIGINT COMMENT 'Page ID of article -- e.g., 8091', qid STRING COMMENT 'Wikidata item ID associated with article -- e.g., Q42', topic STRING COMMENT 'See: https://www.mediawiki.org/wiki/ORES/Articletopic#Taxonomy', score FLOAT COMMENT 'Model confidence [0-1] that topic applies to article.' ) PARTITIONED BY ( wiki_db STRING COMMENT 'Wikipedia version -- e.g., enwiki for English Wikipedia' ) STORED AS PARQUET
Oh haha, thanks for reminding me that I had already partially documented the process for this in T290042#7326209 :)
Now that it's relatively easy to update the dataset, I have held off on additional improvements to the workflow because I'd like to see the model training/prediction moved to a more stable place (ML platform) that removes me as a dependency. This sort of use-case where we make bulk predictions on articles and upload to Hive on a monthly cadence is a very common pattern for which I know we're hoping to build some general solutions. Happy to discuss further when you're ready.