Page MenuHomePhabricator

Update Topic Dataset with data upto Nov 2021
Closed, ResolvedPublic

Description

As a part of investigating the decline in pageviews T296875 it would be useful to look into topic data.

For updating the Topic dataset, @jwang will first need the latest article_topics_outlinks data to be updated by @Isaac

Event Timeline

Ok, all complete! Let me know if the data looks reasonable and then will close this out. Documentation below:

Final output is in this Hive table: isaacj.article_topics_outlinks_2021_11
Schema:

+-----------------------+---------+--------------------------------------------------------------+
|col_name               |data_type|comment                                                       |
+-----------------------+---------+--------------------------------------------------------------+
|pageid                 |bigint   |Page ID of article -- e.g., 8091                              |
|qid                    |string   |Wikidata item ID associated with article -- e.g., Q42         |
|topic                  |string   |See: https://www.mediawiki.org/wiki/ORES/Articletopic#Taxonomy|
|score                  |float    |Model confidence [0-1] that topic applies to article.         |
|wiki_db                |string   |Wikipedia version -- e.g., enwiki for English Wikipedia       |
|# Partition Information|         |                                                              |
|# col_name             |data_type|comment                                                       |
|wiki_db                |string   |Wikipedia version -- e.g., enwiki for English Wikipedia       |
+-----------------------+---------+--------------------------------------------------------------+

EDIT: suggestions for usage:

  • The data that is included are topics with a confidence score above 0.15. In practice, I recommend limiting to only topics with a confidence of 0.5 (WHERE score >= 0.5).
  • The data covers every Wikipedia article but articles can have from 0 (won't show up in the table) to many topics (many rows) associated with them.
  • Confidence isn't a proxy for importance so while e.g., it's simple to just assign the highest-confidence topic, this will bias towards easy-to-predict topics like Biography/Geography and away from hard-to-predict topics like Culture/Society.
  • Generally, I recommend tracking an article in every topic under which it appears. This will potentially inflate your numbers but give the best sense of the distribution of attention etc.

Steps:

  1. Hive SQL for generating outlinks data that is used for making topic predictions (slight tweak of this -- see below)
  2. Dump link data to TSVs and coalesce on stat1007 (code)
  3. Apply trained model (in this case: trained from 2020-12 as described here) to all the data to get topic predictions for all articles (code)
    1. Note: though this model is now almost a year old, in practice there is very little drift in the data and even though lots of new articles are created, they still tend to link to lots of old articles and have plenty of data to make predictions with. I'll train a new model though for the next snapshot update (it's not a lot of work for me but does take a several hours for preprocessing the data / training the model).
  4. Format data for Hive (see below)
  5. Load data into Hive (see below)

Outlinks query:

WITH wikipedia_projects AS (
        SELECT DISTINCT dbname
          FROM wmf_raw.mediawiki_project_namespace_map
         WHERE snapshot = '2021-11'
               AND hostname LIKE '%wikipedia%'
        ),
title_to_id AS (
    SELECT page_id,
           page_title,
           wiki_db
      FROM wmf_raw.mediawiki_page mp
     INNER JOIN wikipedia_projects wp
           ON (mp.wiki_db = wp.dbname)
     WHERE page_namespace = 0
           AND snapshot = '2021-11'
),
redirects AS (
    SELECT mr.rd_from AS rd_from,
           tti.page_id AS rd_to,
           mr.wiki_db AS wiki_db
      FROM wmf_raw.mediawiki_redirect mr
     INNER JOIN title_to_id tti
           ON (mr.rd_title = tti.page_title
               AND mr.wiki_db = tti.wiki_db)
     WHERE mr.snapshot = '2021-11'
           AND mr.rd_namespace = 0
),
pagelinks_reformatted AS (
    SELECT pl.pl_from AS pl_from,
           tti.page_id AS pl_to,
           pl.wiki_db AS wiki_db
      FROM wmf_raw.mediawiki_pagelinks pl
     INNER JOIN title_to_id tti
           ON (pl.pl_title = tti.page_title
               AND pl.wiki_db = tti.wiki_db)
      LEFT ANTI JOIN redirects r
           ON (pl.pl_from = r.rd_from
               AND pl.wiki_db = r.wiki_db)
     WHERE snapshot = '2021-11'
           AND pl_from_namespace = 0
           AND pl_namespace = 0
),
pagelinks_redirects_resolved AS (
    SELECT DISTINCT pl.pl_from AS pl_from,
           COALESCE(r.rd_to, pl.pl_to) AS pl_to,
           pl.wiki_db AS wiki_db
      FROM pagelinks_reformatted pl
      LEFT JOIN redirects r
           ON (pl.pl_to = r.rd_from
               AND pl.wiki_db = r.wiki_db)
),
wikidata_ids AS (
    SELECT wiki_db,
           page_id,
           item_id
      FROM wmf.wikidata_item_page_link wd
     INNER JOIN wikipedia_projects p
           ON (wd.wiki_db = p.dbname)
     WHERE wd.snapshot = '2021-11-29'
           AND wd.page_namespace = 0
    )
INSERT OVERWRITE TABLE isaacj.outlinks_allwikis
PARTITION(snapshot='2021-11')
SELECT wf.item_id AS qid_from,
       p.pl_from AS pid_from,
       wt.item_id AS qid_to,
       p.pl_to AS pid_to,
       p.wiki_db as wiki_db
  FROM pagelinks_redirects_resolved p
  LEFT JOIN wikidata_ids wf
       ON (p.pl_from = wf.page_id
           AND p.wiki_db = wf.wiki_db)
  LEFT JOIN wikidata_ids wt
       ON (p.pl_to = wt.page_id
           AND p.wiki_db = wt.wiki_db)

Predictions -> Hive format:

for_hive_tsv = './fasttext_data/202111/model_alloutlinks_202012_predictions_202111_hive.tsv'
expected_header = ['wiki_db', 'qid', 'pid', 'num_outlinks',
                   'Culture.Biography.Biography*', 'Culture.Biography.Women',
                   'Culture.Food_and_drink',
                   'Culture.Internet_culture',
                   'Culture.Linguistics',
                   'Culture.Literature',
                   'Culture.Media.Books', 'Culture.Media.Entertainment', 'Culture.Media.Films', 'Culture.Media.Media*', 'Culture.Media.Music', 'Culture.Media.Radio', 'Culture.Media.Software', 'Culture.Media.Television', 'Culture.Media.Video_games',
                   'Culture.Performing_arts',
                   'Culture.Philosophy_and_religion',
                   'Culture.Sports',
                   'Culture.Visual_arts.Architecture', 'Culture.Visual_arts.Comics_and_Anime', 'Culture.Visual_arts.Fashion', 'Culture.Visual_arts.Visual_arts*',
                   'Geography.Geographical',
                   'Geography.Regions.Africa.Africa*', 'Geography.Regions.Africa.Central_Africa', 'Geography.Regions.Africa.Eastern_Africa', 'Geography.Regions.Africa.Northern_Africa', 'Geography.Regions.Africa.Southern_Africa', 'Geography.Regions.Africa.Western_Africa',
                   'Geography.Regions.Americas.Central_America', 'Geography.Regions.Americas.North_America', 'Geography.Regions.Americas.South_America',
                   'Geography.Regions.Asia.Asia*', 'Geography.Regions.Asia.Central_Asia', 'Geography.Regions.Asia.East_Asia', 'Geography.Regions.Asia.North_Asia', 'Geography.Regions.Asia.South_Asia', 'Geography.Regions.Asia.Southeast_Asia', 'Geography.Regions.Asia.West_Asia',
                   'Geography.Regions.Europe.Eastern_Europe', 'Geography.Regions.Europe.Europe*', 'Geography.Regions.Europe.Northern_Europe', 'Geography.Regions.Europe.Southern_Europe', 'Geography.Regions.Europe.Western_Europe',
                   'Geography.Regions.Oceania',
                   'History_and_Society.Business_and_economics',
                   'History_and_Society.Education',
                   'History_and_Society.History',
                   'History_and_Society.Military_and_warfare',
                   'History_and_Society.Politics_and_government',
                   'History_and_Society.Society',
                   'History_and_Society.Transportation',
                   'STEM.Biology',
                   'STEM.Chemistry',
                   'STEM.Computing',
                   'STEM.Earth_and_environment',
                   'STEM.Engineering',
                   'STEM.Libraries_&_Information',
                   'STEM.Mathematics',
                   'STEM.Medicine_&_Health',
                   'STEM.Physics',
                   'STEM.STEM*',
                   'STEM.Space',
                   'STEM.Technology']
topic_start_idx = expected_header.index('Culture.Biography.Biography*')
topic_end_idx = expected_header.index('STEM.Technology')
wikidb_idx = expected_header.index('wiki_db')
pid_idx = expected_header.index('pid')
qid_idx = expected_header.index('qid')
threshold = 0.15
with open(for_hive_tsv, 'w') as fout:
    # fout.write('pageid\tqid\ttopic\tscore\twikidb\n')
    with bz2.open(PREDICTIONS_TSV, 'rt') as fin:
        tsvreader = csv.reader(fin, delimiter='\t')
        assert next(tsvreader) == expected_header
        for i, line in enumerate(tsvreader, start=1):
            wiki = line[wikidb_idx]
            pid = int(line[pid_idx])
            qid = line[qid_idx]
            for idx in range(topic_start_idx, topic_end_idx + 1):
                if float(line[idx]) >= threshold:
                    topic = expected_header[idx]
                    fout.write('{0}\t{1}\t{2}\t{3}\t{4}\n'.format(pid, qid, topic, line[idx], wiki))

Loading data into Hive is actually vaguely tricky because it's involves loading it into a staging table and then inserting it into the final table but with partitions (there is maybe an easier way to do this, but I don't know it). Code can be found on stat1007 at /home/isaacj/lang_agnostic_topic_modeling/topics/load_predictions_into_hive.ipynb. The final table is created as such though:

CREATE TABLE isaacj.article_topics_outlinks_2021_11 (
    pageid  BIGINT COMMENT 'Page ID of article -- e.g., 8091',
    qid     STRING COMMENT 'Wikidata item ID associated with article -- e.g., Q42',
    topic   STRING COMMENT 'See: https://www.mediawiki.org/wiki/ORES/Articletopic#Taxonomy',  
    score   FLOAT  COMMENT 'Model confidence [0-1] that topic applies to article.'
    )
PARTITIONED BY (
    wiki_db STRING  COMMENT 'Wikipedia version -- e.g., enwiki for English Wikipedia'
)
STORED AS PARQUET
kzimmerman added subscribers: cchen, kzimmerman.

Maya will create subtasks for @jwang and @cchen to update the tables

Thank you so much @Isaac for your help with this! I would love to discuss about automating this update using the functionalities we've built to automate ETLs for the monthly Key Product Metrics calculations T288677.

But I will wait until we pick up T290042 to discuss further :-)

Oh haha, thanks for reminding me that I had already partially documented the process for this in T290042#7326209 :)

Now that it's relatively easy to update the dataset, I have held off on additional improvements to the workflow because I'd like to see the model training/prediction moved to a more stable place (ML platform) that removes me as a dependency. This sort of use-case where we make bulk predictions on articles and upload to Hive on a monthly cadence is a very common pattern for which I know we're hoping to build some general solutions. Happy to discuss further when you're ready.