Page MenuHomePhabricator

Proposed adjustment to wmf.wikidata_item_page_link to better handle page moves
Closed, ResolvedPublic

Description

@MGerlach and I were working with the wmf.wikidata_item_page_link table in Hive and MGerlach pointed out that some of the Covid-19 articles were missing from the 03-23 snapshot (but did exist in the 03-02 snapshot). Specifically, the 2019-2020 Coronavirus Pandemic (Wikidata; enwiki) is largely missing with only 16 sitelinks as opposed to the >100 sitelinks that should exist (and largely do exist in the 03-02 snapshot; see query results below). This is almost certainly because of a page move on March 11th that presumably happened a lot of other wikis too when the outbreak became an official pandemic: https://en.wikipedia.org/w/index.php?title=2019%E2%80%9320_coronavirus_outbreak&action=history

Going through the documentation for the item_page_link table, you mention that snapshots later in the month will be more likely to have incorrect links. And based on the query, it looks like the joining of item IDs with page IDs / wikidb has to happen on page titles because that is what Wikidata tracks. So if a page is moved midway through the month, the join with the item IDs / sitelinks fails because only the most recent title is retained for a given page ID. Or in this specific case, the Wikidata side had "2019–20 coronavirus outbreak" as the English sitelink but the page history side didn't have that page because it had been superseded by "2019–20 coronavirus pandemic", so the join failed to find an English article for Q81068910. I'm wondering what you think about the following proposal to maybe address this issue (and I only started digging into the mediwiki_page_history data so I easily could have misunderstood something):

In the current_page_titles SELECT statement, make the following changes:

  • use page_title_historical instead of page_title (to capture what a page was called at a given point in time, not just what it is called currently)
  • also include page_title_historical in the partition for the windowing function instead of just the first value so if a page ID has been associated with say 5 titles, they will each be represented as separate rows.
  • filter on page_is_redirect != true so you only include page_id, page_title pairs that were actually articles (and therefore could have a Wikidata ID).

When it comes time for the join with the sitelinks then, it won't matter if the sitelinks dataset is out of date because it should still match with one of the old titles and pull in the correct page ID. If there is a concern that a title might be associated with multiple page IDs that weren't redirects (and therefore would have multiple entries in the final table), you could probably take just the most current page ID.

Here's the history for pageid 62750956, which has been the stable identifier for the 2019-20 coronavirus pandemic article.

SELECT page_id, caused_by_event_type, start_timestamp, page_title_historical FROM mediawiki_page_history WHERE snapshot = '2020-03' AND page_id = 62750956 AND wiki_db = 'enwiki' AND page_namespace = 0 AND page_is_redirect != true ORDER BY start_timestamp LIMIT 100;

page_id	        caused_by_event_type	start_timestamp		page_title_historical
62750956	create			2020-01-05 15:58:49.0	2019-2020_China_pneumonia_outbreak
62750956	create-page		2020-01-05 15:58:49.0	2019-2020_China_pneumonia_outbreak
62750956	move			2020-01-05 19:37:35.0	2019–20_China_pneumonia_outbreak
62750956	move			2020-01-15 23:48:49.0	2019-20_outbreak_of_novel_coronavirus_(2019-nCoV)_in_Wuhan,_China
62750956	move			2020-01-16 00:55:43.0	2019-20_outbreak_of_novel_coronavirus_(2019-nCoV)
62750956	move			2020-01-16 04:27:51.0	2019–20_outbreak_of_novel_coronavirus_(2019-nCoV)
62750956	move			2020-01-23 17:52:56.0	2019–20_outbreak_of_novel_coronavirus_2019-nCoV
62750956	move			2020-01-23 22:43:01.0	2019–20_Wuhan_coronavirus_outbreak
62750956	move			2020-02-06 15:44:16.0	2019-20_novel_coronavirus_outbreak_(2019-nCoV)
62750956	move			2020-02-06 15:52:33.0	2019–20_Wuhan_coronavirus_outbreak
62750956	move			2020-02-18 22:04:15.0	2019–20_coronavirus_outbreak
62750956	move			2020-03-11 17:23:57.0	2019–20_coronavirus_pandemic

And so long as you filter on page_is_redirect != true, each of those titles shouldn't be associated with other page IDs.

SELECT a.page_id, a.caused_by_event_type, a.start_timestamp, a.page_title_historical, a.page_is_redirect FROM mediawiki_page_history a WHERE a.page_title_historical IN (SELECT page_title_historical FROM mediawiki_page_history p WHERE p.snapshot = '2020-03' AND p.page_id = 62750956 AND p.wiki_db = 'enwiki' AND p.page_namespace = 0) AND a.snapshot = '2020-03' AND a.wiki_db = 'enwiki' AND a.page_namespace = 0 ORDER BY a.page_title_historical, a.start_timestamp LIMIT 1000;

page_id		caused_by_event_type	start_timestamp		page_title_historical							page_is_redirect
62750956	create			2020-01-05 15:58:49.0	2019-2020_China_pneumonia_outbreak					false
62750956	create-page		2020-01-05 15:58:49.0	2019-2020_China_pneumonia_outbreak					false
62752361	create			2020-01-05 19:37:35.0	2019-2020_China_pneumonia_outbreak					true
62750956	move			2020-02-06 15:44:16.0	2019-20_novel_coronavirus_outbreak_(2019-nCoV)				false
63040557	create			2020-02-06 15:52:33.0	2019-20_novel_coronavirus_outbreak_(2019-nCoV)				true
62750956	move			2020-01-16 00:55:43.0	2019-20_outbreak_of_novel_coronavirus_(2019-nCoV)			false
62841093	create			2020-01-16 04:27:51.0	2019-20_outbreak_of_novel_coronavirus_(2019-nCoV)			true
62750956	move			2020-01-15 23:48:49.0	2019-20_outbreak_of_novel_coronavirus_(2019-nCoV)_in_Wuhan,_China	false
62839906	create			2020-01-16 00:55:43.0	2019-20_outbreak_of_novel_coronavirus_(2019-nCoV)_in_Wuhan,_China	true
62750956	move			2020-01-05 19:37:35.0	2019–20_China_pneumonia_outbreak					false
62839522	create			2020-01-15 23:48:49.0	2019–20_China_pneumonia_outbreak					true
62750956	move			2020-01-23 22:43:01.0	2019–20_Wuhan_coronavirus_outbreak					false
63040489	create			2020-02-06 15:44:16.0	2019–20_Wuhan_coronavirus_outbreak					NULL
63040489	delete			2020-02-06 15:52:23.0	2019–20_Wuhan_coronavirus_outbreak					NULL
62750956	move			2020-02-06 15:52:33.0	2019–20_Wuhan_coronavirus_outbreak					false
63147709	create			2020-02-18 22:04:15.0	2019–20_Wuhan_coronavirus_outbreak					true
63000665	create			NULL			2019–20_coronavirus_outbreak						NULL
63000665	create-page		2020-02-02 03:43:31.0	2019–20_coronavirus_outbreak						NULL
63000665	delete			2020-02-18 22:04:07.0	2019–20_coronavirus_outbreak						NULL
62750956	move			2020-02-18 22:04:15.0	2019–20_coronavirus_outbreak						false
63347546	create			2020-03-11 17:23:57.0	2019–20_coronavirus_outbreak						true
63347350	create			NULL			2019–20_coronavirus_pandemic						NULL
63347350	create-page		2020-03-11 16:54:23.0	2019–20_coronavirus_pandemic						NULL
63347350	delete			2020-03-11 17:23:51.0	2019–20_coronavirus_pandemic						NULL
62750956	move			2020-03-11 17:23:57.0	2019–20_coronavirus_pandemic						false
62750956	move			2020-01-16 04:27:51.0	2019–20_outbreak_of_novel_coronavirus_(2019-nCoV)			false
62907619	create			2020-01-23 17:52:56.0	2019–20_outbreak_of_novel_coronavirus_(2019-nCoV)			true
62750956	move			2020-01-23 17:52:56.0	2019–20_outbreak_of_novel_coronavirus_2019-nCoV				false
62911066	create			2020-01-23 22:43:01.0	2019–20_outbreak_of_novel_coronavirus_2019-nCoV				true

And here is the current state of the different snapshots:

03-23 Snapshot
SELECT item_id, wiki_db, page_id, page_title FROM wmf.wikidata_item_page_link WHERE snapshot = '2020-03-23' AND item_id = 'Q81068910' ORDER BY wiki_db ASC LIMIT 1000;

item_id	wiki_db	page_id	page_title
Q81068910	bat_smgwiki	36133	2019-2020_m._Kuoruonas_vėrosa_epėdemėjė
Q81068910	bclwiki	20503	Coronavirus-_Pagbungkaras_2019-20
Q81068910	gawiki	91784	Ráig_coróinvíris_2019–2020
Q81068910	hakwiki	20094	2019-ngièn_chṳ_2020-ngièn_COVID-19_liù-hàng
Q81068910	iswiki	156201	Kórónaveirufaraldur_2019-2020
Q81068910	kywiki	120364	Ухань_коронавирусы_чыгышы
Q81068910	mrwiki	252115	२०१९-२०२०_वुहान_कोरोना_व्हायरसचा_उद्रेक
Q81068910	mywiki	128482	၂၀၁၉-၂၀_ကိုရိုနာဗိုင်းရပ်စ်_ဖြစ်ပွားမှု
Q81068910	ruwiki	8210005	Пандемия_COVID-19
Q81068910	sahwiki	44813	Ухань_куоратка_коронавирус_тарҕаныыта
Q81068910	ugwiki	18498	2019-2020_ۋۇخەن_تاجىسىمان_ۋىرۇسنىڭ_شىددەتلىك_تارقىلىشى_ۋەقەسى
Q81068910	ukwikinews	9201	Коронавірус
Q81068910	wuuwiki	58758	2019新型冠状病毒肺炎事体
Q81068910	zh_min_nanwiki	776538	2019_nî–2020_nî_COVID-19_liû-hêng
Q81068910	zh_yuewiki	213932	武漢肺炎大爆發
Q81068910	zhwikinews	42610	2020年新型冠状病毒肺炎
03-02 Snapshot
SELECT item_id, wiki_db, page_id, page_title FROM wmf.wikidata_item_page_link WHERE snapshot = '2020-03-02' AND item_id = 'Q81068910' ORDER BY wiki_db ASC LIMIT 1000;

item_id	wiki_db	page_id	page_title
Q81068910	afwiki	287752	Koronavirusepidemie_van_2020
Q81068910	alswiki	79617	Coronavirus-Epidemie_2019/2020
Q81068910	bat_smgwiki	36133	2019-2020_m._Kuoruonas_vėrosa_epėdemėjė
Q81068910	bclwiki	20503	Coronavirus-_Pagbungkaras_2019-20
Q81068910	bewiki	629170	Успышка_каронавіруса_2019-nCoV
Q81068910	bgwiki	741190	Епидемия_от_коронавирус_2019-20
Q81068910	bnwiki	830337	২০১৯_সালের_করোনাভাইরাসঘটিত_ব্যাধির_প্রাদুর্ভাব_(২০১৯-২০২০)
Q81068910	brwiki	142318	Kleñved-red_Wuhan
Q81068910	cawiki	1731391	Epidèmia_per_coronavirus_de_2019-2020
Q81068910	cswiki	1559185	Epidemie_koronaviru_SARS-CoV-2
Q81068910	cswiki	1565100	Epidemie_koronaviru_SARS-CoV-2
Q81068910	cvwiki	81068	2019-nCoV_коронавирус_хӗлхемӗ
Q81068910	dawiki	1021283	Udbrud_af_COVID-19_i_Wuhan_2019-2020
Q81068910	dawiki	1022999	Udbrud_af_COVID-19_i_Wuhan_2019-2020
Q81068910	dewiki	11183087	Coronavirus-Epidemie_2019/2020
Q81068910	dewiki	11131846	Coronavirus-Epidemie_2019/2020
Q81068910	enwiki	63000665	2019–20_coronavirus_outbreak
Q81068910	enwiki	62750956	2019–20_coronavirus_outbreak
Q81068910	enwikivoyage	177486	2019–2020_coronavirus_outbreak
Q81068910	eowiki	679250	Epidemio_de_COVID-19
Q81068910	eowikinews	6850	Epidemio_de_COVID-19
Q81068910	etwiki	575417	2019.–2020._aasta_koroonaviirushaiguse_puhang
Q81068910	euwiki	931360	COVID-19_gaixotasunaren_2019-2020ko_agerraldia
Q81068910	fawiki	5157569	شیوع_۲۰–۲۰۱۹_کروناویروس
Q81068910	fiwiki	1560152	Koronavirusepidemia_2019–2020
Q81068910	frwiki	13125712	Épidémie_de_maladie_à_coronavirus_de_2019-2020
Q81068910	frwiki	13069866	Épidémie_de_maladie_à_coronavirus_de_2019-2020
Q81068910	frwikinews	84138	Épidémie_de_maladie_à_coronavirus_de_2019-2020
Q81068910	gawiki	91784	Ráig_coróinvíris_2019–2020
Q81068910	hakwiki	20094	2019-ngièn_chṳ_2020-ngièn_COVID-19_liù-hàng
Q81068910	hewiki	1790961	התפרצות_נגיף_קורונה-ווהאן
Q81068910	hewiki	1794362	התפרצות_נגיף_קורונה-ווהאן
Q81068910	hiwiki	1131137	2019–20_वुहान_कोरोना_वायरस_प्रकोप
Q81068910	htwiki	73699	Epidemi_kowonaviris_an_2019-2020
Q81068910	huwiki	1679312	COVID-19_koronavírus-járvány
Q81068910	hywiki	961996	Նոր_կորոնավիրուսի_բռնկում_(2019-2020)
Q81068910	idwiki	3001939	Wabah_penyakit_koronavirus_2019–2020
Q81068910	iswiki	156201	Kórónaveirufaraldur_2019-2020
Q81068910	itwiki	8303012	Epidemia_di_COVID-19_del_2019-2020
Q81068910	itwikiquote	166969	Epidemia_di_SARS-CoV-2_del_2019-2020
Q81068910	kawiki	458509	კორონავირუსის_ეპიდემია_2019-2020
Q81068910	kkwiki	633976	COVID-19_індеті
Q81068910	kowiki	2643632	코로나바이러스감염증-19_유행
Q81068910	kowiki	2624467	코로나바이러스감염증-19_유행
Q81068910	kowiki	2642027	코로나바이러스감염증-19_유행
Q81068910	kuwiki	83579	Belavbûna_vîrûsa_Korona_2019
Q81068910	kywiki	120364	Ухань_коронавирусы_чыгышы
Q81068910	lijwiki	18072	Premonîa_de_Wuhan
Q81068910	ltwiki	570568	Koronaviruso_2019-nCoV_protrūkis
Q81068910	lvwiki	450232	COVID-19_epidēmija
Q81068910	mrwiki	252115	२०१९-२०२०_वुहान_कोरोना_व्हायरसचा_उद्रेक
Q81068910	mswiki	995498	Wabak_COVID-19_(2019-20)
Q81068910	mywiki	128482	၂၀၁၉-၂၀_ကိုရိုနာဗိုင်းရပ်စ်_ဖြစ်ပွားမှု
Q81068910	nlwiki	5312604	Uitbraak_coronavirus_(SARS-CoV-2)_in_2019
Q81068910	nlwikinews	22633	2019-nCoV
Q81068910	plwiki	4701039	Szerzenie_się_zakażeń_wirusem_SARS-CoV-2
Q81068910	ptwiki	6175224	Surto_de_COVID-19
Q81068910	rowiki	2472772	Epidemia_de_coronavirus_(2019-nCoV)
Q81068910	rowiki	2475440	Epidemia_de_coronavirus_(2019-nCoV)
Q81068910	ruwiki	8173519	Вспышка_COVID-19
Q81068910	sahwiki	44813	Ухань_куоратка_коронавирус_тарҕаныыта
Q81068910	scowiki	214894	2019–20_coronavirus_ootbrak
Q81068910	scwiki	19311	Epidemia_de_Pneumònia_peri_Coronavirus_de_2019-2020
Q81068910	simplewiki	733960	2019–20_coronavirus_outbreak
Q81068910	siwiki	84157	2019–20_වූහාන්_කොරෝනාවෛරසයෙහි_ව්‍යාප්තිය
Q81068910	sqwiki	276736	Koronavirusi_2019/2020
Q81068910	srwiki	4121385	Епидемија_вируса_корона_2019/20.
Q81068910	svwiki	8414277	Coronavirusutbrottet_2019–2020
Q81068910	svwiki	8415214	Coronavirusutbrottet_2019–2020
Q81068910	swwiki	121175	Mlipuko_wa_virusi_vya_korona_Wuhan_2019-20
Q81068910	thwiki	1076225	การระบาดของโคโรนาไวรัส_พ.ศ._2562–2563
Q81068910	trwiki	2536507	2019-2020_koronavirüs_salgını
Q81068910	ugwiki	18498	2019-2020_ۋۇخەن_تاجىسىمان_ۋىرۇسنىڭ_شىددەتلىك_تارقىلىشى_ۋەقەسى
Q81068910	ukwikinews	9201	Коронавірус
Q81068910	ukwikiquote	33240	Спалах_коронавірусу_2019-nCoV
Q81068910	urwiki	884409	کورونا_وائرس_کی_وبا،_2019ء_-_2020ء
Q81068910	vecwiki	51867	Epidemia_de_SARS-CoV-2_del_2019-2020
Q81068910	viwiki	17814719	Dịch_COVID-19
Q81068910	wuuwiki	58758	2019新型冠状病毒肺炎事体
Q81068910	zh_classicalwiki	92198	己亥-庚子武漢病毒事
Q81068910	zh_min_nanwiki	776538	2019_nî–2020_nî_COVID-19_liû-hêng
Q81068910	zh_yuewiki	213932	武漢肺炎大爆發
Q81068910	zhwiki	6861061	2019冠狀病毒病疫情
Q81068910	zhwikinews	42610	2020年新型冠状病毒肺炎

Related Objects

Event Timeline

Milimetric triaged this task as Medium priority.Apr 13 2020, 3:38 PM
Milimetric moved this task from Incoming to Data Quality on the Analytics board.
Milimetric subscribed.

Not sure when we can get to this, but do let us know if it's more urgent than we think.

Not sure when we can get to this, but do let us know if it's more urgent than we think.

@Milimetric totally understandable. We currently have the workaround of only using the snapshots from the very start of the month to reduce the issues mentioned in the task but I do think (hope) that the suggested change would substantially improve the robustness of the data without changing the schema etc. for more breaking-news-type content like Covid-19. I recognize though that the change is not just the work of one hour as it likely requires some analysis to make sure it doesn't introduce new bugs. Let me know if I can be of any assistance there.

Milimetric raised the priority of this task from Medium to High.Apr 16 2020, 4:04 PM

marking high so we get to it before everyone forgets the context (still might be a while)

Thanks @Milimetric !

@MGerlach proposed that we build a query that gathers all of the data across all of the snapshots to help temporarily address this problem in the meantime too. I've put the Spark query for that below. Did a bit more testing as part of it and just some more details. Note that this doesn't tell us about how much more data'd have if we implemented my suggested changes (this is just about this more temporary hack):

  • The query below led to 53,084,491 rows from all the snapshots -- i.e. unique triplets of (wiki_db, page_id, item_id) like (enwiki, 62750956, Q81068910). The goal is to use this table to match a given wiki_db and page_id with its associated item_id.
    • This is in comparison to 52,960,356 rows if the snapshot is just 2020-04-06 (most recent), so we recover 124,135 triplets which I'm pretty sure correspond to page moves.
  • 51,668,591 (98.7%) of wiki_db-item_id pairs have only a single associated page_id. An additional 595,388 (1.1%) have just two page_ids.
  • For the outliers that have many associated page_ids (up to 80), this seems to come from pages that were created and then deleted many times over history. I'm not sure if this was actually happening to these pages or some sort of bug in the mediawiki_page_history table. I show an example below.
  • The goal is to get most recent QID for any given wiki-pageID pair -- i.e. if the 2020-04-06 snapshot has data on the same wiki_db-page_id as the 2020-03-01 snapshot, we want the QID from the 2020-04-06 snapshot. This preserves the ability to use this table to 1:1 match wiki_db-page_ids to Wikidata item_ids. It also highlights that because a single item_id may have multiple associated page_ids from the same wiki in even a single snapshot, this table is not generally useful for mapping item_ids to page_ids for a given wiki as that can be 1:many and it wouldn't be clear which page_id is the most current.
# Query for combining all item_page_link snapshots to be as complete as possible
# Filtering: namespace = 0; only Wikipedia sitelinks
spark.sql("""
CREATE TABLE isaacj.tmp_wikidata_allsnapshots AS
    WITH wikipedia_projects AS (
        SELECT DISTINCT dbname
          FROM wmf_raw.mediawiki_project_namespace_map
         WHERE snapshot = '2020-01'
               AND hostname LIKE '%wikipedia%'
        ),
    wd AS (
        SELECT wiki_db,
               page_id,
               FIRST_VALUE(item_id, true) OVER (PARTITION BY wiki_db, page_id ORDER BY snapshot DESC) as item_id
          FROM wmf.wikidata_item_page_link
         WHERE snapshot >= '2020-01-06'
               AND page_namespace = 0
    )
    SELECT DISTINCT wd.page_id,
           wd.item_id,
           wd.wiki_db
      FROM wd
     INNER JOIN wikipedia_projects p
           ON (wd.wiki_db = p.dbname)
           """)
# Examination of an outlier (wiki_db-item_id with many associated page_ids)
SELECT page_id, page_title, start_timestamp, end_timestamp, page_is_redirect, page_title_historical FROM wmf.mediawiki_page_history WHERE wiki_db = 'simplewiki' AND page_title = 'Jabuticaba' and snapshot = '2020-03' and page_namespace = 0 ORDER BY start_timestamp LIMIT 1000;

page_id    page_title    start_timestamp    end_timestamp    page_is_redirect    page_title_historical
524909    Jabuticaba    NULL    2015-12-10 18:12:39.0    NULL    Jabuticaba
525798    Jabuticaba    2015-12-10 18:12:39.0    2015-12-18 17:39:27.0    NULL    Jabuticaba
524909    Jabuticaba    2015-12-10 18:12:39.0    2015-12-10 18:12:39.0    NULL    Jabuticaba
525798    Jabuticaba    2015-12-18 17:39:27.0    2015-12-18 17:39:27.0    NULL    Jabuticaba
527319    Jabuticaba    2015-12-18 17:39:27.0    2016-01-06 17:49:15.0    NULL    Jabuticaba
527319    Jabuticaba    2016-01-06 17:49:15.0    2016-01-06 17:49:15.0    NULL    Jabuticaba
532849    Jabuticaba    2016-01-06 17:49:15.0    2016-02-22 04:53:49.0    NULL    Jabuticaba
536895    Jabuticaba    2016-02-22 04:53:49.0    2016-03-29 21:31:09.0    NULL    Jabuticaba
532849    Jabuticaba    2016-02-22 04:53:49.0    2016-02-22 04:53:49.0    NULL    Jabuticaba
536895    Jabuticaba    2016-03-29 21:31:09.0    2016-03-29 21:31:09.0    NULL    Jabuticaba
537251    Jabuticaba    2016-03-29 21:31:09.0    2016-04-03 03:00:04.0    NULL    Jabuticaba
537950    Jabuticaba    2016-04-03 03:00:04.0    2016-04-11 01:55:39.0    NULL    Jabuticaba
537251    Jabuticaba    2016-04-03 03:00:04.0    2016-04-03 03:00:04.0    NULL    Jabuticaba
537950    Jabuticaba    2016-04-11 01:55:39.0    2016-04-11 01:55:39.0    NULL    Jabuticaba
540909    Jabuticaba    2016-04-11 01:55:39.0    2016-05-05 21:22:13.0    NULL    Jabuticaba
542888    Jabuticaba    2016-05-05 21:22:13.0    2016-05-21 22:22:12.0    NULL    Jabuticaba
540909    Jabuticaba    2016-05-05 21:22:13.0    2016-05-05 21:22:13.0    NULL    Jabuticaba
543214    Jabuticaba    2016-05-21 22:22:12.0    2016-05-24 05:42:58.0    NULL    Jabuticaba
542888    Jabuticaba    2016-05-21 22:22:12.0    2016-05-21 22:22:12.0    NULL    Jabuticaba
543214    Jabuticaba    2016-05-24 05:42:58.0    2016-05-24 05:42:58.0    NULL    Jabuticaba
543266    Jabuticaba    2016-05-24 05:42:58.0    2016-05-24 11:58:47.0    NULL    Jabuticaba
562973    Jabuticaba    2016-05-24 11:58:47.0    2016-11-30 13:29:00.0    NULL    Jabuticaba
543266    Jabuticaba    2016-05-24 11:58:47.0    2016-05-24 11:58:47.0    NULL    Jabuticaba
567854    Jabuticaba    2016-11-30 13:29:00.0    2017-01-16 11:59:56.0    NULL    Jabuticaba
562973    Jabuticaba    2016-11-30 13:29:00.0    2016-11-30 13:29:00.0    NULL    Jabuticaba
568197    Jabuticaba    2017-01-16 11:59:56.0    2017-01-18 13:00:07.0    NULL    Jabuticaba
567854    Jabuticaba    2017-01-16 11:59:56.0    2017-01-16 11:59:56.0    NULL    Jabuticaba
568197    Jabuticaba    2017-01-18 13:00:07.0    2017-01-18 13:00:07.0    NULL    Jabuticaba
574108    Jabuticaba    2017-01-18 13:00:07.0    2017-03-04 15:24:29.0    NULL    Jabuticaba
578649    Jabuticaba    2017-03-04 15:24:29.0    2017-03-29 19:23:09.0    NULL    Jabuticaba
574108    Jabuticaba    2017-03-04 15:24:29.0    2017-03-04 15:24:29.0    NULL    Jabuticaba
578649    Jabuticaba    2017-03-29 19:23:09.0    2017-03-29 19:23:09.0    NULL    Jabuticaba
579599    Jabuticaba    2017-03-29 19:23:09.0    2017-04-05 20:23:46.0    NULL    Jabuticaba
579599    Jabuticaba    2017-04-05 20:23:46.0    2017-04-05 20:23:46.0    NULL    Jabuticaba
581102    Jabuticaba    2017-04-05 20:23:46.0    2017-04-18 09:19:39.0    NULL    Jabuticaba
581971    Jabuticaba    2017-04-18 09:19:39.0    2017-04-26 02:14:58.0    NULL    Jabuticaba
581102    Jabuticaba    2017-04-18 09:19:39.0    2017-04-18 09:19:39.0    NULL    Jabuticaba
583652    Jabuticaba    2017-04-26 02:14:58.0    2017-05-08 19:45:19.0    NULL    Jabuticaba
581971    Jabuticaba    2017-04-26 02:14:58.0    2017-04-26 02:14:58.0    NULL    Jabuticaba
583652    Jabuticaba    2017-05-08 19:45:19.0    2017-05-08 19:45:19.0    NULL    Jabuticaba
584953    Jabuticaba    2017-05-08 19:45:19.0    2017-05-19 01:04:03.0    NULL    Jabuticaba
584953    Jabuticaba    2017-05-19 01:04:03.0    2017-05-19 01:04:03.0    NULL    Jabuticaba
586487    Jabuticaba    2017-05-19 01:04:03.0    2017-06-01 20:26:20.0    NULL    Jabuticaba
586487    Jabuticaba    2017-06-01 20:26:20.0    2017-06-01 20:26:20.0    NULL    Jabuticaba
586553    Jabuticaba    2017-06-01 20:26:20.0    NULL    false    Jabuticaba

I've never seen an example like Jabuticaba, I'll have to look into that (indeed fascinating)! I know there are lots of outliers where page titles actually change many times, get deleted and restored, etc. So long page histories are there, though like you found, not the norm.

I think the approach makes sense, I need some time to figure out how to prioritize between all the snapshots and multiple matches with all the historical titles. We need to not duplicate those joins.

Long term, we really want to incrementally update page_history and the other history tables, that would solve this problem in a much cleaner way.

@Milimetric yeah, while doing the research around this task, I've come to super appreciate the mediawiki_page_history table given how difficult it is to figure out page move history from from an article's edit history. It'd be great to eventually make that table public too so people could quickly query a page's history for longitudinal research, but I'll keep my sights on this first :)

I think the approach makes sense, I need some time to figure out how to prioritize between all the snapshots and multiple matches with all the historical titles. We need to not duplicate those joins.

Sounds good. Don't hesitate to let me know if any of my thoughts didn't come out clearly. The temporary hack I mentioned in T249773#6065879 is hopefully only temporary too -- the original task description is what I'm hoping actually happens so we can stick with querying a single snapshot and avoiding all those joins :)

Long term, we really want to incrementally update page_history and the other history tables, that would solve this problem in a much cleaner way.

Ahh yes, that would be nice and I assume deal with a lot of the challenges with reconstructing this history. I won't pretend to understand how hard that is though.

@Milimetric is to look at the work but just setting expectations that if we need to modify mediaiwki_history to do changes we will probably not be able to do it in the near term as we want to reduce our high risk changes.

@Nuria understood -- my proposed change doesn't modify any underlying tables, just the query to produce the item_page_link table. But I understand that if it seems like the correct solution is actually something much bigger, it could be a while.

Ok... been thinking, @Isaac / @MGerlach, I may have a simpler and more accurate approach. We could use the stream of page move data to augment the page history. I think all we would need to do is get the last page move for a given page id and use that title as the correct current title. Or ... I guess the last page move that still happened before the last import of wikidata sitelinks. Looking through the example you provided, it looks like this works:

 SELECT month, day, hour,
        page_title,
        new_redirect_page
   FROM event.mediawiki_page_move
  WHERE database = 'enwiki'
    AND page_id = 62750956
    AND year > 2018
;

Sorry to be such a slowpoke these days, but what do you think? I could update the job and this would be our first little foray into incrementally updating the history dataset with real time-ish data. The nice thing about this approach is we could technically update as frequently as we want, as long as we can get the sitelinks imported.

Or ... I guess the last page move that still happened before the last import of wikidata sitelinks.

@Milimetric yeah, that's really the challenge. Making sure the page title -> page ID mapping for a given wiki is from the same point in time as the wikidata sitelinks dump that is being used. Correct me if I'm misunderstanding, but your idea is to replace the query of mediawiki_page_history with mediawiki_page_move? Certainly feels more direct and I think would accomplish what I want, so I'm onboard. So if the sitelinks snapshot is 2020-04-13, then you'd want something like what's below (I think the other fields like namespace can be extracted too)?

 SELECT FIRST_VALUE(page_title, true) OVER (PARTITION BY page_id, database ORDER BY meta.dt DESC) as page_title,
        page_id,
        database
   FROM event.mediawiki_page_move
  WHERE (year < 2020 OR month < 4 OR day <= 13)
;

Note: Hive gets mad at me everytime I include database as a field in that query. I assume because it's a protected keyword? Not sure if this is something that should be fixed or if there's an easy workaround. I've been getting around this by actually using the meta.domain field (which would be e.g., en.wikipedia.org). Querying of this table seems to also take a very long time (and failed due to Java heap space when I tested the above for our running example of page_id = 62750956 and database = 'enwiki'), so I couldn't test out that query.

Not sure if this is something that should be fixed or if there's an easy workaround.

@Isaac: quoting "database" should work. Now seeing the select spans all partitions for table I doubt it would work as written, there are 22,000 partitions and thus the GC fail (you can do >show partitions <table> to see number of partitions) .

quoting "database" should work.

Thanks @Nuria !

Now seeing the select spans all partitions for table I doubt it would work as written, there are 22,000 partitions and thus the GC fail (you can do >show partitions <table> to see number of partitions) .

Yeah, agreed. I'll leave to @Milimetric to figure that one out as it's definitely outside of my expertise :) Perhaps my version of the query can be simplified to avoid the issue because I like this idea of directly using the page move table. But explains why the query for mediawiki_page_history seemed to be much faster despite having (presumably) much more data associated with it.

Yes, Isaac, that's the idea. Though the result of your query would be joined with the page titles from mediawiki_page_history, because most pages would not be renamed. So the pseudocode would be like:

SITELINKS_IMPORT = 'some date';
LAST_HISTORY_SNAPSHOT = 'some date';

for every page_id
  get latest page_title from mediawiki_page_history where snapshot = LAST_HISTORY_SNAPSHOT
  get latest page_title from event.mediawiki_page_move where *date* between LAST_HISTORY_SNAPSHOT and SITELINKS_IMPORT

I'll have this working sometime soon and we can look at the details together.

@Milimetric :thumbs up: looks like your approach then will help reduce the number of partitions that have to be searched for mediawiki_page_move and make this actually feasible

@Isaac btw regarding your query troubles before, I should've said but I run my queries in the superset SQL lab, with presto, and it's AMAZING. It auto-quotes things like database and is just generally faster and better. https://superset.wikimedia.org/superset/sqllab in case you haven't used it yet.

Change 594428 had a related patch set uploaded (by Milimetric; owner: Milimetric):
[analytics/refinery/source@master] [WIP] Use page move events to improve joining to entity

https://gerrit.wikimedia.org/r/594428

Quick update, the patch I'm about to send shows some promise. I am compiling and testing on the full dataset, but for the example we're looking at, it seems to do the trick, the following is just the query from the spark job instantiated with the parameters. The updated_page_titles CTE is the addition as we described above:

-- assuming 2020-02 as the snapshot we're looking at
-- and 2020-03-02 as the sitelinks, we get '..._outbreak' in both
-- and if we use 2020-03-23, we should get the pagemove and the new title '... pandemic' (renamed on 3-11)

WITH

snapshot_page_titles AS (

  SELECT DISTINCT
    wiki_db,
    page_id,
    first_value(page_title) OVER w AS page_title,
    first_value(page_namespace) OVER w AS page_namespace
  FROM wmf.mediawiki_page_history
  WHERE snapshot = '2020-02'
    AND page_id IS NOT NULL AND page_id > 0 AND page_id = 62750956 AND wiki_db = 'enwiki'
    AND page_title IS NOT NULL and LENGTH(page_title) > 0
  WINDOW w AS (
    PARTITION BY
      wiki_db,
      page_id
    ORDER BY
      start_timestamp DESC, -- If events have the same timestamp
      source_log_id DESC,   -- Use biggest source_log_id. If same source_log_id
      caused_by_event_type  -- then use create instead of delete.
  )
),

updated_page_titles AS (
  SELECT DISTINCT
    `database` AS wiki_db,
    page_id,
    first_value(page_title) OVER w AS page_title,
    first_value(page_namespace) OVER w AS page_namespace
  FROM event.mediawiki_page_move
  WHERE page_id IS NOT NULL AND page_id > 0
    AND page_title IS NOT NULL and LENGTH(page_title) > 0
    AND concat(year, '-', LPAD(month, 2, '0'), '-', LPAD(day, 2, '0'))
        between '2020-02' and '2020-03-23T00:00:00Z'
  WINDOW w AS (
    PARTITION BY
      `database`,
      page_id
    ORDER BY
      meta.dt DESC
  )
),

current_page_titles AS (
  SELECT s.wiki_db,
    s.page_id,
    coalesce(u.page_title, s.page_title) as page_title,
    coalesce(u.page_namespace, s.page_namespace) as page_namespace
  FROM snapshot_page_titles s
    LEFT JOIN updated_page_titles u
      ON (
        s.wiki_db = u.wiki_db
        AND s.page_id = u.page_id
      )
),

localized_namespace_titles AS (

  SELECT
    wiki_db,
    page_id,
    page_title,
    page_namespace,
    CASE WHEN (LENGTH(namespace_localized_name) > 0)
      THEN CONCAT(namespace_localized_name, ':', page_title)
      ELSE page_title
    END AS page_title_localized_namespace
  FROM current_page_titles cpt
    INNER JOIN wmf_raw.mediawiki_project_namespace_map nsm
      ON (
        cpt.wiki_db = nsm.dbname
        AND cpt.page_namespace = nsm.namespace
        AND nsm.snapshot = '2020-02'
      )
),

wikidata_sitelinks AS (
  SELECT
    id as item_id,
    EXPLODE(siteLinks) AS sitelink
  FROM wmf.wikidata_entity
  WHERE snapshot = '2020-03-23'
    AND size(siteLinks) > 0
)

SELECT
  item_id,
  wiki_db,
  page_id,
  page_title,
  page_namespace,
  page_title_localized_namespace
FROM wikidata_sitelinks ws
  INNER JOIN localized_namespace_titles lnt
    ON (
      ws.sitelink.site = lnt.wiki_db
      AND REPLACE(ws.sitelink.title, ' ', '_') = page_title_localized_namespace
    )

limit 10000
;

@Milimetric that looks great! Only comment: I assume that the wmf.mediawiki_page_history snapshot will always be earlier than the wmf.wikidata_entity snapshot? If not, probably want to enforce that because otherwise I think that the join will return the wrong page ID for pages that were moved after the wikidata_sitelinks snapshot date.

Change 594719 had a related patch set uploaded (by Milimetric; owner: Milimetric):
[analytics/refinery@master] Use new page move incremental updates

https://gerrit.wikimedia.org/r/594719

@Isaac, I was finally able to run this successfully. I'm vetting the data a little bit now, basically just spot checking. If you want to play along, here's where it is:

 select *
   from wmf.wikidata_item_page_link
  where snapshot='2020-03-02'
;

(I unfortunately accidentally overwrote one of the official snapshots, in wmf.wikidata_item_page_link, because of a misconfiguration of the test job (it shares config between input and output and I didn't realize that))

Question: is that ok? I can easily regenerate the 2020-03-02 snapshot as if it was generated by the old logic. Let me know either way.

Question: is that ok? I can easily regenerate the 2020-03-02 snapshot as if it was generated by the old logic. Let me know either way.

@Milimetric no problem from my end but thanks for checking. My session generation job that uses these snapshots is still using the full set of snapshots (the hack I mentioned in T249773#6065879), so I doubt much will change.

I'm vetting the data a little bit now, basically just spot checking. If you want to play along, here's where it is:

Thanks! I'll do my best to take a quick look.

One other thing that I thought of that might speed up the query: I can never remember how snapshot dates work but figured it was best to verify. For a 2020-03-23 sitelinks snapshot, I'm pretty sure that is the Wikidata sitelinks on 2020-03-23, so using 2020-03-23 as a date makes sense. For a 2020-02 mediawiki_page_history snapshot, though, I had thought that was for all of February, so you really only need page moves since 2020-03-01?

Quick context for snapshot ranges -- I checked via this query (I assume the spillover to April is unpredictable and just depends on when the job actually gets scheduled):

SELECT MIN(page_creation_timestamp), MAX(page_creation_timestamp) FROM wmf.mediawiki_page_history WHERE snapshot = '2020-03';

_c0	                _c1
1999-11-30 14:03:30.0	2020-04-02 10:13:09.0

One other thing that I thought of that might speed up the query: I can never remember how snapshot dates work but figured it was best to verify. For a 2020-03-23 sitelinks snapshot, I'm pretty sure that is the Wikidata sitelinks on 2020-03-23, so using 2020-03-23 as a date makes sense. For a 2020-02 mediawiki_page_history snapshot, though, I had thought that was for all of February, so you really only need page moves since 2020-03-01?

They're actually both labeled on the "left" of the time interval. So the 2020-02 history snapshot includes 2020-02-29T23:59:59... and the 2020-03-23 sitelinks snapshot includes 2020-03-30... So in the candidate query I have this hairball:

between
    -- Since history snapshot includes that month we start looking 1 month later
    TO_DATE(concat('${params.historySnapshot}', '-01')) + INTERVAL 1 MONTH
    and
    -- wikidata snapshot includes that week, so look 1 week later
    TO_DATE('${params.wikidataSnapshot}') + INTERVAL 7 DAYS

Quick context for snapshot ranges -- I checked via this query (I assume the spillover to April is unpredictable and just depends on when the job actually gets scheduled):

SELECT MIN(page_creation_timestamp), MAX(page_creation_timestamp) FROM wmf.mediawiki_page_history WHERE snapshot = '2020-03';

_c0	                _c1
1999-11-30 14:03:30.0	2020-04-02 10:13:09.0

Yes, I am actually not sure where that weird min date comes from, must just be bad source data in mediawiki dbs. The max is like that because we start sqooping on 04-01T00:00:00 and finish at some point over the next 24 hours. But those dbs are being written to. We specify some time ranges on things like revisions, that have timestamps, but we can't know whether a record in the page table was updated or added or what, it has no timestamp. We could guess and use the creation timestamp but it's safer not to. So, in lieu of sourcing this on events like we probably will in the future, we just take the spillover, like you guessed.

Vetting. (Note: the new partition wrote bigint for page_namespace, according to the new code, so I created milimetric.wikidata_item_page_link pointing at the same data so we could query it more easily.)

The coronavirus page looks even better than the previous 03-02 snapshot, 98 links compared to 84. I looked into some of those duplicate entries with different page ids for the same title and wiki_db, they're all archived now in the db, but it's weird that they had the same title. Maybe the title normalization squished two different titles into one... in any case, look out for that as you're using the data, I didn't dig too much.

 select *
   from milimetric.wikidata_item_page_link
  where snapshot='2020-03-02'
    and item_id = 'Q81068910'
;

example dupes:
Q81068910	enwiki	62750956	2019–20_coronavirus_outbreak	0	2019–20_coronavirus_outbreak	2020-03-02
Q81068910	enwiki	63000665	2019–20_coronavirus_outbreak	0	2019–20_coronavirus_outbreak	2020-03-02

Q81068910	dewiki	11131846	Coronavirus-Epidemie_2019/2020	0	Coronavirus-Epidemie_2019/2020	2020-03-02
Q81068910	dewiki	11183087	Coronavirus-Epidemie_2019/2020	0	Coronavirus-Epidemie_2019/2020	2020-03-02

Q81068910	viwiki	16945751	Dịch_COVID-19	0	Dịch_COVID-19	2020-03-02
Q81068910	viwiki	17814719	Dịch_COVID-19	0	Dịch_COVID-19	2020-03-02

The other articles I looked at all had decent links. Barak Obama 311, boat 119, coal 135, Mikhail Tal 69, Judit Polgar 65 (there is no God), all in line with how many links wikidata itself shows (though higher because of the dupes, especially for some reason in Obama's case).

So I donno, I'd say this looks good, anecdotally.

@Milimetric thanks for actually verifying my conjecturing around snapshot dates :)

Based on the history for 2019-20 Coronavirus Pandemic in the task description (copied below), it looks like the duplicates come from deleted pages. I'm not sure how to filter this out from event.mediawiki_page_move though... Maybe have to do another join to catch these or maybe I'm missing something?

SELECT a.page_id, a.caused_by_event_type, a.start_timestamp, a.page_title_historical, a.page_is_redirect FROM mediawiki_page_history a WHERE a.page_title_historical IN (SELECT page_title_historical FROM mediawiki_page_history p WHERE p.snapshot = '2020-03' AND p.page_id = 62750956 AND p.wiki_db = 'enwiki' AND p.page_namespace = 0) AND a.snapshot = '2020-03' AND a.wiki_db = 'enwiki' AND a.page_namespace = 0 ORDER BY a.page_title_historical, a.start_timestamp LIMIT 1000;

page_id		caused_by_event_type	start_timestamp		page_title_historical							page_is_redirect
62750956	create			2020-01-05 15:58:49.0	2019-2020_China_pneumonia_outbreak					false
62750956	create-page		2020-01-05 15:58:49.0	2019-2020_China_pneumonia_outbreak					false
62752361	create			2020-01-05 19:37:35.0	2019-2020_China_pneumonia_outbreak					true
62750956	move			2020-02-06 15:44:16.0	2019-20_novel_coronavirus_outbreak_(2019-nCoV)				false
63040557	create			2020-02-06 15:52:33.0	2019-20_novel_coronavirus_outbreak_(2019-nCoV)				true
62750956	move			2020-01-16 00:55:43.0	2019-20_outbreak_of_novel_coronavirus_(2019-nCoV)			false
62841093	create			2020-01-16 04:27:51.0	2019-20_outbreak_of_novel_coronavirus_(2019-nCoV)			true
62750956	move			2020-01-15 23:48:49.0	2019-20_outbreak_of_novel_coronavirus_(2019-nCoV)_in_Wuhan,_China	false
62839906	create			2020-01-16 00:55:43.0	2019-20_outbreak_of_novel_coronavirus_(2019-nCoV)_in_Wuhan,_China	true
62750956	move			2020-01-05 19:37:35.0	2019–20_China_pneumonia_outbreak					false
62839522	create			2020-01-15 23:48:49.0	2019–20_China_pneumonia_outbreak					true
62750956	move			2020-01-23 22:43:01.0	2019–20_Wuhan_coronavirus_outbreak					false
63040489	create			2020-02-06 15:44:16.0	2019–20_Wuhan_coronavirus_outbreak					NULL
63040489	delete			2020-02-06 15:52:23.0	2019–20_Wuhan_coronavirus_outbreak					NULL
62750956	move			2020-02-06 15:52:33.0	2019–20_Wuhan_coronavirus_outbreak					false
63147709	create			2020-02-18 22:04:15.0	2019–20_Wuhan_coronavirus_outbreak					true
63000665	create			NULL			2019–20_coronavirus_outbreak						NULL
63000665	create-page		2020-02-02 03:43:31.0	2019–20_coronavirus_outbreak						NULL
63000665	delete			2020-02-18 22:04:07.0	2019–20_coronavirus_outbreak						NULL
62750956	move			2020-02-18 22:04:15.0	2019–20_coronavirus_outbreak						false
63347546	create			2020-03-11 17:23:57.0	2019–20_coronavirus_outbreak						true
63347350	create			NULL			2019–20_coronavirus_pandemic						NULL
63347350	create-page		2020-03-11 16:54:23.0	2019–20_coronavirus_pandemic						NULL
63347350	delete			2020-03-11 17:23:51.0	2019–20_coronavirus_pandemic						NULL
62750956	move			2020-03-11 17:23:57.0	2019–20_coronavirus_pandemic						false
62750956	move			2020-01-16 04:27:51.0	2019–20_outbreak_of_novel_coronavirus_(2019-nCoV)			false
62907619	create			2020-01-23 17:52:56.0	2019–20_outbreak_of_novel_coronavirus_(2019-nCoV)			true
62750956	move			2020-01-23 17:52:56.0	2019–20_outbreak_of_novel_coronavirus_2019-nCoV				false
62911066	create			2020-01-23 22:43:01.0	2019–20_outbreak_of_novel_coronavirus_2019-nCoV				true

Oh that makes sense, the query in use right now has the same problem because it's not filtering out pages that have their last status as "delete". We could filter it out from history, but that would still get us pages that have been deleted since then. I guess the clever thing to do would be to also look at the mediawiki_page_delete events and exclude any page ids that show up there.

In both cases, we'd run into trouble when the page is restored. The safe thing is to just include the duplicates and filter them out later. What would be preferable for the jobs that use this on your side?

So a simple improvement is adding a AND NOT page_is_deleted to the where clause for the initial snapshot_page_titles query. This would remove at least some of the redundant page titles, but that still leaves us with the time period between the snapshot for mediawiki_page_history and the snapshot for the Wikidata sitelinks as you note. Joining in the page moves like you're doing (and probably the event.mediawiki_page_create table as well) should get the rest of the cases. I am starting to think though that we essentially have to create the mediawiki_page_history table for the duration between the history snapshot and wikidata snapshot if we want to capture all the page moves, creates, deletions, and undeletes (and in the right order). If that's "simple" to do, maybe that's the right solution (and should be guaranteed to be complete and hopefully it actually wouldn't be too computationally intensive given that it will only ever be for ~1 month).

I am starting to think though that we essentially have to create the mediawiki_page_history table for the duration between the history snapshot and wikidata snapshot if we want to capture all the page moves, creates, deletions, and undeletes (and in the right order). If that's "simple" to do, maybe that's the right solution

Agreed it's the right solution, it's not simple, but we are investigating that longer-term. Now, for our purposes here, do you think you can live with the duplicates (they were there before this change)? If not, we probably have to defer the fix until we can so the right thing (TM) that you mention above.

Yeah, I can live with the duplicates until the more long-term fix for missing data is made. There are three ways I can think of for how we use this table:

  • Start with QID and map to page title / ID
  • Start with title and map to QID
  • Start with page ID and map to QID

Only the first two will be broken because of the duplicates, but thankfully almost all the work (if not all) that we do with this table is mapping page IDs -> QIDs. I personally never use the titles if I can avoid it and first map them to page IDs.

Still would be nice to fix the duplicates because, for instance, I can imagine wanting to join page IDs / titles into the wbc_entity_usage table (wmf_raw.mediawiki_wbc_entity_usage) as part of some work I'm doing around Wikidata transclusion. I'll avoid doing that until we sort this out though :)

Change 594428 merged by Milimetric:
[analytics/refinery/source@master] Use page move events to improve joining to entity

https://gerrit.wikimedia.org/r/594428

Change 594719 merged by Milimetric:
[analytics/refinery@master] Use new page move incremental updates

https://gerrit.wikimedia.org/r/594719

Milimetric moved this task from In Code Review to Ready to Deploy on the Analytics-Kanban board.
Milimetric moved this task from Ready to Deploy to Done on the Analytics-Kanban board.

This is deployed, should be processing the 5/18 snapshot sometime soon.

Ok, dumps took longer than expected but it's done now. There's a little snag: new schema / spark produce page_namespace with type int. So queries that select page_namespace in the new snapshots will fail. I'm going to fix the table but that will break querying of old snapshots. Just a note in case this causes trouble, so you're all aware.

NOTE: that caveat is no longer an issue since we have since deleted those old snapshots.
Nuria set Final Story Points to 8.