@MGerlach and I were working with the wmf.wikidata_item_page_link table in Hive and MGerlach pointed out that some of the Covid-19 articles were missing from the 03-23 snapshot (but did exist in the 03-02 snapshot). Specifically, the 2019-2020 Coronavirus Pandemic (Wikidata; enwiki) is largely missing with only 16 sitelinks as opposed to the >100 sitelinks that should exist (and largely do exist in the 03-02 snapshot; see query results below). This is almost certainly because of a page move on March 11th that presumably happened a lot of other wikis too when the outbreak became an official pandemic: https://en.wikipedia.org/w/index.php?title=2019%E2%80%9320_coronavirus_outbreak&action=history
Going through the documentation for the item_page_link table, you mention that snapshots later in the month will be more likely to have incorrect links. And based on the query, it looks like the joining of item IDs with page IDs / wikidb has to happen on page titles because that is what Wikidata tracks. So if a page is moved midway through the month, the join with the item IDs / sitelinks fails because only the most recent title is retained for a given page ID. Or in this specific case, the Wikidata side had "2019–20 coronavirus outbreak" as the English sitelink but the page history side didn't have that page because it had been superseded by "2019–20 coronavirus pandemic", so the join failed to find an English article for Q81068910. I'm wondering what you think about the following proposal to maybe address this issue (and I only started digging into the mediwiki_page_history data so I easily could have misunderstood something):
In the current_page_titles SELECT statement, make the following changes:
- use page_title_historical instead of page_title (to capture what a page was called at a given point in time, not just what it is called currently)
- also include page_title_historical in the partition for the windowing function instead of just the first value so if a page ID has been associated with say 5 titles, they will each be represented as separate rows.
- filter on page_is_redirect != true so you only include page_id, page_title pairs that were actually articles (and therefore could have a Wikidata ID).
When it comes time for the join with the sitelinks then, it won't matter if the sitelinks dataset is out of date because it should still match with one of the old titles and pull in the correct page ID. If there is a concern that a title might be associated with multiple page IDs that weren't redirects (and therefore would have multiple entries in the final table), you could probably take just the most current page ID.
Here's the history for pageid 62750956, which has been the stable identifier for the 2019-20 coronavirus pandemic article.
SELECT page_id, caused_by_event_type, start_timestamp, page_title_historical FROM mediawiki_page_history WHERE snapshot = '2020-03' AND page_id = 62750956 AND wiki_db = 'enwiki' AND page_namespace = 0 AND page_is_redirect != true ORDER BY start_timestamp LIMIT 100; page_id caused_by_event_type start_timestamp page_title_historical 62750956 create 2020-01-05 15:58:49.0 2019-2020_China_pneumonia_outbreak 62750956 create-page 2020-01-05 15:58:49.0 2019-2020_China_pneumonia_outbreak 62750956 move 2020-01-05 19:37:35.0 2019–20_China_pneumonia_outbreak 62750956 move 2020-01-15 23:48:49.0 2019-20_outbreak_of_novel_coronavirus_(2019-nCoV)_in_Wuhan,_China 62750956 move 2020-01-16 00:55:43.0 2019-20_outbreak_of_novel_coronavirus_(2019-nCoV) 62750956 move 2020-01-16 04:27:51.0 2019–20_outbreak_of_novel_coronavirus_(2019-nCoV) 62750956 move 2020-01-23 17:52:56.0 2019–20_outbreak_of_novel_coronavirus_2019-nCoV 62750956 move 2020-01-23 22:43:01.0 2019–20_Wuhan_coronavirus_outbreak 62750956 move 2020-02-06 15:44:16.0 2019-20_novel_coronavirus_outbreak_(2019-nCoV) 62750956 move 2020-02-06 15:52:33.0 2019–20_Wuhan_coronavirus_outbreak 62750956 move 2020-02-18 22:04:15.0 2019–20_coronavirus_outbreak 62750956 move 2020-03-11 17:23:57.0 2019–20_coronavirus_pandemic
And so long as you filter on page_is_redirect != true, each of those titles shouldn't be associated with other page IDs.
SELECT a.page_id, a.caused_by_event_type, a.start_timestamp, a.page_title_historical, a.page_is_redirect FROM mediawiki_page_history a WHERE a.page_title_historical IN (SELECT page_title_historical FROM mediawiki_page_history p WHERE p.snapshot = '2020-03' AND p.page_id = 62750956 AND p.wiki_db = 'enwiki' AND p.page_namespace = 0) AND a.snapshot = '2020-03' AND a.wiki_db = 'enwiki' AND a.page_namespace = 0 ORDER BY a.page_title_historical, a.start_timestamp LIMIT 1000; page_id caused_by_event_type start_timestamp page_title_historical page_is_redirect 62750956 create 2020-01-05 15:58:49.0 2019-2020_China_pneumonia_outbreak false 62750956 create-page 2020-01-05 15:58:49.0 2019-2020_China_pneumonia_outbreak false 62752361 create 2020-01-05 19:37:35.0 2019-2020_China_pneumonia_outbreak true 62750956 move 2020-02-06 15:44:16.0 2019-20_novel_coronavirus_outbreak_(2019-nCoV) false 63040557 create 2020-02-06 15:52:33.0 2019-20_novel_coronavirus_outbreak_(2019-nCoV) true 62750956 move 2020-01-16 00:55:43.0 2019-20_outbreak_of_novel_coronavirus_(2019-nCoV) false 62841093 create 2020-01-16 04:27:51.0 2019-20_outbreak_of_novel_coronavirus_(2019-nCoV) true 62750956 move 2020-01-15 23:48:49.0 2019-20_outbreak_of_novel_coronavirus_(2019-nCoV)_in_Wuhan,_China false 62839906 create 2020-01-16 00:55:43.0 2019-20_outbreak_of_novel_coronavirus_(2019-nCoV)_in_Wuhan,_China true 62750956 move 2020-01-05 19:37:35.0 2019–20_China_pneumonia_outbreak false 62839522 create 2020-01-15 23:48:49.0 2019–20_China_pneumonia_outbreak true 62750956 move 2020-01-23 22:43:01.0 2019–20_Wuhan_coronavirus_outbreak false 63040489 create 2020-02-06 15:44:16.0 2019–20_Wuhan_coronavirus_outbreak NULL 63040489 delete 2020-02-06 15:52:23.0 2019–20_Wuhan_coronavirus_outbreak NULL 62750956 move 2020-02-06 15:52:33.0 2019–20_Wuhan_coronavirus_outbreak false 63147709 create 2020-02-18 22:04:15.0 2019–20_Wuhan_coronavirus_outbreak true 63000665 create NULL 2019–20_coronavirus_outbreak NULL 63000665 create-page 2020-02-02 03:43:31.0 2019–20_coronavirus_outbreak NULL 63000665 delete 2020-02-18 22:04:07.0 2019–20_coronavirus_outbreak NULL 62750956 move 2020-02-18 22:04:15.0 2019–20_coronavirus_outbreak false 63347546 create 2020-03-11 17:23:57.0 2019–20_coronavirus_outbreak true 63347350 create NULL 2019–20_coronavirus_pandemic NULL 63347350 create-page 2020-03-11 16:54:23.0 2019–20_coronavirus_pandemic NULL 63347350 delete 2020-03-11 17:23:51.0 2019–20_coronavirus_pandemic NULL 62750956 move 2020-03-11 17:23:57.0 2019–20_coronavirus_pandemic false 62750956 move 2020-01-16 04:27:51.0 2019–20_outbreak_of_novel_coronavirus_(2019-nCoV) false 62907619 create 2020-01-23 17:52:56.0 2019–20_outbreak_of_novel_coronavirus_(2019-nCoV) true 62750956 move 2020-01-23 17:52:56.0 2019–20_outbreak_of_novel_coronavirus_2019-nCoV false 62911066 create 2020-01-23 22:43:01.0 2019–20_outbreak_of_novel_coronavirus_2019-nCoV true
And here is the current state of the different snapshots:
03-23 Snapshot
SELECT item_id, wiki_db, page_id, page_title FROM wmf.wikidata_item_page_link WHERE snapshot = '2020-03-23' AND item_id = 'Q81068910' ORDER BY wiki_db ASC LIMIT 1000; item_id wiki_db page_id page_title Q81068910 bat_smgwiki 36133 2019-2020_m._Kuoruonas_vėrosa_epėdemėjė Q81068910 bclwiki 20503 Coronavirus-_Pagbungkaras_2019-20 Q81068910 gawiki 91784 Ráig_coróinvíris_2019–2020 Q81068910 hakwiki 20094 2019-ngièn_chṳ_2020-ngièn_COVID-19_liù-hàng Q81068910 iswiki 156201 Kórónaveirufaraldur_2019-2020 Q81068910 kywiki 120364 Ухань_коронавирусы_чыгышы Q81068910 mrwiki 252115 २०१९-२०२०_वुहान_कोरोना_व्हायरसचा_उद्रेक Q81068910 mywiki 128482 ၂၀၁၉-၂၀_ကိုရိုနာဗိုင်းရပ်စ်_ဖြစ်ပွားမှု Q81068910 ruwiki 8210005 Пандемия_COVID-19 Q81068910 sahwiki 44813 Ухань_куоратка_коронавирус_тарҕаныыта Q81068910 ugwiki 18498 2019-2020_ۋۇخەن_تاجىسىمان_ۋىرۇسنىڭ_شىددەتلىك_تارقىلىشى_ۋەقەسى Q81068910 ukwikinews 9201 Коронавірус Q81068910 wuuwiki 58758 2019新型冠状病毒肺炎事体 Q81068910 zh_min_nanwiki 776538 2019_nî–2020_nî_COVID-19_liû-hêng Q81068910 zh_yuewiki 213932 武漢肺炎大爆發 Q81068910 zhwikinews 42610 2020年新型冠状病毒肺炎
03-02 Snapshot
SELECT item_id, wiki_db, page_id, page_title FROM wmf.wikidata_item_page_link WHERE snapshot = '2020-03-02' AND item_id = 'Q81068910' ORDER BY wiki_db ASC LIMIT 1000; item_id wiki_db page_id page_title Q81068910 afwiki 287752 Koronavirusepidemie_van_2020 Q81068910 alswiki 79617 Coronavirus-Epidemie_2019/2020 Q81068910 bat_smgwiki 36133 2019-2020_m._Kuoruonas_vėrosa_epėdemėjė Q81068910 bclwiki 20503 Coronavirus-_Pagbungkaras_2019-20 Q81068910 bewiki 629170 Успышка_каронавіруса_2019-nCoV Q81068910 bgwiki 741190 Епидемия_от_коронавирус_2019-20 Q81068910 bnwiki 830337 ২০১৯_সালের_করোনাভাইরাসঘটিত_ব্যাধির_প্রাদুর্ভাব_(২০১৯-২০২০) Q81068910 brwiki 142318 Kleñved-red_Wuhan Q81068910 cawiki 1731391 Epidèmia_per_coronavirus_de_2019-2020 Q81068910 cswiki 1559185 Epidemie_koronaviru_SARS-CoV-2 Q81068910 cswiki 1565100 Epidemie_koronaviru_SARS-CoV-2 Q81068910 cvwiki 81068 2019-nCoV_коронавирус_хӗлхемӗ Q81068910 dawiki 1021283 Udbrud_af_COVID-19_i_Wuhan_2019-2020 Q81068910 dawiki 1022999 Udbrud_af_COVID-19_i_Wuhan_2019-2020 Q81068910 dewiki 11183087 Coronavirus-Epidemie_2019/2020 Q81068910 dewiki 11131846 Coronavirus-Epidemie_2019/2020 Q81068910 enwiki 63000665 2019–20_coronavirus_outbreak Q81068910 enwiki 62750956 2019–20_coronavirus_outbreak Q81068910 enwikivoyage 177486 2019–2020_coronavirus_outbreak Q81068910 eowiki 679250 Epidemio_de_COVID-19 Q81068910 eowikinews 6850 Epidemio_de_COVID-19 Q81068910 etwiki 575417 2019.–2020._aasta_koroonaviirushaiguse_puhang Q81068910 euwiki 931360 COVID-19_gaixotasunaren_2019-2020ko_agerraldia Q81068910 fawiki 5157569 شیوع_۲۰–۲۰۱۹_کروناویروس Q81068910 fiwiki 1560152 Koronavirusepidemia_2019–2020 Q81068910 frwiki 13125712 Épidémie_de_maladie_à_coronavirus_de_2019-2020 Q81068910 frwiki 13069866 Épidémie_de_maladie_à_coronavirus_de_2019-2020 Q81068910 frwikinews 84138 Épidémie_de_maladie_à_coronavirus_de_2019-2020 Q81068910 gawiki 91784 Ráig_coróinvíris_2019–2020 Q81068910 hakwiki 20094 2019-ngièn_chṳ_2020-ngièn_COVID-19_liù-hàng Q81068910 hewiki 1790961 התפרצות_נגיף_קורונה-ווהאן Q81068910 hewiki 1794362 התפרצות_נגיף_קורונה-ווהאן Q81068910 hiwiki 1131137 2019–20_वुहान_कोरोना_वायरस_प्रकोप Q81068910 htwiki 73699 Epidemi_kowonaviris_an_2019-2020 Q81068910 huwiki 1679312 COVID-19_koronavírus-járvány Q81068910 hywiki 961996 Նոր_կորոնավիրուսի_բռնկում_(2019-2020) Q81068910 idwiki 3001939 Wabah_penyakit_koronavirus_2019–2020 Q81068910 iswiki 156201 Kórónaveirufaraldur_2019-2020 Q81068910 itwiki 8303012 Epidemia_di_COVID-19_del_2019-2020 Q81068910 itwikiquote 166969 Epidemia_di_SARS-CoV-2_del_2019-2020 Q81068910 kawiki 458509 კორონავირუსის_ეპიდემია_2019-2020 Q81068910 kkwiki 633976 COVID-19_індеті Q81068910 kowiki 2643632 코로나바이러스감염증-19_유행 Q81068910 kowiki 2624467 코로나바이러스감염증-19_유행 Q81068910 kowiki 2642027 코로나바이러스감염증-19_유행 Q81068910 kuwiki 83579 Belavbûna_vîrûsa_Korona_2019 Q81068910 kywiki 120364 Ухань_коронавирусы_чыгышы Q81068910 lijwiki 18072 Premonîa_de_Wuhan Q81068910 ltwiki 570568 Koronaviruso_2019-nCoV_protrūkis Q81068910 lvwiki 450232 COVID-19_epidēmija Q81068910 mrwiki 252115 २०१९-२०२०_वुहान_कोरोना_व्हायरसचा_उद्रेक Q81068910 mswiki 995498 Wabak_COVID-19_(2019-20) Q81068910 mywiki 128482 ၂၀၁၉-၂၀_ကိုရိုနာဗိုင်းရပ်စ်_ဖြစ်ပွားမှု Q81068910 nlwiki 5312604 Uitbraak_coronavirus_(SARS-CoV-2)_in_2019 Q81068910 nlwikinews 22633 2019-nCoV Q81068910 plwiki 4701039 Szerzenie_się_zakażeń_wirusem_SARS-CoV-2 Q81068910 ptwiki 6175224 Surto_de_COVID-19 Q81068910 rowiki 2472772 Epidemia_de_coronavirus_(2019-nCoV) Q81068910 rowiki 2475440 Epidemia_de_coronavirus_(2019-nCoV) Q81068910 ruwiki 8173519 Вспышка_COVID-19 Q81068910 sahwiki 44813 Ухань_куоратка_коронавирус_тарҕаныыта Q81068910 scowiki 214894 2019–20_coronavirus_ootbrak Q81068910 scwiki 19311 Epidemia_de_Pneumònia_peri_Coronavirus_de_2019-2020 Q81068910 simplewiki 733960 2019–20_coronavirus_outbreak Q81068910 siwiki 84157 2019–20_වූහාන්_කොරෝනාවෛරසයෙහි_ව්යාප්තිය Q81068910 sqwiki 276736 Koronavirusi_2019/2020 Q81068910 srwiki 4121385 Епидемија_вируса_корона_2019/20. Q81068910 svwiki 8414277 Coronavirusutbrottet_2019–2020 Q81068910 svwiki 8415214 Coronavirusutbrottet_2019–2020 Q81068910 swwiki 121175 Mlipuko_wa_virusi_vya_korona_Wuhan_2019-20 Q81068910 thwiki 1076225 การระบาดของโคโรนาไวรัส_พ.ศ._2562–2563 Q81068910 trwiki 2536507 2019-2020_koronavirüs_salgını Q81068910 ugwiki 18498 2019-2020_ۋۇخەن_تاجىسىمان_ۋىرۇسنىڭ_شىددەتلىك_تارقىلىشى_ۋەقەسى Q81068910 ukwikinews 9201 Коронавірус Q81068910 ukwikiquote 33240 Спалах_коронавірусу_2019-nCoV Q81068910 urwiki 884409 کورونا_وائرس_کی_وبا،_2019ء_-_2020ء Q81068910 vecwiki 51867 Epidemia_de_SARS-CoV-2_del_2019-2020 Q81068910 viwiki 17814719 Dịch_COVID-19 Q81068910 wuuwiki 58758 2019新型冠状病毒肺炎事体 Q81068910 zh_classicalwiki 92198 己亥-庚子武漢病毒事 Q81068910 zh_min_nanwiki 776538 2019_nî–2020_nî_COVID-19_liû-hêng Q81068910 zh_yuewiki 213932 武漢肺炎大爆發 Q81068910 zhwiki 6861061 2019冠狀病毒病疫情 Q81068910 zhwikinews 42610 2020年新型冠状病毒肺炎