For example, according to the on-wiki logs, the page "Jeff Caldwell (soccer)" on enwiki was deleted three times, restored once, and then moved.
But mediawiki_history only records the last two, and the move is actually marked as a creation. It also doesn't include any of the initial creations.
select
event_type,
event_timestamp,
event_user_text,
page_title,
page_title_historical,
page_id
from wmf.mediawiki_history
where
event_entity = "page" and
wiki_db = "enwiki" and
(page_title_historical = "Jeff_Caldwell_(soccer)" or page_title = "Jeff_Caldwell_(soccer)") and
snapshot = "2018-08"
event_type event_timestamp event_user_text \
0 create 2018-07-19 13:00:57.0 Freefalling660
1 restore 2018-07-31 17:33:57.0 Hut 8.5
page_title page_title_historical page_id
0 Freefalling660/Jeff_Caldwell_(soccer) Jeff_Caldwell_(soccer) 57939448
1 Freefalling660/Jeff_Caldwell_(soccer) Jeff_Caldwell_(soccer) 57939448mediawiki_page_history records a bunch more, but there are several duplicates and the schema is a lot more confusing to me (only including the query because the result is too long to print).
select
page_id,
page_id_artificial,
page_title,
page_title_historical,
start_timestamp,
end_timestamp,
caused_by_event_type,
caused_by_user_id
from wmf.mediawiki_page_history
where
wiki_db = "enwiki" and
(page_title_historical = "Jeff_Caldwell_(soccer)" or page_title = "Jeff_Caldwell_(soccer)") and
snapshot = "2018-08"
order by start_timestamp asc
limit 1000As another example, the the page ""Accidente ferroviario de Cerrillos de 1956" on eswiki has had quite a few events, but has no page events at all in mediawiki_history (same with mediawiki_page_history).
select
event_type,
event_timestamp,
event_user_text,
page_id
from wmf.mediawiki_history
where
event_entity = "page" and
wiki_db = "eswiki" and
(page_title_historical = "Accidente ferroviario de Cerrillos de 1956" or page_title = "Accidente ferroviario de Cerrillos de 1956") and
snapshot = "2018-08"Is the data supposed to be this unreliable? Shouldn't mediawiki_history and mediawiki_page_history both be consistent?
On the wiki page, I see a note from almost a year ago saying that "History of pages with complex delete/restore patterns is on purpose not yet corretly worked. Will happen after Wikistats-2 release", but I feel like these issues are bigger than that implies.