Page MenuHomePhabricator

page_creation_timestamp not always correct in mediawiki_history
Closed, ResolvedPublic

Description

Not sure if this is related to T205594 but in my work on T213597 I found that page_creation_timestamp in wmf.mediawiki_history is not always correct:

USE wmf;
SELECT
  wiki_db,
  IF(event_timestamp = page_creation_timestamp, 'matches', 'does not match') AS initial_rev_page_creation_ts,
  COUNT(1) AS n_pages
FROM mediawiki_history
WHERE snapshot = '2018-12'
  AND wiki_db IN('enwiki', 'commonswiki')
  AND event_entity = 'revision'
  AND revision_parent_id = 0 -- initial rev
  AND NOT revision_is_deleted
GROUP BY wiki_db, IF(event_timestamp = page_creation_timestamp, 'matches', 'does not match');

So looks like it affects ~0.2% of pages on Commons and 1.2% of English Wikipedia pages (0.7% of articles specifically):

wiki_dbdoes not matchmatchesproportion of total
commonswiki134788692741590.194%
enwiki572631464578941.218%

Examples

USE wmf;
SELECT
  page_id, page_title,
  event_timestamp, page_creation_timestamp
FROM mediawiki_history
WHERE snapshot = '2018-12'
  AND wiki_db = 'commonswiki'
  AND event_entity = 'revision'
  AND revision_parent_id = 0 -- initial rev
  AND page_namespace = 6
  AND NOT revision_is_deleted
  AND event_timestamp != page_creation_timestamp
LIMIT 100;
page_idpage_titleevent_timestamppage_creation_timestamprevision history linkfirst entry in revision history
2721713U.S._Territorial_Acquisitions.en.alt1.jpg2007-09-10 16:27:25.02012-06-16 12:04:19.0revision history16:27, 10 September 2007
30269399UN_General_Assembly_Resolution_66_(1).pdf2013-12-21 08:19:34.02013-12-23 17:36:24.0revision history08:19, 21 December 2013

Event Timeline

mpopov moved this task from Triage to Tracking on the Product-Analytics board.
fdans moved this task from Incoming to Data Quality on the Analytics board.

This is solved in snapshot 2019-05 onward.
Some explanation:

  • The page_first_edit_timestamp is the field containing the interesting value, not page_creation_timestamp as this one should reflect the timestamp of the first create event. Most of the time, they are equal, but they can differ for pages having complicated histories with deletes and restores.
  • The page_first_edit_timestamp is not always equal to the timestamp of revision having parent_page_id = 0, as the dataset also use archive revision (therefore the first revision can be an archived one), and because complex histories can also lead to multiple revisions having parent_page_id = 0 in their history.

Check:

spark.sql("""
WITH first_edits AS (
  SELECT
    wiki_db,
    page_id,
    page_first_edit_timestamp,
    MIN(event_timestamp) as first_edit
  FROM wmf.mediawiki_history
  WHERE snapshot = '2019-05'
    AND event_entity = 'revision'
    AND page_id IS NOT NULL
    AND page_id > 0
    AND page_first_edit_timestamp is not null
  GROUP BY
    wiki_db, page_id, page_is_deleted, page_first_edit_timestamp
)

SELECT
  IF(first_edit = page_first_edit_timestamp, 'matches', 'does not match') as initial_rev_page_creation_ts,
  COUNT(1) AS n_pages
FROM first_edits f
GROUP BY IF(first_edit = page_first_edit_timestamp, 'matches', 'does not match')
""").show(100, false)


+----------------------------+---------+                                        
|initial_rev_page_creation_ts|n_pages  |
+----------------------------+---------+
|matches                     |446279533|
|does not match              |240      |
+----------------------------+---------+

Ping @Milimetric do we need to update docs in wikitech after last refactor? https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits/Mediawiki_history cause page_first_edit_timestamp field does not appear on docs