Page MenuHomePhabricator

Coordinate work on minor changes for Edit Data Quality
Closed, ResolvedPublic

Description

The following is a list of changes we want to try and test on the mediawiki history reconstruction pipeline. If you're going to grab an item, cross it off and add your name next to it. There's just too many of these to make a separate task for each one, and they might not turn out to be useful or real issues.

  • Use the page id extracted from log_page on move events, that we currently ignore:

https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-job/src/main/scala/org/wikimedia/analytics/refinery/job/mediawikihistory/page/PageHistoryBuilder.scala#L481

  • Refactor Page Events to include "create" events from logging table, and create artificial "create" events out of the current logic that gets the first revision of a page. Consider as part of that the trick that Joseph found, to look into the archive table for the first revision when those happen to be archived for a restored page.

Event Timeline

Milimetric created this task.
Milimetric moved this task from Next Up to In Progress on the Analytics-Kanban board.

Change 485710 had a related patch set uploaded (by Joal; owner: Joal):
[analytics/refinery/source@master] Update delete/restore in mediawiki-history

https://gerrit.wikimedia.org/r/485710

Change 485710 merged by jenkins-bot:
[analytics/refinery/source@master] Update delete/restore in mediawiki-history

https://gerrit.wikimedia.org/r/485710

Data check details:

  • I ran mediawiki-history-check on data generated by this patch and failures are coming from expected changes:
    • No failure for user-data
    • Failures for page-data due to a lot less of page_artificial_id (2 impacted values: growth_distinct_all_page_id and growth_distinct_page_artificial_id)
    • Failures for denorm-data due to presence of page-delete events that we were filtering before.
  • I validated page_id/page_artificial_id don't overlap with the following queries:
val df = spark.read.parquet("/user/joal/wmf/data/wmf/mediawiki/history/snapshot=2019-02")
val dfp = spark.read.parquet("/user/joal/wmf/data/wmf/mediawiki/page_history/snapshot=2019-02")
val dfu = spark.read.parquet("/user/joal/wmf/data/wmf/mediawiki/user_history/snapshot=2019-02")

df.where("event_entity = 'page' and page_id IS NOT NULL and page_id > 0 and page_artificial_id IS NOT NULL AND LENGTH(page_artificial_id) > 0").count()
//res1: Long = 0                                                                  
df.where("event_entity = 'page' and page_id IS NOT NULL and page_id <= 0").count()
//res2: Long = 0                                                                  
df.where("event_entity = 'page' and page_id IS NULL").count()
//res3: Long = 73663394                                                           
df.where("event_entity = 'page' and page_id IS  NULL and page_artificial_id IS NOT NULL AND LENGTH(page_artificial_id) > 0").count()
//res4: Long = 73663394                                                           

dfp.where("page_id IS NOT NULL and page_id > 0 and page_artificial_id IS NOT NULL AND LENGTH(page_artificial_id) > 0").count()
// res5: Long = 0                                                                  
dfp.where("page_id IS NOT NULL and page_id <= 0").count()
// res6: Long = 0                                                                  
dfp.where("page_id IS NULL ").count()
// res7: Long = 73663394                                                           
dfp.where("page_id IS NULL and page_artificial_id IS NOT NULL AND LENGTH(page_artificial_id) > 0").count()
// res8: Long = 73663394
  • I validated that timestamps are expected:
  • A lot of null for page-creation as I have unlink page-creation and first-edit. We need to discuss this. for deleted-pages creation event (we need to reconcile archived revisions to deleted pages by title, not done yet)
  • A small number of timestamp below 1990-01-01 for user alter-block events (refactor needed for data correctness and the various needed improvements)
  • Nothing after 2019-04
df.where("event_timestamp IS NULL").groupBy("event_entity", "event_type").count().show(10, false)
//+------------+----------+---------+                                             
//|event_entity|event_type|count    |
//+------------+----------+---------+
//|page        |create    |44056737|
//+------------+----------+---------+

df.where("substr(event_timestamp, 0, 10) < '1990-01-01'").groupBy("event_entity", "event_type").count().show(10, false)
//+------------+-----------+-----+                                                
//|event_entity|event_type |count|
//+------------+-----------+-----+
//|user        |alterblocks|67   |
//+------------+-----------+-----+

df.where("substr(event_timestamp, 0, 10) > '2019-04-01'").groupBy("event_entity", "event_type").count().show(10, false)
//+------------+----------+-----+                                                 
//|event_entity|event_type|count|
//+------------+----------+-----+
//+------------+----------+-----+
  • I validated that live revisions (not revision_is_deleted) have a defined page_id
df.where("event_entity = 'revision' and (page_id is NULL OR page_id <= 0) and not revision_is_deleted").count()
//res14: Long = 0

The validation checks above are done without the patch for page-history refactor (https://gerrit.wikimedia.org/r/c/analytics/refinery/source/+/493390) because it still generates erroneous data.
Not tested: page events are joined by id, explicit page-create events are used.

Good checking. The user events from before the 90s are funny and weird, but I am hopeful the new patch fixes the event_timestamp for the page create events.

Change 485710 merged by Fdans:
[analytics/refinery/source@master] Update delete/restore in mediawiki-history

https://gerrit.wikimedia.org/r/485710