Page MenuHomePhabricator

Check whether mediawiki production event data is equivalent to mediawiki-history data over a month
Open, HighPublic

Description

To mimic mediawiki-history using events, we need to check that the underlying data provides the same value.

Check whether mediawiki production event data is equivalent to mediawiki-history data over a month such an incremental update can be built.

Event Timeline

JAllemandou created this task.
JAllemandou set Final Story Points to 5.

Bug found in mediawiki_revision_create events: some revision-create have multiple events (same revision-id, multiple event-requests) - Tracked in T262203.

Missing events for user_create and user_rename. Tracked in T262205.

CODE (For simplewiki only)

  • Prepare snapshot data from hudi-saved dataset.
import org.apache.spark.sql.SaveMode
import org.apache.hudi.DataSourceReadOptions
import org.apache.hudi.config.HoodieWriteConfig

val tableNameSimplewiki = "joal.test_hudi_mwh_simplewiki"
val basePathSimplewiki = "hdfs:///user/joal/test_hudi/mwh_simplewiki"

-- Load hudi dataframe
val hfs = spark.read.format("hudi").
     // Set table name
     option(HoodieWriteConfig.TABLE_NAME, tableNameSimplewiki).
     // Read snapshot
     option(DataSourceReadOptions.QUERY_TYPE_OPT_KEY, DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL).
     option("hoodie.index.type", "BLOOM").
     //The number of wildcard asterisks here must be one greater than the number of partition
     load(basePathSimplewiki + "/*/*")

-- restrict data to 2020-07 and keep only id-related fields
val hfsn = hfs.where("snapshot='2020-07' AND year(event_timestamp) = 2020 and month(event_timestamp) = 7").
  select(
    "event_entity",
    "event_type",
    "event_timestamp",
    "event_user_id",
    "event_user_text_historical",
    "page_id",
    "page_artificial_id",
    "revision_id",
    "user_id",
    "row_key_hash").
  cache

-- Rename columns of the dataframe with prefix 'hudi_' to faciliate joining
val hfsnr = hfsn.columns.foldLeft(hfsn){ (acc, names) => acc.withColumnRenamed(names, "hudi_".concat(names))}.cache()

hfsn.groupBy(col("event_entity"), col("event_type")).count.sort("event_entity", "event_type").show(100, false)
/*
+------------+-----------+-----+
|event_entity|event_type |count|
+------------+-----------+-----+
|page        |create     |5775 |
|page        |create-page|5801 |
|page        |delete     |903  |
|page        |move       |261  |
|page        |restore    |8    |
|revision    |create     |38906|
|user        |alterblocks|16   |
|user        |altergroups|12   |
|user        |create     |5656 |
|user        |rename     |258  |
+------------+-----------+-----+
*/
  • Check revision-create events
val eventsRevCreate = spark.sql("""
SELECT
  md5(concat(
    database,                                 '|',
    'revision',                               '|',
    'create',                                 '|',
    -- timestamps of events needs special parsing as they are not sql-timestamp formatted
    -- Also, using rev_timestamp instead of meta.dt here as it is available
    CAST(unix_timestamp(rev_timestamp, 'yyyy-MM-dd\'T\'HH:mm:ss\'Z\'') AS STRING), '|',
    coalesce(performer.user_id, 0),           '|',
    coalesce(performer.user_text, ''),        '|',
    coalesce(page_id, 0),                     '|',
    '',                                       '|', -- no page_artificial_id
    coalesce(rev_id, 0),                      '|',
    0                                              -- no user_id
  )) as row_key_hash,
  'revision' AS event_entity,
  'create' AS event_type,
  rev_timestamp AS event_timestamp,
  performer.user_id AS event_user_id,
  performer.user_text as event_user_text_historical,
  page_id,
  NULL AS page_artificial_id,
  rev_id as revision_id,
  NULL as user_id
from event.mediawiki_revision_create
where year = 2020 and (
    (month = 7)
    -- Expanding boundaries in case there are 
    OR (month = 6 AND day = 30)
    OR (month = 8 AND day = 1))
  and database = 'simplewiki'
  and rev_timestamp >= '2020-07'
  and rev_timestamp < '2020-08'
""").
  distinct. // This distinct is important as there are duplicates
  cache()

eventsRevCreate.count
// 38784 (vs 38906)

-- Now joining on row-key-hash and checking
val jrev = hfsnr.where("hudi_event_entity = 'revision'").join(eventsRevCreate, col("hudi_row_key_hash") === col("row_key_hash"), "outer").cache()
jrev.count
// 39542
jrev.where("hudi_row_key_hash IS NULL").count
// 636
jrev.where("row_key_hash IS NULL").count
// 758
jrev.where("hudi_row_key_hash IS NOT NULL AND row_key_hash IS NOT NULL").count
// 38148

// Most rows share the same hash-key - Looks good
  • Check page-create events
val eventsPageCreate = spark.sql("""
SELECT
  md5(concat(
    database,                                 '|',
    'page',                                   '|',
    'create',                                 '|',
    -- timestamps of events needs special parsing as they are not sql-timestamp formatted
    -- Also, using rev_timestamp instead of meta.dt here as it is available
    CAST(unix_timestamp(rev_timestamp, 'yyyy-MM-dd\'T\'HH:mm:ss\'Z\'') AS STRING), '|',
    coalesce(performer.user_id, 0),           '|',
    coalesce(performer.user_text, ''),        '|',
    coalesce(page_id, 0),                     '|',
    '',                                       '|', -- page_artificial_id
    0,                                        '|', -- no rev_id for page-create in mwh while there is one in event!
    0                                              -- user_id
  )) as row_key_hash,
  'page' AS event_entity,
  'create' AS event_type,
  rev_timestamp AS event_timestamp,
  performer.user_id AS event_user_id,
  performer.user_text as event_user_text_historical,
  page_id,
  NULL AS page_artificial_id,
  rev_id as revision_id,
  NULL as user_id
from event.mediawiki_page_create
where year = 2020 and  month = 7
  and database = 'simplewiki'
""").cache()

eventsPageCreate.count
// 5918 (vs 5801)

-- Now joining on row-key-hash and checking
val jpc = hfsnr.where("hudi_event_entity = 'page' and hudi_event_type = 'create'").join(eventsPageCreate, col("hudi_row_key_hash") === col("row_key_hash"), "outer").cache()
jpc.count
// 6238
jpc.where("hudi_row_key_hash IS NULL").count
// 463
jpc.where("row_key_hash IS NULL").count
// 320
jpc.where("hudi_row_key_hash IS NOT NULL AND row_key_hash IS NOT NULL").count
// 5455

// Most rows also share the same hash-key, but the proportion of mismatch is high - investigation is important for that one
  • Check page-delete events
val eventsPageDelete = spark.sql("""
SELECT
  md5(concat(
    database,                                 '|',
    'page',                                   '|',
    'delete',                                 '|',
    -- timestamps of events needs special parsing as they are not sql-timestamp formatted
    CAST(unix_timestamp(meta.dt, 'yyyy-MM-dd\'T\'HH:mm:ss\'Z\'') AS STRING),'|',
    coalesce(performer.user_id, 0),           '|',
    coalesce(performer.user_text, ''),        '|',
    coalesce(page_id, 0),                     '|',
    '',                                       '|', -- page_artificial_id
    0,                                        '|', -- no rev_id in mwh
    0                                              -- user_id
  )) as row_key_hash,
  'page' AS event_entity,
  'delete' AS event_type,
  meta.dt AS event_timestamp,
  performer.user_id AS event_user_id,
  performer.user_text as event_user_text_historical,
  page_id,
  NULL AS page_artificial_id,
  rev_id as revision_id,
  NULL as user_id
from event.mediawiki_page_delete
where year = 2020 and month = 7
  and database = 'simplewiki'
""").cache()

eventsPageDelete.count
// 915 (vs 903)

-- Now joining on row-key-hash and checking
val jpd = hfsnr.where("hudi_event_entity = 'page' and hudi_event_type = 'delete'").join(eventsPageDelete, col("hudi_row_key_hash") === col("row_key_hash"), "outer").cache()
jpd.count
// 1021
jpd.where("hudi_row_key_hash IS NULL").count
// 118
jpd.where("row_key_hash IS NULL").count
// 106
jpd.where("hudi_row_key_hash IS NOT NULL AND row_key_hash IS NOT NULL").count
// 797

// Most rows also share the same hash-key, but the proportion of mismatch is relatively high - investigation is important for that one

// NOTE: I checked the mediawiki_page_suppress table and it was empty - maybe we'll need it for other wikis?
  • Check page-restore events
val eventsPageRestore = spark.sql("""
SELECT
  md5(concat(
    database,                                 '|',
    'page',                                   '|',
    'restore',                                '|',
    -- timestamps of events needs special parsing as they are not sql-timestamp formatted
    CAST(unix_timestamp(meta.dt, 'yyyy-MM-dd\'T\'HH:mm:ss\'Z\'') AS STRING),'|',
    coalesce(performer.user_id, 0),           '|',
    coalesce(performer.user_text, ''),        '|',
    coalesce(page_id, 0),                     '|',
    '',                                       '|', -- page_artificial_id
    0,                                        '|', -- no revId in mwh
    0                                              -- user_id
  )) as row_key_hash,
  'page' AS event_entity,
  'restore' AS event_type,
  meta.dt AS event_timestamp,
  performer.user_id AS event_user_id,
  performer.user_text as event_user_text_historical,
  page_id,
  NULL AS page_artificial_id,
  rev_id as revision_id,
  NULL as user_id
from event.mediawiki_page_undelete
where year = 2020 and month = 7
  and database = 'simplewiki'
""").cache()

eventsPageRestore.count
// 8 (vs 8)

-- Now joining on row-key-hash and checking
val jpr = hfsnr.where("hudi_event_entity = 'page' and hudi_event_type = 'restore'").join(eventsPageRestore, col("hudi_row_key_hash") === col("row_key_hash"), "outer").cache()
jpr.count
// 8
jpr.where("hudi_row_key_hash IS NULL").count
// 0
jpr.where("row_key_hash IS NULL").count
// 0
jpr.where("hudi_row_key_hash IS NOT NULL AND row_key_hash IS NOT NULL").count
// 8

// Full match - \o/
  • Check page-move events
val eventsPageMove = spark.sql("""
SELECT
  md5(concat(
    database,                                 '|',
    'page',                                   '|',
    'move',                                   '|',
    -- timestamps of events needs special parsing as they are not sql-timestamp formatted
    CAST(unix_timestamp(meta.dt, 'yyyy-MM-dd\'T\'HH:mm:ss\'Z\'') AS STRING),'|',
    coalesce(performer.user_id, 0),           '|',
    coalesce(performer.user_text, ''),        '|',
    coalesce(page_id, 0),                     '|',
    '',                                       '|', -- page_artificial_id
    0,                                        '|', -- no rev_id in mwh
    0                                              -- user_id
  )) as row_key_hash,
  'page' AS event_entity,
  'move' AS event_type,
  meta.dt AS event_timestamp,
  performer.user_id AS event_user_id,
  performer.user_text as event_user_text_historical,
  page_id,
  NULL AS page_artificial_id,
  rev_id as revision_id,
  NULL as user_id
from event.mediawiki_page_move
where year = 2020 and month = 7
  and database = 'simplewiki'
""").cache()

eventsPageMove.count
// 263 (vs 261)

val jpm = hfsnr.where("hudi_event_entity = 'page' and hudi_event_type = 'move'").join(eventsPageMove, col("hudi_row_key_hash") === col("row_key_hash"), "outer").cache()
jpm.count
// 476
jpm.where("hudi_row_key_hash IS NULL").count
// 215
jpm.where("row_key_hash IS NULL").count
// 213
jpm.where("hudi_row_key_hash IS NOT NULL AND row_key_hash IS NOT NULL").count
// 48

// More not-join than joined - investiagtion

val jpmts = hfsnr.where("hudi_event_entity = 'page' and hudi_event_type = 'move'").join(eventsPageMove, col("hudi_page_id") === col("page_id"), "outer").
  where("hudi_page_id IS NULL OR page_id IS NULL OR (abs(unix_timestamp(hudi_event_timestamp) - unix_timestamp(event_timestamp, 'yyyy-MM-dd\\'T\\'HH:mm:ss\\'Z\\'')) < 5)").
  cache()
jpmts.count
// 263
jpmts.where("hudi_row_key_hash IS NULL").count
// 2
jpmts.where("row_key_hash IS NULL").count
// 0
jpmts.where("hudi_row_key_hash IS NOT NULL AND row_key_hash IS NOT NULL").count
// 261

// The problem comes from timestamps - the event-meta.dt timestamp is different from the event_timestamp we get from the logging table...
// It will be interesting to check the other mismatches for this case

// Anyway - how can we define a primary-key????

Checks for discrenpencies (using previous comment setup):

  • page-delete
// Try to match events using timestamps
val jpdts = hfsnr.where("hudi_event_entity = 'page' and hudi_event_type = 'delete'").join(eventsPageDelete, col("hudi_page_id") === col("page_id"), "outer").
  selectExpr("hudi_row_key_hash", "row_key_hash", "(hudi_page_id IS NULL OR page_id IS NULL OR (abs(unix_timestamp(hudi_event_timestamp) - unix_timestamp(event_timestamp, 'yyyy-MM-dd\\'T\\'HH:mm:ss\\'Z\\'')) < 5)) AS ts_match").
  cache()
jpdts.count
// 919
jpdts.where("hudi_row_key_hash IS NULL").count
// 12
jpdts.where("row_key_hash IS NULL").count
// 0
jpdts.where("hudi_row_key_hash IS NOT NULL AND row_key_hash IS NOT NULL AND ts_match").count
// 901 -- Almost all match within 5s
jpdts.where("hudi_row_key_hash IS NOT NULL AND row_key_hash IS NOT NULL AND not ts_match").count
// 6

// Looking at the 6 events that don't match with a timestamp within 5s (not very reasable sorry :():
hfsnr.where("hudi_event_entity = 'page' and hudi_event_type = 'delete'").join(eventsPageDelete, col("hudi_page_id") === col("page_id"), "outer").
  where("hudi_page_id IS NOT NULL AND page_id IS NOT NULL AND (abs(unix_timestamp(hudi_event_timestamp) - unix_timestamp(event_timestamp, 'yyyy-MM-dd\\'T\\'HH:mm:ss\\'Z\\'')) >= 5)").
  show(100, false)

+-----------------+---------------+---------------------+------------------+-------------------------------+------------+-----------------------+----------------+------------+--------------------------------+--------------------------------+------------+----------+--------------------+-------------+--------------------------+-------+------------------+-----------+-------+
|hudi_event_entity|hudi_event_type|hudi_event_timestamp |hudi_event_user_id|hudi_event_user_text_historical|hudi_page_id|hudi_page_artificial_id|hudi_revision_id|hudi_user_id|hudi_row_key_hash               |row_key_hash                    |event_entity|event_type|event_timestamp     |event_user_id|event_user_text_historical|page_id|page_artificial_id|revision_id|user_id|
+-----------------+---------------+---------------------+------------------+-------------------------------+------------+-----------------------+----------------+------------+--------------------------------+--------------------------------+------------+----------+--------------------+-------------+--------------------------+-------+------------------+-----------+-------+
|page             |delete         |2020-07-30 16:49:10.0|8630              |Chenzw                         |776311      |null                   |null            |null        |603321d3c0b792c959095627067d71ac|609894b5f351ef92eb2ec0e0d1c6dc5c|page        |delete    |2020-07-30T16:49:32Z|8630         |Chenzw                    |776311 |null              |7042637    |null   |
|page             |delete         |2020-07-06 17:06:54.0|2133              |Eptalon                        |676812      |null                   |null            |null        |9c21a46ffa5d77f131e03e129421e5e9|7c0b08708a3affb76173cd3a888be8bc|page        |delete    |2020-07-06T22:32:27Z|18539        |Djsasso                   |676812 |null              |7021769    |null   |
|page             |delete         |2020-07-03 07:54:39.0|22027             |Auntof6                        |285032      |null                   |null            |null        |503fa76a1763451b1d6b9472436738f8|3387c53369dc7af5e68b2377f4a3fda8|page        |delete    |2020-07-03T07:54:49Z|22027        |Auntof6                   |285032 |null              |7013107    |null   |
|page             |delete         |2020-07-06 22:32:27.0|18539             |Djsasso                        |676812      |null                   |null            |null        |7c0b08708a3affb76173cd3a888be8bc|6d244a642406d9852d186a124322ccd6|page        |delete    |2020-07-06T17:06:55Z|2133         |Eptalon                   |676812 |null              |7021769    |null   |
|page             |delete         |2020-07-27 08:15:16.0|22027             |Auntof6                        |700118      |null                   |null            |null        |1eaec820d3aa53746824410803aeb70b|27ae3b48edab900f71c18cdc1023a72e|page        |delete    |2020-07-27T08:15:26Z|22027        |Auntof6                   |700118 |null              |6577725    |null   |
|page             |delete         |2020-07-30 16:49:31.0|8630              |Chenzw                         |776311      |null                   |null            |null        |cdc521063916ffe15f3deecfcf362b32|603321d3c0b792c959095627067d71ac|page        |delete    |2020-07-30T16:49:10Z|8630         |Chenzw                    |776311 |null              |7042637    |null   |
+-----------------+---------------+---------------------+------------------+-------------------------------+------------+-----------------------+----------------+------------+--------------------------------+--------------------------------+------------+----------+--------------------+-------------+--------------------------+-------+------------------+-----------+-------+
  • page_create
val jpcts = hfsnr.where("hudi_event_entity = 'page' and hudi_event_type = 'create'").join(eventsPageCreate, col("hudi_page_id") === col("page_id"), "outer").
  selectExpr("hudi_row_key_hash", "row_key_hash", "(hudi_page_id IS NULL OR page_id IS NULL OR (abs(unix_timestamp(hudi_event_timestamp) - unix_timestamp(event_timestamp, 'yyyy-MM-dd\\'T\\'HH:mm:ss\\'Z\\'')) < 5)) AS ts_match").
  cache()
jpcts.count
// 6170
jpcts.where("hudi_row_key_hash IS NULL").count
// 395 -- Still a lot of mismatch :(
jpcts.where("row_key_hash IS NULL").count
// 252 -- Here as well :( Maybe page_is_deleted?
jpcts.where("hudi_row_key_hash IS NOT NULL AND row_key_hash IS NOT NULL AND ts_match").count
// 5455
jpcts.where("hudi_row_key_hash IS NOT NULL AND row_key_hash IS NOT NULL AND NOT ts_match").count
// 68

hfsnr.where("hudi_event_entity = 'page' and hudi_event_type = 'create'").join(eventsPageCreate, col("hudi_page_id") === col("page_id"), "outer").
  where("hudi_page_id IS NOT NULL AND page_id IS NOT NULL AND (abs(unix_timestamp(hudi_event_timestamp) - unix_timestamp(event_timestamp, 'yyyy-MM-dd\\'T\\'HH:mm:ss\\'Z\\'')) >= 5)").
  show(100, false)
+-----------------+---------------+---------------------+------------------+---------------------------------------+------------+-----------------------+----------------+------------+----------------------------
----+--------------------------------+------------+----------+--------------------+-------------+---------------------------------------+-------+------------------+-----------+-------+
|hudi_event_entity|hudi_event_type|hudi_event_timestamp |hudi_event_user_id|hudi_event_user_text_historical        |hudi_page_id|hudi_page_artificial_id|hudi_revision_id|hudi_user_id|hudi_row_key_hash           
    |row_key_hash                    |event_entity|event_type|event_timestamp     |event_user_id|event_user_text_historical             |page_id|page_artificial_id|revision_id|user_id|
+-----------------+---------------+---------------------+------------------+---------------------------------------+------------+-----------------------+----------------+------------+----------------------------
----+--------------------------------+------------+----------+--------------------+-------------+---------------------------------------+-------+------------------+-----------+-------+
|page             |create         |2020-07-01 06:35:03.0|null              |106.205.11.40                          |771740      |null                   |null            |null        |1af5a34c7f74f606eefaed5e3078
749c|85dd947dd13bcf553ad007384f4f533e|page        |create    |2020-07-01T06:42:17Z|null         |106.205.11.40                          |771740 |null              |7013623    |null   |
|page             |create         |2020-07-01 15:30:29.0|null              |69.115.47.9                            |771875      |null                   |null            |null        |5e23915d9d3cd0b3916475aa6ef9
4608|e8ddf68dcaa93e3d63e9a35e75d1944e|page        |create    |2020-07-01T17:45:08Z|null         |69.115.47.9                            |771875 |null              |7014279    |null   |
|page             |create         |2020-07-23 14:35:05.0|null              |null                                   |777039      |null                   |null            |null        |9d7ef58cf29caa3a0775fd15bb2e
f744|2399cb71073a5ccbab4dcd13c50026a8|page        |create    |2020-07-26T19:07:46Z|null         |2604:2000:1481:8600:B905:1CA2:9C84:CD55|777039 |null              |7046945    |null   |
|page             |create         |2020-07-10 12:26:50.0|2133              |Eptalon                                |773914      |null                   |null            |null        |b8624c1f11a13cc7f35f0e8bbfc8
8c92|523bae9ad3e22bfbc56e09266ab4ce82|page        |create    |2020-07-10T21:32:45Z|2133         |Eptalon                                |773914 |null              |7027926    |null   |
|page             |create         |2020-07-21 12:45:22.0|null              |216.11.189.203                         |776157      |null                   |null            |null        |f22116e5cb8a679c78d9c2b29a95
9956|9ec0a18f0df0e5644297af9ee3825488|page        |create    |2020-07-21T19:17:35Z|null         |216.11.189.203                         |776157 |null              |7041575    |null   |
|page             |create         |2020-07-06 21:51:08.0|18539             |Djsasso                                |772897      |null                   |null            |null        |2488132ed5a03f92f3dd0f96c869
038a|fbce1e48d22c49395f0d016a5a7c0041|page        |create    |2020-07-06T22:07:14Z|18539        |Djsasso                                |772897 |null              |7022504    |null   |
|page             |create         |2020-07-23 15:45:08.0|1016980           |Kiagus Muhammad Hanif Sirua            |776615      |null                   |null            |null        |4ec5d85bd29325b10cd41472b279
c787|fc073744a9e20d81b253dcd77a737999|page        |create    |2020-07-24T00:26:12Z|1016980      |Kiagus Muhammad Hanif Sirua            |776615 |null              |7044338    |null   |
|page             |create         |2020-07-23 03:30:41.0|null              |null                                   |776508      |null                   |null            |null        |0548a8babae133796f3d3868d87a
5e2a|a9c1bedc654332524b39154394b90929|page        |create    |2020-07-23T14:34:26Z|null         |2604:2000:1481:8600:1859:D4A4:D541:479 |776508 |null              |7043672    |null   |
|page             |create         |2020-07-11 17:19:58.0|1015017           |Ruchi1403                              |774338      |null                   |null            |null        |3878527da8b548ce60bbce55e8af
b7ae|6225e236ffb2807d6ab449a3a9cfc3d0|page        |create    |2020-07-13T03:47:27Z|1015017      |Ruchi1403                              |774338 |null              |7030827    |null   |
|page             |create         |2020-07-26 12:10:11.0|null              |113.185.46.239                         |777020      |null                   |null            |null        |8f7447a20c361c8565e664eeb745
1ba0|096c16c74e9eba676f1b64fc2c267352|page        |create    |2020-07-26T15:34:36Z|null         |113.185.46.239                         |777020 |null              |7046740    |null   |
|page             |create         |2020-07-23 14:46:00.0|null              |2804:431:9705:123A:44AD:402D:CC1A:7CBD |776936      |null                   |null            |null        |8d7a19c81b2c5f5cdf7f2d765312
529d|3538d42cac905e750bf90e97d1edcf4c|page        |create    |2020-07-25T21:54:43Z|null         |2804:431:9705:123A:44AD:402D:CC1A:7CBD |776936 |null              |7046241    |null   |
|page             |create         |2020-07-03 12:06:50.0|null              |188.28.164.70                          |773753      |null                   |null            |null        |ee98d96c75bb00dd135eac9a9ca3
1eb3|4ca8c4707f307eea229ac4ba0e416f6e|page        |create    |2020-07-09T21:04:04Z|null         |188.28.164.70                          |773753 |null              |7026991    |null   |
|page             |create         |2020-07-08 14:58:40.0|null              |2600:100B:B01F:5D1D:8581:E76B:C057:E33E|773293      |null                   |null            |null        |387146ff9c469610217f15d7f096
c1d3|30049538a0307d04b81ff5e7e74964ae|page        |create    |2020-07-08T14:59:02Z|null         |2600:100B:B01F:5D1D:8581:E76B:C057:E33E|773293 |null              |7024926    |null   |
|page             |create         |2020-07-27 07:31:36.0|null              |96.232.204.207                         |777217      |null                   |null            |null        |ff801bb43ae1937581a9aab7c850
a9e4|b96e012225dddafc1aec916db50321fe|page        |create    |2020-07-27T18:44:23Z|null         |96.232.204.207                         |777217 |null              |7047883    |null   |
|page             |create         |2020-07-24 19:34:57.0|null              |103.133.200.3                          |776712      |null                   |null            |null        |7f2b86db57831d0d0aa21a2e48f5
6591|b74a026b77098ede7dd0808112191551|page        |create    |2020-07-24T19:36:47Z|null         |103.133.200.3                          |776712 |null              |7044995    |null   |
|page             |create         |2020-07-05 06:25:43.0|731605            |ImprovedWikiImprovment                 |777841      |null                   |null            |null        |f4d73b303deb6471e7a369d55ee2
9b1d|86f99e09d43584f9fe21b991d0b9d603|page        |create    |2020-07-31T07:35:54Z|731605       |ImprovedWikiImprovment                 |777841 |null              |7052181    |null   |
|page             |create         |2020-07-11 21:11:25.0|null              |183.171.158.2                          |774113      |null                   |null            |null        |064c14ba51b7ee801702ae6e5b4b
28ca|cdc7c074d18446e33aa2155ab7f0c120|page        |create    |2020-07-11T21:12:34Z|null         |183.171.158.2                          |774113 |null              |7029279    |null   |
|page             |create         |2020-07-18 12:46:14.0|null              |91.186.250.14                          |776269      |null                   |null            |null        |6bb0a417d47f4b460d912da3d806
8d21|2b16d0fabfbe0b87cff878c7e4443930|page        |create    |2020-07-22T10:43:02Z|null         |91.186.250.14                          |776269 |null              |7042178    |null   |
|page             |create         |2020-07-08 14:59:16.0|null              |2600:100B:B01F:5D1D:8581:E76B:C057:E33E|773299      |null                   |null            |null        |e5fd3a0068ab47d04e081c11b6a6
7175|beb557012d2168f665b4296a909545cf|page        |create    |2020-07-08T15:04:12Z|null         |2600:100B:B01F:5D1D:8581:E76B:C057:E33E|773299 |null              |7024934    |null   |
|page             |create         |2020-07-27 16:46:11.0|null              |2A02:C7D:5D63:1300:99F0:483B:4333:E8B3 |777199      |null                   |null            |null        |2b9a9f02cb903ff6dbb3b7f190c9
4ab5|95d89d5c9301d2ee9578966eb28e1044|page        |create    |2020-07-27T17:15:36Z|null         |2A02:C7D:5D63:1300:99F0:483B:4333:E8B3 |777199 |null              |7047744    |null   |
|page             |create         |2020-07-13 13:18:43.0|null              |82.132.238.246                         |774405      |null                   |null            |null        |e4cce9e33d963daf2bc87abb3ec9
82ce|58bcc7da0759e051dc145553229e800d|page        |create    |2020-07-13T13:21:06Z|null         |82.132.238.246                         |774405 |null              |7031240    |null   |
|page             |create         |2020-07-27 19:58:00.0|731605            |ImprovedWikiImprovment                 |777745      |null                   |null            |null        |dcc26c5db9f510a4ba159f250499
12fa|7952f4febe7aeb2bc7cd9d20a643884e|page        |create    |2020-07-30T19:32:59Z|731605       |ImprovedWikiImprovment                 |777745 |null              |7051600    |null   |
|page             |create         |2020-07-05 00:49:56.0|null              |null                                   |775790      |null                   |null            |null        |3be45c24b4244016246db70f73c1
b0bc|8a8c0b95fc5c9ee889ff5871ef0d2e98|page        |create    |2020-07-19T17:19:30Z|null         |104.230.23.164                         |775790 |null              |7039385    |null   |
|page             |create         |2020-07-01 18:14:17.0|2133              |Eptalon                                |771890      |null                   |null            |null        |de31de815585d8b3771760a668d2
78df|771dae02392f7693b57d1cb0bf805046|page        |create    |2020-07-01T18:20:26Z|2133         |Eptalon                                |771890 |null              |7014335    |null   |
|page             |create         |2020-07-16 12:18:19.0|null              |121.209.165.23                         |775164      |null                   |null            |null        |3ce07de39056c2bed378ba30095debe5|c5c32f73848033043dca2d4cf9e24440|page        |create    |2020-07-16T12:27:46Z|null         |121.209.165.23                         |775164 |null              |7035848    |null   |
|page             |create         |2020-07-21 19:21:59.0|null              |216.11.189.203                         |776701      |null                   |null            |null        |90b0516b0a8d8c7a4f70f17b4eb24c48|bb858788bdff445f7492e1efcc7757e4|page        |create    |2020-07-24T19:09:43Z|null         |216.11.189.203                         |776701 |null              |7044924    |null   |
|page             |create         |2020-07-05 06:25:09.0|22027             |Auntof6                                |777114      |null                   |null            |null        |eb372e8ade0219df45adc9823263baa0|5c6d1cc2785a4c1d3a68ebc1946668c0|page        |create    |2020-07-27T08:26:10Z|22027        |Auntof6                                |777114 |null              |7047405    |null   |
|page             |create         |2020-07-10 21:07:38.0|2133              |Eptalon                                |773910      |null                   |null            |null        |89c5d3759bec217e4c62dae85ca33914|d42083b8f1333681c153d4a776de5b9e|page        |create    |2020-07-10T21:10:57Z|2133         |Eptalon                                |773910 |null              |7027906    |null   |
|page             |create         |2020-07-15 10:47:35.0|null              |27.109.115.145                         |777500      |null                   |null            |null        |3244fb79ba82df3b29d5b0f9514aeaa6|e018d8ea407c13320d805b9538b59074|page        |create    |2020-07-29T05:51:15Z|null         |27.109.115.145                         |777500 |null              |7049825    |null   |
|page             |create         |2020-07-05 06:25:21.0|22027             |Auntof6                                |777112      |null                   |null            |null        |92843e6178c454d81a4053b060278cd5|42087db116876f0ba84a7b4ea97d532c|page        |create    |2020-07-27T08:22:07Z|22027        |Auntof6                                |777112 |null              |7047398    |null   |
|page             |create         |2020-07-27 19:19:53.0|null              |2604:2000:1481:8600:3873:EF4F:FFFE:8E43|777377      |null                   |null            |null        |f2cde5c18ef90e9a9a5006bee0844fae|5119076e8337686d73961a52f1e07ebc|page        |create    |2020-07-28T12:20:18Z|null         |2604:2000:1481:8600:3873:EF4F:FFFE:8E43|777377 |null              |7048861    |null   |
|page             |create         |2020-07-03 13:20:00.0|2133              |Eptalon                                |772205      |null                   |null            |null        |54c49ca4e2f02a60af6661f07de22315|d19e94fb5fb4c3484fd820eb7b88163c|page        |create    |2020-07-03T19:37:05Z|2133         |Eptalon                                |772205 |null              |7017062    |null   |
|page             |create         |2020-07-13 03:49:22.0|null              |2409:4064:2E26:2F1C:7C9B:ABB2:48B9:AC3F|776024      |null                   |null            |null        |0f8e38d897b49882966c6ee162d81bbc|c72659d87c8361ebd9af02476b696be7|page        |create    |2020-07-21T04:02:19Z|null         |2409:4064:2E26:2F1C:7C9B:ABB2:48B9:AC3F|776024 |null              |7040796    |null   |
|page             |create         |2020-07-24 18:03:36.0|null              |2604:2000:1481:8600:B905:1CA2:9C84:CD55|777023      |null                   |null            |null        |dbe0bc54000a59d181c297fa376be779|6454bc4b3e6e35f8a3e2c98263b86326|page        |create    |2020-07-26T16:27:26Z|null         |2604:2000:1481:8600:B905:1CA2:9C84:CD55|777023 |null              |7046779    |null   |
|page             |create         |2020-07-11 18:32:37.0|2133              |Eptalon                                |774095      |null                   |null            |null        |3f3baae6338375f179bad1e316cac423|953b40ec5179d4da19c5790c73013535|page        |create    |2020-07-11T19:28:59Z|2133         |Eptalon                                |774095 |null              |7029003    |null   |
|page             |create         |2020-07-04 18:06:57.0|null              |2409:4062:2E07:BB27:931C:7845:371D:3337|772352      |null                   |null            |null        |2d28ce12a7bf37d7398fc4acc3cb4dd1|458be40bbc3bef5134d7e5b67dcc9547|page        |create    |2020-07-04T18:20:54Z|null         |2409:4062:2E07:BB27:931C:7845:371D:3337|772352 |null              |7018319    |null   |
|page             |create         |2020-07-19 19:32:09.0|2133              |Eptalon                                |775814      |null                   |null            |null        |d4271d9c478f57be07302df4958c9727|4d2a77953e760afc6e29aa41e2d753e5|page        |create    |2020-07-19T20:38:50Z|2133         |Eptalon                                |775814 |null              |7039515    |null   |
|page             |create         |2020-07-17 17:05:51.0|null              |2A02:C7F:3E57:2000:8494:F116:FC3C:A6B  |775448      |null                   |null            |null        |a1a778d6124e23a4e25cb02b4ba130ff|10f069f3c43d297fe4a07cd07ea1782e|page        |create    |2020-07-17T22:26:57Z|null         |2A02:C7F:3E57:2000:8494:F116:FC3C:A6B  |775448 |null              |7037669    |null   |
|page             |create         |2020-07-05 06:25:33.0|935234            |Naddruf                                |774441      |null                   |null            |null        |4b6f6bbfa64e598f62e291f3d203514e|60054cfcab3bef1978619aa4b017615b|page        |create    |2020-07-13T17:12:31Z|935234       |Naddruf                                |774441 |null              |7031620    |null   |
|page             |create         |2020-07-14 12:22:38.0|null              |86.130.163.98                          |776469      |null                   |null            |null        |4c8a1c6116a6acafe823a3a84a62f1f1|d08c6286f86f20c219e6ee570cdf7944|page        |create    |2020-07-23T11:15:05Z|null         |86.130.163.98                          |776469 |null              |7043443    |null   |
|page             |create         |2020-07-05 00:49:56.0|null              |104.230.23.164                         |775787      |null                   |null            |null        |bb14ab1d8aac2a4d05f5dfc758b386a4|40b87eaf4120d8ffef03fcdd65b836af|page        |create    |2020-07-19T17:13:16Z|null         |104.230.23.164                         |775787 |null              |7039379    |null   |
|page             |create         |2020-07-25 20:31:08.0|51686             |حُميد                                  |777078      |null                   |null            |null        |7b19e3fdddeadb3efc6b6ddb6f57cb23|f718dfe205bd2bb881be443b60ef0f35|page        |create    |2020-07-26T23:31:04Z|51686        |حُميد                                  |777078 |null              |7047146    |null   |
|page             |create         |2020-07-03 16:54:49.0|1011679           |UN175                                  |776333      |null                   |null            |null        |fec95017672c25711166f1643b6c14ab|786f2bc64b7507aa285c60bdc79a9135|page        |create    |2020-07-22T20:44:13Z|1011679      |UN175                                  |776333 |null              |7042802    |null   |
|page             |create         |2020-07-24 19:16:27.0|null              |68.55.103.35                           |777233      |null                   |null            |null        |f84ad45d616c80733e978d6feb6615f6|38f75c81a012ba5c62c1613afe82f210|page        |create    |2020-07-27T20:33:43Z|null         |68.55.103.35                           |777233 |null              |7048052    |null   |
|page             |create         |2020-07-14 10:59:59.0|null              |178.120.18.16                          |774651      |null                   |null            |null        |cc7db8f3fce3707cdbda4d9e0f0b774c|5a984bca92c73aff3a8aa8179b1aabd0|page        |create    |2020-07-14T11:53:20Z|null         |178.120.18.16                          |774651 |null              |7032708    |null   |
|page             |create         |2020-07-28 12:07:01.0|68157             |Macdonald-ross                         |777376      |null                   |null            |null        |2d6cdb9b2aae179044da6e72ea98cb9b|7213b1f50d69c2dbcfb129f66791f808|page        |create    |2020-07-28T12:10:21Z|68157        |Macdonald-ross                         |777376 |null              |7048853    |null   |
|page             |create         |2020-07-11 19:45:56.0|1014967           |TypoPedia                              |774283      |null                   |null            |null        |16f5f74dbf37205b0a1723a368b91c54|c1eb5bf01bbf9bfe2b2a74121928a91b|page        |create    |2020-07-12T19:38:44Z|1014967      |TypoPedia                              |774283 |null              |7030466    |null   |
|page             |create         |2020-07-15 15:37:24.0|18539             |Djsasso                                |775021      |null                   |null            |null        |38a5226435d779d254b1208142b440b6|4384c2cda0cd7c4cc3f8bad1f2e0b075|page        |create    |2020-07-15T15:37:44Z|18539        |Djsasso                                |775021 |null              |7034776    |null   |
|page             |create         |2020-07-14 11:11:39.0|null              |178.120.18.16                          |774653      |null                   |null            |null        |61f1d9dca6f194a431d42e3192dbca93|bcbef864ef00fd5034c4ca0e7846f27c|page        |create    |2020-07-14T11:55:53Z|null         |178.120.18.16                          |774653 |null              |7032713    |null   |
|page             |create         |2020-07-27 07:44:43.0|null              |123.201.126.221                        |777608      |null                   |null            |null        |625e93c88c218d734b329937890de1b7|98b11a9905f9691ea08c3eb1e5966f91|page        |create    |2020-07-29T17:51:04Z|null         |123.201.126.221                        |777608 |null              |7050507    |null   |
|page             |create         |2020-07-19 20:46:00.0|null              |203.192.209.135                        |775940      |null                   |null            |null        |d6c424e68482a92c5c2436309e29a0be|4eed541b85ca5a69896ded93b92595c0|page        |create    |2020-07-20T13:18:17Z|null         |203.192.209.135                        |775940 |null              |7040222    |null   |
|page             |create         |2020-07-22 20:06:23.0|1016571           |Herɋuartda                             |776364      |null                   |null            |null        |7e0ca3f7a4d97ab2ada3c3f2ca0f8cdc|810498f91baa5f6a2455281e965b2804|page        |create    |2020-07-23T01:22:12Z|1016571      |Herɋuartda                             |776364 |null              |7043020    |null   |
|page             |create         |2020-07-08 09:51:55.0|2133              |Eptalon                                |773254      |null                   |null            |null        |3c71f0a12c3a5367e01178545f90e87e|4ae09ce7d1d5455122e491e6686235f8|page        |create    |2020-07-08T11:02:46Z|2133         |Eptalon                                |773254 |null              |7024657    |null   |
|page             |create         |2020-07-11 17:34:34.0|731605            |ImprovedWikiImprovment                 |776086      |null                   |null            |null        |0739197732ceac5e27b527d2508c96d8|b283af74182c17c8c80f9020282e7635|page        |create    |2020-07-21T11:27:37Z|731605       |ImprovedWikiImprovment                 |776086 |null              |7041129    |null   |
|page             |create         |2020-07-15 10:48:43.0|null              |122.174.248.82                         |774933      |null                   |null            |null        |8360b580582005dde01cddc82abf7f0c|8d4e1b08b1d2b0e89ca42cb4c69aa233|page        |create    |2020-07-15T10:50:59Z|null         |122.174.248.82                         |774933 |null              |7034337    |null   |
|page             |create         |2020-07-20 18:10:08.0|null              |182.187.89.237                         |776291      |null                   |null            |null        |6c17439fc6811a1db3d007d83eea6cdd|cec1b9e526c5ccc921f5fe6dddf6a919|page        |create    |2020-07-22T16:24:21Z|null         |182.187.89.237                         |776291 |null              |7042459    |null   |
|page             |create         |2020-07-03 22:26:35.0|null              |104.230.23.164                         |775789      |null                   |null            |null        |f83d39c944fa8705a4d03878c145151e|1a03bc918d8eaae6b8699e511d45a580|page        |create    |2020-07-19T17:18:12Z|null         |104.230.23.164                         |775789 |null              |7039384    |null   |
|page             |create         |2020-07-16 08:58:00.0|1014250           |SimpleBoi                              |775138      |null                   |null            |null        |94731f976067bfa4aaf55bfec3797f3d|3e048f04e91383fac587c15acef6fbeb|page        |create    |2020-07-16T09:00:07Z|1014250      |SimpleBoi                              |775138 |null              |7035698    |null   |
|page             |create         |2020-07-02 14:54:45.0|null              |185.67.177.137                         |772010      |null                   |null            |null        |906cd1def6bbbf7a750016586581e621|7c73b9c571089cb7fc34699a66cc2ad0|page        |create    |2020-07-02T14:55:36Z|null         |185.67.177.137                         |772010 |null              |7015423    |null   |
|page             |create         |2020-07-23 12:26:42.0|null              |2409:4073:407:CCC5:A29D:AE4A:46C9:E628 |776790      |null                   |null            |null        |00ccd5bf3f68f41a50f7f8d4aaa134fb|57ccae11f0c3205a25b3c5427aba8d2d|page        |create    |2020-07-25T06:07:04Z|null         |2409:4073:407:CCC5:A29D:AE4A:46C9:E628 |776790 |null              |7045424    |null   |
|page             |create         |2020-07-05 00:49:56.0|null              |104.230.23.164                         |775798      |null                   |null            |null        |e08687276bdb7f488ea269fe7a168601|384253c82036282e23c8e19a7eedba5d|page        |create    |2020-07-19T17:40:06Z|null         |104.230.23.164                         |775798 |null              |7039401    |null   |
|page             |create         |2020-07-19 20:46:18.0|null              |98.204.187.175                         |777223      |null                   |null            |null        |211a025c5cb013f4f6d8d43d29b22bdf|cef2acd16426fb5152f2359ff4cc78e1|page        |create    |2020-07-27T19:22:37Z|null         |98.204.187.175                         |777223 |null              |7047934    |null   |
|page             |create         |2020-07-04 06:40:35.0|null              |2409:4062:2E0B:CD10:49BB:E6CF:F9C5:68E9|772348      |null                   |null            |null        |9882235dc34668977e5fd53a3b1b6f5b|fbc1b16ae7539d290e507df344acc0c6|page        |create    |2020-07-04T17:41:59Z|null         |2409:4062:2E0B:CD10:49BB:E6CF:F9C5:68E9|772348 |null              |7018286    |null   |
|page             |create         |2020-07-23 16:43:41.0|null              |201.157.252.176                        |777198      |null                   |null            |null        |2560e393bc7c05c13e40e9401b9939c7|d747e9c48f63fdc4351ec28c2824a588|page        |create    |2020-07-27T17:15:15Z|null         |201.157.252.176                        |777198 |null              |7047743    |null   |
|page             |create         |2020-07-02 23:27:01.0|null              |103.78.225.18                          |772668      |null                   |null            |null        |bef4a9cb9dbd66b046ee3ae0c87a9ef4|45e87713cec3b4566456aa609302476d|page        |create    |2020-07-06T00:35:23Z|null         |103.78.225.18                          |772668 |null              |7021010    |null   |
|page             |create         |2020-07-13 13:09:06.0|null              |82.132.238.246                         |774401      |null                   |null            |null        |fe334a53259e9f372781a795a7a09f50|85a334491245ac70bcc452ed6fb1ec33|page        |create    |2020-07-13T13:16:14Z|null         |82.132.238.246                         |774401 |null              |7031224    |null   |
|page             |create         |2020-07-03 22:26:35.0|null              |2600:1017:B40C:71D1:70CA:95AE:5487:87C7|772648      |null                   |null            |null        |8f61a6e607f594069b21890675ce390a|f28288f13bca005a53afc9e8cc22830c|page        |create    |2020-07-05T23:21:44Z|null         |2600:1017:B40C:71D1:70CA:95AE:5487:87C7|772648 |null              |7020944    |null   |
|page             |create         |2020-07-17 17:04:12.0|null              |203.192.244.66                         |775707      |null                   |null            |null        |7da9908ccae63b9f76c31e150d4692ce|3b0495f56c02882d22a3f1ae866f4718|page        |create    |2020-07-19T11:13:40Z|null         |203.192.244.66                         |775707 |null              |7038973    |null   |
+-----------------+---------------+---------------------+------------------+---------------------------------------+------------+-----------------------+----------------+------------+--------------------------------+--------------------------------+------------+----------+--------------------+-------------+---------------------------------------+-------+------------------+-----------+-------+
  • revision-create
val jrevts = hfsnr.where("hudi_event_entity = 'revision'").join(eventsRevCreate, col("hudi_revision_id") === col("revision_id"), "outer").
  selectExpr("hudi_row_key_hash", "row_key_hash", "(hudi_revision_id IS NULL OR revision_id IS NULL OR (abs(unix_timestamp(hudi_event_timestamp) - unix_timestamp(event_timestamp, 'yyyy-MM-dd\\'T\\'HH:mm:ss\\'Z\\'')) < 5)) AS ts_match").
  cache()
jrevts.count
// 39532
jrevts.where("hudi_row_key_hash IS NULL").count
// 613
jrevts.where("row_key_hash IS NULL").count
// 748
jrevts.where("hudi_row_key_hash IS NOT NULL AND row_key_hash IS NOT NULL AND ts_match").count
// 38171
jrevts.where("hudi_row_key_hash IS NOT NULL AND row_key_hash IS NOT NULL AND NOT ts_match").count
// 0

// No timestamp difference here - need to find a different for mismatch.
Milimetric triaged this task as High priority.Sep 10 2020, 3:42 PM
Milimetric moved this task from Incoming to Smart Tools for Better Data on the Analytics board.

Checks for discrenpencies (using previous comment setup):

  • page-delete

...

I noticed here that events to the same page were criss-crossing and they're actually all really close in time, here they are split into two tables and re-ordered to match (btw if you keep tables outside of the code formatting they format nicely in phab):

hudi_event_entityhudi_event_typehudi_event_timestamphudi_event_user_idhudi_event_user_text_historicalhudi_page_id
pagedelete2020-07-30 16:49:10.08630Chenzw776311
pagedelete2020-07-06 17:06:54.02133Eptalon676812
pagedelete2020-07-03 07:54:39.022027Auntof6285032
pagedelete2020-07-06 22:32:27.018539Djsasso676812
pagedelete2020-07-27 08:15:16.022027Auntof6700118
pagedelete2020-07-30 16:49:31.08630Chenzw776311
event_entityevent_typeevent_timestampevent_user_idevent_user_text_historicalpage_idrevision_id
pagedelete2020-07-30T16:49:32Z8630Chenzw7763117042637
pagedelete2020-07-06T17:06:55Z2133Eptalon6768127021769
pagedelete2020-07-03T07:54:49Z22027Auntof62850327013107
pagedelete2020-07-06T22:32:27Z18539Djsasso6768127021769
pagedelete2020-07-27T08:15:26Z22027Auntof67001186577725
pagedelete2020-07-30T16:49:10Z8630Chenzw7763117042637

So the main problem seems to be that hudi_revision_id is null and the times are off by up to 30 seconds or so.

I had no other ideas on the other two checks yet.

Last news on revision_create for simplewiki 2020-07:

  • All kafka-events match mediawiki-history except the ones with deleted-parts (mostly because performer_id is hidden, leading to a mismatch in row_hash_key) - Should be fixable with revision-visibility-change events.
  • Some events of mediawiki-history are missing kafka-events - see T215001
  • Some kafka-events are duplicated - see T262203
Nuria renamed this task from Check that mediawiki-events match mediawiki-history changes over a month to Check whether mediawiki production event data is equivalent to mediawiki-history data over a month .Sep 23 2020, 6:21 PM
Nuria updated the task description. (Show Details)

Final note.
I have found 2 problems with mediawiki-events (over simplewiki only, using mediawiki-history as a baseline):

  • Some events are lost (seen in revision-create, page-create, page-move) - documented in T215001.
  • Some events are duplicated (seen in revision-create only) - documented in T262203.

I have found new features to be added to events to allow to recreate mediawiki-history using events:

  • Add user-create and user-rename events - documented in T262205
  • Add logging events (corresponding to an addition to the logging table) and a reference to those events in the main related events. For instance every page-create, page-move, user-create etc events would point to their corresponding logging-event - Documented in T263055