Page MenuHomePhabricator

Check whether mediawiki production event data is equivalent to mediawiki-history data over a month
Closed, ResolvedPublic

Description

To mimic mediawiki-history using events, we need to check that the underlying data provides the same value.

Check whether mediawiki production event data is equivalent to mediawiki-history data over a month such an incremental update can be built.

Event Timeline

JAllemandou created this task.
JAllemandou set Final Story Points to 5.

Bug found in mediawiki_revision_create events: some revision-create have multiple events (same revision-id, multiple event-requests) - Tracked in T262203.

Missing events for user_create and user_rename. Tracked in T262205.

CODE (For simplewiki only)

  • Prepare snapshot data from hudi-saved dataset.
import org.apache.spark.sql.SaveMode
import org.apache.hudi.DataSourceReadOptions
import org.apache.hudi.config.HoodieWriteConfig

val tableNameSimplewiki = "joal.test_hudi_mwh_simplewiki"
val basePathSimplewiki = "hdfs:///user/joal/test_hudi/mwh_simplewiki"

-- Load hudi dataframe
val hfs = spark.read.format("hudi").
     // Set table name
     option(HoodieWriteConfig.TABLE_NAME, tableNameSimplewiki).
     // Read snapshot
     option(DataSourceReadOptions.QUERY_TYPE_OPT_KEY, DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL).
     option("hoodie.index.type", "BLOOM").
     //The number of wildcard asterisks here must be one greater than the number of partition
     load(basePathSimplewiki + "/*/*")

-- restrict data to 2020-07 and keep only id-related fields
val hfsn = hfs.where("snapshot='2020-07' AND year(event_timestamp) = 2020 and month(event_timestamp) = 7").
  select(
    "event_entity",
    "event_type",
    "event_timestamp",
    "event_user_id",
    "event_user_text_historical",
    "page_id",
    "page_artificial_id",
    "revision_id",
    "user_id",
    "row_key_hash").
  cache

-- Rename columns of the dataframe with prefix 'hudi_' to faciliate joining
val hfsnr = hfsn.columns.foldLeft(hfsn){ (acc, names) => acc.withColumnRenamed(names, "hudi_".concat(names))}.cache()

hfsn.groupBy(col("event_entity"), col("event_type")).count.sort("event_entity", "event_type").show(100, false)
/*
+------------+-----------+-----+
|event_entity|event_type |count|
+------------+-----------+-----+
|page        |create     |5775 |
|page        |create-page|5801 |
|page        |delete     |903  |
|page        |move       |261  |
|page        |restore    |8    |
|revision    |create     |38906|
|user        |alterblocks|16   |
|user        |altergroups|12   |
|user        |create     |5656 |
|user        |rename     |258  |
+------------+-----------+-----+
*/
  • Check revision-create events
val eventsRevCreate = spark.sql("""
SELECT
  md5(concat(
    database,                                 '|',
    'revision',                               '|',
    'create',                                 '|',
    -- timestamps of events needs special parsing as they are not sql-timestamp formatted
    -- Also, using rev_timestamp instead of meta.dt here as it is available
    CAST(unix_timestamp(rev_timestamp, 'yyyy-MM-dd\'T\'HH:mm:ss\'Z\'') AS STRING), '|',
    coalesce(performer.user_id, 0),           '|',
    coalesce(performer.user_text, ''),        '|',
    coalesce(page_id, 0),                     '|',
    '',                                       '|', -- no page_artificial_id
    coalesce(rev_id, 0),                      '|',
    0                                              -- no user_id
  )) as row_key_hash,
  'revision' AS event_entity,
  'create' AS event_type,
  rev_timestamp AS event_timestamp,
  performer.user_id AS event_user_id,
  performer.user_text as event_user_text_historical,
  page_id,
  NULL AS page_artificial_id,
  rev_id as revision_id,
  NULL as user_id
from event.mediawiki_revision_create
where year = 2020 and (
    (month = 7)
    -- Expanding boundaries in case there are 
    OR (month = 6 AND day = 30)
    OR (month = 8 AND day = 1))
  and database = 'simplewiki'
  and rev_timestamp >= '2020-07'
  and rev_timestamp < '2020-08'
""").
  distinct. // This distinct is important as there are duplicates
  cache()

eventsRevCreate.count
// 38784 (vs 38906)

-- Now joining on row-key-hash and checking
val jrev = hfsnr.where("hudi_event_entity = 'revision'").join(eventsRevCreate, col("hudi_row_key_hash") === col("row_key_hash"), "outer").cache()
jrev.count
// 39542
jrev.where("hudi_row_key_hash IS NULL").count
// 636
jrev.where("row_key_hash IS NULL").count
// 758
jrev.where("hudi_row_key_hash IS NOT NULL AND row_key_hash IS NOT NULL").count
// 38148

// Most rows share the same hash-key - Looks good
  • Check page-create events
val eventsPageCreate = spark.sql("""
SELECT
  md5(concat(
    database,                                 '|',
    'page',                                   '|',
    'create',                                 '|',
    -- timestamps of events needs special parsing as they are not sql-timestamp formatted
    -- Also, using rev_timestamp instead of meta.dt here as it is available
    CAST(unix_timestamp(rev_timestamp, 'yyyy-MM-dd\'T\'HH:mm:ss\'Z\'') AS STRING), '|',
    coalesce(performer.user_id, 0),           '|',
    coalesce(performer.user_text, ''),        '|',
    coalesce(page_id, 0),                     '|',
    '',                                       '|', -- page_artificial_id
    0,                                        '|', -- no rev_id for page-create in mwh while there is one in event!
    0                                              -- user_id
  )) as row_key_hash,
  'page' AS event_entity,
  'create' AS event_type,
  rev_timestamp AS event_timestamp,
  performer.user_id AS event_user_id,
  performer.user_text as event_user_text_historical,
  page_id,
  NULL AS page_artificial_id,
  rev_id as revision_id,
  NULL as user_id
from event.mediawiki_page_create
where year = 2020 and  month = 7
  and database = 'simplewiki'
""").cache()

eventsPageCreate.count
// 5918 (vs 5801)

-- Now joining on row-key-hash and checking
val jpc = hfsnr.where("hudi_event_entity = 'page' and hudi_event_type = 'create'").join(eventsPageCreate, col("hudi_row_key_hash") === col("row_key_hash"), "outer").cache()
jpc.count
// 6238
jpc.where("hudi_row_key_hash IS NULL").count
// 463
jpc.where("row_key_hash IS NULL").count
// 320
jpc.where("hudi_row_key_hash IS NOT NULL AND row_key_hash IS NOT NULL").count
// 5455

// Most rows also share the same hash-key, but the proportion of mismatch is high - investigation is important for that one
  • Check page-delete events
val eventsPageDelete = spark.sql("""
SELECT
  md5(concat(
    database,                                 '|',
    'page',                                   '|',
    'delete',                                 '|',
    -- timestamps of events needs special parsing as they are not sql-timestamp formatted
    CAST(unix_timestamp(meta.dt, 'yyyy-MM-dd\'T\'HH:mm:ss\'Z\'') AS STRING),'|',
    coalesce(performer.user_id, 0),           '|',
    coalesce(performer.user_text, ''),        '|',
    coalesce(page_id, 0),                     '|',
    '',                                       '|', -- page_artificial_id
    0,                                        '|', -- no rev_id in mwh
    0                                              -- user_id
  )) as row_key_hash,
  'page' AS event_entity,
  'delete' AS event_type,
  meta.dt AS event_timestamp,
  performer.user_id AS event_user_id,
  performer.user_text as event_user_text_historical,
  page_id,
  NULL AS page_artificial_id,
  rev_id as revision_id,
  NULL as user_id
from event.mediawiki_page_delete
where year = 2020 and month = 7
  and database = 'simplewiki'
""").cache()

eventsPageDelete.count
// 915 (vs 903)

-- Now joining on row-key-hash and checking
val jpd = hfsnr.where("hudi_event_entity = 'page' and hudi_event_type = 'delete'").join(eventsPageDelete, col("hudi_row_key_hash") === col("row_key_hash"), "outer").cache()
jpd.count
// 1021
jpd.where("hudi_row_key_hash IS NULL").count
// 118
jpd.where("row_key_hash IS NULL").count
// 106
jpd.where("hudi_row_key_hash IS NOT NULL AND row_key_hash IS NOT NULL").count
// 797

// Most rows also share the same hash-key, but the proportion of mismatch is relatively high - investigation is important for that one

// NOTE: I checked the mediawiki_page_suppress table and it was empty - maybe we'll need it for other wikis?
  • Check page-restore events
val eventsPageRestore = spark.sql("""
SELECT
  md5(concat(
    database,                                 '|',
    'page',                                   '|',
    'restore',                                '|',
    -- timestamps of events needs special parsing as they are not sql-timestamp formatted
    CAST(unix_timestamp(meta.dt, 'yyyy-MM-dd\'T\'HH:mm:ss\'Z\'') AS STRING),'|',
    coalesce(performer.user_id, 0),           '|',
    coalesce(performer.user_text, ''),        '|',
    coalesce(page_id, 0),                     '|',
    '',                                       '|', -- page_artificial_id
    0,                                        '|', -- no revId in mwh
    0                                              -- user_id
  )) as row_key_hash,
  'page' AS event_entity,
  'restore' AS event_type,
  meta.dt AS event_timestamp,
  performer.user_id AS event_user_id,
  performer.user_text as event_user_text_historical,
  page_id,
  NULL AS page_artificial_id,
  rev_id as revision_id,
  NULL as user_id
from event.mediawiki_page_undelete
where year = 2020 and month = 7
  and database = 'simplewiki'
""").cache()

eventsPageRestore.count
// 8 (vs 8)

-- Now joining on row-key-hash and checking
val jpr = hfsnr.where("hudi_event_entity = 'page' and hudi_event_type = 'restore'").join(eventsPageRestore, col("hudi_row_key_hash") === col("row_key_hash"), "outer").cache()
jpr.count
// 8
jpr.where("hudi_row_key_hash IS NULL").count
// 0
jpr.where("row_key_hash IS NULL").count
// 0
jpr.where("hudi_row_key_hash IS NOT NULL AND row_key_hash IS NOT NULL").count
// 8

// Full match - \o/
  • Check page-move events
val eventsPageMove = spark.sql("""
SELECT
  md5(concat(
    database,                                 '|',
    'page',                                   '|',
    'move',                                   '|',
    -- timestamps of events needs special parsing as they are not sql-timestamp formatted
    CAST(unix_timestamp(meta.dt, 'yyyy-MM-dd\'T\'HH:mm:ss\'Z\'') AS STRING),'|',
    coalesce(performer.user_id, 0),           '|',
    coalesce(performer.user_text, ''),        '|',
    coalesce(page_id, 0),                     '|',
    '',                                       '|', -- page_artificial_id
    0,                                        '|', -- no rev_id in mwh
    0                                              -- user_id
  )) as row_key_hash,
  'page' AS event_entity,
  'move' AS event_type,
  meta.dt AS event_timestamp,
  performer.user_id AS event_user_id,
  performer.user_text as event_user_text_historical,
  page_id,
  NULL AS page_artificial_id,
  rev_id as revision_id,
  NULL as user_id
from event.mediawiki_page_move
where year = 2020 and month = 7
  and database = 'simplewiki'
""").cache()

eventsPageMove.count
// 263 (vs 261)

val jpm = hfsnr.where("hudi_event_entity = 'page' and hudi_event_type = 'move'").join(eventsPageMove, col("hudi_row_key_hash") === col("row_key_hash"), "outer").cache()
jpm.count
// 476
jpm.where("hudi_row_key_hash IS NULL").count
// 215
jpm.where("row_key_hash IS NULL").count
// 213
jpm.where("hudi_row_key_hash IS NOT NULL AND row_key_hash IS NOT NULL").count
// 48

// More not-join than joined - investiagtion

val jpmts = hfsnr.where("hudi_event_entity = 'page' and hudi_event_type = 'move'").join(eventsPageMove, col("hudi_page_id") === col("page_id"), "outer").
  where("hudi_page_id IS NULL OR page_id IS NULL OR (abs(unix_timestamp(hudi_event_timestamp) - unix_timestamp(event_timestamp, 'yyyy-MM-dd\\'T\\'HH:mm:ss\\'Z\\'')) < 5)").
  cache()
jpmts.count
// 263
jpmts.where("hudi_row_key_hash IS NULL").count
// 2
jpmts.where("row_key_hash IS NULL").count
// 0
jpmts.where("hudi_row_key_hash IS NOT NULL AND row_key_hash IS NOT NULL").count
// 261

// The problem comes from timestamps - the event-meta.dt timestamp is different from the event_timestamp we get from the logging table...
// It will be interesting to check the other mismatches for this case

// Anyway - how can we define a primary-key????

Checks for discrenpencies (using previous comment setup):

  • page-delete
// Try to match events using timestamps
val jpdts = hfsnr.where("hudi_event_entity = 'page' and hudi_event_type = 'delete'").join(eventsPageDelete, col("hudi_page_id") === col("page_id"), "outer").
  selectExpr("hudi_row_key_hash", "row_key_hash", "(hudi_page_id IS NULL OR page_id IS NULL OR (abs(unix_timestamp(hudi_event_timestamp) - unix_timestamp(event_timestamp, 'yyyy-MM-dd\\'T\\'HH:mm:ss\\'Z\\'')) < 5)) AS ts_match").
  cache()
jpdts.count
// 919
jpdts.where("hudi_row_key_hash IS NULL").count
// 12
jpdts.where("row_key_hash IS NULL").count
// 0
jpdts.where("hudi_row_key_hash IS NOT NULL AND row_key_hash IS NOT NULL AND ts_match").count
// 901 -- Almost all match within 5s
jpdts.where("hudi_row_key_hash IS NOT NULL AND row_key_hash IS NOT NULL AND not ts_match").count
// 6

// Looking at the 6 events that don't match with a timestamp within 5s (not very reasable sorry :():
hfsnr.where("hudi_event_entity = 'page' and hudi_event_type = 'delete'").join(eventsPageDelete, col("hudi_page_id") === col("page_id"), "outer").
  where("hudi_page_id IS NOT NULL AND page_id IS NOT NULL AND (abs(unix_timestamp(hudi_event_timestamp) - unix_timestamp(event_timestamp, 'yyyy-MM-dd\\'T\\'HH:mm:ss\\'Z\\'')) >= 5)").
  show(100, false)

+-----------------+---------------+---------------------+------------------+-------------------------------+------------+-----------------------+----------------+------------+--------------------------------+--------------------------------+------------+----------+--------------------+-------------+--------------------------+-------+------------------+-----------+-------+
|hudi_event_entity|hudi_event_type|hudi_event_timestamp |hudi_event_user_id|hudi_event_user_text_historical|hudi_page_id|hudi_page_artificial_id|hudi_revision_id|hudi_user_id|hudi_row_key_hash               |row_key_hash                    |event_entity|event_type|event_timestamp     |event_user_id|event_user_text_historical|page_id|page_artificial_id|revision_id|user_id|
+-----------------+---------------+---------------------+------------------+-------------------------------+------------+-----------------------+----------------+------------+--------------------------------+--------------------------------+------------+----------+--------------------+-------------+--------------------------+-------+------------------+-----------+-------+
|page             |delete         |2020-07-30 16:49:10.0|8630              |Chenzw                         |776311      |null                   |null            |null        |603321d3c0b792c959095627067d71ac|609894b5f351ef92eb2ec0e0d1c6dc5c|page        |delete    |2020-07-30T16:49:32Z|8630         |Chenzw                    |776311 |null              |7042637    |null   |
|page             |delete         |2020-07-06 17:06:54.0|2133              |Eptalon                        |676812      |null                   |null            |null        |9c21a46ffa5d77f131e03e129421e5e9|7c0b08708a3affb76173cd3a888be8bc|page        |delete    |2020-07-06T22:32:27Z|18539        |Djsasso                   |676812 |null              |7021769    |null   |
|page             |delete         |2020-07-03 07:54:39.0|22027             |Auntof6                        |285032      |null                   |null            |null        |503fa76a1763451b1d6b9472436738f8|3387c53369dc7af5e68b2377f4a3fda8|page        |delete    |2020-07-03T07:54:49Z|22027        |Auntof6                   |285032 |null              |7013107    |null   |
|page             |delete         |2020-07-06 22:32:27.0|18539             |Djsasso                        |676812      |null                   |null            |null        |7c0b08708a3affb76173cd3a888be8bc|6d244a642406d9852d186a124322ccd6|page        |delete    |2020-07-06T17:06:55Z|2133         |Eptalon                   |676812 |null              |7021769    |null   |
|page             |delete         |2020-07-27 08:15:16.0|22027             |Auntof6                        |700118      |null                   |null            |null        |1eaec820d3aa53746824410803aeb70b|27ae3b48edab900f71c18cdc1023a72e|page        |delete    |2020-07-27T08:15:26Z|22027        |Auntof6                   |700118 |null              |6577725    |null   |
|page             |delete         |2020-07-30 16:49:31.0|8630              |Chenzw                         |776311      |null                   |null            |null        |cdc521063916ffe15f3deecfcf362b32|603321d3c0b792c959095627067d71ac|page        |delete    |2020-07-30T16:49:10Z|8630         |Chenzw                    |776311 |null              |7042637    |null   |
+-----------------+---------------+---------------------+------------------+-------------------------------+------------+-----------------------+----------------+------------+--------------------------------+--------------------------------+------------+----------+--------------------+-------------+--------------------------+-------+------------------+-----------+-------+
  • page_create
val jpcts = hfsnr.where("hudi_event_entity = 'page' and hudi_event_type = 'create'").join(eventsPageCreate, col("hudi_page_id") === col("page_id"), "outer").
  selectExpr("hudi_row_key_hash", "row_key_hash", "(hudi_page_id IS NULL OR page_id IS NULL OR (abs(unix_timestamp(hudi_event_timestamp) - unix_timestamp(event_timestamp, 'yyyy-MM-dd\\'T\\'HH:mm:ss\\'Z\\'')) < 5)) AS ts_match").
  cache()
jpcts.count
// 6170
jpcts.where("hudi_row_key_hash IS NULL").count
// 395 -- Still a lot of mismatch :(
jpcts.where("row_key_hash IS NULL").count
// 252 -- Here as well :( Maybe page_is_deleted?
jpcts.where("hudi_row_key_hash IS NOT NULL AND row_key_hash IS NOT NULL AND ts_match").count
// 5455
jpcts.where("hudi_row_key_hash IS NOT NULL AND row_key_hash IS NOT NULL AND NOT ts_match").count
// 68

hfsnr.where("hudi_event_entity = 'page' and hudi_event_type = 'create'").join(eventsPageCreate, col("hudi_page_id") === col("page_id"), "outer").
  where("hudi_page_id IS NOT NULL AND page_id IS NOT NULL AND (abs(unix_timestamp(hudi_event_timestamp) - unix_timestamp(event_timestamp, 'yyyy-MM-dd\\'T\\'HH:mm:ss\\'Z\\'')) >= 5)").
  show(100, false)
+-----------------+---------------+---------------------+------------------+---------------------------------------+------------+-----------------------+----------------+------------+----------------------------
----+--------------------------------+------------+----------+--------------------+-------------+---------------------------------------+-------+------------------+-----------+-------+
|hudi_event_entity|hudi_event_type|hudi_event_timestamp |hudi_event_user_id|hudi_event_user_text_historical        |hudi_page_id|hudi_page_artificial_id|hudi_revision_id|hudi_user_id|hudi_row_key_hash           
    |row_key_hash                    |event_entity|event_type|event_timestamp     |event_user_id|event_user_text_historical             |page_id|page_artificial_id|revision_id|user_id|
+-----------------+---------------+---------------------+------------------+---------------------------------------+------------+-----------------------+----------------+------------+----------------------------
----+--------------------------------+------------+----------+--------------------+-------------+---------------------------------------+-------+------------------+-----------+-------+
|page             |create         |2020-07-01 06:35:03.0|null              |106.205.11.40                          |771740      |null                   |null            |null        |1af5a34c7f74f606eefaed5e3078
749c|85dd947dd13bcf553ad007384f4f533e|page        |create    |2020-07-01T06:42:17Z|null         |106.205.11.40                          |771740 |null              |7013623    |null   |
|page             |create         |2020-07-01 15:30:29.0|null              |69.115.47.9                            |771875      |null                   |null            |null        |5e23915d9d3cd0b3916475aa6ef9
4608|e8ddf68dcaa93e3d63e9a35e75d1944e|page        |create    |2020-07-01T17:45:08Z|null         |69.115.47.9                            |771875 |null              |7014279    |null   |
|page             |create         |2020-07-23 14:35:05.0|null              |null                                   |777039      |null                   |null            |null        |9d7ef58cf29caa3a0775fd15bb2e
f744|2399cb71073a5ccbab4dcd13c50026a8|page        |create    |2020-07-26T19:07:46Z|null         |2604:2000:1481:8600:B905:1CA2:9C84:CD55|777039 |null              |7046945    |null   |
|page             |create         |2020-07-10 12:26:50.0|2133              |Eptalon                                |773914      |null                   |null            |null        |b8624c1f11a13cc7f35f0e8bbfc8
8c92|523bae9ad3e22bfbc56e09266ab4ce82|page        |create    |2020-07-10T21:32:45Z|2133         |Eptalon                                |773914 |null              |7027926    |null   |
|page             |create         |2020-07-21 12:45:22.0|null              |216.11.189.203                         |776157      |null                   |null            |null        |f22116e5cb8a679c78d9c2b29a95
9956|9ec0a18f0df0e5644297af9ee3825488|page        |create    |2020-07-21T19:17:35Z|null         |216.11.189.203                         |776157 |null              |7041575    |null   |
|page             |create         |2020-07-06 21:51:08.0|18539             |Djsasso                                |772897      |null                   |null            |null        |2488132ed5a03f92f3dd0f96c869
038a|fbce1e48d22c49395f0d016a5a7c0041|page        |create    |2020-07-06T22:07:14Z|18539        |Djsasso                                |772897 |null              |7022504    |null   |
|page             |create         |2020-07-23 15:45:08.0|1016980           |Kiagus Muhammad Hanif Sirua            |776615      |null                   |null            |null        |4ec5d85bd29325b10cd41472b279
c787|fc073744a9e20d81b253dcd77a737999|page        |create    |2020-07-24T00:26:12Z|1016980      |Kiagus Muhammad Hanif Sirua            |776615 |null              |7044338    |null   |
|page             |create         |2020-07-23 03:30:41.0|null              |null                                   |776508      |null                   |null            |null        |0548a8babae133796f3d3868d87a
5e2a|a9c1bedc654332524b39154394b90929|page        |create    |2020-07-23T14:34:26Z|null         |2604:2000:1481:8600:1859:D4A4:D541:479 |776508 |null              |7043672    |null   |
|page             |create         |2020-07-11 17:19:58.0|1015017           |Ruchi1403                              |774338      |null                   |null            |null        |3878527da8b548ce60bbce55e8af
b7ae|6225e236ffb2807d6ab449a3a9cfc3d0|page        |create    |2020-07-13T03:47:27Z|1015017      |Ruchi1403                              |774338 |null              |7030827    |null   |
|page             |create         |2020-07-26 12:10:11.0|null              |113.185.46.239                         |777020      |null                   |null            |null        |8f7447a20c361c8565e664eeb745
1ba0|096c16c74e9eba676f1b64fc2c267352|page        |create    |2020-07-26T15:34:36Z|null         |113.185.46.239                         |777020 |null              |7046740    |null   |
|page             |create         |2020-07-23 14:46:00.0|null              |2804:431:9705:123A:44AD:402D:CC1A:7CBD |776936      |null                   |null            |null        |8d7a19c81b2c5f5cdf7f2d765312
529d|3538d42cac905e750bf90e97d1edcf4c|page        |create    |2020-07-25T21:54:43Z|null         |2804:431:9705:123A:44AD:402D:CC1A:7CBD |776936 |null              |7046241    |null   |
|page             |create         |2020-07-03 12:06:50.0|null              |188.28.164.70                          |773753      |null                   |null            |null        |ee98d96c75bb00dd135eac9a9ca3
1eb3|4ca8c4707f307eea229ac4ba0e416f6e|page        |create    |2020-07-09T21:04:04Z|null         |188.28.164.70                          |773753 |null              |7026991    |null   |
|page             |create         |2020-07-08 14:58:40.0|null              |2600:100B:B01F:5D1D:8581:E76B:C057:E33E|773293      |null                   |null            |null        |387146ff9c469610217f15d7f096
c1d3|30049538a0307d04b81ff5e7e74964ae|page        |create    |2020-07-08T14:59:02Z|null         |2600:100B:B01F:5D1D:8581:E76B:C057:E33E|773293 |null              |7024926    |null   |
|page             |create         |2020-07-27 07:31:36.0|null              |96.232.204.207                         |777217      |null                   |null            |null        |ff801bb43ae1937581a9aab7c850
a9e4|b96e012225dddafc1aec916db50321fe|page        |create    |2020-07-27T18:44:23Z|null         |96.232.204.207                         |777217 |null              |7047883    |null   |
|page             |create         |2020-07-24 19:34:57.0|null              |103.133.200.3                          |776712      |null                   |null            |null        |7f2b86db57831d0d0aa21a2e48f5
6591|b74a026b77098ede7dd0808112191551|page        |create    |2020-07-24T19:36:47Z|null         |103.133.200.3                          |776712 |null              |7044995    |null   |
|page             |create         |2020-07-05 06:25:43.0|731605            |ImprovedWikiImprovment                 |777841      |null                   |null            |null        |f4d73b303deb6471e7a369d55ee2
9b1d|86f99e09d43584f9fe21b991d0b9d603|page        |create    |2020-07-31T07:35:54Z|731605       |ImprovedWikiImprovment                 |777841 |null              |7052181    |null   |
|page             |create         |2020-07-11 21:11:25.0|null              |183.171.158.2                          |774113      |null                   |null            |null        |064c14ba51b7ee801702ae6e5b4b
28ca|cdc7c074d18446e33aa2155ab7f0c120|page        |create    |2020-07-11T21:12:34Z|null         |183.171.158.2                          |774113 |null              |7029279    |null   |
|page             |create         |2020-07-18 12:46:14.0|null              |91.186.250.14                          |776269      |null                   |null            |null        |6bb0a417d47f4b460d912da3d806
8d21|2b16d0fabfbe0b87cff878c7e4443930|page        |create    |2020-07-22T10:43:02Z|null         |91.186.250.14                          |776269 |null              |7042178    |null   |
|page             |create         |2020-07-08 14:59:16.0|null              |2600:100B:B01F:5D1D:8581:E76B:C057:E33E|773299      |null                   |null            |null        |e5fd3a0068ab47d04e081c11b6a6
7175|beb557012d2168f665b4296a909545cf|page        |create    |2020-07-08T15:04:12Z|null         |2600:100B:B01F:5D1D:8581:E76B:C057:E33E|773299 |null              |7024934    |null   |
|page             |create         |2020-07-27 16:46:11.0|null              |2A02:C7D:5D63:1300:99F0:483B:4333:E8B3 |777199      |null                   |null            |null        |2b9a9f02cb903ff6dbb3b7f190c9
4ab5|95d89d5c9301d2ee9578966eb28e1044|page        |create    |2020-07-27T17:15:36Z|null         |2A02:C7D:5D63:1300:99F0:483B:4333:E8B3 |777199 |null              |7047744    |null   |
|page             |create         |2020-07-13 13:18:43.0|null              |82.132.238.246                         |774405      |null                   |null            |null        |e4cce9e33d963daf2bc87abb3ec9
82ce|58bcc7da0759e051dc145553229e800d|page        |create    |2020-07-13T13:21:06Z|null         |82.132.238.246                         |774405 |null              |7031240    |null   |
|page             |create         |2020-07-27 19:58:00.0|731605            |ImprovedWikiImprovment                 |777745      |null                   |null            |null        |dcc26c5db9f510a4ba159f250499
12fa|7952f4febe7aeb2bc7cd9d20a643884e|page        |create    |2020-07-30T19:32:59Z|731605       |ImprovedWikiImprovment                 |777745 |null              |7051600    |null   |
|page             |create         |2020-07-05 00:49:56.0|null              |null                                   |775790      |null                   |null            |null        |3be45c24b4244016246db70f73c1
b0bc|8a8c0b95fc5c9ee889ff5871ef0d2e98|page        |create    |2020-07-19T17:19:30Z|null         |104.230.23.164                         |775790 |null              |7039385    |null   |
|page             |create         |2020-07-01 18:14:17.0|2133              |Eptalon                                |771890      |null                   |null            |null        |de31de815585d8b3771760a668d2
78df|771dae02392f7693b57d1cb0bf805046|page        |create    |2020-07-01T18:20:26Z|2133         |Eptalon                                |771890 |null              |7014335    |null   |
|page             |create         |2020-07-16 12:18:19.0|null              |121.209.165.23                         |775164      |null                   |null            |null        |3ce07de39056c2bed378ba30095debe5|c5c32f73848033043dca2d4cf9e24440|page        |create    |2020-07-16T12:27:46Z|null         |121.209.165.23                         |775164 |null              |7035848    |null   |
|page             |create         |2020-07-21 19:21:59.0|null              |216.11.189.203                         |776701      |null                   |null            |null        |90b0516b0a8d8c7a4f70f17b4eb24c48|bb858788bdff445f7492e1efcc7757e4|page        |create    |2020-07-24T19:09:43Z|null         |216.11.189.203                         |776701 |null              |7044924    |null   |
|page             |create         |2020-07-05 06:25:09.0|22027             |Auntof6                                |777114      |null                   |null            |null        |eb372e8ade0219df45adc9823263baa0|5c6d1cc2785a4c1d3a68ebc1946668c0|page        |create    |2020-07-27T08:26:10Z|22027        |Auntof6                                |777114 |null              |7047405    |null   |
|page             |create         |2020-07-10 21:07:38.0|2133              |Eptalon                                |773910      |null                   |null            |null        |89c5d3759bec217e4c62dae85ca33914|d42083b8f1333681c153d4a776de5b9e|page        |create    |2020-07-10T21:10:57Z|2133         |Eptalon                                |773910 |null              |7027906    |null   |
|page             |create         |2020-07-15 10:47:35.0|null              |27.109.115.145                         |777500      |null                   |null            |null        |3244fb79ba82df3b29d5b0f9514aeaa6|e018d8ea407c13320d805b9538b59074|page        |create    |2020-07-29T05:51:15Z|null         |27.109.115.145                         |777500 |null              |7049825    |null   |
|page             |create         |2020-07-05 06:25:21.0|22027             |Auntof6                                |777112      |null                   |null            |null        |92843e6178c454d81a4053b060278cd5|42087db116876f0ba84a7b4ea97d532c|page        |create    |2020-07-27T08:22:07Z|22027        |Auntof6                                |777112 |null              |7047398    |null   |
|page             |create         |2020-07-27 19:19:53.0|null              |2604:2000:1481:8600:3873:EF4F:FFFE:8E43|777377      |null                   |null            |null        |f2cde5c18ef90e9a9a5006bee0844fae|5119076e8337686d73961a52f1e07ebc|page        |create    |2020-07-28T12:20:18Z|null         |2604:2000:1481:8600:3873:EF4F:FFFE:8E43|777377 |null              |7048861    |null   |
|page             |create         |2020-07-03 13:20:00.0|2133              |Eptalon                                |772205      |null                   |null            |null        |54c49ca4e2f02a60af6661f07de22315|d19e94fb5fb4c3484fd820eb7b88163c|page        |create    |2020-07-03T19:37:05Z|2133         |Eptalon                                |772205 |null              |7017062    |null   |
|page             |create         |2020-07-13 03:49:22.0|null              |2409:4064:2E26:2F1C:7C9B:ABB2:48B9:AC3F|776024      |null                   |null            |null        |0f8e38d897b49882966c6ee162d81bbc|c72659d87c8361ebd9af02476b696be7|page        |create    |2020-07-21T04:02:19Z|null         |2409:4064:2E26:2F1C:7C9B:ABB2:48B9:AC3F|776024 |null              |7040796    |null   |
|page             |create         |2020-07-24 18:03:36.0|null              |2604:2000:1481:8600:B905:1CA2:9C84:CD55|777023      |null                   |null            |null        |dbe0bc54000a59d181c297fa376be779|6454bc4b3e6e35f8a3e2c98263b86326|page        |create    |2020-07-26T16:27:26Z|null         |2604:2000:1481:8600:B905:1CA2:9C84:CD55|777023 |null              |7046779    |null   |
|page             |create         |2020-07-11 18:32:37.0|2133              |Eptalon                                |774095      |null                   |null            |null        |3f3baae6338375f179bad1e316cac423|953b40ec5179d4da19c5790c73013535|page        |create    |2020-07-11T19:28:59Z|2133         |Eptalon                                |774095 |null              |7029003    |null   |
|page             |create         |2020-07-04 18:06:57.0|null              |2409:4062:2E07:BB27:931C:7845:371D:3337|772352      |null                   |null            |null        |2d28ce12a7bf37d7398fc4acc3cb4dd1|458be40bbc3bef5134d7e5b67dcc9547|page        |create    |2020-07-04T18:20:54Z|null         |2409:4062:2E07:BB27:931C:7845:371D:3337|772352 |null              |7018319    |null   |
|page             |create         |2020-07-19 19:32:09.0|2133              |Eptalon                                |775814      |null                   |null            |null        |d4271d9c478f57be07302df4958c9727|4d2a77953e760afc6e29aa41e2d753e5|page        |create    |2020-07-19T20:38:50Z|2133         |Eptalon                                |775814 |null              |7039515    |null   |
|page             |create         |2020-07-17 17:05:51.0|null              |2A02:C7F:3E57:2000:8494:F116:FC3C:A6B  |775448      |null                   |null            |null        |a1a778d6124e23a4e25cb02b4ba130ff|10f069f3c43d297fe4a07cd07ea1782e|page        |create    |2020-07-17T22:26:57Z|null         |2A02:C7F:3E57:2000:8494:F116:FC3C:A6B  |775448 |null              |7037669    |null   |
|page             |create         |2020-07-05 06:25:33.0|935234            |Naddruf                                |774441      |null                   |null            |null        |4b6f6bbfa64e598f62e291f3d203514e|60054cfcab3bef1978619aa4b017615b|page        |create    |2020-07-13T17:12:31Z|935234       |Naddruf                                |774441 |null              |7031620    |null   |
|page             |create         |2020-07-14 12:22:38.0|null              |86.130.163.98                          |776469      |null                   |null            |null        |4c8a1c6116a6acafe823a3a84a62f1f1|d08c6286f86f20c219e6ee570cdf7944|page        |create    |2020-07-23T11:15:05Z|null         |86.130.163.98                          |776469 |null              |7043443    |null   |
|page             |create         |2020-07-05 00:49:56.0|null              |104.230.23.164                         |775787      |null                   |null            |null        |bb14ab1d8aac2a4d05f5dfc758b386a4|40b87eaf4120d8ffef03fcdd65b836af|page        |create    |2020-07-19T17:13:16Z|null         |104.230.23.164                         |775787 |null              |7039379    |null   |
|page             |create         |2020-07-25 20:31:08.0|51686             |حُميد                                  |777078      |null                   |null            |null        |7b19e3fdddeadb3efc6b6ddb6f57cb23|f718dfe205bd2bb881be443b60ef0f35|page        |create    |2020-07-26T23:31:04Z|51686        |حُميد                                  |777078 |null              |7047146    |null   |
|page             |create         |2020-07-03 16:54:49.0|1011679           |UN175                                  |776333      |null                   |null            |null        |fec95017672c25711166f1643b6c14ab|786f2bc64b7507aa285c60bdc79a9135|page        |create    |2020-07-22T20:44:13Z|1011679      |UN175                                  |776333 |null              |7042802    |null   |
|page             |create         |2020-07-24 19:16:27.0|null              |68.55.103.35                           |777233      |null                   |null            |null        |f84ad45d616c80733e978d6feb6615f6|38f75c81a012ba5c62c1613afe82f210|page        |create    |2020-07-27T20:33:43Z|null         |68.55.103.35                           |777233 |null              |7048052    |null   |
|page             |create         |2020-07-14 10:59:59.0|null              |178.120.18.16                          |774651      |null                   |null            |null        |cc7db8f3fce3707cdbda4d9e0f0b774c|5a984bca92c73aff3a8aa8179b1aabd0|page        |create    |2020-07-14T11:53:20Z|null         |178.120.18.16                          |774651 |null              |7032708    |null   |
|page             |create         |2020-07-28 12:07:01.0|68157             |Macdonald-ross                         |777376      |null                   |null            |null        |2d6cdb9b2aae179044da6e72ea98cb9b|7213b1f50d69c2dbcfb129f66791f808|page        |create    |2020-07-28T12:10:21Z|68157        |Macdonald-ross                         |777376 |null              |7048853    |null   |
|page             |create         |2020-07-11 19:45:56.0|1014967           |TypoPedia                              |774283      |null                   |null            |null        |16f5f74dbf37205b0a1723a368b91c54|c1eb5bf01bbf9bfe2b2a74121928a91b|page        |create    |2020-07-12T19:38:44Z|1014967      |TypoPedia                              |774283 |null              |7030466    |null   |
|page             |create         |2020-07-15 15:37:24.0|18539             |Djsasso                                |775021      |null                   |null            |null        |38a5226435d779d254b1208142b440b6|4384c2cda0cd7c4cc3f8bad1f2e0b075|page        |create    |2020-07-15T15:37:44Z|18539        |Djsasso                                |775021 |null              |7034776    |null   |
|page             |create         |2020-07-14 11:11:39.0|null              |178.120.18.16                          |774653      |null                   |null            |null        |61f1d9dca6f194a431d42e3192dbca93|bcbef864ef00fd5034c4ca0e7846f27c|page        |create    |2020-07-14T11:55:53Z|null         |178.120.18.16                          |774653 |null              |7032713    |null   |
|page             |create         |2020-07-27 07:44:43.0|null              |123.201.126.221                        |777608      |null                   |null            |null        |625e93c88c218d734b329937890de1b7|98b11a9905f9691ea08c3eb1e5966f91|page        |create    |2020-07-29T17:51:04Z|null         |123.201.126.221                        |777608 |null              |7050507    |null   |
|page             |create         |2020-07-19 20:46:00.0|null              |203.192.209.135                        |775940      |null                   |null            |null        |d6c424e68482a92c5c2436309e29a0be|4eed541b85ca5a69896ded93b92595c0|page        |create    |2020-07-20T13:18:17Z|null         |203.192.209.135                        |775940 |null              |7040222    |null   |
|page             |create         |2020-07-22 20:06:23.0|1016571           |Herɋuartda                             |776364      |null                   |null            |null        |7e0ca3f7a4d97ab2ada3c3f2ca0f8cdc|810498f91baa5f6a2455281e965b2804|page        |create    |2020-07-23T01:22:12Z|1016571      |Herɋuartda                             |776364 |null              |7043020    |null   |
|page             |create         |2020-07-08 09:51:55.0|2133              |Eptalon                                |773254      |null                   |null            |null        |3c71f0a12c3a5367e01178545f90e87e|4ae09ce7d1d5455122e491e6686235f8|page        |create    |2020-07-08T11:02:46Z|2133         |Eptalon                                |773254 |null              |7024657    |null   |
|page             |create         |2020-07-11 17:34:34.0|731605            |ImprovedWikiImprovment                 |776086      |null                   |null            |null        |0739197732ceac5e27b527d2508c96d8|b283af74182c17c8c80f9020282e7635|page        |create    |2020-07-21T11:27:37Z|731605       |ImprovedWikiImprovment                 |776086 |null              |7041129    |null   |
|page             |create         |2020-07-15 10:48:43.0|null              |122.174.248.82                         |774933      |null                   |null            |null        |8360b580582005dde01cddc82abf7f0c|8d4e1b08b1d2b0e89ca42cb4c69aa233|page        |create    |2020-07-15T10:50:59Z|null         |122.174.248.82                         |774933 |null              |7034337    |null   |
|page             |create         |2020-07-20 18:10:08.0|null              |182.187.89.237                         |776291      |null                   |null            |null        |6c17439fc6811a1db3d007d83eea6cdd|cec1b9e526c5ccc921f5fe6dddf6a919|page        |create    |2020-07-22T16:24:21Z|null         |182.187.89.237                         |776291 |null              |7042459    |null   |
|page             |create         |2020-07-03 22:26:35.0|null              |104.230.23.164                         |775789      |null                   |null            |null        |f83d39c944fa8705a4d03878c145151e|1a03bc918d8eaae6b8699e511d45a580|page        |create    |2020-07-19T17:18:12Z|null         |104.230.23.164                         |775789 |null              |7039384    |null   |
|page             |create         |2020-07-16 08:58:00.0|1014250           |SimpleBoi                              |775138      |null                   |null            |null        |94731f976067bfa4aaf55bfec3797f3d|3e048f04e91383fac587c15acef6fbeb|page        |create    |2020-07-16T09:00:07Z|1014250      |SimpleBoi                              |775138 |null              |7035698    |null   |
|page             |create         |2020-07-02 14:54:45.0|null              |185.67.177.137                         |772010      |null                   |null            |null        |906cd1def6bbbf7a750016586581e621|7c73b9c571089cb7fc34699a66cc2ad0|page        |create    |2020-07-02T14:55:36Z|null         |185.67.177.137                         |772010 |null              |7015423    |null   |
|page             |create         |2020-07-23 12:26:42.0|null              |2409:4073:407:CCC5:A29D:AE4A:46C9:E628 |776790      |null                   |null            |null        |00ccd5bf3f68f41a50f7f8d4aaa134fb|57ccae11f0c3205a25b3c5427aba8d2d|page        |create    |2020-07-25T06:07:04Z|null         |2409:4073:407:CCC5:A29D:AE4A:46C9:E628 |776790 |null              |7045424    |null   |
|page             |create         |2020-07-05 00:49:56.0|null              |104.230.23.164                         |775798      |null                   |null            |null        |e08687276bdb7f488ea269fe7a168601|384253c82036282e23c8e19a7eedba5d|page        |create    |2020-07-19T17:40:06Z|null         |104.230.23.164                         |775798 |null              |7039401    |null   |
|page             |create         |2020-07-19 20:46:18.0|null              |98.204.187.175                         |777223      |null                   |null            |null        |211a025c5cb013f4f6d8d43d29b22bdf|cef2acd16426fb5152f2359ff4cc78e1|page        |create    |2020-07-27T19:22:37Z|null         |98.204.187.175                         |777223 |null              |7047934    |null   |
|page             |create         |2020-07-04 06:40:35.0|null              |2409:4062:2E0B:CD10:49BB:E6CF:F9C5:68E9|772348      |null                   |null            |null        |9882235dc34668977e5fd53a3b1b6f5b|fbc1b16ae7539d290e507df344acc0c6|page        |create    |2020-07-04T17:41:59Z|null         |2409:4062:2E0B:CD10:49BB:E6CF:F9C5:68E9|772348 |null              |7018286    |null   |
|page             |create         |2020-07-23 16:43:41.0|null              |201.157.252.176                        |777198      |null                   |null            |null        |2560e393bc7c05c13e40e9401b9939c7|d747e9c48f63fdc4351ec28c2824a588|page        |create    |2020-07-27T17:15:15Z|null         |201.157.252.176                        |777198 |null              |7047743    |null   |
|page             |create         |2020-07-02 23:27:01.0|null              |103.78.225.18                          |772668      |null                   |null            |null        |bef4a9cb9dbd66b046ee3ae0c87a9ef4|45e87713cec3b4566456aa609302476d|page        |create    |2020-07-06T00:35:23Z|null         |103.78.225.18                          |772668 |null              |7021010    |null   |
|page             |create         |2020-07-13 13:09:06.0|null              |82.132.238.246                         |774401      |null                   |null            |null        |fe334a53259e9f372781a795a7a09f50|85a334491245ac70bcc452ed6fb1ec33|page        |create    |2020-07-13T13:16:14Z|null         |82.132.238.246                         |774401 |null              |7031224    |null   |
|page             |create         |2020-07-03 22:26:35.0|null              |2600:1017:B40C:71D1:70CA:95AE:5487:87C7|772648      |null                   |null            |null        |8f61a6e607f594069b21890675ce390a|f28288f13bca005a53afc9e8cc22830c|page        |create    |2020-07-05T23:21:44Z|null         |2600:1017:B40C:71D1:70CA:95AE:5487:87C7|772648 |null              |7020944    |null   |
|page             |create         |2020-07-17 17:04:12.0|null              |203.192.244.66                         |775707      |null                   |null            |null        |7da9908ccae63b9f76c31e150d4692ce|3b0495f56c02882d22a3f1ae866f4718|page        |create    |2020-07-19T11:13:40Z|null         |203.192.244.66                         |775707 |null              |7038973    |null   |
+-----------------+---------------+---------------------+------------------+---------------------------------------+------------+-----------------------+----------------+------------+--------------------------------+--------------------------------+------------+----------+--------------------+-------------+---------------------------------------+-------+------------------+-----------+-------+
  • revision-create
val jrevts = hfsnr.where("hudi_event_entity = 'revision'").join(eventsRevCreate, col("hudi_revision_id") === col("revision_id"), "outer").
  selectExpr("hudi_row_key_hash", "row_key_hash", "(hudi_revision_id IS NULL OR revision_id IS NULL OR (abs(unix_timestamp(hudi_event_timestamp) - unix_timestamp(event_timestamp, 'yyyy-MM-dd\\'T\\'HH:mm:ss\\'Z\\'')) < 5)) AS ts_match").
  cache()
jrevts.count
// 39532
jrevts.where("hudi_row_key_hash IS NULL").count
// 613
jrevts.where("row_key_hash IS NULL").count
// 748
jrevts.where("hudi_row_key_hash IS NOT NULL AND row_key_hash IS NOT NULL AND ts_match").count
// 38171
jrevts.where("hudi_row_key_hash IS NOT NULL AND row_key_hash IS NOT NULL AND NOT ts_match").count
// 0

// No timestamp difference here - need to find a different for mismatch.
Milimetric moved this task from Incoming to Smart Tools for Better Data on the Analytics board.

Checks for discrenpencies (using previous comment setup):

  • page-delete

...

I noticed here that events to the same page were criss-crossing and they're actually all really close in time, here they are split into two tables and re-ordered to match (btw if you keep tables outside of the code formatting they format nicely in phab):

hudi_event_entityhudi_event_typehudi_event_timestamphudi_event_user_idhudi_event_user_text_historicalhudi_page_id
pagedelete2020-07-30 16:49:10.08630Chenzw776311
pagedelete2020-07-06 17:06:54.02133Eptalon676812
pagedelete2020-07-03 07:54:39.022027Auntof6285032
pagedelete2020-07-06 22:32:27.018539Djsasso676812
pagedelete2020-07-27 08:15:16.022027Auntof6700118
pagedelete2020-07-30 16:49:31.08630Chenzw776311
event_entityevent_typeevent_timestampevent_user_idevent_user_text_historicalpage_idrevision_id
pagedelete2020-07-30T16:49:32Z8630Chenzw7763117042637
pagedelete2020-07-06T17:06:55Z2133Eptalon6768127021769
pagedelete2020-07-03T07:54:49Z22027Auntof62850327013107
pagedelete2020-07-06T22:32:27Z18539Djsasso6768127021769
pagedelete2020-07-27T08:15:26Z22027Auntof67001186577725
pagedelete2020-07-30T16:49:10Z8630Chenzw7763117042637

So the main problem seems to be that hudi_revision_id is null and the times are off by up to 30 seconds or so.

I had no other ideas on the other two checks yet.

Last news on revision_create for simplewiki 2020-07:

  • All kafka-events match mediawiki-history except the ones with deleted-parts (mostly because performer_id is hidden, leading to a mismatch in row_hash_key) - Should be fixable with revision-visibility-change events.
  • Some events of mediawiki-history are missing kafka-events - see T215001
  • Some kafka-events are duplicated - see T262203
Nuria renamed this task from Check that mediawiki-events match mediawiki-history changes over a month to Check whether mediawiki production event data is equivalent to mediawiki-history data over a month .Sep 23 2020, 6:21 PM
Nuria updated the task description. (Show Details)

Final note.
I have found 2 problems with mediawiki-events (over simplewiki only, using mediawiki-history as a baseline):

  • Some events are lost (seen in revision-create, page-create, page-move) - documented in T215001.
  • Some events are duplicated (seen in revision-create only) - documented in T262203.

I have found new features to be added to events to allow to recreate mediawiki-history using events:

  • Add user-create and user-rename events - documented in T262205
  • Add logging events (corresponding to an addition to the logging table) and a reference to those events in the main related events. For instance every page-create, page-move, user-create etc events would point to their corresponding logging-event - Documented in T263055

Re-read and re-thought. All sounds good, it's a solid plan for the next couple of quarters. And I especially like the insight from T262261#6478831 that some failed joins are just due to hidden performer information. This will be good to keep in mind and add an event for, because the log events won't cover it.