To mimic mediawiki-history using events, we need to check that the underlying data provides the same value.
Check whether mediawiki production event data is equivalent to mediawiki-history data over a month such an incremental update can be built.
To mimic mediawiki-history using events, we need to check that the underlying data provides the same value.
Check whether mediawiki production event data is equivalent to mediawiki-history data over a month such an incremental update can be built.
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Declined | None | T258511 Data Lake incremental Data Updates | |||
Open | None | T231938 Get "edits hourly" on a daily basis | |||
Resolved | Milimetric | T258532 [SPIKE] Prototype of incremental updates for mediawiki history for simplewiki , including reverts using apache hudi | |||
Resolved | JAllemandou | T262261 Check whether mediawiki production event data is equivalent to mediawiki-history data over a month | |||
Resolved | Milimetric | T215001 Revisions missing from mediawiki_revision_create | |||
Resolved | None | T280538 Capture rev_is_revert event data in a stream different than mediawiki.revision-create |
Bug found in mediawiki_revision_create events: some revision-create have multiple events (same revision-id, multiple event-requests) - Tracked in T262203.
CODE (For simplewiki only)
import org.apache.spark.sql.SaveMode import org.apache.hudi.DataSourceReadOptions import org.apache.hudi.config.HoodieWriteConfig val tableNameSimplewiki = "joal.test_hudi_mwh_simplewiki" val basePathSimplewiki = "hdfs:///user/joal/test_hudi/mwh_simplewiki" -- Load hudi dataframe val hfs = spark.read.format("hudi"). // Set table name option(HoodieWriteConfig.TABLE_NAME, tableNameSimplewiki). // Read snapshot option(DataSourceReadOptions.QUERY_TYPE_OPT_KEY, DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL). option("hoodie.index.type", "BLOOM"). //The number of wildcard asterisks here must be one greater than the number of partition load(basePathSimplewiki + "/*/*") -- restrict data to 2020-07 and keep only id-related fields val hfsn = hfs.where("snapshot='2020-07' AND year(event_timestamp) = 2020 and month(event_timestamp) = 7"). select( "event_entity", "event_type", "event_timestamp", "event_user_id", "event_user_text_historical", "page_id", "page_artificial_id", "revision_id", "user_id", "row_key_hash"). cache -- Rename columns of the dataframe with prefix 'hudi_' to faciliate joining val hfsnr = hfsn.columns.foldLeft(hfsn){ (acc, names) => acc.withColumnRenamed(names, "hudi_".concat(names))}.cache() hfsn.groupBy(col("event_entity"), col("event_type")).count.sort("event_entity", "event_type").show(100, false) /* +------------+-----------+-----+ |event_entity|event_type |count| +------------+-----------+-----+ |page |create |5775 | |page |create-page|5801 | |page |delete |903 | |page |move |261 | |page |restore |8 | |revision |create |38906| |user |alterblocks|16 | |user |altergroups|12 | |user |create |5656 | |user |rename |258 | +------------+-----------+-----+ */
val eventsRevCreate = spark.sql(""" SELECT md5(concat( database, '|', 'revision', '|', 'create', '|', -- timestamps of events needs special parsing as they are not sql-timestamp formatted -- Also, using rev_timestamp instead of meta.dt here as it is available CAST(unix_timestamp(rev_timestamp, 'yyyy-MM-dd\'T\'HH:mm:ss\'Z\'') AS STRING), '|', coalesce(performer.user_id, 0), '|', coalesce(performer.user_text, ''), '|', coalesce(page_id, 0), '|', '', '|', -- no page_artificial_id coalesce(rev_id, 0), '|', 0 -- no user_id )) as row_key_hash, 'revision' AS event_entity, 'create' AS event_type, rev_timestamp AS event_timestamp, performer.user_id AS event_user_id, performer.user_text as event_user_text_historical, page_id, NULL AS page_artificial_id, rev_id as revision_id, NULL as user_id from event.mediawiki_revision_create where year = 2020 and ( (month = 7) -- Expanding boundaries in case there are OR (month = 6 AND day = 30) OR (month = 8 AND day = 1)) and database = 'simplewiki' and rev_timestamp >= '2020-07' and rev_timestamp < '2020-08' """). distinct. // This distinct is important as there are duplicates cache() eventsRevCreate.count // 38784 (vs 38906) -- Now joining on row-key-hash and checking val jrev = hfsnr.where("hudi_event_entity = 'revision'").join(eventsRevCreate, col("hudi_row_key_hash") === col("row_key_hash"), "outer").cache() jrev.count // 39542 jrev.where("hudi_row_key_hash IS NULL").count // 636 jrev.where("row_key_hash IS NULL").count // 758 jrev.where("hudi_row_key_hash IS NOT NULL AND row_key_hash IS NOT NULL").count // 38148 // Most rows share the same hash-key - Looks good
val eventsPageCreate = spark.sql(""" SELECT md5(concat( database, '|', 'page', '|', 'create', '|', -- timestamps of events needs special parsing as they are not sql-timestamp formatted -- Also, using rev_timestamp instead of meta.dt here as it is available CAST(unix_timestamp(rev_timestamp, 'yyyy-MM-dd\'T\'HH:mm:ss\'Z\'') AS STRING), '|', coalesce(performer.user_id, 0), '|', coalesce(performer.user_text, ''), '|', coalesce(page_id, 0), '|', '', '|', -- page_artificial_id 0, '|', -- no rev_id for page-create in mwh while there is one in event! 0 -- user_id )) as row_key_hash, 'page' AS event_entity, 'create' AS event_type, rev_timestamp AS event_timestamp, performer.user_id AS event_user_id, performer.user_text as event_user_text_historical, page_id, NULL AS page_artificial_id, rev_id as revision_id, NULL as user_id from event.mediawiki_page_create where year = 2020 and month = 7 and database = 'simplewiki' """).cache() eventsPageCreate.count // 5918 (vs 5801) -- Now joining on row-key-hash and checking val jpc = hfsnr.where("hudi_event_entity = 'page' and hudi_event_type = 'create'").join(eventsPageCreate, col("hudi_row_key_hash") === col("row_key_hash"), "outer").cache() jpc.count // 6238 jpc.where("hudi_row_key_hash IS NULL").count // 463 jpc.where("row_key_hash IS NULL").count // 320 jpc.where("hudi_row_key_hash IS NOT NULL AND row_key_hash IS NOT NULL").count // 5455 // Most rows also share the same hash-key, but the proportion of mismatch is high - investigation is important for that one
val eventsPageDelete = spark.sql(""" SELECT md5(concat( database, '|', 'page', '|', 'delete', '|', -- timestamps of events needs special parsing as they are not sql-timestamp formatted CAST(unix_timestamp(meta.dt, 'yyyy-MM-dd\'T\'HH:mm:ss\'Z\'') AS STRING),'|', coalesce(performer.user_id, 0), '|', coalesce(performer.user_text, ''), '|', coalesce(page_id, 0), '|', '', '|', -- page_artificial_id 0, '|', -- no rev_id in mwh 0 -- user_id )) as row_key_hash, 'page' AS event_entity, 'delete' AS event_type, meta.dt AS event_timestamp, performer.user_id AS event_user_id, performer.user_text as event_user_text_historical, page_id, NULL AS page_artificial_id, rev_id as revision_id, NULL as user_id from event.mediawiki_page_delete where year = 2020 and month = 7 and database = 'simplewiki' """).cache() eventsPageDelete.count // 915 (vs 903) -- Now joining on row-key-hash and checking val jpd = hfsnr.where("hudi_event_entity = 'page' and hudi_event_type = 'delete'").join(eventsPageDelete, col("hudi_row_key_hash") === col("row_key_hash"), "outer").cache() jpd.count // 1021 jpd.where("hudi_row_key_hash IS NULL").count // 118 jpd.where("row_key_hash IS NULL").count // 106 jpd.where("hudi_row_key_hash IS NOT NULL AND row_key_hash IS NOT NULL").count // 797 // Most rows also share the same hash-key, but the proportion of mismatch is relatively high - investigation is important for that one // NOTE: I checked the mediawiki_page_suppress table and it was empty - maybe we'll need it for other wikis?
val eventsPageRestore = spark.sql(""" SELECT md5(concat( database, '|', 'page', '|', 'restore', '|', -- timestamps of events needs special parsing as they are not sql-timestamp formatted CAST(unix_timestamp(meta.dt, 'yyyy-MM-dd\'T\'HH:mm:ss\'Z\'') AS STRING),'|', coalesce(performer.user_id, 0), '|', coalesce(performer.user_text, ''), '|', coalesce(page_id, 0), '|', '', '|', -- page_artificial_id 0, '|', -- no revId in mwh 0 -- user_id )) as row_key_hash, 'page' AS event_entity, 'restore' AS event_type, meta.dt AS event_timestamp, performer.user_id AS event_user_id, performer.user_text as event_user_text_historical, page_id, NULL AS page_artificial_id, rev_id as revision_id, NULL as user_id from event.mediawiki_page_undelete where year = 2020 and month = 7 and database = 'simplewiki' """).cache() eventsPageRestore.count // 8 (vs 8) -- Now joining on row-key-hash and checking val jpr = hfsnr.where("hudi_event_entity = 'page' and hudi_event_type = 'restore'").join(eventsPageRestore, col("hudi_row_key_hash") === col("row_key_hash"), "outer").cache() jpr.count // 8 jpr.where("hudi_row_key_hash IS NULL").count // 0 jpr.where("row_key_hash IS NULL").count // 0 jpr.where("hudi_row_key_hash IS NOT NULL AND row_key_hash IS NOT NULL").count // 8 // Full match - \o/
val eventsPageMove = spark.sql(""" SELECT md5(concat( database, '|', 'page', '|', 'move', '|', -- timestamps of events needs special parsing as they are not sql-timestamp formatted CAST(unix_timestamp(meta.dt, 'yyyy-MM-dd\'T\'HH:mm:ss\'Z\'') AS STRING),'|', coalesce(performer.user_id, 0), '|', coalesce(performer.user_text, ''), '|', coalesce(page_id, 0), '|', '', '|', -- page_artificial_id 0, '|', -- no rev_id in mwh 0 -- user_id )) as row_key_hash, 'page' AS event_entity, 'move' AS event_type, meta.dt AS event_timestamp, performer.user_id AS event_user_id, performer.user_text as event_user_text_historical, page_id, NULL AS page_artificial_id, rev_id as revision_id, NULL as user_id from event.mediawiki_page_move where year = 2020 and month = 7 and database = 'simplewiki' """).cache() eventsPageMove.count // 263 (vs 261) val jpm = hfsnr.where("hudi_event_entity = 'page' and hudi_event_type = 'move'").join(eventsPageMove, col("hudi_row_key_hash") === col("row_key_hash"), "outer").cache() jpm.count // 476 jpm.where("hudi_row_key_hash IS NULL").count // 215 jpm.where("row_key_hash IS NULL").count // 213 jpm.where("hudi_row_key_hash IS NOT NULL AND row_key_hash IS NOT NULL").count // 48 // More not-join than joined - investiagtion val jpmts = hfsnr.where("hudi_event_entity = 'page' and hudi_event_type = 'move'").join(eventsPageMove, col("hudi_page_id") === col("page_id"), "outer"). where("hudi_page_id IS NULL OR page_id IS NULL OR (abs(unix_timestamp(hudi_event_timestamp) - unix_timestamp(event_timestamp, 'yyyy-MM-dd\\'T\\'HH:mm:ss\\'Z\\'')) < 5)"). cache() jpmts.count // 263 jpmts.where("hudi_row_key_hash IS NULL").count // 2 jpmts.where("row_key_hash IS NULL").count // 0 jpmts.where("hudi_row_key_hash IS NOT NULL AND row_key_hash IS NOT NULL").count // 261 // The problem comes from timestamps - the event-meta.dt timestamp is different from the event_timestamp we get from the logging table... // It will be interesting to check the other mismatches for this case // Anyway - how can we define a primary-key????
Checks for discrenpencies (using previous comment setup):
// Try to match events using timestamps val jpdts = hfsnr.where("hudi_event_entity = 'page' and hudi_event_type = 'delete'").join(eventsPageDelete, col("hudi_page_id") === col("page_id"), "outer"). selectExpr("hudi_row_key_hash", "row_key_hash", "(hudi_page_id IS NULL OR page_id IS NULL OR (abs(unix_timestamp(hudi_event_timestamp) - unix_timestamp(event_timestamp, 'yyyy-MM-dd\\'T\\'HH:mm:ss\\'Z\\'')) < 5)) AS ts_match"). cache() jpdts.count // 919 jpdts.where("hudi_row_key_hash IS NULL").count // 12 jpdts.where("row_key_hash IS NULL").count // 0 jpdts.where("hudi_row_key_hash IS NOT NULL AND row_key_hash IS NOT NULL AND ts_match").count // 901 -- Almost all match within 5s jpdts.where("hudi_row_key_hash IS NOT NULL AND row_key_hash IS NOT NULL AND not ts_match").count // 6 // Looking at the 6 events that don't match with a timestamp within 5s (not very reasable sorry :(): hfsnr.where("hudi_event_entity = 'page' and hudi_event_type = 'delete'").join(eventsPageDelete, col("hudi_page_id") === col("page_id"), "outer"). where("hudi_page_id IS NOT NULL AND page_id IS NOT NULL AND (abs(unix_timestamp(hudi_event_timestamp) - unix_timestamp(event_timestamp, 'yyyy-MM-dd\\'T\\'HH:mm:ss\\'Z\\'')) >= 5)"). show(100, false) +-----------------+---------------+---------------------+------------------+-------------------------------+------------+-----------------------+----------------+------------+--------------------------------+--------------------------------+------------+----------+--------------------+-------------+--------------------------+-------+------------------+-----------+-------+ |hudi_event_entity|hudi_event_type|hudi_event_timestamp |hudi_event_user_id|hudi_event_user_text_historical|hudi_page_id|hudi_page_artificial_id|hudi_revision_id|hudi_user_id|hudi_row_key_hash |row_key_hash |event_entity|event_type|event_timestamp |event_user_id|event_user_text_historical|page_id|page_artificial_id|revision_id|user_id| +-----------------+---------------+---------------------+------------------+-------------------------------+------------+-----------------------+----------------+------------+--------------------------------+--------------------------------+------------+----------+--------------------+-------------+--------------------------+-------+------------------+-----------+-------+ |page |delete |2020-07-30 16:49:10.0|8630 |Chenzw |776311 |null |null |null |603321d3c0b792c959095627067d71ac|609894b5f351ef92eb2ec0e0d1c6dc5c|page |delete |2020-07-30T16:49:32Z|8630 |Chenzw |776311 |null |7042637 |null | |page |delete |2020-07-06 17:06:54.0|2133 |Eptalon |676812 |null |null |null |9c21a46ffa5d77f131e03e129421e5e9|7c0b08708a3affb76173cd3a888be8bc|page |delete |2020-07-06T22:32:27Z|18539 |Djsasso |676812 |null |7021769 |null | |page |delete |2020-07-03 07:54:39.0|22027 |Auntof6 |285032 |null |null |null |503fa76a1763451b1d6b9472436738f8|3387c53369dc7af5e68b2377f4a3fda8|page |delete |2020-07-03T07:54:49Z|22027 |Auntof6 |285032 |null |7013107 |null | |page |delete |2020-07-06 22:32:27.0|18539 |Djsasso |676812 |null |null |null |7c0b08708a3affb76173cd3a888be8bc|6d244a642406d9852d186a124322ccd6|page |delete |2020-07-06T17:06:55Z|2133 |Eptalon |676812 |null |7021769 |null | |page |delete |2020-07-27 08:15:16.0|22027 |Auntof6 |700118 |null |null |null |1eaec820d3aa53746824410803aeb70b|27ae3b48edab900f71c18cdc1023a72e|page |delete |2020-07-27T08:15:26Z|22027 |Auntof6 |700118 |null |6577725 |null | |page |delete |2020-07-30 16:49:31.0|8630 |Chenzw |776311 |null |null |null |cdc521063916ffe15f3deecfcf362b32|603321d3c0b792c959095627067d71ac|page |delete |2020-07-30T16:49:10Z|8630 |Chenzw |776311 |null |7042637 |null | +-----------------+---------------+---------------------+------------------+-------------------------------+------------+-----------------------+----------------+------------+--------------------------------+--------------------------------+------------+----------+--------------------+-------------+--------------------------+-------+------------------+-----------+-------+
val jpcts = hfsnr.where("hudi_event_entity = 'page' and hudi_event_type = 'create'").join(eventsPageCreate, col("hudi_page_id") === col("page_id"), "outer"). selectExpr("hudi_row_key_hash", "row_key_hash", "(hudi_page_id IS NULL OR page_id IS NULL OR (abs(unix_timestamp(hudi_event_timestamp) - unix_timestamp(event_timestamp, 'yyyy-MM-dd\\'T\\'HH:mm:ss\\'Z\\'')) < 5)) AS ts_match"). cache() jpcts.count // 6170 jpcts.where("hudi_row_key_hash IS NULL").count // 395 -- Still a lot of mismatch :( jpcts.where("row_key_hash IS NULL").count // 252 -- Here as well :( Maybe page_is_deleted? jpcts.where("hudi_row_key_hash IS NOT NULL AND row_key_hash IS NOT NULL AND ts_match").count // 5455 jpcts.where("hudi_row_key_hash IS NOT NULL AND row_key_hash IS NOT NULL AND NOT ts_match").count // 68 hfsnr.where("hudi_event_entity = 'page' and hudi_event_type = 'create'").join(eventsPageCreate, col("hudi_page_id") === col("page_id"), "outer"). where("hudi_page_id IS NOT NULL AND page_id IS NOT NULL AND (abs(unix_timestamp(hudi_event_timestamp) - unix_timestamp(event_timestamp, 'yyyy-MM-dd\\'T\\'HH:mm:ss\\'Z\\'')) >= 5)"). show(100, false) +-----------------+---------------+---------------------+------------------+---------------------------------------+------------+-----------------------+----------------+------------+---------------------------- ----+--------------------------------+------------+----------+--------------------+-------------+---------------------------------------+-------+------------------+-----------+-------+ |hudi_event_entity|hudi_event_type|hudi_event_timestamp |hudi_event_user_id|hudi_event_user_text_historical |hudi_page_id|hudi_page_artificial_id|hudi_revision_id|hudi_user_id|hudi_row_key_hash |row_key_hash |event_entity|event_type|event_timestamp |event_user_id|event_user_text_historical |page_id|page_artificial_id|revision_id|user_id| +-----------------+---------------+---------------------+------------------+---------------------------------------+------------+-----------------------+----------------+------------+---------------------------- ----+--------------------------------+------------+----------+--------------------+-------------+---------------------------------------+-------+------------------+-----------+-------+ |page |create |2020-07-01 06:35:03.0|null |106.205.11.40 |771740 |null |null |null |1af5a34c7f74f606eefaed5e3078 749c|85dd947dd13bcf553ad007384f4f533e|page |create |2020-07-01T06:42:17Z|null |106.205.11.40 |771740 |null |7013623 |null | |page |create |2020-07-01 15:30:29.0|null |69.115.47.9 |771875 |null |null |null |5e23915d9d3cd0b3916475aa6ef9 4608|e8ddf68dcaa93e3d63e9a35e75d1944e|page |create |2020-07-01T17:45:08Z|null |69.115.47.9 |771875 |null |7014279 |null | |page |create |2020-07-23 14:35:05.0|null |null |777039 |null |null |null |9d7ef58cf29caa3a0775fd15bb2e f744|2399cb71073a5ccbab4dcd13c50026a8|page |create |2020-07-26T19:07:46Z|null |2604:2000:1481:8600:B905:1CA2:9C84:CD55|777039 |null |7046945 |null | |page |create |2020-07-10 12:26:50.0|2133 |Eptalon |773914 |null |null |null |b8624c1f11a13cc7f35f0e8bbfc8 8c92|523bae9ad3e22bfbc56e09266ab4ce82|page |create |2020-07-10T21:32:45Z|2133 |Eptalon |773914 |null |7027926 |null | |page |create |2020-07-21 12:45:22.0|null |216.11.189.203 |776157 |null |null |null |f22116e5cb8a679c78d9c2b29a95 9956|9ec0a18f0df0e5644297af9ee3825488|page |create |2020-07-21T19:17:35Z|null |216.11.189.203 |776157 |null |7041575 |null | |page |create |2020-07-06 21:51:08.0|18539 |Djsasso |772897 |null |null |null |2488132ed5a03f92f3dd0f96c869 038a|fbce1e48d22c49395f0d016a5a7c0041|page |create |2020-07-06T22:07:14Z|18539 |Djsasso |772897 |null |7022504 |null | |page |create |2020-07-23 15:45:08.0|1016980 |Kiagus Muhammad Hanif Sirua |776615 |null |null |null |4ec5d85bd29325b10cd41472b279 c787|fc073744a9e20d81b253dcd77a737999|page |create |2020-07-24T00:26:12Z|1016980 |Kiagus Muhammad Hanif Sirua |776615 |null |7044338 |null | |page |create |2020-07-23 03:30:41.0|null |null |776508 |null |null |null |0548a8babae133796f3d3868d87a 5e2a|a9c1bedc654332524b39154394b90929|page |create |2020-07-23T14:34:26Z|null |2604:2000:1481:8600:1859:D4A4:D541:479 |776508 |null |7043672 |null | |page |create |2020-07-11 17:19:58.0|1015017 |Ruchi1403 |774338 |null |null |null |3878527da8b548ce60bbce55e8af b7ae|6225e236ffb2807d6ab449a3a9cfc3d0|page |create |2020-07-13T03:47:27Z|1015017 |Ruchi1403 |774338 |null |7030827 |null | |page |create |2020-07-26 12:10:11.0|null |113.185.46.239 |777020 |null |null |null |8f7447a20c361c8565e664eeb745 1ba0|096c16c74e9eba676f1b64fc2c267352|page |create |2020-07-26T15:34:36Z|null |113.185.46.239 |777020 |null |7046740 |null | |page |create |2020-07-23 14:46:00.0|null |2804:431:9705:123A:44AD:402D:CC1A:7CBD |776936 |null |null |null |8d7a19c81b2c5f5cdf7f2d765312 529d|3538d42cac905e750bf90e97d1edcf4c|page |create |2020-07-25T21:54:43Z|null |2804:431:9705:123A:44AD:402D:CC1A:7CBD |776936 |null |7046241 |null | |page |create |2020-07-03 12:06:50.0|null |188.28.164.70 |773753 |null |null |null |ee98d96c75bb00dd135eac9a9ca3 1eb3|4ca8c4707f307eea229ac4ba0e416f6e|page |create |2020-07-09T21:04:04Z|null |188.28.164.70 |773753 |null |7026991 |null | |page |create |2020-07-08 14:58:40.0|null |2600:100B:B01F:5D1D:8581:E76B:C057:E33E|773293 |null |null |null |387146ff9c469610217f15d7f096 c1d3|30049538a0307d04b81ff5e7e74964ae|page |create |2020-07-08T14:59:02Z|null |2600:100B:B01F:5D1D:8581:E76B:C057:E33E|773293 |null |7024926 |null | |page |create |2020-07-27 07:31:36.0|null |96.232.204.207 |777217 |null |null |null |ff801bb43ae1937581a9aab7c850 a9e4|b96e012225dddafc1aec916db50321fe|page |create |2020-07-27T18:44:23Z|null |96.232.204.207 |777217 |null |7047883 |null | |page |create |2020-07-24 19:34:57.0|null |103.133.200.3 |776712 |null |null |null |7f2b86db57831d0d0aa21a2e48f5 6591|b74a026b77098ede7dd0808112191551|page |create |2020-07-24T19:36:47Z|null |103.133.200.3 |776712 |null |7044995 |null | |page |create |2020-07-05 06:25:43.0|731605 |ImprovedWikiImprovment |777841 |null |null |null |f4d73b303deb6471e7a369d55ee2 9b1d|86f99e09d43584f9fe21b991d0b9d603|page |create |2020-07-31T07:35:54Z|731605 |ImprovedWikiImprovment |777841 |null |7052181 |null | |page |create |2020-07-11 21:11:25.0|null |183.171.158.2 |774113 |null |null |null |064c14ba51b7ee801702ae6e5b4b 28ca|cdc7c074d18446e33aa2155ab7f0c120|page |create |2020-07-11T21:12:34Z|null |183.171.158.2 |774113 |null |7029279 |null | |page |create |2020-07-18 12:46:14.0|null |91.186.250.14 |776269 |null |null |null |6bb0a417d47f4b460d912da3d806 8d21|2b16d0fabfbe0b87cff878c7e4443930|page |create |2020-07-22T10:43:02Z|null |91.186.250.14 |776269 |null |7042178 |null | |page |create |2020-07-08 14:59:16.0|null |2600:100B:B01F:5D1D:8581:E76B:C057:E33E|773299 |null |null |null |e5fd3a0068ab47d04e081c11b6a6 7175|beb557012d2168f665b4296a909545cf|page |create |2020-07-08T15:04:12Z|null |2600:100B:B01F:5D1D:8581:E76B:C057:E33E|773299 |null |7024934 |null | |page |create |2020-07-27 16:46:11.0|null |2A02:C7D:5D63:1300:99F0:483B:4333:E8B3 |777199 |null |null |null |2b9a9f02cb903ff6dbb3b7f190c9 4ab5|95d89d5c9301d2ee9578966eb28e1044|page |create |2020-07-27T17:15:36Z|null |2A02:C7D:5D63:1300:99F0:483B:4333:E8B3 |777199 |null |7047744 |null | |page |create |2020-07-13 13:18:43.0|null |82.132.238.246 |774405 |null |null |null |e4cce9e33d963daf2bc87abb3ec9 82ce|58bcc7da0759e051dc145553229e800d|page |create |2020-07-13T13:21:06Z|null |82.132.238.246 |774405 |null |7031240 |null | |page |create |2020-07-27 19:58:00.0|731605 |ImprovedWikiImprovment |777745 |null |null |null |dcc26c5db9f510a4ba159f250499 12fa|7952f4febe7aeb2bc7cd9d20a643884e|page |create |2020-07-30T19:32:59Z|731605 |ImprovedWikiImprovment |777745 |null |7051600 |null | |page |create |2020-07-05 00:49:56.0|null |null |775790 |null |null |null |3be45c24b4244016246db70f73c1 b0bc|8a8c0b95fc5c9ee889ff5871ef0d2e98|page |create |2020-07-19T17:19:30Z|null |104.230.23.164 |775790 |null |7039385 |null | |page |create |2020-07-01 18:14:17.0|2133 |Eptalon |771890 |null |null |null |de31de815585d8b3771760a668d2 78df|771dae02392f7693b57d1cb0bf805046|page |create |2020-07-01T18:20:26Z|2133 |Eptalon |771890 |null |7014335 |null | |page |create |2020-07-16 12:18:19.0|null |121.209.165.23 |775164 |null |null |null |3ce07de39056c2bed378ba30095debe5|c5c32f73848033043dca2d4cf9e24440|page |create |2020-07-16T12:27:46Z|null |121.209.165.23 |775164 |null |7035848 |null | |page |create |2020-07-21 19:21:59.0|null |216.11.189.203 |776701 |null |null |null |90b0516b0a8d8c7a4f70f17b4eb24c48|bb858788bdff445f7492e1efcc7757e4|page |create |2020-07-24T19:09:43Z|null |216.11.189.203 |776701 |null |7044924 |null | |page |create |2020-07-05 06:25:09.0|22027 |Auntof6 |777114 |null |null |null |eb372e8ade0219df45adc9823263baa0|5c6d1cc2785a4c1d3a68ebc1946668c0|page |create |2020-07-27T08:26:10Z|22027 |Auntof6 |777114 |null |7047405 |null | |page |create |2020-07-10 21:07:38.0|2133 |Eptalon |773910 |null |null |null |89c5d3759bec217e4c62dae85ca33914|d42083b8f1333681c153d4a776de5b9e|page |create |2020-07-10T21:10:57Z|2133 |Eptalon |773910 |null |7027906 |null | |page |create |2020-07-15 10:47:35.0|null |27.109.115.145 |777500 |null |null |null |3244fb79ba82df3b29d5b0f9514aeaa6|e018d8ea407c13320d805b9538b59074|page |create |2020-07-29T05:51:15Z|null |27.109.115.145 |777500 |null |7049825 |null | |page |create |2020-07-05 06:25:21.0|22027 |Auntof6 |777112 |null |null |null |92843e6178c454d81a4053b060278cd5|42087db116876f0ba84a7b4ea97d532c|page |create |2020-07-27T08:22:07Z|22027 |Auntof6 |777112 |null |7047398 |null | |page |create |2020-07-27 19:19:53.0|null |2604:2000:1481:8600:3873:EF4F:FFFE:8E43|777377 |null |null |null |f2cde5c18ef90e9a9a5006bee0844fae|5119076e8337686d73961a52f1e07ebc|page |create |2020-07-28T12:20:18Z|null |2604:2000:1481:8600:3873:EF4F:FFFE:8E43|777377 |null |7048861 |null | |page |create |2020-07-03 13:20:00.0|2133 |Eptalon |772205 |null |null |null |54c49ca4e2f02a60af6661f07de22315|d19e94fb5fb4c3484fd820eb7b88163c|page |create |2020-07-03T19:37:05Z|2133 |Eptalon |772205 |null |7017062 |null | |page |create |2020-07-13 03:49:22.0|null |2409:4064:2E26:2F1C:7C9B:ABB2:48B9:AC3F|776024 |null |null |null |0f8e38d897b49882966c6ee162d81bbc|c72659d87c8361ebd9af02476b696be7|page |create |2020-07-21T04:02:19Z|null |2409:4064:2E26:2F1C:7C9B:ABB2:48B9:AC3F|776024 |null |7040796 |null | |page |create |2020-07-24 18:03:36.0|null |2604:2000:1481:8600:B905:1CA2:9C84:CD55|777023 |null |null |null |dbe0bc54000a59d181c297fa376be779|6454bc4b3e6e35f8a3e2c98263b86326|page |create |2020-07-26T16:27:26Z|null |2604:2000:1481:8600:B905:1CA2:9C84:CD55|777023 |null |7046779 |null | |page |create |2020-07-11 18:32:37.0|2133 |Eptalon |774095 |null |null |null |3f3baae6338375f179bad1e316cac423|953b40ec5179d4da19c5790c73013535|page |create |2020-07-11T19:28:59Z|2133 |Eptalon |774095 |null |7029003 |null | |page |create |2020-07-04 18:06:57.0|null |2409:4062:2E07:BB27:931C:7845:371D:3337|772352 |null |null |null |2d28ce12a7bf37d7398fc4acc3cb4dd1|458be40bbc3bef5134d7e5b67dcc9547|page |create |2020-07-04T18:20:54Z|null |2409:4062:2E07:BB27:931C:7845:371D:3337|772352 |null |7018319 |null | |page |create |2020-07-19 19:32:09.0|2133 |Eptalon |775814 |null |null |null |d4271d9c478f57be07302df4958c9727|4d2a77953e760afc6e29aa41e2d753e5|page |create |2020-07-19T20:38:50Z|2133 |Eptalon |775814 |null |7039515 |null | |page |create |2020-07-17 17:05:51.0|null |2A02:C7F:3E57:2000:8494:F116:FC3C:A6B |775448 |null |null |null |a1a778d6124e23a4e25cb02b4ba130ff|10f069f3c43d297fe4a07cd07ea1782e|page |create |2020-07-17T22:26:57Z|null |2A02:C7F:3E57:2000:8494:F116:FC3C:A6B |775448 |null |7037669 |null | |page |create |2020-07-05 06:25:33.0|935234 |Naddruf |774441 |null |null |null |4b6f6bbfa64e598f62e291f3d203514e|60054cfcab3bef1978619aa4b017615b|page |create |2020-07-13T17:12:31Z|935234 |Naddruf |774441 |null |7031620 |null | |page |create |2020-07-14 12:22:38.0|null |86.130.163.98 |776469 |null |null |null |4c8a1c6116a6acafe823a3a84a62f1f1|d08c6286f86f20c219e6ee570cdf7944|page |create |2020-07-23T11:15:05Z|null |86.130.163.98 |776469 |null |7043443 |null | |page |create |2020-07-05 00:49:56.0|null |104.230.23.164 |775787 |null |null |null |bb14ab1d8aac2a4d05f5dfc758b386a4|40b87eaf4120d8ffef03fcdd65b836af|page |create |2020-07-19T17:13:16Z|null |104.230.23.164 |775787 |null |7039379 |null | |page |create |2020-07-25 20:31:08.0|51686 |حُميد |777078 |null |null |null |7b19e3fdddeadb3efc6b6ddb6f57cb23|f718dfe205bd2bb881be443b60ef0f35|page |create |2020-07-26T23:31:04Z|51686 |حُميد |777078 |null |7047146 |null | |page |create |2020-07-03 16:54:49.0|1011679 |UN175 |776333 |null |null |null |fec95017672c25711166f1643b6c14ab|786f2bc64b7507aa285c60bdc79a9135|page |create |2020-07-22T20:44:13Z|1011679 |UN175 |776333 |null |7042802 |null | |page |create |2020-07-24 19:16:27.0|null |68.55.103.35 |777233 |null |null |null |f84ad45d616c80733e978d6feb6615f6|38f75c81a012ba5c62c1613afe82f210|page |create |2020-07-27T20:33:43Z|null |68.55.103.35 |777233 |null |7048052 |null | |page |create |2020-07-14 10:59:59.0|null |178.120.18.16 |774651 |null |null |null |cc7db8f3fce3707cdbda4d9e0f0b774c|5a984bca92c73aff3a8aa8179b1aabd0|page |create |2020-07-14T11:53:20Z|null |178.120.18.16 |774651 |null |7032708 |null | |page |create |2020-07-28 12:07:01.0|68157 |Macdonald-ross |777376 |null |null |null |2d6cdb9b2aae179044da6e72ea98cb9b|7213b1f50d69c2dbcfb129f66791f808|page |create |2020-07-28T12:10:21Z|68157 |Macdonald-ross |777376 |null |7048853 |null | |page |create |2020-07-11 19:45:56.0|1014967 |TypoPedia |774283 |null |null |null |16f5f74dbf37205b0a1723a368b91c54|c1eb5bf01bbf9bfe2b2a74121928a91b|page |create |2020-07-12T19:38:44Z|1014967 |TypoPedia |774283 |null |7030466 |null | |page |create |2020-07-15 15:37:24.0|18539 |Djsasso |775021 |null |null |null |38a5226435d779d254b1208142b440b6|4384c2cda0cd7c4cc3f8bad1f2e0b075|page |create |2020-07-15T15:37:44Z|18539 |Djsasso |775021 |null |7034776 |null | |page |create |2020-07-14 11:11:39.0|null |178.120.18.16 |774653 |null |null |null |61f1d9dca6f194a431d42e3192dbca93|bcbef864ef00fd5034c4ca0e7846f27c|page |create |2020-07-14T11:55:53Z|null |178.120.18.16 |774653 |null |7032713 |null | |page |create |2020-07-27 07:44:43.0|null |123.201.126.221 |777608 |null |null |null |625e93c88c218d734b329937890de1b7|98b11a9905f9691ea08c3eb1e5966f91|page |create |2020-07-29T17:51:04Z|null |123.201.126.221 |777608 |null |7050507 |null | |page |create |2020-07-19 20:46:00.0|null |203.192.209.135 |775940 |null |null |null |d6c424e68482a92c5c2436309e29a0be|4eed541b85ca5a69896ded93b92595c0|page |create |2020-07-20T13:18:17Z|null |203.192.209.135 |775940 |null |7040222 |null | |page |create |2020-07-22 20:06:23.0|1016571 |Herɋuartda |776364 |null |null |null |7e0ca3f7a4d97ab2ada3c3f2ca0f8cdc|810498f91baa5f6a2455281e965b2804|page |create |2020-07-23T01:22:12Z|1016571 |Herɋuartda |776364 |null |7043020 |null | |page |create |2020-07-08 09:51:55.0|2133 |Eptalon |773254 |null |null |null |3c71f0a12c3a5367e01178545f90e87e|4ae09ce7d1d5455122e491e6686235f8|page |create |2020-07-08T11:02:46Z|2133 |Eptalon |773254 |null |7024657 |null | |page |create |2020-07-11 17:34:34.0|731605 |ImprovedWikiImprovment |776086 |null |null |null |0739197732ceac5e27b527d2508c96d8|b283af74182c17c8c80f9020282e7635|page |create |2020-07-21T11:27:37Z|731605 |ImprovedWikiImprovment |776086 |null |7041129 |null | |page |create |2020-07-15 10:48:43.0|null |122.174.248.82 |774933 |null |null |null |8360b580582005dde01cddc82abf7f0c|8d4e1b08b1d2b0e89ca42cb4c69aa233|page |create |2020-07-15T10:50:59Z|null |122.174.248.82 |774933 |null |7034337 |null | |page |create |2020-07-20 18:10:08.0|null |182.187.89.237 |776291 |null |null |null |6c17439fc6811a1db3d007d83eea6cdd|cec1b9e526c5ccc921f5fe6dddf6a919|page |create |2020-07-22T16:24:21Z|null |182.187.89.237 |776291 |null |7042459 |null | |page |create |2020-07-03 22:26:35.0|null |104.230.23.164 |775789 |null |null |null |f83d39c944fa8705a4d03878c145151e|1a03bc918d8eaae6b8699e511d45a580|page |create |2020-07-19T17:18:12Z|null |104.230.23.164 |775789 |null |7039384 |null | |page |create |2020-07-16 08:58:00.0|1014250 |SimpleBoi |775138 |null |null |null |94731f976067bfa4aaf55bfec3797f3d|3e048f04e91383fac587c15acef6fbeb|page |create |2020-07-16T09:00:07Z|1014250 |SimpleBoi |775138 |null |7035698 |null | |page |create |2020-07-02 14:54:45.0|null |185.67.177.137 |772010 |null |null |null |906cd1def6bbbf7a750016586581e621|7c73b9c571089cb7fc34699a66cc2ad0|page |create |2020-07-02T14:55:36Z|null |185.67.177.137 |772010 |null |7015423 |null | |page |create |2020-07-23 12:26:42.0|null |2409:4073:407:CCC5:A29D:AE4A:46C9:E628 |776790 |null |null |null |00ccd5bf3f68f41a50f7f8d4aaa134fb|57ccae11f0c3205a25b3c5427aba8d2d|page |create |2020-07-25T06:07:04Z|null |2409:4073:407:CCC5:A29D:AE4A:46C9:E628 |776790 |null |7045424 |null | |page |create |2020-07-05 00:49:56.0|null |104.230.23.164 |775798 |null |null |null |e08687276bdb7f488ea269fe7a168601|384253c82036282e23c8e19a7eedba5d|page |create |2020-07-19T17:40:06Z|null |104.230.23.164 |775798 |null |7039401 |null | |page |create |2020-07-19 20:46:18.0|null |98.204.187.175 |777223 |null |null |null |211a025c5cb013f4f6d8d43d29b22bdf|cef2acd16426fb5152f2359ff4cc78e1|page |create |2020-07-27T19:22:37Z|null |98.204.187.175 |777223 |null |7047934 |null | |page |create |2020-07-04 06:40:35.0|null |2409:4062:2E0B:CD10:49BB:E6CF:F9C5:68E9|772348 |null |null |null |9882235dc34668977e5fd53a3b1b6f5b|fbc1b16ae7539d290e507df344acc0c6|page |create |2020-07-04T17:41:59Z|null |2409:4062:2E0B:CD10:49BB:E6CF:F9C5:68E9|772348 |null |7018286 |null | |page |create |2020-07-23 16:43:41.0|null |201.157.252.176 |777198 |null |null |null |2560e393bc7c05c13e40e9401b9939c7|d747e9c48f63fdc4351ec28c2824a588|page |create |2020-07-27T17:15:15Z|null |201.157.252.176 |777198 |null |7047743 |null | |page |create |2020-07-02 23:27:01.0|null |103.78.225.18 |772668 |null |null |null |bef4a9cb9dbd66b046ee3ae0c87a9ef4|45e87713cec3b4566456aa609302476d|page |create |2020-07-06T00:35:23Z|null |103.78.225.18 |772668 |null |7021010 |null | |page |create |2020-07-13 13:09:06.0|null |82.132.238.246 |774401 |null |null |null |fe334a53259e9f372781a795a7a09f50|85a334491245ac70bcc452ed6fb1ec33|page |create |2020-07-13T13:16:14Z|null |82.132.238.246 |774401 |null |7031224 |null | |page |create |2020-07-03 22:26:35.0|null |2600:1017:B40C:71D1:70CA:95AE:5487:87C7|772648 |null |null |null |8f61a6e607f594069b21890675ce390a|f28288f13bca005a53afc9e8cc22830c|page |create |2020-07-05T23:21:44Z|null |2600:1017:B40C:71D1:70CA:95AE:5487:87C7|772648 |null |7020944 |null | |page |create |2020-07-17 17:04:12.0|null |203.192.244.66 |775707 |null |null |null |7da9908ccae63b9f76c31e150d4692ce|3b0495f56c02882d22a3f1ae866f4718|page |create |2020-07-19T11:13:40Z|null |203.192.244.66 |775707 |null |7038973 |null | +-----------------+---------------+---------------------+------------------+---------------------------------------+------------+-----------------------+----------------+------------+--------------------------------+--------------------------------+------------+----------+--------------------+-------------+---------------------------------------+-------+------------------+-----------+-------+
val jrevts = hfsnr.where("hudi_event_entity = 'revision'").join(eventsRevCreate, col("hudi_revision_id") === col("revision_id"), "outer"). selectExpr("hudi_row_key_hash", "row_key_hash", "(hudi_revision_id IS NULL OR revision_id IS NULL OR (abs(unix_timestamp(hudi_event_timestamp) - unix_timestamp(event_timestamp, 'yyyy-MM-dd\\'T\\'HH:mm:ss\\'Z\\'')) < 5)) AS ts_match"). cache() jrevts.count // 39532 jrevts.where("hudi_row_key_hash IS NULL").count // 613 jrevts.where("row_key_hash IS NULL").count // 748 jrevts.where("hudi_row_key_hash IS NOT NULL AND row_key_hash IS NOT NULL AND ts_match").count // 38171 jrevts.where("hudi_row_key_hash IS NOT NULL AND row_key_hash IS NOT NULL AND NOT ts_match").count // 0 // No timestamp difference here - need to find a different for mismatch.
I noticed here that events to the same page were criss-crossing and they're actually all really close in time, here they are split into two tables and re-ordered to match (btw if you keep tables outside of the code formatting they format nicely in phab):
hudi_event_entity | hudi_event_type | hudi_event_timestamp | hudi_event_user_id | hudi_event_user_text_historical | hudi_page_id |
page | delete | 2020-07-30 16:49:10.0 | 8630 | Chenzw | 776311 |
page | delete | 2020-07-06 17:06:54.0 | 2133 | Eptalon | 676812 |
page | delete | 2020-07-03 07:54:39.0 | 22027 | Auntof6 | 285032 |
page | delete | 2020-07-06 22:32:27.0 | 18539 | Djsasso | 676812 |
page | delete | 2020-07-27 08:15:16.0 | 22027 | Auntof6 | 700118 |
page | delete | 2020-07-30 16:49:31.0 | 8630 | Chenzw | 776311 |
event_entity | event_type | event_timestamp | event_user_id | event_user_text_historical | page_id | revision_id |
page | delete | 2020-07-30T16:49:32Z | 8630 | Chenzw | 776311 | 7042637 |
page | delete | 2020-07-06T17:06:55Z | 2133 | Eptalon | 676812 | 7021769 |
page | delete | 2020-07-03T07:54:49Z | 22027 | Auntof6 | 285032 | 7013107 |
page | delete | 2020-07-06T22:32:27Z | 18539 | Djsasso | 676812 | 7021769 |
page | delete | 2020-07-27T08:15:26Z | 22027 | Auntof6 | 700118 | 6577725 |
page | delete | 2020-07-30T16:49:10Z | 8630 | Chenzw | 776311 | 7042637 |
So the main problem seems to be that hudi_revision_id is null and the times are off by up to 30 seconds or so.
I had no other ideas on the other two checks yet.
Last news on revision_create for simplewiki 2020-07:
Final note.
I have found 2 problems with mediawiki-events (over simplewiki only, using mediawiki-history as a baseline):
I have found new features to be added to events to allow to recreate mediawiki-history using events:
Re-read and re-thought. All sounds good, it's a solid plan for the next couple of quarters. And I especially like the insight from T262261#6478831 that some failed joins are just due to hidden performer information. This will be good to keep in mind and add an event for, because the log events won't cover it.