To mimic mediawiki-history using events, we need to check that the underlying data provides the same value.
Check whether mediawiki production event data is equivalent to mediawiki-history data over a month such an incremental update can be built.
To mimic mediawiki-history using events, we need to check that the underlying data provides the same value.
Check whether mediawiki production event data is equivalent to mediawiki-history data over a month such an incremental update can be built.
| Status | Subtype | Assigned | Task | ||
|---|---|---|---|---|---|
| Open | None | T204950 Public Edit Data Lake: Mediawiki history snapshots available in SQL data store to cloud (labs) users | |||
| Open | None | T258511 Data Lake incremental Data Updates | |||
| Resolved | None | T231938 Get "edits hourly" on a daily basis | |||
| Resolved | Milimetric | T258532 [SPIKE] Prototype of incremental updates for mediawiki history for simplewiki , including reverts using apache hudi | |||
| Resolved | JAllemandou | T262261 Check whether mediawiki production event data is equivalent to mediawiki-history data over a month | |||
| Resolved | Milimetric | T215001 Revisions missing from mediawiki_revision_create | |||
| Resolved | None | T280538 Capture rev_is_revert event data in a stream different than mediawiki.revision-create |
Bug found in mediawiki_revision_create events: some revision-create have multiple events (same revision-id, multiple event-requests) - Tracked in T262203.
CODE (For simplewiki only)
import org.apache.spark.sql.SaveMode
import org.apache.hudi.DataSourceReadOptions
import org.apache.hudi.config.HoodieWriteConfig
val tableNameSimplewiki = "joal.test_hudi_mwh_simplewiki"
val basePathSimplewiki = "hdfs:///user/joal/test_hudi/mwh_simplewiki"
-- Load hudi dataframe
val hfs = spark.read.format("hudi").
// Set table name
option(HoodieWriteConfig.TABLE_NAME, tableNameSimplewiki).
// Read snapshot
option(DataSourceReadOptions.QUERY_TYPE_OPT_KEY, DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL).
option("hoodie.index.type", "BLOOM").
//The number of wildcard asterisks here must be one greater than the number of partition
load(basePathSimplewiki + "/*/*")
-- restrict data to 2020-07 and keep only id-related fields
val hfsn = hfs.where("snapshot='2020-07' AND year(event_timestamp) = 2020 and month(event_timestamp) = 7").
select(
"event_entity",
"event_type",
"event_timestamp",
"event_user_id",
"event_user_text_historical",
"page_id",
"page_artificial_id",
"revision_id",
"user_id",
"row_key_hash").
cache
-- Rename columns of the dataframe with prefix 'hudi_' to faciliate joining
val hfsnr = hfsn.columns.foldLeft(hfsn){ (acc, names) => acc.withColumnRenamed(names, "hudi_".concat(names))}.cache()
hfsn.groupBy(col("event_entity"), col("event_type")).count.sort("event_entity", "event_type").show(100, false)
/*
+------------+-----------+-----+
|event_entity|event_type |count|
+------------+-----------+-----+
|page |create |5775 |
|page |create-page|5801 |
|page |delete |903 |
|page |move |261 |
|page |restore |8 |
|revision |create |38906|
|user |alterblocks|16 |
|user |altergroups|12 |
|user |create |5656 |
|user |rename |258 |
+------------+-----------+-----+
*/val eventsRevCreate = spark.sql("""
SELECT
md5(concat(
database, '|',
'revision', '|',
'create', '|',
-- timestamps of events needs special parsing as they are not sql-timestamp formatted
-- Also, using rev_timestamp instead of meta.dt here as it is available
CAST(unix_timestamp(rev_timestamp, 'yyyy-MM-dd\'T\'HH:mm:ss\'Z\'') AS STRING), '|',
coalesce(performer.user_id, 0), '|',
coalesce(performer.user_text, ''), '|',
coalesce(page_id, 0), '|',
'', '|', -- no page_artificial_id
coalesce(rev_id, 0), '|',
0 -- no user_id
)) as row_key_hash,
'revision' AS event_entity,
'create' AS event_type,
rev_timestamp AS event_timestamp,
performer.user_id AS event_user_id,
performer.user_text as event_user_text_historical,
page_id,
NULL AS page_artificial_id,
rev_id as revision_id,
NULL as user_id
from event.mediawiki_revision_create
where year = 2020 and (
(month = 7)
-- Expanding boundaries in case there are
OR (month = 6 AND day = 30)
OR (month = 8 AND day = 1))
and database = 'simplewiki'
and rev_timestamp >= '2020-07'
and rev_timestamp < '2020-08'
""").
distinct. // This distinct is important as there are duplicates
cache()
eventsRevCreate.count
// 38784 (vs 38906)
-- Now joining on row-key-hash and checking
val jrev = hfsnr.where("hudi_event_entity = 'revision'").join(eventsRevCreate, col("hudi_row_key_hash") === col("row_key_hash"), "outer").cache()
jrev.count
// 39542
jrev.where("hudi_row_key_hash IS NULL").count
// 636
jrev.where("row_key_hash IS NULL").count
// 758
jrev.where("hudi_row_key_hash IS NOT NULL AND row_key_hash IS NOT NULL").count
// 38148
// Most rows share the same hash-key - Looks goodval eventsPageCreate = spark.sql("""
SELECT
md5(concat(
database, '|',
'page', '|',
'create', '|',
-- timestamps of events needs special parsing as they are not sql-timestamp formatted
-- Also, using rev_timestamp instead of meta.dt here as it is available
CAST(unix_timestamp(rev_timestamp, 'yyyy-MM-dd\'T\'HH:mm:ss\'Z\'') AS STRING), '|',
coalesce(performer.user_id, 0), '|',
coalesce(performer.user_text, ''), '|',
coalesce(page_id, 0), '|',
'', '|', -- page_artificial_id
0, '|', -- no rev_id for page-create in mwh while there is one in event!
0 -- user_id
)) as row_key_hash,
'page' AS event_entity,
'create' AS event_type,
rev_timestamp AS event_timestamp,
performer.user_id AS event_user_id,
performer.user_text as event_user_text_historical,
page_id,
NULL AS page_artificial_id,
rev_id as revision_id,
NULL as user_id
from event.mediawiki_page_create
where year = 2020 and month = 7
and database = 'simplewiki'
""").cache()
eventsPageCreate.count
// 5918 (vs 5801)
-- Now joining on row-key-hash and checking
val jpc = hfsnr.where("hudi_event_entity = 'page' and hudi_event_type = 'create'").join(eventsPageCreate, col("hudi_row_key_hash") === col("row_key_hash"), "outer").cache()
jpc.count
// 6238
jpc.where("hudi_row_key_hash IS NULL").count
// 463
jpc.where("row_key_hash IS NULL").count
// 320
jpc.where("hudi_row_key_hash IS NOT NULL AND row_key_hash IS NOT NULL").count
// 5455
// Most rows also share the same hash-key, but the proportion of mismatch is high - investigation is important for that oneval eventsPageDelete = spark.sql("""
SELECT
md5(concat(
database, '|',
'page', '|',
'delete', '|',
-- timestamps of events needs special parsing as they are not sql-timestamp formatted
CAST(unix_timestamp(meta.dt, 'yyyy-MM-dd\'T\'HH:mm:ss\'Z\'') AS STRING),'|',
coalesce(performer.user_id, 0), '|',
coalesce(performer.user_text, ''), '|',
coalesce(page_id, 0), '|',
'', '|', -- page_artificial_id
0, '|', -- no rev_id in mwh
0 -- user_id
)) as row_key_hash,
'page' AS event_entity,
'delete' AS event_type,
meta.dt AS event_timestamp,
performer.user_id AS event_user_id,
performer.user_text as event_user_text_historical,
page_id,
NULL AS page_artificial_id,
rev_id as revision_id,
NULL as user_id
from event.mediawiki_page_delete
where year = 2020 and month = 7
and database = 'simplewiki'
""").cache()
eventsPageDelete.count
// 915 (vs 903)
-- Now joining on row-key-hash and checking
val jpd = hfsnr.where("hudi_event_entity = 'page' and hudi_event_type = 'delete'").join(eventsPageDelete, col("hudi_row_key_hash") === col("row_key_hash"), "outer").cache()
jpd.count
// 1021
jpd.where("hudi_row_key_hash IS NULL").count
// 118
jpd.where("row_key_hash IS NULL").count
// 106
jpd.where("hudi_row_key_hash IS NOT NULL AND row_key_hash IS NOT NULL").count
// 797
// Most rows also share the same hash-key, but the proportion of mismatch is relatively high - investigation is important for that one
// NOTE: I checked the mediawiki_page_suppress table and it was empty - maybe we'll need it for other wikis?val eventsPageRestore = spark.sql("""
SELECT
md5(concat(
database, '|',
'page', '|',
'restore', '|',
-- timestamps of events needs special parsing as they are not sql-timestamp formatted
CAST(unix_timestamp(meta.dt, 'yyyy-MM-dd\'T\'HH:mm:ss\'Z\'') AS STRING),'|',
coalesce(performer.user_id, 0), '|',
coalesce(performer.user_text, ''), '|',
coalesce(page_id, 0), '|',
'', '|', -- page_artificial_id
0, '|', -- no revId in mwh
0 -- user_id
)) as row_key_hash,
'page' AS event_entity,
'restore' AS event_type,
meta.dt AS event_timestamp,
performer.user_id AS event_user_id,
performer.user_text as event_user_text_historical,
page_id,
NULL AS page_artificial_id,
rev_id as revision_id,
NULL as user_id
from event.mediawiki_page_undelete
where year = 2020 and month = 7
and database = 'simplewiki'
""").cache()
eventsPageRestore.count
// 8 (vs 8)
-- Now joining on row-key-hash and checking
val jpr = hfsnr.where("hudi_event_entity = 'page' and hudi_event_type = 'restore'").join(eventsPageRestore, col("hudi_row_key_hash") === col("row_key_hash"), "outer").cache()
jpr.count
// 8
jpr.where("hudi_row_key_hash IS NULL").count
// 0
jpr.where("row_key_hash IS NULL").count
// 0
jpr.where("hudi_row_key_hash IS NOT NULL AND row_key_hash IS NOT NULL").count
// 8
// Full match - \o/val eventsPageMove = spark.sql("""
SELECT
md5(concat(
database, '|',
'page', '|',
'move', '|',
-- timestamps of events needs special parsing as they are not sql-timestamp formatted
CAST(unix_timestamp(meta.dt, 'yyyy-MM-dd\'T\'HH:mm:ss\'Z\'') AS STRING),'|',
coalesce(performer.user_id, 0), '|',
coalesce(performer.user_text, ''), '|',
coalesce(page_id, 0), '|',
'', '|', -- page_artificial_id
0, '|', -- no rev_id in mwh
0 -- user_id
)) as row_key_hash,
'page' AS event_entity,
'move' AS event_type,
meta.dt AS event_timestamp,
performer.user_id AS event_user_id,
performer.user_text as event_user_text_historical,
page_id,
NULL AS page_artificial_id,
rev_id as revision_id,
NULL as user_id
from event.mediawiki_page_move
where year = 2020 and month = 7
and database = 'simplewiki'
""").cache()
eventsPageMove.count
// 263 (vs 261)
val jpm = hfsnr.where("hudi_event_entity = 'page' and hudi_event_type = 'move'").join(eventsPageMove, col("hudi_row_key_hash") === col("row_key_hash"), "outer").cache()
jpm.count
// 476
jpm.where("hudi_row_key_hash IS NULL").count
// 215
jpm.where("row_key_hash IS NULL").count
// 213
jpm.where("hudi_row_key_hash IS NOT NULL AND row_key_hash IS NOT NULL").count
// 48
// More not-join than joined - investiagtion
val jpmts = hfsnr.where("hudi_event_entity = 'page' and hudi_event_type = 'move'").join(eventsPageMove, col("hudi_page_id") === col("page_id"), "outer").
where("hudi_page_id IS NULL OR page_id IS NULL OR (abs(unix_timestamp(hudi_event_timestamp) - unix_timestamp(event_timestamp, 'yyyy-MM-dd\\'T\\'HH:mm:ss\\'Z\\'')) < 5)").
cache()
jpmts.count
// 263
jpmts.where("hudi_row_key_hash IS NULL").count
// 2
jpmts.where("row_key_hash IS NULL").count
// 0
jpmts.where("hudi_row_key_hash IS NOT NULL AND row_key_hash IS NOT NULL").count
// 261
// The problem comes from timestamps - the event-meta.dt timestamp is different from the event_timestamp we get from the logging table...
// It will be interesting to check the other mismatches for this case
// Anyway - how can we define a primary-key????Checks for discrenpencies (using previous comment setup):
// Try to match events using timestamps
val jpdts = hfsnr.where("hudi_event_entity = 'page' and hudi_event_type = 'delete'").join(eventsPageDelete, col("hudi_page_id") === col("page_id"), "outer").
selectExpr("hudi_row_key_hash", "row_key_hash", "(hudi_page_id IS NULL OR page_id IS NULL OR (abs(unix_timestamp(hudi_event_timestamp) - unix_timestamp(event_timestamp, 'yyyy-MM-dd\\'T\\'HH:mm:ss\\'Z\\'')) < 5)) AS ts_match").
cache()
jpdts.count
// 919
jpdts.where("hudi_row_key_hash IS NULL").count
// 12
jpdts.where("row_key_hash IS NULL").count
// 0
jpdts.where("hudi_row_key_hash IS NOT NULL AND row_key_hash IS NOT NULL AND ts_match").count
// 901 -- Almost all match within 5s
jpdts.where("hudi_row_key_hash IS NOT NULL AND row_key_hash IS NOT NULL AND not ts_match").count
// 6
// Looking at the 6 events that don't match with a timestamp within 5s (not very reasable sorry :():
hfsnr.where("hudi_event_entity = 'page' and hudi_event_type = 'delete'").join(eventsPageDelete, col("hudi_page_id") === col("page_id"), "outer").
where("hudi_page_id IS NOT NULL AND page_id IS NOT NULL AND (abs(unix_timestamp(hudi_event_timestamp) - unix_timestamp(event_timestamp, 'yyyy-MM-dd\\'T\\'HH:mm:ss\\'Z\\'')) >= 5)").
show(100, false)
+-----------------+---------------+---------------------+------------------+-------------------------------+------------+-----------------------+----------------+------------+--------------------------------+--------------------------------+------------+----------+--------------------+-------------+--------------------------+-------+------------------+-----------+-------+
|hudi_event_entity|hudi_event_type|hudi_event_timestamp |hudi_event_user_id|hudi_event_user_text_historical|hudi_page_id|hudi_page_artificial_id|hudi_revision_id|hudi_user_id|hudi_row_key_hash |row_key_hash |event_entity|event_type|event_timestamp |event_user_id|event_user_text_historical|page_id|page_artificial_id|revision_id|user_id|
+-----------------+---------------+---------------------+------------------+-------------------------------+------------+-----------------------+----------------+------------+--------------------------------+--------------------------------+------------+----------+--------------------+-------------+--------------------------+-------+------------------+-----------+-------+
|page |delete |2020-07-30 16:49:10.0|8630 |Chenzw |776311 |null |null |null |603321d3c0b792c959095627067d71ac|609894b5f351ef92eb2ec0e0d1c6dc5c|page |delete |2020-07-30T16:49:32Z|8630 |Chenzw |776311 |null |7042637 |null |
|page |delete |2020-07-06 17:06:54.0|2133 |Eptalon |676812 |null |null |null |9c21a46ffa5d77f131e03e129421e5e9|7c0b08708a3affb76173cd3a888be8bc|page |delete |2020-07-06T22:32:27Z|18539 |Djsasso |676812 |null |7021769 |null |
|page |delete |2020-07-03 07:54:39.0|22027 |Auntof6 |285032 |null |null |null |503fa76a1763451b1d6b9472436738f8|3387c53369dc7af5e68b2377f4a3fda8|page |delete |2020-07-03T07:54:49Z|22027 |Auntof6 |285032 |null |7013107 |null |
|page |delete |2020-07-06 22:32:27.0|18539 |Djsasso |676812 |null |null |null |7c0b08708a3affb76173cd3a888be8bc|6d244a642406d9852d186a124322ccd6|page |delete |2020-07-06T17:06:55Z|2133 |Eptalon |676812 |null |7021769 |null |
|page |delete |2020-07-27 08:15:16.0|22027 |Auntof6 |700118 |null |null |null |1eaec820d3aa53746824410803aeb70b|27ae3b48edab900f71c18cdc1023a72e|page |delete |2020-07-27T08:15:26Z|22027 |Auntof6 |700118 |null |6577725 |null |
|page |delete |2020-07-30 16:49:31.0|8630 |Chenzw |776311 |null |null |null |cdc521063916ffe15f3deecfcf362b32|603321d3c0b792c959095627067d71ac|page |delete |2020-07-30T16:49:10Z|8630 |Chenzw |776311 |null |7042637 |null |
+-----------------+---------------+---------------------+------------------+-------------------------------+------------+-----------------------+----------------+------------+--------------------------------+--------------------------------+------------+----------+--------------------+-------------+--------------------------+-------+------------------+-----------+-------+val jpcts = hfsnr.where("hudi_event_entity = 'page' and hudi_event_type = 'create'").join(eventsPageCreate, col("hudi_page_id") === col("page_id"), "outer").
selectExpr("hudi_row_key_hash", "row_key_hash", "(hudi_page_id IS NULL OR page_id IS NULL OR (abs(unix_timestamp(hudi_event_timestamp) - unix_timestamp(event_timestamp, 'yyyy-MM-dd\\'T\\'HH:mm:ss\\'Z\\'')) < 5)) AS ts_match").
cache()
jpcts.count
// 6170
jpcts.where("hudi_row_key_hash IS NULL").count
// 395 -- Still a lot of mismatch :(
jpcts.where("row_key_hash IS NULL").count
// 252 -- Here as well :( Maybe page_is_deleted?
jpcts.where("hudi_row_key_hash IS NOT NULL AND row_key_hash IS NOT NULL AND ts_match").count
// 5455
jpcts.where("hudi_row_key_hash IS NOT NULL AND row_key_hash IS NOT NULL AND NOT ts_match").count
// 68
hfsnr.where("hudi_event_entity = 'page' and hudi_event_type = 'create'").join(eventsPageCreate, col("hudi_page_id") === col("page_id"), "outer").
where("hudi_page_id IS NOT NULL AND page_id IS NOT NULL AND (abs(unix_timestamp(hudi_event_timestamp) - unix_timestamp(event_timestamp, 'yyyy-MM-dd\\'T\\'HH:mm:ss\\'Z\\'')) >= 5)").
show(100, false)
+-----------------+---------------+---------------------+------------------+---------------------------------------+------------+-----------------------+----------------+------------+----------------------------
----+--------------------------------+------------+----------+--------------------+-------------+---------------------------------------+-------+------------------+-----------+-------+
|hudi_event_entity|hudi_event_type|hudi_event_timestamp |hudi_event_user_id|hudi_event_user_text_historical |hudi_page_id|hudi_page_artificial_id|hudi_revision_id|hudi_user_id|hudi_row_key_hash
|row_key_hash |event_entity|event_type|event_timestamp |event_user_id|event_user_text_historical |page_id|page_artificial_id|revision_id|user_id|
+-----------------+---------------+---------------------+------------------+---------------------------------------+------------+-----------------------+----------------+------------+----------------------------
----+--------------------------------+------------+----------+--------------------+-------------+---------------------------------------+-------+------------------+-----------+-------+
|page |create |2020-07-01 06:35:03.0|null |106.205.11.40 |771740 |null |null |null |1af5a34c7f74f606eefaed5e3078
749c|85dd947dd13bcf553ad007384f4f533e|page |create |2020-07-01T06:42:17Z|null |106.205.11.40 |771740 |null |7013623 |null |
|page |create |2020-07-01 15:30:29.0|null |69.115.47.9 |771875 |null |null |null |5e23915d9d3cd0b3916475aa6ef9
4608|e8ddf68dcaa93e3d63e9a35e75d1944e|page |create |2020-07-01T17:45:08Z|null |69.115.47.9 |771875 |null |7014279 |null |
|page |create |2020-07-23 14:35:05.0|null |null |777039 |null |null |null |9d7ef58cf29caa3a0775fd15bb2e
f744|2399cb71073a5ccbab4dcd13c50026a8|page |create |2020-07-26T19:07:46Z|null |2604:2000:1481:8600:B905:1CA2:9C84:CD55|777039 |null |7046945 |null |
|page |create |2020-07-10 12:26:50.0|2133 |Eptalon |773914 |null |null |null |b8624c1f11a13cc7f35f0e8bbfc8
8c92|523bae9ad3e22bfbc56e09266ab4ce82|page |create |2020-07-10T21:32:45Z|2133 |Eptalon |773914 |null |7027926 |null |
|page |create |2020-07-21 12:45:22.0|null |216.11.189.203 |776157 |null |null |null |f22116e5cb8a679c78d9c2b29a95
9956|9ec0a18f0df0e5644297af9ee3825488|page |create |2020-07-21T19:17:35Z|null |216.11.189.203 |776157 |null |7041575 |null |
|page |create |2020-07-06 21:51:08.0|18539 |Djsasso |772897 |null |null |null |2488132ed5a03f92f3dd0f96c869
038a|fbce1e48d22c49395f0d016a5a7c0041|page |create |2020-07-06T22:07:14Z|18539 |Djsasso |772897 |null |7022504 |null |
|page |create |2020-07-23 15:45:08.0|1016980 |Kiagus Muhammad Hanif Sirua |776615 |null |null |null |4ec5d85bd29325b10cd41472b279
c787|fc073744a9e20d81b253dcd77a737999|page |create |2020-07-24T00:26:12Z|1016980 |Kiagus Muhammad Hanif Sirua |776615 |null |7044338 |null |
|page |create |2020-07-23 03:30:41.0|null |null |776508 |null |null |null |0548a8babae133796f3d3868d87a
5e2a|a9c1bedc654332524b39154394b90929|page |create |2020-07-23T14:34:26Z|null |2604:2000:1481:8600:1859:D4A4:D541:479 |776508 |null |7043672 |null |
|page |create |2020-07-11 17:19:58.0|1015017 |Ruchi1403 |774338 |null |null |null |3878527da8b548ce60bbce55e8af
b7ae|6225e236ffb2807d6ab449a3a9cfc3d0|page |create |2020-07-13T03:47:27Z|1015017 |Ruchi1403 |774338 |null |7030827 |null |
|page |create |2020-07-26 12:10:11.0|null |113.185.46.239 |777020 |null |null |null |8f7447a20c361c8565e664eeb745
1ba0|096c16c74e9eba676f1b64fc2c267352|page |create |2020-07-26T15:34:36Z|null |113.185.46.239 |777020 |null |7046740 |null |
|page |create |2020-07-23 14:46:00.0|null |2804:431:9705:123A:44AD:402D:CC1A:7CBD |776936 |null |null |null |8d7a19c81b2c5f5cdf7f2d765312
529d|3538d42cac905e750bf90e97d1edcf4c|page |create |2020-07-25T21:54:43Z|null |2804:431:9705:123A:44AD:402D:CC1A:7CBD |776936 |null |7046241 |null |
|page |create |2020-07-03 12:06:50.0|null |188.28.164.70 |773753 |null |null |null |ee98d96c75bb00dd135eac9a9ca3
1eb3|4ca8c4707f307eea229ac4ba0e416f6e|page |create |2020-07-09T21:04:04Z|null |188.28.164.70 |773753 |null |7026991 |null |
|page |create |2020-07-08 14:58:40.0|null |2600:100B:B01F:5D1D:8581:E76B:C057:E33E|773293 |null |null |null |387146ff9c469610217f15d7f096
c1d3|30049538a0307d04b81ff5e7e74964ae|page |create |2020-07-08T14:59:02Z|null |2600:100B:B01F:5D1D:8581:E76B:C057:E33E|773293 |null |7024926 |null |
|page |create |2020-07-27 07:31:36.0|null |96.232.204.207 |777217 |null |null |null |ff801bb43ae1937581a9aab7c850
a9e4|b96e012225dddafc1aec916db50321fe|page |create |2020-07-27T18:44:23Z|null |96.232.204.207 |777217 |null |7047883 |null |
|page |create |2020-07-24 19:34:57.0|null |103.133.200.3 |776712 |null |null |null |7f2b86db57831d0d0aa21a2e48f5
6591|b74a026b77098ede7dd0808112191551|page |create |2020-07-24T19:36:47Z|null |103.133.200.3 |776712 |null |7044995 |null |
|page |create |2020-07-05 06:25:43.0|731605 |ImprovedWikiImprovment |777841 |null |null |null |f4d73b303deb6471e7a369d55ee2
9b1d|86f99e09d43584f9fe21b991d0b9d603|page |create |2020-07-31T07:35:54Z|731605 |ImprovedWikiImprovment |777841 |null |7052181 |null |
|page |create |2020-07-11 21:11:25.0|null |183.171.158.2 |774113 |null |null |null |064c14ba51b7ee801702ae6e5b4b
28ca|cdc7c074d18446e33aa2155ab7f0c120|page |create |2020-07-11T21:12:34Z|null |183.171.158.2 |774113 |null |7029279 |null |
|page |create |2020-07-18 12:46:14.0|null |91.186.250.14 |776269 |null |null |null |6bb0a417d47f4b460d912da3d806
8d21|2b16d0fabfbe0b87cff878c7e4443930|page |create |2020-07-22T10:43:02Z|null |91.186.250.14 |776269 |null |7042178 |null |
|page |create |2020-07-08 14:59:16.0|null |2600:100B:B01F:5D1D:8581:E76B:C057:E33E|773299 |null |null |null |e5fd3a0068ab47d04e081c11b6a6
7175|beb557012d2168f665b4296a909545cf|page |create |2020-07-08T15:04:12Z|null |2600:100B:B01F:5D1D:8581:E76B:C057:E33E|773299 |null |7024934 |null |
|page |create |2020-07-27 16:46:11.0|null |2A02:C7D:5D63:1300:99F0:483B:4333:E8B3 |777199 |null |null |null |2b9a9f02cb903ff6dbb3b7f190c9
4ab5|95d89d5c9301d2ee9578966eb28e1044|page |create |2020-07-27T17:15:36Z|null |2A02:C7D:5D63:1300:99F0:483B:4333:E8B3 |777199 |null |7047744 |null |
|page |create |2020-07-13 13:18:43.0|null |82.132.238.246 |774405 |null |null |null |e4cce9e33d963daf2bc87abb3ec9
82ce|58bcc7da0759e051dc145553229e800d|page |create |2020-07-13T13:21:06Z|null |82.132.238.246 |774405 |null |7031240 |null |
|page |create |2020-07-27 19:58:00.0|731605 |ImprovedWikiImprovment |777745 |null |null |null |dcc26c5db9f510a4ba159f250499
12fa|7952f4febe7aeb2bc7cd9d20a643884e|page |create |2020-07-30T19:32:59Z|731605 |ImprovedWikiImprovment |777745 |null |7051600 |null |
|page |create |2020-07-05 00:49:56.0|null |null |775790 |null |null |null |3be45c24b4244016246db70f73c1
b0bc|8a8c0b95fc5c9ee889ff5871ef0d2e98|page |create |2020-07-19T17:19:30Z|null |104.230.23.164 |775790 |null |7039385 |null |
|page |create |2020-07-01 18:14:17.0|2133 |Eptalon |771890 |null |null |null |de31de815585d8b3771760a668d2
78df|771dae02392f7693b57d1cb0bf805046|page |create |2020-07-01T18:20:26Z|2133 |Eptalon |771890 |null |7014335 |null |
|page |create |2020-07-16 12:18:19.0|null |121.209.165.23 |775164 |null |null |null |3ce07de39056c2bed378ba30095debe5|c5c32f73848033043dca2d4cf9e24440|page |create |2020-07-16T12:27:46Z|null |121.209.165.23 |775164 |null |7035848 |null |
|page |create |2020-07-21 19:21:59.0|null |216.11.189.203 |776701 |null |null |null |90b0516b0a8d8c7a4f70f17b4eb24c48|bb858788bdff445f7492e1efcc7757e4|page |create |2020-07-24T19:09:43Z|null |216.11.189.203 |776701 |null |7044924 |null |
|page |create |2020-07-05 06:25:09.0|22027 |Auntof6 |777114 |null |null |null |eb372e8ade0219df45adc9823263baa0|5c6d1cc2785a4c1d3a68ebc1946668c0|page |create |2020-07-27T08:26:10Z|22027 |Auntof6 |777114 |null |7047405 |null |
|page |create |2020-07-10 21:07:38.0|2133 |Eptalon |773910 |null |null |null |89c5d3759bec217e4c62dae85ca33914|d42083b8f1333681c153d4a776de5b9e|page |create |2020-07-10T21:10:57Z|2133 |Eptalon |773910 |null |7027906 |null |
|page |create |2020-07-15 10:47:35.0|null |27.109.115.145 |777500 |null |null |null |3244fb79ba82df3b29d5b0f9514aeaa6|e018d8ea407c13320d805b9538b59074|page |create |2020-07-29T05:51:15Z|null |27.109.115.145 |777500 |null |7049825 |null |
|page |create |2020-07-05 06:25:21.0|22027 |Auntof6 |777112 |null |null |null |92843e6178c454d81a4053b060278cd5|42087db116876f0ba84a7b4ea97d532c|page |create |2020-07-27T08:22:07Z|22027 |Auntof6 |777112 |null |7047398 |null |
|page |create |2020-07-27 19:19:53.0|null |2604:2000:1481:8600:3873:EF4F:FFFE:8E43|777377 |null |null |null |f2cde5c18ef90e9a9a5006bee0844fae|5119076e8337686d73961a52f1e07ebc|page |create |2020-07-28T12:20:18Z|null |2604:2000:1481:8600:3873:EF4F:FFFE:8E43|777377 |null |7048861 |null |
|page |create |2020-07-03 13:20:00.0|2133 |Eptalon |772205 |null |null |null |54c49ca4e2f02a60af6661f07de22315|d19e94fb5fb4c3484fd820eb7b88163c|page |create |2020-07-03T19:37:05Z|2133 |Eptalon |772205 |null |7017062 |null |
|page |create |2020-07-13 03:49:22.0|null |2409:4064:2E26:2F1C:7C9B:ABB2:48B9:AC3F|776024 |null |null |null |0f8e38d897b49882966c6ee162d81bbc|c72659d87c8361ebd9af02476b696be7|page |create |2020-07-21T04:02:19Z|null |2409:4064:2E26:2F1C:7C9B:ABB2:48B9:AC3F|776024 |null |7040796 |null |
|page |create |2020-07-24 18:03:36.0|null |2604:2000:1481:8600:B905:1CA2:9C84:CD55|777023 |null |null |null |dbe0bc54000a59d181c297fa376be779|6454bc4b3e6e35f8a3e2c98263b86326|page |create |2020-07-26T16:27:26Z|null |2604:2000:1481:8600:B905:1CA2:9C84:CD55|777023 |null |7046779 |null |
|page |create |2020-07-11 18:32:37.0|2133 |Eptalon |774095 |null |null |null |3f3baae6338375f179bad1e316cac423|953b40ec5179d4da19c5790c73013535|page |create |2020-07-11T19:28:59Z|2133 |Eptalon |774095 |null |7029003 |null |
|page |create |2020-07-04 18:06:57.0|null |2409:4062:2E07:BB27:931C:7845:371D:3337|772352 |null |null |null |2d28ce12a7bf37d7398fc4acc3cb4dd1|458be40bbc3bef5134d7e5b67dcc9547|page |create |2020-07-04T18:20:54Z|null |2409:4062:2E07:BB27:931C:7845:371D:3337|772352 |null |7018319 |null |
|page |create |2020-07-19 19:32:09.0|2133 |Eptalon |775814 |null |null |null |d4271d9c478f57be07302df4958c9727|4d2a77953e760afc6e29aa41e2d753e5|page |create |2020-07-19T20:38:50Z|2133 |Eptalon |775814 |null |7039515 |null |
|page |create |2020-07-17 17:05:51.0|null |2A02:C7F:3E57:2000:8494:F116:FC3C:A6B |775448 |null |null |null |a1a778d6124e23a4e25cb02b4ba130ff|10f069f3c43d297fe4a07cd07ea1782e|page |create |2020-07-17T22:26:57Z|null |2A02:C7F:3E57:2000:8494:F116:FC3C:A6B |775448 |null |7037669 |null |
|page |create |2020-07-05 06:25:33.0|935234 |Naddruf |774441 |null |null |null |4b6f6bbfa64e598f62e291f3d203514e|60054cfcab3bef1978619aa4b017615b|page |create |2020-07-13T17:12:31Z|935234 |Naddruf |774441 |null |7031620 |null |
|page |create |2020-07-14 12:22:38.0|null |86.130.163.98 |776469 |null |null |null |4c8a1c6116a6acafe823a3a84a62f1f1|d08c6286f86f20c219e6ee570cdf7944|page |create |2020-07-23T11:15:05Z|null |86.130.163.98 |776469 |null |7043443 |null |
|page |create |2020-07-05 00:49:56.0|null |104.230.23.164 |775787 |null |null |null |bb14ab1d8aac2a4d05f5dfc758b386a4|40b87eaf4120d8ffef03fcdd65b836af|page |create |2020-07-19T17:13:16Z|null |104.230.23.164 |775787 |null |7039379 |null |
|page |create |2020-07-25 20:31:08.0|51686 |حُميد |777078 |null |null |null |7b19e3fdddeadb3efc6b6ddb6f57cb23|f718dfe205bd2bb881be443b60ef0f35|page |create |2020-07-26T23:31:04Z|51686 |حُميد |777078 |null |7047146 |null |
|page |create |2020-07-03 16:54:49.0|1011679 |UN175 |776333 |null |null |null |fec95017672c25711166f1643b6c14ab|786f2bc64b7507aa285c60bdc79a9135|page |create |2020-07-22T20:44:13Z|1011679 |UN175 |776333 |null |7042802 |null |
|page |create |2020-07-24 19:16:27.0|null |68.55.103.35 |777233 |null |null |null |f84ad45d616c80733e978d6feb6615f6|38f75c81a012ba5c62c1613afe82f210|page |create |2020-07-27T20:33:43Z|null |68.55.103.35 |777233 |null |7048052 |null |
|page |create |2020-07-14 10:59:59.0|null |178.120.18.16 |774651 |null |null |null |cc7db8f3fce3707cdbda4d9e0f0b774c|5a984bca92c73aff3a8aa8179b1aabd0|page |create |2020-07-14T11:53:20Z|null |178.120.18.16 |774651 |null |7032708 |null |
|page |create |2020-07-28 12:07:01.0|68157 |Macdonald-ross |777376 |null |null |null |2d6cdb9b2aae179044da6e72ea98cb9b|7213b1f50d69c2dbcfb129f66791f808|page |create |2020-07-28T12:10:21Z|68157 |Macdonald-ross |777376 |null |7048853 |null |
|page |create |2020-07-11 19:45:56.0|1014967 |TypoPedia |774283 |null |null |null |16f5f74dbf37205b0a1723a368b91c54|c1eb5bf01bbf9bfe2b2a74121928a91b|page |create |2020-07-12T19:38:44Z|1014967 |TypoPedia |774283 |null |7030466 |null |
|page |create |2020-07-15 15:37:24.0|18539 |Djsasso |775021 |null |null |null |38a5226435d779d254b1208142b440b6|4384c2cda0cd7c4cc3f8bad1f2e0b075|page |create |2020-07-15T15:37:44Z|18539 |Djsasso |775021 |null |7034776 |null |
|page |create |2020-07-14 11:11:39.0|null |178.120.18.16 |774653 |null |null |null |61f1d9dca6f194a431d42e3192dbca93|bcbef864ef00fd5034c4ca0e7846f27c|page |create |2020-07-14T11:55:53Z|null |178.120.18.16 |774653 |null |7032713 |null |
|page |create |2020-07-27 07:44:43.0|null |123.201.126.221 |777608 |null |null |null |625e93c88c218d734b329937890de1b7|98b11a9905f9691ea08c3eb1e5966f91|page |create |2020-07-29T17:51:04Z|null |123.201.126.221 |777608 |null |7050507 |null |
|page |create |2020-07-19 20:46:00.0|null |203.192.209.135 |775940 |null |null |null |d6c424e68482a92c5c2436309e29a0be|4eed541b85ca5a69896ded93b92595c0|page |create |2020-07-20T13:18:17Z|null |203.192.209.135 |775940 |null |7040222 |null |
|page |create |2020-07-22 20:06:23.0|1016571 |Herɋuartda |776364 |null |null |null |7e0ca3f7a4d97ab2ada3c3f2ca0f8cdc|810498f91baa5f6a2455281e965b2804|page |create |2020-07-23T01:22:12Z|1016571 |Herɋuartda |776364 |null |7043020 |null |
|page |create |2020-07-08 09:51:55.0|2133 |Eptalon |773254 |null |null |null |3c71f0a12c3a5367e01178545f90e87e|4ae09ce7d1d5455122e491e6686235f8|page |create |2020-07-08T11:02:46Z|2133 |Eptalon |773254 |null |7024657 |null |
|page |create |2020-07-11 17:34:34.0|731605 |ImprovedWikiImprovment |776086 |null |null |null |0739197732ceac5e27b527d2508c96d8|b283af74182c17c8c80f9020282e7635|page |create |2020-07-21T11:27:37Z|731605 |ImprovedWikiImprovment |776086 |null |7041129 |null |
|page |create |2020-07-15 10:48:43.0|null |122.174.248.82 |774933 |null |null |null |8360b580582005dde01cddc82abf7f0c|8d4e1b08b1d2b0e89ca42cb4c69aa233|page |create |2020-07-15T10:50:59Z|null |122.174.248.82 |774933 |null |7034337 |null |
|page |create |2020-07-20 18:10:08.0|null |182.187.89.237 |776291 |null |null |null |6c17439fc6811a1db3d007d83eea6cdd|cec1b9e526c5ccc921f5fe6dddf6a919|page |create |2020-07-22T16:24:21Z|null |182.187.89.237 |776291 |null |7042459 |null |
|page |create |2020-07-03 22:26:35.0|null |104.230.23.164 |775789 |null |null |null |f83d39c944fa8705a4d03878c145151e|1a03bc918d8eaae6b8699e511d45a580|page |create |2020-07-19T17:18:12Z|null |104.230.23.164 |775789 |null |7039384 |null |
|page |create |2020-07-16 08:58:00.0|1014250 |SimpleBoi |775138 |null |null |null |94731f976067bfa4aaf55bfec3797f3d|3e048f04e91383fac587c15acef6fbeb|page |create |2020-07-16T09:00:07Z|1014250 |SimpleBoi |775138 |null |7035698 |null |
|page |create |2020-07-02 14:54:45.0|null |185.67.177.137 |772010 |null |null |null |906cd1def6bbbf7a750016586581e621|7c73b9c571089cb7fc34699a66cc2ad0|page |create |2020-07-02T14:55:36Z|null |185.67.177.137 |772010 |null |7015423 |null |
|page |create |2020-07-23 12:26:42.0|null |2409:4073:407:CCC5:A29D:AE4A:46C9:E628 |776790 |null |null |null |00ccd5bf3f68f41a50f7f8d4aaa134fb|57ccae11f0c3205a25b3c5427aba8d2d|page |create |2020-07-25T06:07:04Z|null |2409:4073:407:CCC5:A29D:AE4A:46C9:E628 |776790 |null |7045424 |null |
|page |create |2020-07-05 00:49:56.0|null |104.230.23.164 |775798 |null |null |null |e08687276bdb7f488ea269fe7a168601|384253c82036282e23c8e19a7eedba5d|page |create |2020-07-19T17:40:06Z|null |104.230.23.164 |775798 |null |7039401 |null |
|page |create |2020-07-19 20:46:18.0|null |98.204.187.175 |777223 |null |null |null |211a025c5cb013f4f6d8d43d29b22bdf|cef2acd16426fb5152f2359ff4cc78e1|page |create |2020-07-27T19:22:37Z|null |98.204.187.175 |777223 |null |7047934 |null |
|page |create |2020-07-04 06:40:35.0|null |2409:4062:2E0B:CD10:49BB:E6CF:F9C5:68E9|772348 |null |null |null |9882235dc34668977e5fd53a3b1b6f5b|fbc1b16ae7539d290e507df344acc0c6|page |create |2020-07-04T17:41:59Z|null |2409:4062:2E0B:CD10:49BB:E6CF:F9C5:68E9|772348 |null |7018286 |null |
|page |create |2020-07-23 16:43:41.0|null |201.157.252.176 |777198 |null |null |null |2560e393bc7c05c13e40e9401b9939c7|d747e9c48f63fdc4351ec28c2824a588|page |create |2020-07-27T17:15:15Z|null |201.157.252.176 |777198 |null |7047743 |null |
|page |create |2020-07-02 23:27:01.0|null |103.78.225.18 |772668 |null |null |null |bef4a9cb9dbd66b046ee3ae0c87a9ef4|45e87713cec3b4566456aa609302476d|page |create |2020-07-06T00:35:23Z|null |103.78.225.18 |772668 |null |7021010 |null |
|page |create |2020-07-13 13:09:06.0|null |82.132.238.246 |774401 |null |null |null |fe334a53259e9f372781a795a7a09f50|85a334491245ac70bcc452ed6fb1ec33|page |create |2020-07-13T13:16:14Z|null |82.132.238.246 |774401 |null |7031224 |null |
|page |create |2020-07-03 22:26:35.0|null |2600:1017:B40C:71D1:70CA:95AE:5487:87C7|772648 |null |null |null |8f61a6e607f594069b21890675ce390a|f28288f13bca005a53afc9e8cc22830c|page |create |2020-07-05T23:21:44Z|null |2600:1017:B40C:71D1:70CA:95AE:5487:87C7|772648 |null |7020944 |null |
|page |create |2020-07-17 17:04:12.0|null |203.192.244.66 |775707 |null |null |null |7da9908ccae63b9f76c31e150d4692ce|3b0495f56c02882d22a3f1ae866f4718|page |create |2020-07-19T11:13:40Z|null |203.192.244.66 |775707 |null |7038973 |null |
+-----------------+---------------+---------------------+------------------+---------------------------------------+------------+-----------------------+----------------+------------+--------------------------------+--------------------------------+------------+----------+--------------------+-------------+---------------------------------------+-------+------------------+-----------+-------+val jrevts = hfsnr.where("hudi_event_entity = 'revision'").join(eventsRevCreate, col("hudi_revision_id") === col("revision_id"), "outer").
selectExpr("hudi_row_key_hash", "row_key_hash", "(hudi_revision_id IS NULL OR revision_id IS NULL OR (abs(unix_timestamp(hudi_event_timestamp) - unix_timestamp(event_timestamp, 'yyyy-MM-dd\\'T\\'HH:mm:ss\\'Z\\'')) < 5)) AS ts_match").
cache()
jrevts.count
// 39532
jrevts.where("hudi_row_key_hash IS NULL").count
// 613
jrevts.where("row_key_hash IS NULL").count
// 748
jrevts.where("hudi_row_key_hash IS NOT NULL AND row_key_hash IS NOT NULL AND ts_match").count
// 38171
jrevts.where("hudi_row_key_hash IS NOT NULL AND row_key_hash IS NOT NULL AND NOT ts_match").count
// 0
// No timestamp difference here - need to find a different for mismatch.I noticed here that events to the same page were criss-crossing and they're actually all really close in time, here they are split into two tables and re-ordered to match (btw if you keep tables outside of the code formatting they format nicely in phab):
| hudi_event_entity | hudi_event_type | hudi_event_timestamp | hudi_event_user_id | hudi_event_user_text_historical | hudi_page_id |
| page | delete | 2020-07-30 16:49:10.0 | 8630 | Chenzw | 776311 |
| page | delete | 2020-07-06 17:06:54.0 | 2133 | Eptalon | 676812 |
| page | delete | 2020-07-03 07:54:39.0 | 22027 | Auntof6 | 285032 |
| page | delete | 2020-07-06 22:32:27.0 | 18539 | Djsasso | 676812 |
| page | delete | 2020-07-27 08:15:16.0 | 22027 | Auntof6 | 700118 |
| page | delete | 2020-07-30 16:49:31.0 | 8630 | Chenzw | 776311 |
| event_entity | event_type | event_timestamp | event_user_id | event_user_text_historical | page_id | revision_id |
| page | delete | 2020-07-30T16:49:32Z | 8630 | Chenzw | 776311 | 7042637 |
| page | delete | 2020-07-06T17:06:55Z | 2133 | Eptalon | 676812 | 7021769 |
| page | delete | 2020-07-03T07:54:49Z | 22027 | Auntof6 | 285032 | 7013107 |
| page | delete | 2020-07-06T22:32:27Z | 18539 | Djsasso | 676812 | 7021769 |
| page | delete | 2020-07-27T08:15:26Z | 22027 | Auntof6 | 700118 | 6577725 |
| page | delete | 2020-07-30T16:49:10Z | 8630 | Chenzw | 776311 | 7042637 |
So the main problem seems to be that hudi_revision_id is null and the times are off by up to 30 seconds or so.
I had no other ideas on the other two checks yet.
Last news on revision_create for simplewiki 2020-07:
Final note.
I have found 2 problems with mediawiki-events (over simplewiki only, using mediawiki-history as a baseline):
I have found new features to be added to events to allow to recreate mediawiki-history using events:
Re-read and re-thought. All sounds good, it's a solid plan for the next couple of quarters. And I especially like the insight from T262261#6478831 that some failed joins are just due to hidden performer information. This will be good to keep in mind and add an event for, because the log events won't cover it.