In mediawiki_history we went from (nothing) and _latest to _historical and (nothing) for column name suffixes. We should do the same in the other two tables that have these types of columns.
Description
Event Timeline
Due to https://gerrit.wikimedia.org/r/#/c/388265/ not having been reflected on wmf.mediawiki_user_history and wmf.mediawiki_page_history, we expected field definition issues.
Check from hive and spark over user table/parquet:
- Hive:
describe wmf.mediawiki_user_history; OK col_name data_type comment wiki_db string enwiki, dewiki, eswiktionary, etc. user_id bigint ID of the user, as in the user table. user_name string Historical user name. user_name_latest string User name as of today. user_groups array<string> Historical user groups. user_groups_latest array<string> User groups as of today. user_blocks array<string> Historical user blocks. user_blocks_latest array<string> User blocks as of today. user_registration_timestamp string When the user accoung was registered, in YYYYMMDDHHmmss format. created_by_self boolean Whether the user created their own account created_by_system boolean Whether the user account was created by mediawiki (eg. centralauth) created_by_peer boolean Whether the user account was created by another user anonymous boolean Whether the user is not registered is_bot_by_name boolean Whether the user's name matches patterns we use to identify bots start_timestamp string Timestamp from where this state applies (inclusive). end_timestamp string Timestamp to where this state applies (exclusive). caused_by_event_type string Event that caused this state (create, move, delete or restore). caused_by_user_id bigint ID from the user that caused this state. caused_by_block_expiration string Block expiration timestamp, if any. inferred_from string If non-NULL, indicates that some of this state's fields have been inferred after an inconsistency in the source data. snapshot string Versioning information to keep multiple datasets (YYYY-MM for regular labs imports) # Partition Information # col_name data_type comment snapshot string Versioning information to keep multiple datasets (YYYY-MM for regular labs imports)
- Spark:
val df = spark.read.parquet("/wmf/data/wmf/mediawiki/user_history/snapshot=2018-03") df.printSchema root |-- wiki_db: string (nullable = true) |-- user_id: long (nullable = true) |-- user_name_historical: string (nullable = true) |-- user_name: string (nullable = true) |-- user_groups_historical: array (nullable = true) | |-- element: string (containsNull = true) |-- user_groups: array (nullable = true) | |-- element: string (containsNull = true) |-- user_blocks_historical: array (nullable = true) | |-- element: string (containsNull = true) |-- user_blocks: array (nullable = true) | |-- element: string (containsNull = true) |-- user_registration_timestamp: string (nullable = true) |-- created_by_self: boolean (nullable = true) |-- created_by_system: boolean (nullable = true) |-- created_by_peer: boolean (nullable = true) |-- anonymous: boolean (nullable = true) |-- is_bot_by_name: boolean (nullable = true) |-- start_timestamp: string (nullable = true) |-- end_timestamp: string (nullable = true) |-- caused_by_event_type: string (nullable = true) |-- caused_by_user_id: long (nullable = true) |-- caused_by_block_expiration: string (nullable = true) |-- inferred_from: string (nullable = true)
This leads to:
- NULL values for any _latest field in hive
- Incorrect data when querying an expected-historical field in hive (actually a non-historical one).
This is broken since snapshot 2017-11.
Proposed solution:
- Manually delete snapshots still existing with the wrong format (2017-09, 2017-10), and also delete old private ones (2016-12_private and 2017-07_private)
- Drop tables, recreate them with the correct schemas and repair them.
@JAllemandou thanks for following up on this, the solution looks good to me. The only thing is then we won't have any private partitions, but people will come to us if they need one and we can make a new one. The data's wrong, so it's already broken, really.
Mentioned in SAL (#wikimedia-analytics) [2018-04-20T18:23:30Z] <joal> Drop/recreate wmf.mediawiki_user_history andwmf.mediawiki_page_history for T188669
@TheDragonFire we keep tasks in the Done column until our manager can review them and close them, but I can see that this causes some friction when the task is unbreak now. We'll lower its priority when it gets to the Done column from now on.