Page MenuHomePhabricator

Update user_history and page_history column naming convention
Closed, ResolvedPublic3 Estimated Story Points

Description

In mediawiki_history we went from (nothing) and _latest to _historical and (nothing) for column name suffixes. We should do the same in the other two tables that have these types of columns.

Event Timeline

mforns triaged this task as Unbreak Now! priority.Apr 19 2018, 4:51 PM
JAllemandou set the point value for this task to 3.

Due to https://gerrit.wikimedia.org/r/#/c/388265/ not having been reflected on wmf.mediawiki_user_history and wmf.mediawiki_page_history, we expected field definition issues.

Check from hive and spark over user table/parquet:

  • Hive:
describe wmf.mediawiki_user_history;
OK
col_name	data_type	comment
wiki_db             	string              	enwiki, dewiki, eswiktionary, etc.
user_id             	bigint              	ID of the user, as in the user table.
user_name           	string              	Historical user name.
user_name_latest    	string              	User name as of today.
user_groups         	array<string>       	Historical user groups.
user_groups_latest  	array<string>       	User groups as of today.
user_blocks         	array<string>       	Historical user blocks.
user_blocks_latest  	array<string>       	User blocks as of today.
user_registration_timestamp	string              	When the user accoung was registered, in YYYYMMDDHHmmss format.
created_by_self     	boolean             	Whether the user created their own account
created_by_system   	boolean             	Whether the user account was created by mediawiki (eg. centralauth)
created_by_peer     	boolean             	Whether the user account was created by another user
anonymous           	boolean             	Whether the user is not registered
is_bot_by_name      	boolean             	Whether the user's name matches patterns we use to identify bots
start_timestamp     	string              	Timestamp from where this state applies (inclusive).
end_timestamp       	string              	Timestamp to where this state applies (exclusive).
caused_by_event_type	string              	Event that caused this state (create, move, delete or restore).
caused_by_user_id   	bigint              	ID from the user that caused this state.
caused_by_block_expiration	string              	Block expiration timestamp, if any.
inferred_from       	string              	If non-NULL, indicates that some of this state's fields have been inferred after an inconsistency in the source data.
snapshot            	string              	Versioning information to keep multiple datasets (YYYY-MM for regular labs imports)
	 	 
# Partition Information	 	 
# col_name            	data_type           	comment             
	 	 
snapshot            	string              	Versioning information to keep multiple datasets (YYYY-MM for regular labs imports)
  • Spark:
val df = spark.read.parquet("/wmf/data/wmf/mediawiki/user_history/snapshot=2018-03")
df.printSchema
root
 |-- wiki_db: string (nullable = true)
 |-- user_id: long (nullable = true)
 |-- user_name_historical: string (nullable = true)
 |-- user_name: string (nullable = true)
 |-- user_groups_historical: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- user_groups: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- user_blocks_historical: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- user_blocks: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- user_registration_timestamp: string (nullable = true)
 |-- created_by_self: boolean (nullable = true)
 |-- created_by_system: boolean (nullable = true)
 |-- created_by_peer: boolean (nullable = true)
 |-- anonymous: boolean (nullable = true)
 |-- is_bot_by_name: boolean (nullable = true)
 |-- start_timestamp: string (nullable = true)
 |-- end_timestamp: string (nullable = true)
 |-- caused_by_event_type: string (nullable = true)
 |-- caused_by_user_id: long (nullable = true)
 |-- caused_by_block_expiration: string (nullable = true)
 |-- inferred_from: string (nullable = true)

This leads to:

  • NULL values for any _latest field in hive
  • Incorrect data when querying an expected-historical field in hive (actually a non-historical one).

This is broken since snapshot 2017-11.

Proposed solution:

  • Manually delete snapshots still existing with the wrong format (2017-09, 2017-10), and also delete old private ones (2016-12_private and 2017-07_private)
  • Drop tables, recreate them with the correct schemas and repair them.

@JAllemandou thanks for following up on this, the solution looks good to me. The only thing is then we won't have any private partitions, but people will come to us if they need one and we can make a new one. The data's wrong, so it's already broken, really.

Mentioned in SAL (#wikimedia-analytics) [2018-04-20T18:23:30Z] <joal> Drop/recreate wmf.mediawiki_user_history andwmf.mediawiki_page_history for T188669

TheDragonFire lowered the priority of this task from Unbreak Now! to Needs Triage.May 12 2018, 4:17 AM
TheDragonFire subscribed.

Downgrading from UBN per last comment. Is this closable now?

@TheDragonFire we keep tasks in the Done column until our manager can review them and close them, but I can see that this causes some friction when the task is unbreak now. We'll lower its priority when it gets to the Done column from now on.