To serve Global Editor Metrics, we need user_central_id in Druid mediawiki_history_reduced: T406263: mediawiki_history_reduced - add page_id and user_central_id fields
As part of T405039: Global Editor Metrics - Data Pipeline we are going to incrementally update the Druid mediawiki_history_reduced dataset.
T365648: Add user_central_id to mediawiki_history and mediawiki_history_reduced Hive tables will be done for monthly snapshots.
To update this dataset daily, we need an incremental datasource. Our options are:
- event.mediawiki_page_change_v1
- mediawiki_content_history_v1
T403664: EventBus - Add central user id to MediaWiki events is done, so we could use event.mediawiki_page_change_v1.
However, mediawiki_content_history_v1 is daily reconciled, so will be more accurate. We'd prefer to use mediawiki_content_history_v1.
We could join mediawiki_content_history_v1 with event.mediawiki_page_change_v1 to look up the relevant user_central_id.
But, it would be much better and less work if mediawiki_content_history_v1 had user_central_id.
This field will be very useful for things other than Global Editor Metrics, so it makes sense to add this field to mediawiki_content_history_v1. Along the way, we should also add it to mediawiki_content_current_v1.
Done is
- user_central_id added to mediawiki_content_history_v1, populated ongoing from mediawiki.page_content_change.v1
- user_central_id added to mediawiki_content_current_v1, populated ongoing as it is downstream of mediawiki_content_history_v1
- mediawiki_content_history_v1 backfilled from either centralauth_localuser, or from MariaDB centralauth.localuser table
- mediawiki_content_current_v1 backfilled.