Page MenuHomePhabricator

Add user_central_id to mediawiki_content_history_v1 (and mediawiki_content_current_v1)
Open, Needs TriagePublic

Description

To serve Global Editor Metrics, we need user_central_id in Druid mediawiki_history_reduced: T406263: mediawiki_history_reduced - add page_id and user_central_id fields

As part of T405039: Global Editor Metrics - Data Pipeline we are going to incrementally update the Druid mediawiki_history_reduced dataset.

T365648: Add user_central_id to mediawiki_history and mediawiki_history_reduced Hive tables will be done for monthly snapshots.

To update this dataset daily, we need an incremental datasource. Our options are:

  • event.mediawiki_page_change_v1
  • mediawiki_content_history_v1

T403664: EventBus - Add central user id to MediaWiki events is done, so we could use event.mediawiki_page_change_v1.

However, mediawiki_content_history_v1 is daily reconciled, so will be more accurate. We'd prefer to use mediawiki_content_history_v1.

We could join mediawiki_content_history_v1 with event.mediawiki_page_change_v1 to look up the relevant user_central_id.

But, it would be much better and less work if mediawiki_content_history_v1 had user_central_id.

This field will be very useful for things other than Global Editor Metrics, so it makes sense to add this field to mediawiki_content_history_v1. Along the way, we should also add it to mediawiki_content_current_v1.

Done is

  • user_central_id added to mediawiki_content_history_v1, populated ongoing from mediawiki.page_content_change.v1
  • user_central_id added to mediawiki_content_current_v1, populated ongoing as it is downstream of mediawiki_content_history_v1
  • mediawiki_content_history_v1 backfilled from either centralauth_localuser, or from MariaDB centralauth.localuser table
  • mediawiki_content_current_v1 backfilled.

Details

Related Changes in GitLab:
TitleReferenceAuthorSource BranchDest Branch
main: bump mw content to pickup schema changes.repos/data-engineering/airflow-dags!1758xcollazodrop-sha1-add-central-idmain
Drop revision_sha1, add user_central_idrepos/data-engineering/dumps/mediawiki-content-dump!81xcollazodrop-sha1-add-central-idmain
Customize query in GitLab

Event Timeline

@xcollazo, @Antoine_Quhen, in a meeting today, @Milimetric suggested that this task be done rather than doing joins with page_change.

This would have a tight turnaround: we'd need this ASAP, as we need to start building the data pipeline now.

What are our chances of getting it done? :D CC also @Ahoelzl

A summary of the decision from the meeting is at {T405039#11247302}

wmf_raw. centralauth_localuser appears empty. Perhaps the sqoop job is not in yet?

spark-sql (default)> select count(1) as count from centralauth_localuser;
count
0
Time taken: 7.217 seconds, Fetched 1 row(s)
spark-sql (default)> show partitions centralauth_localuser;
partition
Time taken: 0.27 seconds

wmf_raw. centralauth_localuser appears empty. Perhaps the sqoop job is not in yet?

spark-sql (default)> select count(1) as count from centralauth_localuser;
count
0
Time taken: 7.217 seconds, Fetched 1 row(s)
spark-sql (default)> show partitions centralauth_localuser;
partition
Time taken: 0.27 seconds

Oh I see the first sqoop is happening as part of T389666. Got it.

Ya apparently I did it wrong. Trying to fix it asap.

Also, I think we'll need user_central_id in page_content_change events, eh? That needs https://gitlab.wikimedia.org/repos/data-engineering/mediawiki-event-enrichment/-/merge_requests/96, but errrr that might be currently blocked on the Flink 1.20 upgrade eek.

wmf_raw. centralauth_localuser appears empty. Perhaps the sqoop job is not in yet?

In the meantime, I did a manual sqoop in have at otto.centralauth_localuser

Thanks to @tchin, user_central_id is now in mediawiki.page_content_change.v1, so we should be able to add it to mediawiki_content_history_v1 ongoing, yes?

As agreed with @amastilovic, I'm taking over as I am similar changes for T405641.

Thanks to @tchin, user_central_id is now in mediawiki.page_content_change.v1, so we should be able to add it to mediawiki_content_history_v1 ongoing, yes?

Yes, but event emission for reconcile is looking difficult as user_central_id is not part of a regular wiki's DB schema. Let's talk.

It doesn't look like the MW Action API for revisions will expose a users central id.

But, if we have the local user id, we can fetch the central id by hitting the userinfo API:

https://www.mediawiki.org/w/api.php?action=query&format=json&meta=userinfo&formatversion=2&uiprop=centralids

{
    "batchcomplete": true,
    "query": {
        "userinfo": {
            "id": 17637304,
            "name": "XCollazo-WMF",
            "centralids": {
                "CentralAuth": 69912009,
                "local": 17637304
            },
            "attachedlocal": {
                "CentralAuth": true,
                "local": true
            }
        }
    }
}

Thus perhaps we can modify mediawiki_event_enrichment to enrich the reconcile events with this data.

Oh hm yes indeed.

reconcile is looking difficult as user_central_id is not part of a regular wiki's DB schema

and/or, is it possible to include user_central_id in reconcile events at emission time by looking them up in centralauth localuser?

Let's chat!

Ok, from chat with @Ottomata:

We figured that user_central_id can be fetched from the centralauth database, and that this database is on section s7 as per refinery code:

elif dbname == 'centralauth':
    # The 'centralauth' db is a special case, not currently
    # listed among the mediawiki-config's dblists. The more automated
    # solution would be to parse db-production.php in mediawiki-config, but it
    # would add more complexity than what's necessary.
    shard = 's7'

Still, today's event emission reconcile code only connects to the target DB, and we will now also need a connection to centralauth, and to join both dataframes, hopefully in an efficient way. This is all possible, but it will take a bit of time to get right.

Thus we have decided that, as a first step, we will just start ingesting the column user_central_id into the mediawiki tables, but we will not reconcile or backfill it. We will mark it as so in the documentation. We will of course finish the work of reconcile and backfill later on.

Thanks! This will allow us to use mw_content_history for daily updates to Druid mediawiki_history_reduced. This way, we don't have to migrate to mw_content_history later.

@xcollazo Marcel and I just discussed more complications with using mediawiki_content_history_v1 for daily updates for our stuff: T405039#11290461

I think we are just going to use mediawiki_page_change_v1 for now. I'm sorry for the extra urgent work.

But! Getting user_central_id into mediawiki_content_history_v1 is still very valuable so please proceed if you are almost there!

Ran the followin in production:

$ hostname -f
an-launcher1002.eqiad.wmnet
$ whoami
analytics
$ kerberos-run-command analytics spark3-sql

spark-sql (default)> ALTER TABLE wmf_content.mediawiki_content_history_v1 DROP COLUMN revision_sha1;
25/10/20 19:13:21 WARN BaseTransaction: Failed to load metadata for a committed snapshot, skipping clean-up
Response code
Time taken: 2.315 seconds

spark-sql (default)> ALTER TABLE wmf_content.mediawiki_content_current_v1 DROP COLUMN revision_sha1;
25/10/20 19:13:28 WARN BaseTransaction: Failed to load metadata for a committed snapshot, skipping clean-up
Response code
Time taken: 0.211 seconds

spark-sql (default)> ALTER TABLE wmf_content.mediawiki_content_history_v1 ADD COLUMN user_central_id BIGINT COMMENT 'Global cross-wiki user ID. See: https://www.mediawiki.org/wiki/Manual:Central_ID' AFTER user_id;
25/10/20 19:13:36 WARN BaseTransaction: Failed to load metadata for a committed snapshot, skipping clean-up
Response code
Time taken: 0.237 seconds

spark-sql (default)> ALTER TABLE wmf_content.mediawiki_content_current_v1 ADD COLUMN user_central_id BIGINT COMMENT 'Global cross-wiki user ID. See: https://www.mediawiki.org/wiki/Manual:Central_ID' AFTER user_id;
25/10/20 19:13:42 WARN BaseTransaction: Failed to load metadata for a committed snapshot, skipping clean-up
Response code
Time taken: 0.181 seconds

Sanity checks:

spark-sql (default)> select wiki_id, revision_id, user_id, user_central_id from wmf_content.mediawiki_content_history_v1 where wiki_id='simplewiki' limit 10;
wiki_id	revision_id	user_id	user_central_id
simplewiki	59	0	NULL
simplewiki	60	20	NULL
simplewiki	1480	20	NULL
simplewiki	4210	2	NULL
simplewiki	4522	11	NULL
simplewiki	5619	121	NULL
simplewiki	8877	11	NULL
simplewiki	10140	0	NULL
simplewiki	22416	11	NULL
simplewiki	23660	793	NULL
Time taken: 1.236 seconds, Fetched 10 row(s)

spark-sql (default)> select wiki_id, revision_id, user_id, user_central_id from wmf_content.mediawiki_content_current_v1 where wiki_id='simplewiki' limit 10;
wiki_id	revision_id	user_id	user_central_id
simplewiki	10580813	1678721	NULL
simplewiki	10580720	1678706	NULL
simplewiki	10579440	430706	NULL
simplewiki	10579698	1673561	NULL
simplewiki	10579097	1595360	NULL
simplewiki	10579676	1677895	NULL
simplewiki	10580818	805501	NULL
simplewiki	10580800	1011873	NULL
simplewiki	10579686	1673561	NULL
simplewiki	10579603	1150185	NULL
Time taken: 0.921 seconds, Fetched 10 row(s)