Page MenuHomePhabricator

NEW/CHANGE FEATURE REQUEST: make available the centralauth.globaluser table in Data Lake
Closed, ResolvedPublicFeature

Description

Data Platform Request Form

Is this a request for a:

  • Dataset
  • Data Pipeline
  • Data Feature

Is this a change to something existing:

  • Yes - please provide details of existing datasets/data pipelines (wiki links, Git URL, names of jobs, etc)
  • No

Please provide the description of your request:
Please make available the centralauth.globaluser table in Data Lake.

This request would solve the same use case for me as this older ticket, but it would solve it more broadly, so I think the solution to this request would be better.

Use Case: (Please briefly explain what this feature will be used for):
If we have this table in Data Lake, then we can easily join it on to tables like the mediawiki_history table, so that:

    1. We can use the unchangeable global user ids as global identifiers instead of using the changeable usernames for that purpose. This will make the analyses that use unique global users more reliable.
  1. We can join data that is by global user id onto tables that are by username and do not have global user ids, like the mediawiki_history table.
    • Example use case for this: I frequently need to join the Events Registration data that is by global user id onto the mediawiki_history table, which does not have it. For my current workaround for this, see the "Add local user id and wiki per global user id" section in this notebook

Ideal Delivery Date:
No specific delivery date, but sooner would be better.

Event Timeline

This ticket could override the need for T365648 .

The work requested in this ticket would support the following:

  • my current work:
    • Strategic events impact reporting for internal WMF audience
  • my work scheduled for the upcoming quarter:
    • Open events impact report Toolforge tool
  • the work of other data practitioners:

Event Registration not being instrumented / not treating its data as product continues to be a major source of friction.
There are some Campaigns-Product metrics that depend on Event Registration data. Being stored in x1 makes it incompatible with our sqoop process.

Timing:
Sooner would be better. No specific deadline for this, because I have a potentially brittle but working workaround for this.

For context on the timing:
For my current work, I came up with a workaround (example code, see the "Add local user id and wiki project" section). This workaround is likely to break eventually, because it relies on a SQL IN clause, which probably has a limit on how many ids can go into it. It is also very slow. For now, the volume of users that my code processes is fairly small due to adoption of a new tool (the Event Registration Tool), but as more users start using the new tool, the list of ids I'd have to put in that IN clause will grow, and maybe it'll reach that limit. Before that happens, I'd like to have a more robust solution, and the work requested in this ticket will support the more robust solution of joining on the centralauth.globaluser table in Data Lake and not having to use an IN clause.

For YiR, monthly is fine. However, for Growth Impact Module and for app Activity Tab use cases, if we had daily we may be able generate editor metrics daily too.

I will try to do a one off sqoop of globaluser and localuser to see how long it takes, and to see if we can do it daily.

Change #1184925 had a related patch set uploaded (by Ottomata; author: Ottomata):

[analytics/refinery@master] Sqoop CentralAuth's globaluser and localuser tables

https://gerrit.wikimedia.org/r/1184925

I was able to test my sqoop patch and get globaluser and localuser avro data sqooped. The test command I ran was:

PYTHONPATH=/home/otto/refinery/python python3 /home/otto/refinery/bin/sqoop-mediawiki-tables \
--output-dir /user/otto/wmf/data/raw/mediawiki/tables \
--user research \
--password-file /user/milimetric/mysql-analytics-research-client-pw.txt \
--wiki-file "/home/otto/refinery/static_data/mediawiki/grouped_wikis/centralauth.csv" \
--tables globaluser,localuser \
--partition-name snapshot \
--partition-value otto1 \
--output-format avrodata \
--sample-wiki-for-jar centralauth \
--processors=32 \
--mappers=32 \
--job-name=sqoop-centralauth-tables-otto1 \
--log-file /tmp/sqoop-info-and-above.log \
--verbose

This resulted in the following dirs and sizes:

16:08:15 [@stat1011:/home/otto] $ hdfs dfs -du -s -h  /user/otto/wmf/data/raw/mediawiki/tables/globaluser/snapshot=otto1/wiki_db=centralauth
1.5 G  /user/otto/wmf/data/raw/mediawiki/tables/globaluser/snapshot=otto1/wiki_db=centralauth

16:08:27 [@stat1011:/home/otto] $ hdfs dfs -du -s -h  /user/otto/wmf/data/raw/mediawiki/tables/localuser/snapshot=otto1/wiki_db=centralauth
7.1 G  /user/otto/wmf/data/raw/mediawiki/tables/localuser/snapshot=otto1/wiki_db=centralauth

I was able to count records with pyspark:

pyspark3 --jars /srv/deployment/analytics/refinery/artifacts/refinery-job-shaded.jar
In [1]: globaluser = spark.read.format("avro").load("hdfs:///user/otto/wmf/data/raw/mediawiki/tables/globaluser/snapshot=otto1")
In [2]: localuser = spark.read.format("avro").load("hdfs:///user/otto/wmf/data/raw/mediawiki/tables/localuser/snapshot=otto1")

In [3]: globaluser.count()
Out[3]: 79491989

In [4]: localuser.count()
Out[4]: 288787606

Just tested with Hive tables too. Looks good.


spark-sql (default)> select * from centralauth_globaluser where snapshot='otto1' limit 1;
gu_id	gu_name	gu_home_db	gu_email	gu_email_authenticated	gu_password	gu_locked	gu_hidden_level	gu_registration	gu_password_reset_key	gu_password_reset_expiration	gu_auth_token	gu_cas_token	snapshot	wiki_db

1	[xxxx]	enwiki	NULL	20121231002451	NULL	false	0	20080313041656	NULL	NULL	NULL	NULL	otto1	centralauth


spark-sql (default)> select * from centralauth_localuser where snapshot='otto1' limit 1;
lu_wiki	lu_name	lu_attached_timestamp	lu_attached_method	lu_attachment_method	lu_local_id	lu_global_id	snapshot	wiki_db
aawiki	[xxxx]	20130411192301	primary	NULL	34	16773808	otto1	centralauth

Change #1184925 merged by Ottomata:

[analytics/refinery@master] Sqoop CentralAuth's globaluser and localuser tables

https://gerrit.wikimedia.org/r/1184925

Merged patch. centralauth tables should be available after the next monthly sqoop.

Change #1193886 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/puppet@production] sqoop - Add centralauth globaluser and localuser tables

https://gerrit.wikimedia.org/r/1193886

Change #1193886 merged by Ottomata:

[operations/puppet@production] sqoop - Add centralauth globaluser and localuser tables

https://gerrit.wikimedia.org/r/1193886

Change #1193897 had a related patch set uploaded (by Ottomata; author: Ottomata):

[analytics/refinery@master] hql/centralauth DDL - Fix location path for tables

https://gerrit.wikimedia.org/r/1193897

Change #1193901 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/puppet@production] sqoop - fix name of hive centralauth tables

https://gerrit.wikimedia.org/r/1193901

Change #1193901 merged by Ottomata:

[operations/puppet@production] sqoop - fix name of hive centralauth tables

https://gerrit.wikimedia.org/r/1193901

Change #1193897 merged by Ottomata:

[analytics/refinery@master] sqoop centralauth - Fix location path for tables and name of sqoop tables

https://gerrit.wikimedia.org/r/1193897

Mentioned in SAL (#wikimedia-analytics) [2025-10-06T16:44:18Z] <ottomata> deploying analytics/refinery to an-launcher1002 to pick up change for T389666

Mentioned in SAL (#wikimedia-operations) [2025-10-06T16:44:43Z] <otto@deploy2002> Started deploy [analytics/refinery@21fe78f]: deploying analytics/refinery to an-launcher1002 to pick up change for T389666

Mentioned in SAL (#wikimedia-operations) [2025-10-06T16:46:35Z] <otto@deploy2002> Finished deploy [analytics/refinery@21fe78f]: deploying analytics/refinery to an-launcher1002 to pick up change for T389666 (duration: 02m 11s)

Created tables.

sudo -u analytics kerberos-run-command analytics spark3-sql --database wmf_raw -f hql/centralauth/create_centralauth_localuser_table.hql

sudo -u analytics kerberos-run-command analytics spark3-sql --database wmf_raw -f hql/centralauth/create_centralauth_globaluser_table.hql
hive (wmf_raw)> show tables;
...
centralauth_globaluser
centralauth_localuser
...

Change #1193926 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/puppet@production] sqoop - fix centralauth job by using seperate script and adding it to sqoop-whole-mediawiki

https://gerrit.wikimedia.org/r/1193926

Change #1193926 merged by Ottomata:

[operations/puppet@production] sqoop - fix centralauth - use seperate script and add to sqoop-whole-mediawiki

https://gerrit.wikimedia.org/r/1193926

Mentioned in SAL (#wikimedia-analytics) [2025-10-08T13:27:51Z] <ottomata> running manual sqoop of centralauth tables - T389666

> MSCK REPAIR TABLE wmf_raw.centralauth_globaluser;
> MSCK REPAIR TABLE wmf_raw.centralauth_localuser;

> show partitions wmf_raw.centralauth_globaluser;
partition
snapshot=2025-09/wiki_db=centralauth

spark-sql (default)> show partitions wmf_raw.centralauth_globaluser;
partition
snapshot=2025-09/wiki_db=centralauth
select lu_local_id, lu_name, lu_global_id from wmf_raw.centralauth_localuser where snapshot="2025-09" and lu_global_id > 20000000 limit 3;
lu_local_id     lu_name lu_global_id
3576    - - - - - - -testinternetu      61381154
3812    - gr22treqfgqd  62588654
3688    07      45024500