Page MenuHomePhabricator

CentralAuth's localuser table contains many nulls and duplicate mappings
Open, Needs TriagePublic

Description

While working on T410688 to integrate lu_global_id into a data engineering pipeline, we found out that data on CentralAuth's localuser table contains both NULLs as well as duplicate mappings.

Here are examples from a snapshot of the table taken on 2025-10 that we have available in the datalake. Note that you should ignore any references below to wiki_db and snapshot, as these columns are added when ingesting the data into the datalake and are not to be found on MariaDB.

Funnily enough, now there are now 4.3M rows more on the target table than the source

I have investigated @xcollazo finding, and it's not great: the centralauth.local_user table contains rows with NULL values for local_user_id for many projects, and for other projects (ocwiki and outreachwiki for the least) it has multiple rows for the same local_user_id and global_user_id...
This explains the row duplication :(
I have been trying the MERGE approach on my test table removing corrupted data, but the job still fails. I'll continue my investigations in that direction.

Looked at this as well. Sharing issues with the table for completeness:

spark.sql("""
SELECT count(1) as count
FROM wmf_raw.centralauth_localuser 
WHERE snapshot='2025-10'
  AND wiki_db='centralauth'
  AND lu_local_id IS NULL
""").show(100, truncate=False)

+-----+
|count|
+-----+
|6365 |
+-----+

spark.sql("""
SELECT count(1) as count FROM (
  SELECT count(1) as count,
         lu_wiki,
         lu_local_id
  FROM wmf_raw.centralauth_localuser
  WHERE snapshot='2025-10'
    AND wiki_db='centralauth'
  GROUP BY lu_wiki, lu_local_id
  HAVING count > 1
)
""").show(100, truncate=False)

+------+
|count |
+------+
|221780|
+------+

spark.sql("""
SELECT count(1) as count FROM (
  SELECT count(1) as count,
         lu_wiki,
         lu_local_id,
         lu_global_id
  FROM wmf_raw.centralauth_localuser
  WHERE snapshot='2025-10'
    AND wiki_db='centralauth'
  GROUP BY lu_wiki, lu_local_id, lu_global_id
  HAVING count > 1
)
""").show(100, truncate=False)

+------+
|count |
+------+
|221647|
+------+

spark.sql("""
  SELECT count(1) as count,
         lu_wiki,
         lu_local_id,
         lu_global_id
  FROM wmf_raw.centralauth_localuser
  WHERE snapshot='2025-10'
    AND wiki_db='centralauth'
  GROUP BY lu_wiki, lu_local_id, lu_global_id
  HAVING count > 1
  ORDER BY count DESC
""").show(100, truncate=False)

+-----+-------------+-----------+------------+
|count|lu_wiki      |lu_local_id|lu_global_id|
+-----+-------------+-----------+------------+
|2573 |enwiki       |null       |null        |
|536  |eswiki       |null       |null        |
|164  |ptwiki       |null       |null        |
|118  |jawiki       |null       |null        |
|78   |frwiki       |null       |null        |
|68   |idwiki       |null       |null        |
|49   |ruwiki       |null       |null        |
|45   |metawiki     |null       |null        |
|44   |viwiki       |null       |null        |
|36   |dewiki       |null       |null        |
|35   |trwiki       |null       |null        |
|19   |mediawikiwiki|null       |null        |
|15   |itwiki       |null       |null        |
|15   |fawiki       |null       |null        |
|12   |enwiki       |null       |0           |
|9    |commonswiki  |null       |null        |
|8    |simplewiki   |null       |null        |
|7    |hewiki       |null       |null        |
|7    |svwiki       |null       |null        |
|6    |mswiki       |null       |null        |
|6    |nlwiki       |null       |null        |
|6    |zhwiki       |null       |null        |
|6    |plwiki       |null       |null        |
|5    |srwiki       |null       |null        |
|5    |azwiki       |null       |null        |
|5    |mkwiki       |null       |null        |
|4    |sqwiki       |null       |null        |
|4    |arzwiki      |null       |null        |
|4    |cawiki       |null       |null        |
|3    |cswiki       |null       |null        |
|3    |enwiktionary |null       |null        |
|3    |arwiki       |null       |null        |
|3    |fiwiki       |null       |null        |
|3    |metawiki     |null       |0           |
|2    |ocwiki       |30104      |43352052    |
|2    |ocwiki       |44967      |64797735    |
|2    |ocwiki       |33093      |1469800     |
|2    |ocwiki       |33359      |48463092    |
|2    |ocwiki       |15703      |573928      |
|2    |ocwiki       |55770      |62357577    |
|2    |ocwiki       |34908      |6532839     |
|2    |ocwiki       |1782       |417258      |
|2    |ocwiki       |25686      |13967321    |
|2    |ocwiki       |53445      |74506123    |
|2    |ocwiki       |31591      |44800838    |
|2    |ocwiki       |33969      |1038620     |
|2    |olowiki      |6501       |72655905    |
|2    |ocwiki       |23757      |16898251    |
|2    |ocwiki       |745        |5731        |
|2    |ocwiki       |27884      |6090122     |
|2    |ocwiki       |51960      |72164030    |
|2    |ocwiki       |57836      |62111448    |
|2    |ocwiki       |46918      |68928761    |
...

Seems that, other than nulls, the repeats on ocwiki and friends are duplicates of the same mapping, so perhaps all we need to do is filter the table for lu_local_id IS NULL before joining on it.

Event Timeline

Restricted Application added subscribers: hubaishan, Nemoralis, Huji and 2 others. · View Herald Transcript

So the cases where lu_global_id IS null (3913 accounts) would be rows where the user was never actually globalised.

Looking at migrateAccount.php:

		$this->addDescription( <<<'TEXT'
			Migrates the specified usernames to a global account if email matches
			and there are no conflicts. Assumes the localuser and globaluser tables
			are up to date (e.g. migratePass0 has been run).
			TEXT );

Which is reasonable... If we can't work out if a user is the same as on another wiki.. We can't globalise them.

However, where the account only exists on one wiki (ie it's unique), and for example has no email on either... Well, I think we can globalise those, which will set a lu_global_id and because it has a wiki account, that lu_local_id will not be null either.

This is attachbroken in migrateAccount.php:

			if ( $this->getOption( 'attachbroken', false ) ) {
				// This option is for T63876 / T41996 where the account has
				// an empty password and email set, and became unattached.
				// Since there is no way an account can have an empty password manually
				// it has to be due to a CentralAuth bug. So just attach it then.
				// But just to be on the safe side, check that it also has 0 edits.
				foreach ( $unattached as $wiki => $local ) {
					if ( $local['email'] === '' && $local['password'] === ''
						&& $local['editCount'] === '0'
					) {
						$this->output( "ATTACHING: $username@$wiki\n" );
						// Ironically, the attachment is made due to lack of a password.
						$central->attach(
							$wiki, 'password', /** $sendToRC = */ !$this->suppressRC
						);
					}
				}
			}

There are slightly less accounts with no lu_global_id (3913) than those with no lu_local_id (4042)...

wikiadmin2023@10.192.48.205(centralauth)> select lu_name, count(*) as cnt from localuser where lu_global_id is null group by lu_name having cnt > 1;
+------------------------------+-----+
| lu_name                      | cnt |
+------------------------------+-----+
| Besir.arifi                  |   2 |
| Fabriciotg                   |   2 |
| Mazzikamadura                |   2 |
| SELFEDITION                  |   2 |
| Trần Nguyễn Hạo Thiên        |   2 |
+------------------------------+-----+
5 rows in set (9 min 55.617 sec)

So basically we can globalise most of those accounts...

I don't know if my other script runs have tidied all of those things up... But I doubt it, as they don't delete rows... But it will have fixed some of the lu_local_id is null entries as per T303590#11411628

Are those ocwiki dupes legit? One picked at random...

wikiadmin2023@10.192.48.205(centralauth)> select * from localuser where lu_wiki='ocwiki' and (lu_global_id = 16898251 or lu_local_id = 23757 );
+---------+---------+-----------------------+--------------------+-------------+--------------+----------------------+
| lu_wiki | lu_name | lu_attached_timestamp | lu_attached_method | lu_local_id | lu_global_id | lu_attachment_method |
+---------+---------+-----------------------+--------------------+-------------+--------------+----------------------+
| ocwiki  | IIM 78  | 20150726154950        | login              |       23757 |     16898251 |                 NULL |
+---------+---------+-----------------------+--------------------+-------------+--------------+----------------------+
1 row in set (0.035 sec)

I'd expect to see 2 rows?

wikiadmin2023@10.192.48.205(centralauth)> show indexes from localuser;
+-----------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+
| Table     | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment | Ignored |
+-----------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+
| localuser |          0 | PRIMARY  |            1 | lu_wiki     | A         |      162658 |     NULL | NULL   |      | BTREE      |         |               | NO      |
| localuser |          0 | PRIMARY  |            2 | lu_name     | A         |   227233813 |     NULL | NULL   |      | BTREE      |         |               | NO      |
| localuser |          1 | lu_name  |            1 | lu_name     | A         |   227233813 |     NULL | NULL   |      | BTREE      |         |               | NO      |
| localuser |          1 | lu_name  |            2 | lu_wiki     | A         |   227233813 |     NULL | NULL   |      | BTREE      |         |               | NO      |
+-----------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+
4 rows in set (0.001 sec)

I don't know if my other script runs have tidied all of those things up... But I doubt it, as they don't delete rows... But it will have fixed some of the lu_local_id is null entries as per T303590#11411628

Are those ocwiki dupes legit? ...

Once the next snapshot of the localuser table is available in the datalake I will rerun the query, see if we still see the issues. It could very well be a datalake ingest issue for ocwiki and olowiki.