While working on T410688 to integrate lu_global_id into a data engineering pipeline, we found out that data on CentralAuth's localuser table contains both NULLs as well as duplicate mappings.
Here are examples from a snapshot of the table taken on 2025-10 that we have available in the datalake. Note that you should ignore any references below to wiki_db and snapshot, as these columns are added when ingesting the data into the datalake and are not to be found on MariaDB.
In T410688#11409702, @xcollazo wrote:In T410688#11408750, @JAllemandou wrote:Funnily enough, now there are now 4.3M rows more on the target table than the source
I have investigated @xcollazo finding, and it's not great: the centralauth.local_user table contains rows with NULL values for local_user_id for many projects, and for other projects (ocwiki and outreachwiki for the least) it has multiple rows for the same local_user_id and global_user_id...
This explains the row duplication :(
I have been trying the MERGE approach on my test table removing corrupted data, but the job still fails. I'll continue my investigations in that direction.Looked at this as well. Sharing issues with the table for completeness:
spark.sql(""" SELECT count(1) as count FROM wmf_raw.centralauth_localuser WHERE snapshot='2025-10' AND wiki_db='centralauth' AND lu_local_id IS NULL """).show(100, truncate=False) +-----+ |count| +-----+ |6365 | +-----+ spark.sql(""" SELECT count(1) as count FROM ( SELECT count(1) as count, lu_wiki, lu_local_id FROM wmf_raw.centralauth_localuser WHERE snapshot='2025-10' AND wiki_db='centralauth' GROUP BY lu_wiki, lu_local_id HAVING count > 1 ) """).show(100, truncate=False) +------+ |count | +------+ |221780| +------+ spark.sql(""" SELECT count(1) as count FROM ( SELECT count(1) as count, lu_wiki, lu_local_id, lu_global_id FROM wmf_raw.centralauth_localuser WHERE snapshot='2025-10' AND wiki_db='centralauth' GROUP BY lu_wiki, lu_local_id, lu_global_id HAVING count > 1 ) """).show(100, truncate=False) +------+ |count | +------+ |221647| +------+ spark.sql(""" SELECT count(1) as count, lu_wiki, lu_local_id, lu_global_id FROM wmf_raw.centralauth_localuser WHERE snapshot='2025-10' AND wiki_db='centralauth' GROUP BY lu_wiki, lu_local_id, lu_global_id HAVING count > 1 ORDER BY count DESC """).show(100, truncate=False) +-----+-------------+-----------+------------+ |count|lu_wiki |lu_local_id|lu_global_id| +-----+-------------+-----------+------------+ |2573 |enwiki |null |null | |536 |eswiki |null |null | |164 |ptwiki |null |null | |118 |jawiki |null |null | |78 |frwiki |null |null | |68 |idwiki |null |null | |49 |ruwiki |null |null | |45 |metawiki |null |null | |44 |viwiki |null |null | |36 |dewiki |null |null | |35 |trwiki |null |null | |19 |mediawikiwiki|null |null | |15 |itwiki |null |null | |15 |fawiki |null |null | |12 |enwiki |null |0 | |9 |commonswiki |null |null | |8 |simplewiki |null |null | |7 |hewiki |null |null | |7 |svwiki |null |null | |6 |mswiki |null |null | |6 |nlwiki |null |null | |6 |zhwiki |null |null | |6 |plwiki |null |null | |5 |srwiki |null |null | |5 |azwiki |null |null | |5 |mkwiki |null |null | |4 |sqwiki |null |null | |4 |arzwiki |null |null | |4 |cawiki |null |null | |3 |cswiki |null |null | |3 |enwiktionary |null |null | |3 |arwiki |null |null | |3 |fiwiki |null |null | |3 |metawiki |null |0 | |2 |ocwiki |30104 |43352052 | |2 |ocwiki |44967 |64797735 | |2 |ocwiki |33093 |1469800 | |2 |ocwiki |33359 |48463092 | |2 |ocwiki |15703 |573928 | |2 |ocwiki |55770 |62357577 | |2 |ocwiki |34908 |6532839 | |2 |ocwiki |1782 |417258 | |2 |ocwiki |25686 |13967321 | |2 |ocwiki |53445 |74506123 | |2 |ocwiki |31591 |44800838 | |2 |ocwiki |33969 |1038620 | |2 |olowiki |6501 |72655905 | |2 |ocwiki |23757 |16898251 | |2 |ocwiki |745 |5731 | |2 |ocwiki |27884 |6090122 | |2 |ocwiki |51960 |72164030 | |2 |ocwiki |57836 |62111448 | |2 |ocwiki |46918 |68928761 | ...Seems that, other than nulls, the repeats on ocwiki and friends are duplicates of the same mapping, so perhaps all we need to do is filter the table for lu_local_id IS NULL before joining on it.