Page MenuHomePhabricator

Import of MediaWiki tables into the Data Lakes mangles usernames
Closed, DuplicatePublic

Description

Something in the sqoop process apparently causes wmf_raw.mediawiki_private_actor(and therefore wmf.mediawiki_history as well) to mangle usernames that use unusual characters.

I found this looking for duplicate actor IDs for the same user:

select
    actor_name, 
    count(actor_id) as n_actor_ids,
    collect_list(actor_user) as user_ids
from wmf_raw.mediawiki_private_actor
where
    snapshot = "2019-07" and
    wiki_db = "enwiki"
group by actor_name
having n_actor_ids > 1
actor_namen_actor_idsuser_ids
????28[7971454,7630487,29548314,169932,1241395,14754180,16163829,36027965,36919251,37020397,28526886,28303085,30943519,30775112,32487125,30943557,31584194,29929605,30943142,32547336,31767650,29548814,31082738,30747253,30989504,32838619,35052220,35310020]
????????4[13080937,555876,7977532,19937304]
????????????2[1481183,29076045]
????????????????????2[1956565,34551626]
????????????????????????2[31469439,30293052]
???????????????????????? ????????????????????????2[33162048,33061987]
????????????????????????????3[36177882,36489581,36633928]
????????????????????????????????5[28277776,26501398,34237464,29655875,29774585]
????????????????????????????????????2[36874166,36135233]
????????????????????????????????????????????3[36932906,33559373,34844422]

But all of these users have different usernames (although they all use unusual characters):

select *
from enwiki.actor
where actor_user in (
    7971454, 7630487, 29548314, 169932,
    1241395, 14754180, 16163829, 36027965,
    36919251, 37020397, 28526886, 28303085,
    30943519, 30775112, 32487125, 30943557,
    31584194, 29929605, 30943142, 32547336,
    31767650, 29548814, 31082738, 30747253,
    30989504, 32838619, 35052220, 35310020,
    13080937, 555876, 7977532, 19937304,
    1481183, 29076045, 1956565, 34551626,
    31469439, 30293052, 33162048, 33061987,
    36489581, 36633928, 28277776, 26501398,
    34237464, 29655875, 29774585, 36874166,
    36135233, 36932906, 33559373, 34844422
)
actor_idactor_useractor_name
784003169932๐Ÿ˜‚
1755400555876????????
24387611241395๔ฟฟ
26784231481183????????????
31534211956565????????????????????
88253477630487๐ฅฃ
91656587971454๐กšด
91717347977532๐ชšฅ๐ชšฅ
1427635313080937๐ ”ป๐ชšฅ
1594954414754180๐›ฅ
1735912616163829๐ŸŒˆ
2112062119937304๐Ÿ‡บ๐Ÿ‡ธ
2766005826501398๐Œผ๐Œฐ๐Œป๐Œพ๐Œฐ๐‚๐Œฝ๐Œน
2941387428277776๐•€๐““๐’†๐“ณ๐“ช๐“ฟ๐“Š๐•€
2943862628303085๐œ†
2965743928526886๐”Š
3019183929076045๐จ€๐จจ๐จช
24050829548314๐Ÿฑ
3065542329548814๐Ÿค–
3076049729655875๐•ฎ๐–—๐–š๐–˜๐–†๐–‰๐–Š๐–—
3087716929774585๐Š๐จ๐ฐ๐š๐ฅ๐ฌ๐ค๐ข
3102989529929605๐“ผ
3138883230293052๐•น๐–†๐–•๐–™๐–”๐–—
3183509630747253๐Ÿ’ค
3186244530775112๐Ÿฃ
3202696130943142๐Ÿ”
3202733430943519๐Ÿ˜Š
3202737230943557๐Ÿ˜Ž
3207232530989504๐Ÿ’Š
3216368831082738๐Ÿ’ฉ
3254401631469439๐Œด๐Œน๐ƒ๐Œท๐‰๐‚
3265720631584194๐ŸŒš
3283781931767650๐Ÿœ
3354687632487125๐Ÿ™ƒ
3360617032547336๐Ÿ 
3389352232838619๐Ÿ”ฅ
3411353333061987๐•ฒ๐•บ๐•ท๐•ฏ๐•ฐ๐•น ๐Ÿ„บ๐Ÿ„ธ๐Ÿ„ป๐Ÿ„ป๐Ÿ„ด๐Ÿ…
3421198033162048๐•€๐•ฆ๐•๐•š๐•’๐•Ÿ ๐•‹๐•’๐•Ÿ๐•’๐•ค๐•–
3460336833559373๐“˜๐“ฝ๐“ผ๐“™๐“ฒ๐“พ๐“™๐“ฒ๐“ฝ๐“ผ๐“พ
3527259634237464๐•ฎ๐–๐•ป๐–‘๐–†๐–™๐–Š๐–‘
3558369634551626๐’ฝ๐“Ž๐‘’๐“ˆ๐“Š
3587320634844422๐–‡๐–š๐–Ž๐–‘๐–Š๐–‰๐–š๐–ˆ๐–†๐–“๐–
3607782635052220๐ˆ
3633164435310020๐ฉน‰
19008355636027965๐ŸฆŠ
19053964436135233๐•ฏ๐–Ž๐–†๐–“๐–†๐Ÿ ๐Ÿš๐Ÿก๐Ÿ›
19229462936489581๐–๐–†๐–๐––๐–š๐–Ž๐–“
19298896936633928๐•ฒ๐–‘๐–†๐–˜๐–˜๐•ธ๐–†
19410141236874166๐“›๐“ฒ๐“ท๐“พ๐“›๐“ฒ๐“ท๐“พ๐“ผ
19431597136919251๐Ÿ‘€
19438444636932906๐ค๐š๐ญ๐ข๐ž๐›๐š๐›๐ข๐œ๐ณ
19480396837020397๐Ÿฆ

This propagates into mediawiki_history too. The following query returns only null values for event_user_text:

select event_user_text
from wmf.mediawiki_history
where
    event_user_id in (
        7971454, 7630487, 29548314, 169932,
        1241395, 14754180, 16163829, 36027965,
        36919251, 37020397, 28526886, 28303085,
        30943519, 30775112, 32487125, 30943557,
        31584194, 29929605, 30943142, 32547336,
        31767650, 29548814, 31082738, 30747253,
        30989504, 32838619, 35052220, 35310020,
        13080937, 555876, 7977532, 19937304,
        1481183, 29076045, 1956565, 34551626,
        31469439, 30293052, 33162048, 33061987,
        36489581, 36633928, 28277776, 26501398,
        34237464, 29655875, 29774585, 36874166,
        36135233, 36932906, 33559373, 34844422
    ) and
    wiki_db = "enwiki" and
    snapshot = "2019-07"
limit 20

Event Timeline

Restricted Application added a subscriber: Aklapper. ยท View Herald TranscriptAug 21 2019, 11:55 AM
nshahquinn-wmf renamed this task from Actor table copy in the Data Lake contains mangled usernames to Import of MediaWiki tables into the Data Lakes mangles usernames.Aug 21 2019, 12:01 PM
nshahquinn-wmf updated the task description. (Show Details)
โ€ข fdans moved this task from Incoming to Data Quality on the Analytics board.
Milimetric moved this task from Data Quality to Incoming on the Analytics board.
Milimetric added a subscriber: lexnasser.
โ€ข fdans raised the priority of this task from Medium to High.Apr 15 2021, 5:04 PM
โ€ข fdans moved this task from Incoming to Data Quality on the Analytics board.
Milimetric moved this task from Next Up to In Progress on the Analytics-Kanban board.
Milimetric subscribed.

seems related to T284623, so I'm going to look at both.

Indeed this seems to be the same issue as in the other task, namely wrong charset used in sqooping. We should be using utf8mb4, I think, but I have to do some more reading because of course it's not simple.

select actor_name, convert(actor_name using utf8), convert(actor_name using utf8mb4) from actor where actor_user in (7971454,7630487,29548314,169932,1241395,14754180,16163829,36027965,36919251,37020397,28526886,28303085,30943519,30775112,32487125,30943557,31584194,29929605,30943142,32547336,31767650,29548814,31082738,30747253,30989504,32838619,35052220,35310020)

I'm going to merge this into the other task, even though it's a totally different symptom the root cause is the same and is fixed by that patch.