Page MenuHomePhabricator

Import of MediaWiki tables into the Data Lakes mangles usernames
Open, NormalPublic

Description

Something in the sqoop process apparently causes wmf_raw.mediawiki_private_actor(and therefore wmf.mediawiki_history as well) to mangle usernames that use unusual characters.

I found this looking for duplicate actor IDs for the same user:

select
    actor_name, 
    count(actor_id) as n_actor_ids,
    collect_list(actor_user) as user_ids
from wmf_raw.mediawiki_private_actor
where
    snapshot = "2019-07" and
    wiki_db = "enwiki"
group by actor_name
having n_actor_ids > 1
actor_namen_actor_idsuser_ids
????28[7971454,7630487,29548314,169932,1241395,14754180,16163829,36027965,36919251,37020397,28526886,28303085,30943519,30775112,32487125,30943557,31584194,29929605,30943142,32547336,31767650,29548814,31082738,30747253,30989504,32838619,35052220,35310020]
????????4[13080937,555876,7977532,19937304]
????????????2[1481183,29076045]
????????????????????2[1956565,34551626]
????????????????????????2[31469439,30293052]
???????????????????????? ????????????????????????2[33162048,33061987]
????????????????????????????3[36177882,36489581,36633928]
????????????????????????????????5[28277776,26501398,34237464,29655875,29774585]
????????????????????????????????????2[36874166,36135233]
????????????????????????????????????????????3[36932906,33559373,34844422]

But all of these users have different usernames (although they all use unusual characters):

select *
from enwiki.actor
where actor_user in (
    7971454, 7630487, 29548314, 169932,
    1241395, 14754180, 16163829, 36027965,
    36919251, 37020397, 28526886, 28303085,
    30943519, 30775112, 32487125, 30943557,
    31584194, 29929605, 30943142, 32547336,
    31767650, 29548814, 31082738, 30747253,
    30989504, 32838619, 35052220, 35310020,
    13080937, 555876, 7977532, 19937304,
    1481183, 29076045, 1956565, 34551626,
    31469439, 30293052, 33162048, 33061987,
    36489581, 36633928, 28277776, 26501398,
    34237464, 29655875, 29774585, 36874166,
    36135233, 36932906, 33559373, 34844422
)
actor_idactor_useractor_name
784003169932๐Ÿ˜‚
1755400555876????????
24387611241395๔ฟฟ
26784231481183????????????
31534211956565????????????????????
88253477630487๐ฅฃ
91656587971454๐กšด
91717347977532๐ชšฅ๐ชšฅ
1427635313080937๐ ”ป๐ชšฅ
1594954414754180๐›ฅ
1735912616163829๐ŸŒˆ
2112062119937304๐Ÿ‡บ๐Ÿ‡ธ
2766005826501398๐Œผ๐Œฐ๐Œป๐Œพ๐Œฐ๐‚๐Œฝ๐Œน
2941387428277776๐•€๐““๐’†๐“ณ๐“ช๐“ฟ๐“Š๐•€
2943862628303085๐œ†
2965743928526886๐”Š
3019183929076045๐จ€๐จจ๐จช
24050829548314๐Ÿฑ
3065542329548814๐Ÿค–
3076049729655875๐•ฎ๐–—๐–š๐–˜๐–†๐–‰๐–Š๐–—
3087716929774585๐Š๐จ๐ฐ๐š๐ฅ๐ฌ๐ค๐ข
3102989529929605๐“ผ
3138883230293052๐•น๐–†๐–•๐–™๐–”๐–—
3183509630747253๐Ÿ’ค
3186244530775112๐Ÿฃ
3202696130943142๐Ÿ”
3202733430943519๐Ÿ˜Š
3202737230943557๐Ÿ˜Ž
3207232530989504๐Ÿ’Š
3216368831082738๐Ÿ’ฉ
3254401631469439๐Œด๐Œน๐ƒ๐Œท๐‰๐‚
3265720631584194๐ŸŒš
3283781931767650๐Ÿœ
3354687632487125๐Ÿ™ƒ
3360617032547336๐Ÿ 
3389352232838619๐Ÿ”ฅ
3411353333061987๐•ฒ๐•บ๐•ท๐•ฏ๐•ฐ๐•น ๐Ÿ„บ๐Ÿ„ธ๐Ÿ„ป๐Ÿ„ป๐Ÿ„ด๐Ÿ…
3421198033162048๐•€๐•ฆ๐•๐•š๐•’๐•Ÿ ๐•‹๐•’๐•Ÿ๐•’๐•ค๐•–
3460336833559373๐“˜๐“ฝ๐“ผ๐“™๐“ฒ๐“พ๐“™๐“ฒ๐“ฝ๐“ผ๐“พ
3527259634237464๐•ฎ๐–๐•ป๐–‘๐–†๐–™๐–Š๐–‘
3558369634551626๐’ฝ๐“Ž๐‘’๐“ˆ๐“Š
3587320634844422๐–‡๐–š๐–Ž๐–‘๐–Š๐–‰๐–š๐–ˆ๐–†๐–“๐–
3607782635052220๐ˆ
3633164435310020๐ฉน‰
19008355636027965๐ŸฆŠ
19053964436135233๐•ฏ๐–Ž๐–†๐–“๐–†๐Ÿ ๐Ÿš๐Ÿก๐Ÿ›
19229462936489581๐–๐–†๐–๐––๐–š๐–Ž๐–“
19298896936633928๐•ฒ๐–‘๐–†๐–˜๐–˜๐•ธ๐–†
19410141236874166๐“›๐“ฒ๐“ท๐“พ๐“›๐“ฒ๐“ท๐“พ๐“ผ
19431597136919251๐Ÿ‘€
19438444636932906๐ค๐š๐ญ๐ข๐ž๐›๐š๐›๐ข๐œ๐ณ
19480396837020397๐Ÿฆ

This propagates into mediawiki_history too. The following query returns only null values for event_user_text:

select event_user_text
from wmf.mediawiki_history
where
    event_user_id in (
        7971454, 7630487, 29548314, 169932,
        1241395, 14754180, 16163829, 36027965,
        36919251, 37020397, 28526886, 28303085,
        30943519, 30775112, 32487125, 30943557,
        31584194, 29929605, 30943142, 32547336,
        31767650, 29548814, 31082738, 30747253,
        30989504, 32838619, 35052220, 35310020,
        13080937, 555876, 7977532, 19937304,
        1481183, 29076045, 1956565, 34551626,
        31469439, 30293052, 33162048, 33061987,
        36489581, 36633928, 28277776, 26501398,
        34237464, 29655875, 29774585, 36874166,
        36135233, 36932906, 33559373, 34844422
    ) and
    wiki_db = "enwiki" and
    snapshot = "2019-07"
limit 20

Event Timeline

Restricted Application added a subscriber: Aklapper. ยท View Herald TranscriptAug 21 2019, 11:55 AM
Neil_P._Quinn_WMF renamed this task from Actor table copy in the Data Lake contains mangled usernames to Import of MediaWiki tables into the Data Lakes mangles usernames.Aug 21 2019, 12:01 PM
Neil_P._Quinn_WMF updated the task description. (Show Details)
fdans triaged this task as Normal priority.Aug 22 2019, 5:12 PM
fdans moved this task from Incoming to Data Quality on the Analytics board.