Page MenuHomePhabricator

Eliminate duplicated «"source": "wikimedia:its"» identities in korma identities DB
Closed, ResolvedPublic

Description

As we talked about performance scanning a large korma identities DB with 80000 entries today in our meeting...

If I remember correctly, "source": "wikimedia:its" is Bugzilla and "source": "wikimedia:its_1" is Phabricator.

There are 523 identities with "source": "wikimedia:its" in the korma identities DB which are duplicates simply due to encoding the @ in the username ("username@example.com" vs "username@example.com").
See the very last command below (the other commands are just explaining what I did plus checking if the data is reasonable).

$:andre\> pwd
/home/user/.local/bin
$:andre\> curl http://stedolan.github.io/jq/download/linux64/jq -o ./jq
$:andre\> chmod u+x jq
$:andre\> grep "wikimedia:its\"" ~/wm/git/mediawiki-identities/wikimedia-affiliations.json | wc -l
13137
$:andre\> cat ~/wm/git/mediawiki-identities/wikimedia-affiliations.json | jq '.uidentities | .[] | .identities | .[] | select(.source == "wikimedia:its") | .username' | wc -l
13137
$:andre\> cat ~/wm/git/mediawiki-identities/wikimedia-affiliations.json | jq '.uidentities | .[] | .identities | .[] | select(.source == "wikimedia:its") | .username' | grep -E '\&\#64|\@' | wc -l
9811
$:andre\> cat ~/wm/git/mediawiki-identities/wikimedia-affiliations.json | jq '.uidentities | .[] | .identities | .[] | select(.source == "wikimedia:its") | .username' | grep '\&\#64' | wc -l
605
$:andre\> cat ~/wm/git/mediawiki-identities/wikimedia-affiliations.json | jq '.uidentities | .[] | .identities | .[] | select(.source == "wikimedia:its") | .username' | grep '\@' | wc -l
9206
$:andre\> cat ~/wm/git/mediawiki-identities/wikimedia-affiliations.json | jq '.uidentities | .[] | .identities | .[] | select(.source == "wikimedia:its") | .username' | grep -E '\&\#64|\@' | sed 's/\&\#64;/\@/g' | sort | uniq -c | more | sort -rn | head -n 1000

Also, there are many username values which are not even email addresses but Bugzilla *required* an email address as a user name:

$:andre\> cat ~/wikimedia/git/bitergia/mediawiki-identities/wikimedia-affiliations.json | jq '.uidentities | .[] | .identities | .[] | select(.source == "wikimedia:its") | .username' | grep -Ev '\&\#64|\@' | wc -l
3326

Searching for some of those 3326 items, they all seem to be duplicates of other identities with "complete" email addresses and miss the @ and the domain.

Anybody having any explanations? :D

Event Timeline

Aklapper raised the priority of this task from to Low.
Aklapper updated the task description. (Show Details)
Aklapper added a project: wikimedia.biterg.io.
Aklapper added subscribers: Aklapper, Lcanasdiaz, Dicortazar.
Aklapper added a project: DevRel-March-2016.

There are 523 identities with "source": "wikimedia:its" in the korma identities DB which are duplicates simply due to encoding the @ in the username ("username@example.com" vs "username@example.com").

I've cleaned those manually in https://github.com/Bitergia/mediawiki-identities/commit/130799337e197ef79989fef38405b002ebcbdab1

Also, there are many username values which are not even email addresses but Bugzilla *required* an email address as a user name

This SQL query would be close enough. Still needs removing unclear items (e.g. is john john@foo.baror john@bar.foo) and then removing the non-ID columns in a spreadsheet application before using the data to batch-run sortinghat merge on the accounts DB:

SELECT a.id, b.uuid, a.username, b.username FROM 
(SELECT id,username FROM identities WHERE source = "wikimedia:its" AND username NOT LIKE "%@%" AND uuid = id) a, 
(SELECT uuid, username FROM identities WHERE source = "wikimedia:its" AND username LIKE "%@%") b 
WHERE SUBSTRING_INDEX(b.username, '@', 1) = a.username 
ORDER BY a.username ASC;

Merged nearly 2000 disconnected ITS (Bugzilla) identities in https://github.com/Bitergia/mediawiki-identities/commit/aee15abf880b10d881a0b5fc2c8b091e861da5fb

Not fully "done" because some are not unique but merged as much as we could, hence closing.