As we talked about performance scanning a large korma identities DB with 80000 entries today in our meeting...
If I remember correctly, "source": "wikimedia:its" is Bugzilla and "source": "wikimedia:its_1" is Phabricator.
There are 523 identities with "source": "wikimedia:its" in the korma identities DB which are duplicates simply due to encoding the @ in the username ("username@example.com" vs "username@example.com").
See the very last command below (the other commands are just explaining what I did plus checking if the data is reasonable).
$:andre\> pwd /home/user/.local/bin $:andre\> curl http://stedolan.github.io/jq/download/linux64/jq -o ./jq $:andre\> chmod u+x jq $:andre\> grep "wikimedia:its\"" ~/wm/git/mediawiki-identities/wikimedia-affiliations.json | wc -l 13137 $:andre\> cat ~/wm/git/mediawiki-identities/wikimedia-affiliations.json | jq '.uidentities | .[] | .identities | .[] | select(.source == "wikimedia:its") | .username' | wc -l 13137 $:andre\> cat ~/wm/git/mediawiki-identities/wikimedia-affiliations.json | jq '.uidentities | .[] | .identities | .[] | select(.source == "wikimedia:its") | .username' | grep -E '\&\#64|\@' | wc -l 9811 $:andre\> cat ~/wm/git/mediawiki-identities/wikimedia-affiliations.json | jq '.uidentities | .[] | .identities | .[] | select(.source == "wikimedia:its") | .username' | grep '\&\#64' | wc -l 605 $:andre\> cat ~/wm/git/mediawiki-identities/wikimedia-affiliations.json | jq '.uidentities | .[] | .identities | .[] | select(.source == "wikimedia:its") | .username' | grep '\@' | wc -l 9206 $:andre\> cat ~/wm/git/mediawiki-identities/wikimedia-affiliations.json | jq '.uidentities | .[] | .identities | .[] | select(.source == "wikimedia:its") | .username' | grep -E '\&\#64|\@' | sed 's/\&\#64;/\@/g' | sort | uniq -c | more | sort -rn | head -n 1000
Also, there are many username values which are not even email addresses but Bugzilla *required* an email address as a user name:
$:andre\> cat ~/wikimedia/git/bitergia/mediawiki-identities/wikimedia-affiliations.json | jq '.uidentities | .[] | .identities | .[] | select(.source == "wikimedia:its") | .username' | grep -Ev '\&\#64|\@' | wc -l 3326
Searching for some of those 3326 items, they all seem to be duplicates of other identities with "complete" email addresses and miss the @ and the domain.
Anybody having any explanations? :D