Page MenuHomePhabricator

When indexing new users, identify identical email addresses and merge identities accordingly in the DB
Closed, ResolvedPublic

Description

Owlbot updated wikimedia-affiliations.json on 2016-11-25.
It added the uidentity f505a9da34c6bf833d9c461f863ea5e0b374753f with "source": "wikimedia:scm" and an "email" value defined.
That email value is the same as "email" in the "profile" 03fff776d570ededc83f7cb67d2cfcba5359dc5f.
So it's very likely the same person.

I'd expect an algorithm to recognize this and merge those identities automatically, so I do not have to merge it manually.

Does this sound reasonable?

Event Timeline

Aklapper created this task.

Going through today's diff, just one example of five detached uidentities I'll have to manually merge, as they are all the same person:

0af3bfb0cc222e6c06a642b9b92fd1d7da4df5e0 (scr  : "username": "geek"; "email": "XYZ")
80a299df0262c114e270ae52220534a1ec40812d (its_1: "username": "ABC" ; "email": null )
d86b7859271e2a3f3faf091c3e953dcd19c40419 (irc  : "username": "ABC" ; "email": null )
f9bd84c000546714ebb3a6c06a24877f39c46633 (scr  : "username": "ABC" ; "email": "XYZ")
fe027a53c1a93f060e2b15b1d7df08191de72570 (scm  : "username": null  ; "email": "XYZ")

I'd at least expect the scm and scr ones to be merged automatically as their email values are identical.

Going through the diff and manually merges (to have cleaner data and identify relevant contributors) takes quite some time. I'd like to avoid that.

Example from the last 9 days:
f94c87a3d35b3ab9d92d5c0d83426505d34f1820 should have been merged automatically into existing 74a4c7d7fda0317c5844e7242d81b0ed17eb7d32, as they both have the same email addresses.

Last weekend, Jesus mentioned that this should be possible already and that it might "just" be our configuration that needs an update?

This seems to still be an issue and I'd highly welcome investigation:
OwlBot in git revision 25ea4f09445b8780a091eea8563333b3b9f85701 on March 25th 2017 added the uuid 5648ffae1ff7fbe200431898df8b744470a3a99a, which has an email address defined, to https://github.com/Bitergia/mediawiki-identities/commits/master/wikimedia-affiliations.json .
uuid 188579ca6a4438cf6450bd52b1e2d591c1c8d297 has the very same email address so I would have expected a merge.

The issue should be solved after the migration. As the uuids has changed, I cannot reproduce it. Let me know if you still can reproduce the issue and I'll work on it. If not, feel free to close it :)

@Albertinisg: Indeed! Thanks a lot!
The "source": "gerrit"/"git" (=sources which require email addresses) id additions to the uuids 296a7b9077b1fcbf37bfb1a098504e62e8df1f2e and 6db0af9ad4c48883a4559032e2db1c01f55c1437 (those are uuids which already had ids that included email addresses before) in the recent git change ffa92acf21a55c4f27f8e0935d970ca3e748b60b look exactly like what I was looking forward to.
Hence closing this task as resolved. :)

Happy to see this; will save me some time (less manual checking of latest new accounts via JSON dump diffs and querying for email addresses to merge potential dups)!