Page MenuHomePhabricator

OwlBot seems to merge random user accounts in korma user data
Closed, ResolvedPublic

Description

uuid 028b6b8dce6241c60c313f9a4dd1ee0c957db7e5 suddenly has numerous ids from very different people merged into that uuid.
This makes user data entirely unreliable by displaying random names and merging contributions of random persons.

I tried to give this a quick investigation but no idea what's going on - Git change 77873dc9f778443df352013bcd77a80aaa4d45c6 somehow "renamed" uuid 136155e528cf630d61a8352ab1a80ae2ddbc7a03 into the new uuid 028b6b8dce6241c60c313f9a4dd1ee0c957db7e5 (which was previously one id under 136155e528cf630d61a8352ab1a80ae2ddbc7a03). Beforehand, git change 7b7d06d6a57e69f1cb855fee5673b66180cc4635 changed the uuid 00da234ef9520f36d14af411cda0755be9d6d529 into aff499e52b8bcd8e3357c7dea8e77d1a7be28f30.
(It also removed data for reasons I don't know, e.g. the email address value of id 00293d7819ecddd16fd561bca66c42e59213b532 ).

I checked the diff of the last human git commit d27b36e73df818b156e45c11950efd4fd2ebe151 by myself but I don't see that account touched in there...

Event Timeline

Aklapper raised the priority of this task from to Unbreak Now!.
Aklapper updated the task description. (Show Details)
Aklapper added subscribers: Aklapper, Qgil, Lcanasdiaz, Dicortazar.
Aklapper added a project: DevRel-December-2015.
Aklapper moved this task from Backlog to Ready to Go on the DevRel-December-2015 board.

I'm having a look at the data. If there's an id with loads of merges, this is usually an id to be added to the sortinghat blackiist.

Blacklisted names are those like 'root', black spaces or other generic names.

I'm importing right now the JSON file to check the ids you mention.

I'm having a look at the data. If there's an id with loads of merges, this is usually an id to be added to the sortinghat blackiist.
I'm importing right now the JSON file to check the ids you mention.

@Dicortazar: Any news to share?

Aklapper moved this task from Backlog to Doing on the wikimedia.biterg.io board.Dec 10 2015, 1:39 PM
Lcanasdiaz set Security to None.
Lcanasdiaz removed a subscriber: Lcanasdiaz.
Lcanasdiaz added a comment.EditedDec 15 2015, 12:21 PM

In order to fix this issue I'm going to split this identity and execute again the procedure. I've seen this behaviour before when people shares mail accounts, this seems not to be the case so I could be just an error introduced the first time data was imported.

After spliting them, I identified at least 6 people. The first idea is to split all the accounts and see if sortinghat manages to group them correctly. If not, we could have the same error during the next executions.

We're having issues getting the metrics with the sortinghat database. Something is broken in the load process, I'm debugging the sortinghat workflow and crossing my fingers :-/

Ok, data is updated finally.

The identities were correctly splittted, have a look at the identity 028b6b8dce6241c60c313f9a4dd1ee0c957db7e5 in the identities file.

@Lcanasdiaz: I see the "Manual edition" change and data looks reliable again on korma, but do we know what underlying problem happened, and how to avoid that in the future?

Plus looking at http://korma.wmflabs.org/browser/scr-contributors.html it seems that several staffers lost their affiliation info. I guess I can fix that manually but it would be great if that's avoidable in the future.

Plus looking at http://korma.wmflabs.org/browser/scr-contributors.html it seems that several staffers lost their affiliation info. I guess I can fix that manually but it would be great if that's avoidable in the future.

The ones that lost their affiliation were part of the huge identity I've just splitted. I though sortinghat would be able to sort them correctly, but it didn't match them

Aklapper closed this task as Resolved.Dec 18 2015, 1:41 PM

If I got it right this was a potential bug created with the first execution of the searching tool for merging tool identities.
According to Bitergia it's unlikely to happen again.
Hence closing it as the data looks good to me. Thank you!

Followup: Andre to clean up in the next days (reassign and update affiliations; potentially split the Gerrit Upload Tool mess?).

After @Lcanasdiaz's manual edition, OwlBot's next https://github.com/Bitergia/mediawiki-identities/commit/d4c99f0af9b617e282451309e7e5fd70dbbd55f7 diff has "64 additions and 1,291,607 deletions". And that number scares me, as JSON filesize also went down from 67MB to 18MB. Was that intended?

Aklapper reopened this task as Open.EditedDec 19 2015, 3:46 PM

Reopening. I imported the JSON file and it successfully finishes at uuid 3af174979f787f4b8a1597ae5f5838f9d83b329b (with 19494 identities in total, was about 80000 before). As it's extremely unlikely that the algorithm does not create uuid hashes in the full hexademical range, we either lost data or something broke exporting to JSON.

Aklapper closed this task as Resolved.Dec 28 2015, 11:12 PM

https://github.com/Bitergia/mediawiki-identities/commit/1bba45afaa3696d5f44a0a531b66be062a33e660#diff-93489085b8f0a3138e4f542cd8abef0e seems to have fixed this issue and we're back to a sane size (68513943 bytes).
Hence closing as resolved.