Page MenuHomePhabricator

Find out (and fix) why we have a higher number of identity entries than before switching to new Bitergia DB scheme
Closed, ResolvedPublic

Description

After T157898#3354370 (and fixing my local sortinghat setup) I see ~105000 uidentity / uuid entries in the DB:

$:acko\> grep enrollments wikimedia-affiliations.json | wc -l
104758

Before the number was usually slightly less than 80000.

The file hadn't seen updates for a while so maybe we had so many new users in several places, however I'm suspicious.

  • SELECT id,uuid,name FROM identities WHERE email = "no-reply\@phabricator.wikimedia.org" AND id = uuid; shows 150 detached identities with `"source": "pipermail" (I need to blacklist that, probably some bugspam mailing list being indexed).

Event Timeline

Aklapper created this task.Jun 19 2017, 2:44 AM
Qgil triaged this task as Medium priority.Jun 19 2017, 1:54 PM
Qgil moved this task from Backlog to June on the Developer-Advocacy (Apr-Jun 2017) board.
Qgil added a subscriber: Qgil.

Assuming priority Normal.

Aklapper updated the task description. (Show Details)Jun 19 2017, 8:39 PM
  • The DB dump on Jun 16, 2017 includes 1502 duplicated uuids with source=phabricator - see the results of SELECT username, source, COUNT(*) FROM identities WHERE source = "phabricator" GROUP BY username, source HAVING COUNT(*) > 1;. Most of them (but not all, e.g. 790c99c07cd1529297e537505d157bc76c2938e7 and 49c861497d82e38502bcd00be6e7818c72f39427 is a name change due to marriage) have the very same username but one has an empty name while the other has not.

@Albertinisg: I can batch-merge these uuids but sounds like a potential bug you may want to investigate if interested. (Not a high priority for us though.)

Aklapper updated the task description. (Show Details)Jun 19 2017, 8:42 PM

@Aklapper this is something tricky as we discard by policy to merge accounts using the same mail with different names. 99% of the time this activity belongs to a bot.

So I don't see the point in having those accounts grouped with the rest of the accounts used by a human contributor. All these mails sent from no-reply@phabricator.wikimedia.org are sent by a bot, why would you like to add them to a "human" identity?

Aklapper updated the task description. (Show Details)Jun 20 2017, 12:23 PM

I don't see the point in having those accounts grouped with the rest of the accounts used by a human contributor. All these mails sent from no-reply@phabricator.wikimedia.org are sent by a bot, why would you like to add them to a "human" identity?

Sorry, entirely my fault: I wrote "whitelist" when I meant "blacklist" AND I did not realize that git change 6596aca9e13c1c9e366418eb51af0ca06b7595c5 already added it to the blacklist (I was just looking at another part of the DB).

  • The DB dump on Jun 16, 2017 includes 1502 duplicated uuids with source=phabricator - see the results of SELECT username, source, COUNT(*) FROM identities WHERE source = "phabricator" GROUP BY username, source HAVING COUNT(*) > 1;. Most of them (but not all, e.g. 790c99c07cd1529297e537505d157bc76c2938e7 and 49c861497d82e38502bcd00be6e7818c72f39427 is a name change due to marriage) have the very same username but one has an empty name while the other has not.

@Albertinisg: I can batch-merge these uuids but sounds like a potential bug you may want to investigate if interested. (Not a high priority for us though.)

@Aklapper my bad. While merging the identities from the old db and the new one I missed this field, and "" is different than null, that's why those are duplicated and have different uuid. I'll check it out, but it should be easy to fix. Thanks!

Regarding the numbers, I will need some time to have a look there. Consider also that the old db was probably not updated since the korma dashboard was down (2016?) .

and "" is different than null, that's why those are duplicated and have different uuid. I'll check it out, but it should be easy to fix.

Do you plan to fix this? If not I could also save you time by using the SQL query results to mass-merge them in the JSON dump file.

Regarding the numbers, I will need some time to have a look there. Consider also that the old db was probably not updated since the korma dashboard was down (2016?) .

Yeah, the more I play with the data, the more I think the numbers are actually fine, subtracting numerous duplicates due to "" vs null.
Hence proposing to close this task once the path forward for the "" vs null issue is defined.

! In T168217#3377790, @Aklapper wrote:
Do you plan to fix this? If not I could also save you time by using the SQL query results to mass-merge them in the JSON dump file.

Changes applied, but as the load command does not remove identities, I suggest loading again the identities file in a clean database (and I hope this is the last time we need to perform this action) :)

https://wikimedia.biterg.io:443/goto/3423b26f9958ae594cbfcc88d7778ba8 lists currently 79 established devs as "Last Attracted Developers" in June 2017 who have been active for way way longer in our community. I assume that is another artifact of this problem.

https://wikimedia.biterg.io:443/goto/3423b26f9958ae594cbfcc88d7778ba8 lists currently 79 established devs as "Last Attracted Developers" in June 2017 who have been active for way way longer in our community. I assume that is another artifact of this problem.

Yes, it's related with the demography issue in T161309: Git's "Last Attracted Developers" lists established developers and developers without a First Commit Date and that was fixed a week ago in https://github.com/grimoirelab/GrimoireELK/commit/3fe1c8df9d0352271e1c8842e6c3a163969ecbea. I still have to test it and apply it in the dashboard.

Aklapper closed this task as Resolved.Jul 4 2017, 5:51 PM

I found a bunch of uuids (like c37bab850e1019bcd78637cc3ce4617edf8dca3a) from gerrit which were detached while having email values existing in other ids by going through SELECT email FROM profiles GROUP BY email HAVING COUNT(email)>1; and merged them.
And SELECT name,id FROM identities WHERE source = "gerrit" AND username = ""; was also a short list to merge.

I don't think there is anything left to do here for us and I trust the increased number of unique identities, hence closing.

Qgil awarded a token.Jul 5 2017, 10:36 AM