Page MenuHomePhabricator

Correct affiliation for code review contributors of the past 30 days
Closed, ResolvedPublic

Description

http://korma.wmflabs.org/browser/scr-contributors.html has many missing affiliations, and some of them are not correct or need updating.

We need to organize a crowdsourcing exercise to have at least the data of the contributors active in the past 30 days up to date. We can create a wiki page and dump the list. It would be a good opportunity to ask for previous affiliations (a timeline is possible, not simultaneous affiliations) and country of residence.

See https://www.mediawiki.org/wiki/Community_metrics#How_to_update_user_data

Event Timeline

Qgil raised the priority of this task from to Low.
Qgil updated the task description. (Show Details)
Qgil subscribed.
Aklapper raised the priority of this task from Low to Medium.Sep 27 2015, 6:27 PM
Aklapper added a subscriber: Dicortazar.

Went through http://korma.wmflabs.org/browser/scr-contributors.html by screenscraping and some fun with spreadsheet processing.

  • Korma User Affiliation corrections in P2112
  • Korma Username to merge (same persons/identity) in P2113

@Qgil: You're welcome to take another quick look at those two lists.
@Dicortazar: Passing this to you to get that data updated in the DB after Quim has given an ACK. As usual it would be awesome to have cleaner September stats already, but that might be a too tight schedule...

@Aklapper, according to the process previously defined, this is a task that can be easily done by you or @Qgil.

The process is now fully automated and this shouldn't take that long.

Once this is ready to go, this is as easy as uploading the very last version of the exported JSON file to the bitbucket repository.

Do you want to give it a try?

Do you want to give it a try?

@Dicortazar: I'd like to. But I'm afraid I don't fully get it yet:
I have read the docs (which explain how to use Sortinghat but not where) but I still have no clue how to get an exported JSON file. Is there maybe an implicit undocumented assumption that I have shell access to the korma.wmflabs.org machine? :) I don't see any obvious "identities" related file in mediawiki-dashboard's /browser/data/json/ folder either.

Ok, for this, there's a bitbucket account with that info daily updated. That would be the JSON file to import and play with that. So far Quim has access to this, do you need that?

Yes, because I cannot be the single point of failure. ;)

Ok, for this, there's a bitbucket account with that info daily updated. That would be the JSON file to import and play with that. So far Quim has access to this, do you need that?

@Dicortazar: Would be very welcome. Feel free to drop me a private email so I can proceed here. :)

Quick status update: I'm waiting for @Lcanasdiaz to provide me access.
Things potentially going slow because Bitergia is about to move away from Bitbucket if I got it correctly.

Oupss, I made a mistake with the ticket. See my comment at https://phabricator.wikimedia.org/T60585#1734723

To sum up, we're ready, I need your github user.

I've got access to the JSON file (thanks!).
I git cloned sortinghat, ./setup.py build and ./setup.py install and installed some deps (python-sqlalchemy python-jinja2 python-mysql in my case)....

That would be the JSON file to import and play with that.

....but it looks like sortinghat requires a DB and only offers an export command but not importing the JSON file into a DB (and creating a scheme):

$:andre\> sortinghat export
Error: Can't connect to local MySQL server through socket '/var/lib/mysql/mysql.sock' (2 "No such file or directory") (err: 2002)
$:andre\> sortinghat import
Error: Unknown command import

...and no import-related command listed by sortinghat --help either.

$:andre\> sortinghat merge --help

does not print the argument parameters but only the error about not having set up an MySQL DB. I'd call that unexpected.

So I guess I have to stick to editing a huge JSON file in a text editor and cross fingers to not make any mistakes.

Looking at the JSON file (85ec9724c524618f02d593d5180c29d9e166ece9 and 1ac2657a87de642b39f5e17344ea42eeba842e64), the software does not seem to realize that in identities\username, @ is the very same as @ when trying to identify matches? Is there already an upstream ticket about that or should I file one (against which repo is that)?

...and no import-related command listed by sortinghat --help either.

Note to myself: The command is called "load", as Daniel told me, and as I see now in the options. Stupid me.

Aklapper raised the priority of this task from Medium to High.Oct 20 2015, 4:46 PM
Aklapper moved this task from Need Discussion to Doing on the DevRel-October-2015 board.
  • This is time consuming.
  • I gave priority on merging accounts of people active in last 30 days.
  • I gave priority on adding affiliation info to people active in last 30 days when being affiliated.
  • I mostly ignored turning "Unknown" into "Independent" affiliation. Punting.
  • I mostly ignored (except for highly active folks) finding and adding info when exactly someone joined an affiliation. Punting.
  • I've contacted WMDE asking for offboarding info of some folks (info is not public). Contact is on holidays this week, says the auto-reply message.
  • I've got a small list of highly active accounts left to sort out in the next weeks. After that I'll close this task. I think I've survived the worst part.

Notes to myself:

  • Concept of unique identities which consist of at least one identity, exactly one profile, and at least zero enrollments.
  • I do not want move or add but merge to associate identities to a unique identity. Docs.
  • When an affilation already exists (with e.g. wrong dates) and sortinghat enroll is without --merge, it just adds a second item for the very same org, as shown by sortinghat show. Always use --merge.
  • Importing the ~80000 identities on my old testing machine took about 9-10hours.
  • Using both parameters in `sortinghat export --identities --orgs updated.json` does not work so I sticked to the first one. It took 200 minutes.
  • I want better offboarding processes. In any context.

I pushed my first update. All change commands are listed in P2215. Data on korma should get updated soon+automatically. (Thanks Daniel for the explanations!)
After that update (and receiving more info; see comment above) I'll have a second go here and then we're done in this task, just in time for October.

@Aklapper the data on korma is already updated

@Luiscanasdiaz: I don't see that fully reflected.
Just one example (there are several), on http://korma.wmflabs.org/browser/scr-contributors.html I see both

  • "matma.rex" (under "Code Review Users")
  • "Bartosz Dziewo?ski" (under "Code Review Committers")

which is the same person. I'd expect only one single consistent name to be displayed as I merged all occurrences into uidentity 1ac2657a87de642b39f5e17344ea42eeba842e64 - see P2215 for a complete log of my changes to mediawiki-identities.
For that user I can only find a single identity for "source": "wikimedia:scr" which says "id": "e8a5dc53821029238c59cdd9daa100e432dbd2bb", "name": "Bartosz Dziewo\u0144ski" and the only place where I can actually find "name": "matma.rex" is in the "profile" of that uidentity.

So either my expectations are wrong as different names are displayed for reasons I don't understand, or the software sometimes (?) picks the name from the scr source (sub)identity of a uidentity and sometimes from the profile of a uidentity.

Also, looking at the right panel of http://korma.wmflabs.org/browser/scr.html and just picking one example, Legoktm is listed as Independent which should not be the case as I ran sortinghat enroll --merge --from 2013-12-12 --to 2100-01-01 25de7a6387ca833bd2fbe6e6f0a3ddc72cd065df "Wikimedia Foundation".

@Aklapper, I've been reviewing the data with my colleague Santiago Dueñas and we discovered that the name we are showing in some tables is random when you have more than one identity. That means that:
#1 you did it well
#2 data is already aggregated
#3 the information about matma on the left table shows the aggregated data for the two accounts
#4 the information about Bartosz on the right table shows the aggregated data for the two accounts
#5 we have a bug in our queries, that should return the same name always

#5 we have a bug in our queries, that should return the same name always

Thanks for clarifying! What's the bug report URL (in GitHub?) that allows me to follow and find out when it's fixed?

@Luiscanasdiaz: Another problem I still see with the refreshed and deployed data:

In the right panel of http://korma.wmflabs.org/browser/scr-contributors.html , tab "Last 30 days", Legoktm is listed as Independent which should not be the case as I ran sortinghat enroll --merge --from 2013-12-12 --to 2100-01-01 25de7a6387ca833bd2fbe6e6f0a3ddc72cd065df "Wikimedia Foundation".
I'd expect that to be reflected as this is about the "last 30 days" which are after 2013-12-12.
A bug? If so, where to file in upstream? :)

@ Bitergia: Generally, any identities with "email": "username@svn.wikimedia.org" got assigned to "organization": "Wikimedia Foundation". That is not good.

(And note to myself: P2250 will be the 2nd iteration, once I've pulled again for safety)

#5 we have a bug in our queries, that should return the same name always

Thanks for clarifying! What's the bug report URL (in GitHub?) that allows me to follow and find out when it's fixed?

https://github.com/VizGrimoire/GrimoireLib

Second iteration of identity merges and affiliation corrections pushed in 7ba9d95f0039e6b6583d09aa0002ff444a4212a1

Once that's reflected in korma we can close this task.

Bug reports filled as Incorrect affiliation displayed on scr-contributions.html and Convert @ in usernames to @ for improved identitiy matching and scr metrics are not using correctly information about profiles and affiliation.

Once that's reflected in korma we can close this task.

It is now: http://korma.wmflabs.org/browser/scr-contributors.html

It is still misleading on korma due to "scr metrics are not using correctly information about profiles and affiliation" but that is out of scope for this task.
Correcting lots of stuff in the DB: Done. Hence resolving this task.