Page MenuHomePhabricator

"Median time to review for Gerrit Changesets, per month": External vs. WMF/WMDE/etc patch authors
Closed, ResolvedPublic

Description

Once T97715 got implemented, split the result into "external/3rd party" patch authors and WMF/WMDE/etc patch authors.

See Also:
T95238: Handling multiple affiliations (at once; like work vs spare time) in tech community metrics

Related Objects

StatusAssignedTask
DuplicateQgil
ResolvedQgil
ResolvedQgil
InvalidNone
InvalidNone
ResolvedAklapper
OpenNone
DeclinedNone
OpenNone
ResolvedQgil
ResolvedQgil
ResolvedQgil
ResolvedQgil
ResolvedAklapper
ResolvedNone
ResolvedAklapper
ResolvedAklapper
ResolvedQgil
ResolvedDicortazar
InvalidDicortazar
ResolvedDicortazar

Event Timeline

Aklapper created this task.May 24 2015, 3:47 PM
Aklapper raised the priority of this task from to Needs Triage.
Aklapper triaged this task as Normal priority.
Aklapper updated the task description. (Show Details)
Aklapper added a project: wikimedia.biterg.io.
Aklapper set Security to None.
Aklapper added a subscriber: Aklapper.
Qgil added subscribers: Acs, Qgil.

In http://korma.wmflabs.org/browser/gerrit_review_queue.html we have "Age of open changesets by affilation (monthly snapshot)" which should be enough to find out whether there are significant differences between paid staff and volunteers, and to extract the list of repositories and developers involved, if needed.

Currently that graph has two problems, though:

From the current list we have some legit affiliations:

  • Wikimedia Foundation
  • Individual (better renamed as Independent, as it was before)
  • Wikimedia Deutschland
  • WikiWorks

There are wrong, since those patches hasn't been made in the name of these organizations. They are just the ISPs that provided the email address of the contributors:

  • Debian GNU_Linux
  • XS4ALL Internet bv
  • Free Software Foundation
  • GMX
  • GRNet
  • University of Melbourne

Also, what happened to Wikia and Unknown?

These two problems show that we cannot trust that graph, and it has been around for a long time. @Acs and me have spent a lot of time on it in several waves. I think this is a blocker of the Gerrit Cleanup Day. I'm not sure about the Basic Metrics goal, but it would be great to have this problem solved in this sprint, for once and forever.

There are plans to exclude obvious Freemail providers like GMX from the stats.
We kept this task as "normal priority" for June; "great to have" if time allows once I have decided on the interpretation of T97715; but might end up in July if the higher priority tasks take too much time. :-/

Qgil added a comment.Jul 3 2015, 11:52 AM

The WMF Annual Plan 2015-16 includes a goal related to this task:

Set and monitor code review KPIs for all community-sourced contributions

Solving this task will help identifying "community-sourced" contributions.

Qgil added a comment.Aug 6 2015, 8:11 AM

At least from the point of view of the WMF Engineering management, having reliable metrics by affiliation is going to be very useful. Failing to address patch contributions when both the authors and the maintainers are WMF employees shows a clear lack of coordination that WMF teams alone can solve.

At least among WMF teams, this WMF-specific data can help "improving our code review practices".

Aklapper updated the task description. (Show Details)Aug 24 2015, 12:32 PM

There are plans to exclude obvious Freemail providers like GMX from the stats.

@Dicortazar: Is there some dedicated (upstream?) task for that? Where exactly is information like "Freemail provider hostnames" currently stored? In the database itself that I could not easily access, or in some code repository (which one?) that would allow patches for customization of a whitelist/blacklist?
Asking as e.g. organizations which use MetricsGrimoire might have different interpretations whether a contributor using an email address with a certain hostname (e.g. @debian.org) is considered representing that entity or considered an individual.

Qgil raised the priority of this task from Normal to High.Aug 25 2015, 12:00 PM

This task is blocking T88531: Goal: Organize a Gerrit Cleanup Day on September 23, 2015 and T107562: Tech community KPIs for the WMF metrics meeting and, as said above, even touches a WMF Annual Plan goal. @Dicortazar, it cannot be so difficult to get rid of "GitHub", "Deutsche Telekom", etc, and get back to the situation we had before grouping all those under "Independent".

Increasing priority.

We've removed a bunch of organizations (see below), but I'm going to explain the situation, just to try to be on the same page.

The situation is as follows. We usually try to find affiliations, by a combination of automated and manual ways. We try to be as accurate as possible, specially for main contributors, with the following baseline idea: "if a person Alice works for the organization Org, we affiliate Alice to Org, and in all listings by organization, activity by Alice is attributed to Org". Then we have people "affiliated" to the fake organization "Individual", which are those that we can find out that are not affiliated with any organization. And still we have the rest, those who we don't know about their affiliation. Those are "Unknown", and depending on the stats/charts, sometimes they are considered, sometimes they are not.

In your case, it seems that you only want to have some organizations "as such" ("Wikimedia Foundation", "Wikimedia Deutschland", "WikiWorks", "Wikia", maybe some others), and the rest to be considered as "Individual" (or "Independent"). Is this right?

If this were the case, for the next iteration we could (at least for developers with activity in code review):

  • Define the organizations of your interest.
  • Find out as hard as reasonably possible the people affiliated to those organizations (probably with your help, for verification, and maybe for providing listings of people affiliated to eg WMF).
  • Consider the rest as "affiliated" to "Individual" (or "Independent" if you prefer).

My impression is that this could be close enough to your needs, but please let me know. The main constraint I see is that a person is affiliated to an organization for all the datasources. That would mean that if we say Alice is "Independent" for code review, she will be "Independent" as well in eg mainling lists, if we find her identity there. But I guess this is probably not an issue in your case, right?

Note: Organizations removed from database: "University of Melbourne", "GitHub", "Debian GNU/Linux", "Free Software Foundation", "XS4ALL Internet bv", "GMX", "GitHub", and some others, including some email providers. This changes will be reflected in the next run of the scripts.

In your case, it seems that you only want to have some organizations "as such" ("Wikimedia Foundation", "Wikimedia Deutschland", "WikiWorks", "Wikia", maybe some others), and the rest to be considered as "Individual" (or "Independent"). Is this right?

I'm not entirely sure what "as such" means, but I'd say that's right.

  • Define the organizations of your interest.
  • Find out as hard as reasonably possible the people affiliated to those organizations (probably with your help, for verification, and maybe for providing listings of people affiliated to eg WMF).
  • Consider the rest as "affiliated" to "Individual" (or "Independent" if you prefer).

The main constraint I see is that a person is affiliated to an organization for all the datasources.

To me, assuming the same affiliation across all datasources feels like a valid assumption to start with.
But let's discuss whether to try differentiating whether one person might have some activity for a company and some other activity as part of their personal/individual capacity (e.g. by using different IDs like different email addresses) in T95238: Handling multiple affiliations (at once; like work vs spare time) in tech community metrics.

Note: Organizations removed from database: "University of Melbourne", "GitHub", "Debian GNU/Linux", "Free Software Foundation", "XS4ALL Internet bv", "GMX", "GitHub", and some others, including some email providers. This changes will be reflected in the next run of the scripts.

Thanks, that's a great start and will make the data more reliable for our needs!
So such orgs (and whether they should be "recognized" as orgs instead of "individual"/"independent" or "unknown") are currently stored in the DB itself, and there is no way that someone like me could theoretically change such a list (and settings related to one org) via a patch in some code repository?

In your case, it seems that you only want to have some organizations "as such" ("Wikimedia Foundation", "Wikimedia Deutschland", "WikiWorks", "Wikia", maybe some others), and the rest to be considered as "Individual" (or "Independent"). Is this right?

I'm not entirely sure what "as such" means, but I'd say that's right.

Probably "as such" was not a good expression. I meant considering affiliations only to a certain list of organizations, and consider the rest as "individuals". In other words, it seems to me that, with respect to affiliation, you prefer to consider only affiliations to a certain list of organizations, instead of considering somebody affiliated to an organization eg. if she is using its email address, or is working in it. Am I right?

  • Define the organizations of your interest.
  • Find out as hard as reasonably possible the people affiliated to those organizations (probably with your help, for verification, and maybe for providing listings of people affiliated to eg WMF).
  • Consider the rest as "affiliated" to "Individual" (or "Independent" if you prefer).

The main constraint I see is that a person is affiliated to an organization for all the datasources.

To me, assuming the same affiliation across all datasources feels like a valid assumption to start with.

ok, let's start that way then.

But let's discuss whether to try differentiating whether one person might have some activity for a company and some other activity as part of their personal/individual capacity (e.g. by using different IDs like different email addresses) in T95238: Handling multiple affiliations (at once; like work vs spare time) in tech community metrics.

Right now, we can affiliate a person to different organizations in different periods. The underlying assumption is that a person is, at any given time, affiliated to some organization, or individual. Right now we have no way of considering a given person affiliated to two organizations, or affiliated to one and individual, during the same period. However, if needed, we could explore having different unique identities for persons with "two lifes" (one individual and another one corporate, for examplle). That way we could have for instance "Alice (Independent)" , affiliated as "Independent", when contributing as alice@mydomain.org and "Alice (Org)" affiliated as "Org", when contributing as alice@org.com. But in this case, Alice would be considered as two different persons, with two different "contribution sets", and for example would count as two when counting the number of contributing developers. In addition, have in mind that we not always have an email address (eg, IRC), and in those cases it would be impossible to make a difference.

[...]

So such orgs (and whether they should be "recognized" as orgs instead of "individual"/"independent" or "unknown") are currently stored in the DB itself, and there is no way that someone like me could theoretically change such a list (and settings related to one org) via a patch in some code repository?

We can produce a JSON file, which you can review and edit, and then we can reload it. We could explore defining a procedure for doing what you intend. Let me talk about that internally.

Qgil added a comment.EditedAug 28 2015, 8:08 AM

We are happy acknowledging any real affiliation, no matter how small the organization, how small the contribution, and how distant from the Wikimedia movement such organization is. The problem comes from the assumptions powering the automatic assignment of a person to their affiliation based on their email address.

Someone with a "@wikimedia.org" email address will definitely be affiliated to the Wikimedia Foundation, but the same cannot be said about someone with a Debian or FSF etc email address. Of course, the situation will be the opposite if it's the Debian project or the Free Software Foundation who are using Metrics Grimoire, so you need to find a flexible and efficient solution to this problem.

Ideally, there would be a file somewhere where we could define those assumptions: "if email address domain equals X, then affiliation is Y". Then Grimoire would not make assumptions about the rest. Users willing to declare their affiliation can do so via https://www.mediawiki.org/wiki/Community_metrics#User_data, and if we find new patterns with email address domains then we could include them in that file.

Qgil added a comment.Aug 28 2015, 8:13 AM

So such orgs (and whether they should be "recognized" as orgs instead of "individual"/"independent" or "unknown") are currently stored in the DB itself, and there is no way that someone like me could theoretically change such a list (and settings related to one org) via a patch in some code repository?

I have been wondering about this as well. I would like to document which affiliations are automatically deduced from email addresses (and how to modify these rules) at https://www.mediawiki.org/wiki/Community_metrics#Contributors

jgbarah moved this task from Backlog to Doing on the ECT-August-2015 board.Aug 28 2015, 8:18 AM

Summarizing from our monthly meeting:
Basically Wikimedia prefers to assume by default that somebody with a Uni of Cal email address is not affiliated, but to allow people to explicitly say that they do work for Uni of Cal.

  • "If there is no whitelist then guess as much as you can, if there is use that whitelist."
  • Start with WMF, WMDE, Wikia, Wikiworks, Independent, Unknown.
  • Remove all affiliations not among those first four.
  • Remove all Independent, as Independent means "user explicitly expressed Independent".
  • Mark everything else as Unknown
  • @jgbarah to talk to Santi; potentially store whitelist in code repo
  • Wikimedia to document the process on https://www.mediawiki.org/wiki/Community_metrics once "enough info given"
Aklapper moved this task from Backlog to Doing on the DevRel-September-2015 board.Aug 31 2015, 7:23 PM

We're now working on an automated way to deal with identities and affiliations.

We already had related activity such as T88277: Instructions to update user data in korma, but T111767: Automate the identities generation should improve such activity having a daily snapshot of the identities JSON file and automatically updating the database according to your changes there.

Ok, this is already rsync in the server.

People out of the four organizations plus Independent are not listed in any organization, so they are Unknown.

The update of the affiliations is now on your roof according to the process defined at T111767: Automate the identities generation

Dicortazar closed this task as Resolved.Sep 14 2015, 12:36 PM
Dicortazar added a subscriber: sduenas.

Closing this task as resolved.

Kudos to @sduenas.

Qgil added a comment.Sep 14 2015, 1:28 PM

Good!

Before we had Wikia as well. Any @wikia.com contributor can be safely assigned to Wikia.

Qgil added a comment.Sep 15 2015, 7:17 AM

Before we had Wikia as well. Any @wikia.com contributor can be safely assigned to Wikia.

To be handled in T112621: Add Wikia as recognized organization in Korma.