Page MenuHomePhabricator

Provide open changeset snapshot data on Sep 22 and Sep 24 (for Gerrit Cleanup Day)
Closed, ResolvedPublic

Description

As Wed Sep 23 is Wikimedia's Gerrit Cleanup Day (T88531), ask Bitergia to provide the following numbers:

Raw numbers are totally sufficient. Nothing graphical.

Summary of values on the 22nd of September:

  • Total number of changesets waiting for a reviewer action and Unknown affiliation: 751.
  • Total number of changesets waiting for a reviewer action and Independent affiliation: 159.
  • Total number of changesets waiting for a reviewer action, Unknown affiliation and open in the last three months: 335
  • Total number of changesets waiting for a reviewer action, Independent affiliation and open in the last three monthts: 77

Summary of values on the 24th of September:

  • Total number of changesets waiting for a reviewer action and Unknown affiliation: 628 (was 751 ~ -17%)
  • Total number of changesets waiting for a reviewer action and Independent affiliation: 124 (was 159 ~ -23,1%)
  • Total number of changesets waiting for a reviewer action, Unknown affiliation and open in the last three months: 261 (was 335 ~ -22%)
  • Total number of changesets waiting for a reviewer action, Independent affiliation and open in the last three monthts: 53 (was 71 ~ -25,4%

Related Objects

StatusSubtypeAssignedTask
DuplicateQgil
ResolvedQgil
ResolvedQgil
InvalidNone
InvalidNone
ResolvedAklapper
DeclinedNone
DeclinedNone
OpenNone
ResolvedQgil
ResolvedQgil
ResolvedQgil
ResolvedQgil
ResolvedAklapper
ResolvedNone
ResolvedAklapper
ResolvedAklapper
ResolvedDicortazar

Event Timeline

Aklapper raised the priority of this task from to Medium.
Aklapper updated the task description. (Show Details)
Aklapper added subscribers: Aklapper, Qgil, Dicortazar.
Aklapper set Security to None.
Aklapper added a subscriber: jgbarah.

@Aklapper, is this related to this panel?: http://korma.wmflabs.org/browser/scr-backlog.html

This is a list of code reviews in progress that you wanted to have for cleaning purposes as far as I remember. If this is ok, or if at least that panel is ok for you, we may add some more info to that panel.

@Dicortazar: It's basically about those two "total" numbers of open changesets waiting for review and age of open changesets (see these links).

We're just interested in the pure numbers (if those panels should a green dot next to the graph line and the green dot corresponds to the results so far in our current month and that green dot got updated every day, that would exactly be the numbers we're after here).
There is no real need to update any panels or graphs.

Does that clarifies things? :-/

This is one case where, for once, we are happy with a simple report and we don't need changes in the dashboard. :)

Ok, I get the point :).

I'll simply calculate the open changesets on the 22th and on the 24th. Indeed this should be a copy of the numbers in the dashboards during those dates and specifically after the update in the system.

Now that T100189 is resolved (yay!) and we can clearly differentiate between individual/unknown and non-volunteers, let me please clarify and improve the scope of this task as our Gerrit Cleanup Day concentrates on volunteer contributions (and hence our evaluation should too).

I'd like to go for four numbers:

Raw numbers are totally sufficient. Nothing graphical.

So these four numbers on Sep 22. And these four numbers on Sep 24, to compare.

@Dicortazar: Is this doable?

Aklapper raised the priority of this task from Medium to High.Sep 15 2015, 6:20 PM
Aklapper updated the task description. (Show Details)

Sure @Aklapper, I'll start working on the queries. However, I'd like to be sure we're still using the same filters for the data.

Some requirements I see:

  • Ignore for any metrics those changesets declared as WIP.
  • Regarding to the third and fourth metric, when a changeset is 'open', that means that it has not been abandoned or merged and we do not care the current status (except WIP) of the changeset, but that this is authored by volunteers.

I'll try to produce a first version of the numbers and validate those with you.

It would be good to be consistent with the filters applied: WIP, CR -1, V -1.

@Qgil, if we want to measure 'Median age of open changesets authored by volunteers in the last three months', it may make sense to measure all of the changesets that are not merged nor abandoned.

If this is the case, we shouldn't take into account CR -1 and V -1 for the median age of open changesets, should we?

We are requesting four metrics. The two first are "Number of open changesets waiting for review authored by volunteers", where "waiting for review" is important. For consistence, the other two should be also waiting for review, which implies filter for WIP, CR -1, V -1.

The problem we want to measure is how well are we treating to volunteers waiting for us. If we add the waiting times of volunteers slow to respond, then the metrics is distorted.

Ok, I'll update the task description to add the 'waiting for review' requirement. It is now clear to me how to proceed. Thanks!

Hi again,

Some comments.

I've thought about the metrics, and I'd say that we need a more proper definition.

The backlog of open changesets at some point is the number of open changesets in a given date. So for instance, for the metric 'Number of open changesets waiting for review authored by volunteers (=Unknown; Independent) in the last three months', it makes sense to calculate those up to some point in time. But this does not make sense to calculate this for the last three months.

I mean, the number of open changesets yesterday is probably different to the number of open changesets tomorrow. We shouldn't use a timeframe, but a point in time.

We may change this to:

'Number of open changesets waiting for review authored by volunteers (=Unknown; Independent) on the 22nd of September at 09:00 am and on the 24th of September at 09:00 am'
and
'Median age of open changesets waiting for review authored by volunteers (=Unknown; Independent) on the 22nd of September at 09:00 am and on the 24th of September at 09:00 am'.

Summarizing: the use of 'in the last three months' does not make sense unless you're interested in knowing the backlog of open changesets one month ago, two months ago and three months ago.

We mean open changesets created in the last three months. We want to look at how well we are reacting to new changesets versus the old ones.

We mean open changesets created in the last three months. We want to look at how well we are reacting to new changesets versus the old ones.

Just to clarify: in this case, the whole definition is:

'Median age of open changesets authored by volunteers, that were open during the last three months'.

Is it a single number (that is, "as of now"), or a number to be calculated every some period? (eg, monthly).

Please, have into account that for comparison purposes, if it is periodic, it should be like:

"Given a snapshot time (ST), median age, calculated at ST, of open changesets authored by volunteers, that were open between ST-3 months and ST". ST would be, eg, 1st day of every month.

So, it is a bit more complex that it seemed...

In addition, have in mind that this metric is sensitive to the distribution of the time-of-opening. I mean that if for a given three-months period, many reviews were open during the last days of the period, the median will be quite different than if they were open during the first days of the period. So, if the distribution of opening-times within each period is similar, this will work well (in the sense timing will be comparable), but if they are not (eg, because you have a code-review-party where you close many code reviews close to the end of a period), numbers may be not easy to compare.

Good point. Then maybe we can forget about "median time" for the last three months (since it is a too volatile value) but still keep "number of open changesets".

Some numbers to have some context.

  • Total number of changesets still open: 3029.
  • Total number of changesets still open by Independent developers: 389
  • Total number of changesets still open by Unknown developers: 1486

From the set of Unknown affiliated developers:

  • Total number of changesets still open and ignoring those in CR -2 and WIP status: 1260
  • And finally, total number of changesets waiting for reviewer action: 869

From the set of Independent affiliated developers:

  • Total number of changesets waiting for reviewer: 180

These numbers still need some review and are tentative for this ticket.

Those are counted till today at 02:00 am CEST.

Hi again, some improved numbers after some minor filtering process:

  • Total number of changesets waiting for a reviewer action and Unknown affiliation: 751.
  • Total number of changesets waiting for a reviewer action and Independent affiliation: 159.
  • Total number of changesets waiting for a reviewer action, Unknown affiliation and open in the last three months: 335
  • Total number of changesets waiting for a reviewer action, Independent affiliation and open in the last three monthts: 77

Some comments related to affiliations:

  • Important people in the Independent group are legoktm or nemo_bis (double checking this with you).
  • Alex Monk is marked in Wikimedia Foundatio and Independent. For these numbers, he's counted as Independent.

Some comments related to the data:

  • When a changeset has received a CR +1, this is still measured as waiting for a reviewer action. The rationale is that the submitter does not really need to do anything till a negative evaluation is received.

Comments are welcome.

Nemo is clearly independent. Legoktm and Alex Monk (Krenair) are WMF employees, but they are cases of developers that also have significant contributions as volunteers. In case of doubt, the affiliation is expressed by the email address used for commits, and therefore the current affiliation as independent is correct.

I wish we could clean the list of "unknown", because I'm sure we know many of them, but now it's too late. No problem. It might be that tomorrow many people will realize they are "unknown" and they will provide their affiliation through some way. We should not update affiliations right now, only after the second snapshot after the Cleanup Day, to assure that we are measuring equivalent data sets.

Thanks for providing these numbers (and seeing the problems of medians that I didn't think about)!
Most important is that the numbers gathered today (Tue 22) and Thu 24 are comparable.

are WMF employees, but they are cases of developers that also have significant contributions as volunteers

Basically unsolved T95238: Handling multiple affiliations (at once; like work vs spare time) in tech community metrics...(volunteers becoming staff)?

I wish we could clean the list of "unknown", because I'm sure we know many of them, but now it's too late.

Meh, I should have realized after fixing T100189: "Median time to review for Gerrit Changesets, per month": External vs. WMF/WMDE/etc patch authors last week and before running T88531: Goal: Organize a Gerrit Cleanup Day on September 23, 2015. :(
Going through the top names in "Independent" and "Unknown" is one obvious followup action after this week.

Numbers for today. Data were updated up to 2:00 am CEST.

  • Total number of changesets waiting for a reviewer action and Unknown affiliation: 628 (was 751 ~ -17%)
  • Total number of changesets waiting for a reviewer action and Independent affiliation: 124 (was 159 ~ -23,1%)
  • Total number of changesets waiting for a reviewer action, Unknown affiliation and open in the last three months: 261 (was 335 ~ -22%)
  • Total number of changesets waiting for a reviewer action, Independent affiliation and open in the last three monthts: 53 (was 71 ~ -25,4%

Some method notes, just in case you liked to reproduce the analysis:

This analysis uses the Gerrit database provided in Korma. And the identities database. Given its private nature, this needs some request access from third parties.

Queries:

When retrieving 'Total number of changesets waiting for a reviewer action and Idependent affiliation in the last three months':

SELECT count(distinct(i.id))
FROM issues i,
             people_uidentities puid,
             last_patch lp
WHERE i.submitted_by = puid.people_id AND
                 puid.uuid IN 
                     (SELECT DISTINCT(uuid)
                      FROM <identities_db>.enrollments
                      WHERE organization_id=91) AND
                  i.status = 'NEW' AND 
                  i.id = lp.issue_id AND
                  lp.issue_id NOT IN  
                      (SELECT lp.issue_id
                       FROM last_patch lp, 
                                    changes ch
                       WHERE lp.issue_id = ch.issue_id AND
                                        lp.last_patch = ch.old_value AND 
                                        ch.new_value=-1) AND 
                   i.summary NOT LIKE '%WIP%' AND 
                   i.submitted_on>='2015-06-22';

In order to remove the 'last three months' filter, the very last filter needs to be removed from the query: "i.submitted_on>='2015-06-22'"

When retrieving 'Total number of changesets waiting for a reviewer action, Independent affiliation and open in the last three monthts'

SELECT count(distinct(i.id)) 
FROM issues i, 
              people_uidentities puid, 
              last_patch lp 
WHERE i.submitted_by = puid.people_id AND 
                 puid.uuid NOT IN 
                     (SELECT DISTINCT(uuid) 
                      FROM <identities_db>.enrollments) AND 
                  i.status = 'NEW' AND 
                  i.id = lp.issue_id AND 
                  lp.issue_id NOT IN  
                       (SELECT lp.issue_id 
                        FROM last_patch lp, 
                                      changes ch 
                        WHERE lp.issue_id = ch.issue_id AND 
                                         lp.last_patch = ch.old_value AND 
                                         ch.new_value=-1) AND 
                  i.summary NOT LIKE '%WIP%' AND 
                  i.submitted_on>='2015-06-22';

In order to remove the 'last three months' filter, the very last filter needs to be removed from the query: "i.submitted_on>='2015-06-22'"

Numbers for today. Data were updated up to 2:00 am CEST.

Just realizing and pointing out: As 02:00 CEST is 17:00 PST in San Francisco there may have been some later improvements on that day, but the numbers are definitely sufficient to give us a good impression.

@Aklapper, I can update numbers tomorrow if needed. Please, let me know!

@Aklapper, I can update numbers tomorrow if needed. Please, let me know!

I'm curious enough so if that takes you less than 10min it would be welcome!

General question: I assume the data is always collected at 00:00UTC?