Page MenuHomePhabricator

Have statistics which allow measuring success of a potential "Onboarding new developers" program
Closed, ResolvedPublic

Description

Onboarding New Developers program.

  • Define them.
  • Implement them.

Note: Demographics already lists some items.

Related Objects

Event Timeline

We likely want to get T151161: Have "Last Attracted Developers" information for Gerrit automatically updated / Integrate new demography panels in GrimoireLab product fixed for this task.
The Git-Demographics dashboard currently has an "Attracted developers" widget showing new contributors per month who wrote changes merged into Wikimedia Git (not: Gerrit). Via one-click filtering on the "Top Repositories" widget or manual filtering in Advanced filter field, one can see the stats for specific Wikimedia Git repositories.

I am pretty clueless how we could measure any activity things outside of Wikimedia Git/Gerrit. See for example T132499#3110951.

Aklapper raised the priority of this task from Low to Medium.Mar 27 2017, 11:01 AM

Just a note to myself:
Quim mentioned "Do we know which Phabricator accounts are also active in Differential (if we ever kill Gerrit)?"
Answer: We neither have Differential stats in wikimedia.biterg.io nor does our monthly email to wikitech-l@ cover this: "Active users" in that email only checks entries in the maniphest_transaction table. Phabricator has a differential_transaction table in the phabricator_differential database (just checked) which also has a authorPHID field (SHOW COLUMNS is your friend).

Aklapper renamed this task from Have statistics on wikimedia.biterg.io which allow measuring success of a potential "Onboarding new developers" program to Have statistics which allow measuring success of a potential "Onboarding new developers" program.Apr 1 2017, 12:35 PM
Aklapper raised the priority of this task from Medium to High.
Aklapper moved this task from Ready to Go to April on the Developer-Advocacy (Apr-Jun 2017) board.

T132499: Review KPIs, workflows, and scorecards of the Technical Collaboration team proposes:

Developer Relations

  • Number of volunteers contributing code to Wikimedia repositories, total & selected projects. (*)
  • Number of new volunteers contributing code to Wikimedia repositories, total & selected projects. (*)
  • Percentage of retention of new developers 12 months after their first contribution, total & selected projects. (*)
  • Number of software projects selected for new developers, based on their ability to provide mentors, good entry-level documentation, first tasks, and a roadmap. (*)
  • Number of Wikimedia affiliates and partner organizations (through Wikipedia Education Program, GLAM...) involved in developer outreach activities.

(*) Recommended projects may include not only those going through Gerrit/Differential, but also Labs tools, bots, gadgets, Lua modules, and GitHub hosted.

Is this a good list? Anything missing? Anything that should be removed or changed? What can we have in place by July 1st?

It is clear that we need a task about creating a selection of software projects for newcomers. I think it is more important to start small with accurate statistics about these selected projects, and then add total Wikimedia stats as we are confident about their accuracy.

OK, here are three metrics from the above list that we could consider our team KPIs:

  • Volunteers uploading code to Wikimedia repositories.
  • New volunteers contributing code to Wikimedia repositories in the last 90 days.
  • New volunteers active 1 year after their first contribution.

We can start counting all repositories in Gerrit (last quarter, QoQ, YoY).

@Aklapper, do you agree?

@Aklapper, do you agree?

Yes.

  • Volunteers contributing code to Wikimedia repositories.

We can check that already nowadays on https://wikimedia.biterg.io as affiliation data is in acceptable shape. What timeframe to compare on?

  • New volunteers contributing code to Wikimedia repositories in the last 90 days.

(Means: who made their first contribution in the last 90d.) This should be about Gerrit (uploading code proposals) and not (Git) merged code. Will be possible once T151161 is resolved.

  • New volunteers active 1 year after their first contribution.

That is not measurable currently. If we really go for specific people we'd have to find a way to "freeze" that specific subset of new (Gerrit) contributors and recheck/compare a year later.

We can start counting all repositories in Gerrit (last quarter, QoQ, YoY).

OK, since we agree on the principles, let's nail down the implementation:

  • KPI 1: Volunteers uploading code to Wikimedia repositories

How many unique independent users have uploaded code contributions to Gerrit. This can be new changesets or new versions of existing changesets. We need last 90 days, QoQ, YoY. Do you need anything else?

  • KPI 2: New volunteers uploading code to Wikimedia repositories in the last 90 days

How many unique independent users have uploaded their first code contribution to Gerrit in the last 90 days. We need last 90 days (obviously), QoQ, YoY.

  • KPI 3: New volunteers active 1 year after their first code upload

I think we can measure this easily (even if manually, every quarter) with a "Last attracted developers in the last 15 months" table (similar to the ones in "Demographics") filtered to show only independent users, with these columns:

  • Author
  • First upload
  • Last upload

Export to CSV. In a spreadsheet, sort the First Upload column by oldest first, keep those uploading patches 15-12 months ago and discard the rest. Then apply a subtraction between Last and First upload, and check those that are older than one year.

I have created an example spreadsheet to illustrate the point.

With this last one, we also would count last 90 days, QoQ, YoY.

  • KPI 1: Volunteers uploading code to Wikimedia repositories

How many unique independent users have uploaded code contributions to Gerrit. This can be new changesets or new versions of existing changesets. We need last 90 days, QoQ, YoY. Do you need anything else?

  • KPI 2: New volunteers uploading code to Wikimedia repositories in the last 90 days

How many unique independent users have uploaded their first code contribution to Gerrit in the last 90 days. We need last 90 days (obviously), QoQ, YoY.

  • Rephrasing to "New volunteers uploading code change proposals to Wikimedia repositories...".
  • Blocked on T151161#3176235 and its "New Authors" bar chart widget. Once T151161 is fixed, it's more or less the same as above.
  • Again, admins may need to increase the buckets' size from currently 30 (as "Last Attracted Developers" on https://wikimedia.biterg.io/app/kibana#/dashboard/Git-Demographics (note: Git, not Gerrit) does not cover 3 months by default I'd expect the same 'problem' for Gerrit).
  • KPI 3: New volunteers active 1 year after their first code upload

I think we can measure this easily (even if manually, every quarter) with a "Last attracted developers in the last 15 months" table (similar to the ones in "Demographics") filtered to show only independent users, with these columns:

  • Author
  • First upload
  • Last upload

Export to CSV. In a spreadsheet, sort the First Upload column by oldest first, keep those uploading patches 15-12 months ago and discard the rest. Then apply a subtraction between Last and First upload, and check those that are older than one year.

I have created an example spreadsheet to illustrate the point.

With this last one, we also would count last 90 days, QoQ, YoY.

Thanks for these thoughts!

  • Rephrasing to "New volunteers active 1 year after their first upload of a code change proposal".

If we go for tracking specific people (=names),

  • this is again blocked on T151161#3176235 and its "New Authors" list-of-names widget.
  • Once T151161 is fixed, we can export author names and the date of their first commit to CSV from that widget.
  • TODO: However, we'd still have to add a "Last upload" column somewhere in order to get that data. That is currently not possible/available and I do not see any workaround as I do not want to manually check for recent activity of each person who was a new contributor one year ago.

An attempt to make these titles more digestive

  1. Volunteers contributing patches for review
  2. New volunteers contributing patches for review
  3. New volunteers active one year after their first contribution
  • KPI 1: Volunteers uploading code to Wikimedia repositories

How many unique independent users have uploaded code contributions to Gerrit. This can be new changesets or new versions of existing changesets. We need last 90 days, QoQ, YoY. Do you need anything else?

KPI 1: @Qgil: The "Changeset Submitters" widget on https://wikimedia.biterg.io:443/goto/3da7d73ffd88c10d9aed1b37b513ba08 shows the number of Gerrit uploaders with independent or unknown affiliation per month but not per quarter.
The "Interval" option in the non-public admin section offers no "Quarterly" option, and quarters do not always have the same amount of days either. An admin could create a similar widget with the (time) "Interval" field set to a "Custom" value like "91d" && setting an "absolute" timeframe (e.g. from March1 to May31) in the upper right corner, but that would require manual intervention every time (as 91d is not static). Or asking Bitergia to introduce a "Quarterly" Interval option. Any opinion?

KPI 2 and KPI 3 are still blocked on T151161.

An admin could create a similar widget with the (time) "Interval" field set to a "Custom" value like "91d" && setting an "absolute" timeframe (e.g. from March1 to May31) in the upper right corner, but that would require manual intervention every time (as 91d is not static).

Correction: I am wrong. When setting 90d and the timeframe 2017-01-01 00:00:00.000 to 2017-03-31 23:59:59.999 I do not get one single bucket value which covers exactly that timeframe, but two which both partially cover the timeframe. Meh. So not an option to get one single value for an entire quarter.

Screenshot from 2017-06-14 20-25-49.png (977×1 px, 111 KB)

I guess I'll have to ask upstream for a Quarterly option here if we go for quarterly instead of monthly stats.

Sorry, the link format breaks markdown, but the link works nevertheless.

[[ https://wikimedia.biterg.io/app/kibana?#/dashboard/Gerrit?_g=(refreshInterval:(display:Off,pause:!f,value:0),time:(from:'2017-01-01T17:03:34.084Z',mode:absolute,to:'2017-03-31T16:03:34.084Z'))&_a=(filters:!(),options:(darkTheme:!f),panels:!((col:1,id:Gerrit-metrics__gerrit_enrich,panelIndex:2,row:1,size_x:1,size_y:4,title:Gerrit,type:visualization),(col:1,id:Patchsets-per-review__gerrit_enrich,panelIndex:4,row:9,size_x:3,size_y:3,title:'Patchsets%20Statistics%20Per%20Review',type:visualization),(col:4,id:Time-per-review__gerrit_enrich,panelIndex:5,row:9,size_x:3,size_y:3,title:'Changesets%20Statistics%20(Open%20Time)',type:visualization),(col:2,id:Reviews-by-opening-time__gerrit_enrich,panelIndex:7,row:1,size_x:5,size_y:2,title:'Changesets%20Per%20Status',type:visualization),(col:7,id:Patchsets-per-review-per-month__gerrit_enrich,panelIndex:9,row:9,size_x:6,size_y:3,title:'Patchsets%20Per%20Review',type:visualization),(col:2,id:Change-submitters-per-month__gerrit_enrich,panelIndex:11,row:3,size_x:5,size_y:2,title:'Changeset%20Submitters',type:visualization),(col:7,id:Organizations-pie__gerrit_eclipse_enrich,panelIndex:15,row:1,size_x:3,size_y:4,title:Organizations,type:visualization),(col:1,id:gerrit_top_developers,panelIndex:17,row:5,size_x:6,size_y:4,title:Submitters,type:visualization),(col:7,id:gerrit_evolution_organizations,panelIndex:18,row:5,size_x:6,size_y:4,title:Organizations,type:visualization),(col:10,id:gerrit_repositories_table,panelIndex:19,row:1,size_x:3,size_y:4,title:Repositories,type:visualization)),query:(query_string:(analyze_wildcard:!t,query:'author_bot:%22false%22%20AND%20(author_org_name:%22Independent%22%20OR%20author_org_name:%22Unknown%22)')),title:Gerrit,uiState:(P-11:(title:'Changeset%20Submitters',vis:(legendOpen:!f)),P-15:(title:Organizations),P-17:(title:Submitters),P-18:(title:Organizations),P-19:(title:Repositories),P-2:(title:Gerrit),P-4:(title:'Patchsets%20Statistics%20Per%20Review'),P-5:(title:'Changesets%20Statistics%20(Open%20Time)'),P-7:(title:'Changesets%20Per%20Status',vis:(legendOpen:!f)),P-9:(title:'Patchsets%20Per%20Review'))) | This query from Jan-Mar 2017 ]]shows "147 submitters". Isn't this correct?

It is introeuced manually, true, but it is done once every quarter, four times a year. If the number is correct, the manual query looks doable. :)

Ah, these "147 submitters" is everybody, and the KPI is about "independent users". So I am wrong, and I will wait for news. :)

Ah, these "147 submitters" is everybody, and the KPI is about "independent users". So I am wrong, and I will wait for news. :)

Heh. I did not see the forest for the trees. @Qgil: You are right. Thank you.

To verify, as an admin I changed the size of the "Submitters" panel from 100 to 500 entries (admin link). I went to the Q1/2017 view for Unknown/Independent authors and exported the "Submitters" panel to CSV. It does have 147 entries. Which is the number you posted. (The number of all contributors is 280 for that quarter.) Works.

KPI2: Data added to page on meta, after a hell lot of confusion. (Getting its data source on wikimedia.biterg.io into a stable condition is covered in T151161.)

KPI3:

Notes (on methodology etc):

  • Assuming today is 2017-07-01 and we look at users who got first active between 2016-01-01 and 2016-03-31 and if they were still active between 2016-04-01 and 2016-06-30, there is an implicit assumption that people are active at least once per quarter. Which is likely wrong and lowers our retention numbers.
  • Also, someone active in GCI in January 2016 and active in next GCI in January 2017 will not be listed as retained.
  • wondered if account renames might make the numbers lower than reality but not an issue if both these lists are pulled at the same time (instead of ancient local copies)
  • wondered if some names might show up because they went from staff to volunteer but these should not be listed as "New" anyway (if they were that's a bug)

KPI3:

KPI 3: New volunteers active 1 year after their first code upload
I think we can measure this easily (even if manually, every quarter) with a "Last attracted developers in the last 15 months" table (similar to the ones in "Demographics") filtered to show only independent users, with these columns: [...]
Export to CSV. In a spreadsheet, sort the First Upload column by oldest first, keep those uploading patches 15-12 months ago and discard the rest. Then apply a subtraction between Last and First upload, and check those that are older than one year.
I have created an example spreadsheet to illustrate the point.

As finding out for dozens of people when they were active for the last time sounds like a lot of manual work, I was wondering about a different approach instead which does not check for exactly "active after 12 months" but "active after between 12 and 15 months":

  • Filter data of independent contributors in Gerrit from Q1 2016.
  • Double-check that "New Authors" widget on that page lists ALL names (compare with number "New Authors" number in "Summary" widget at the top).
    • If it does not, be an admin and edit that visualization to include more data.
  • Click "Export: Raw" at the bottom of the "New Authors" widget.
  • Open C_Gerrit_Demo_Table.csv (or whatever the file is called) in your favorite spreadsheet application
  • Set comma (,) as separator when importing into your favorite spreadsheet application
  • Remove all columns except for first one (Author)
  • Remove first row (Author)
  • Sort column by alphabet
  • Save (overwrite) file as CSV
  • Filter data of independent contributors in Gerrit of Q2 2017 (not Q1, to have a full one year after previous quarter) and exclude a good bunch of (upstream) repositories.
  • Double-check that "Submitters" widget on that page lists ALL names (compare with number "Submitters" number in "Summary" widget at the top).
    • If it does not, be an admin and edit that visualization to include more data.
  • Click "Export: Raw" at the bottom of the "Submitters" widget.
  • Open gerrit_top_developers.csv (or whatever the file is called) in your favorite spreadsheet application
  • Set comma (,) as separator when importing into your favorite spreadsheet application
  • Remove all columns except for first one (Submitter)
  • Remove first row (Submitter)
  • Sort column by alphabet
  • Save (overwrite) file as CSV
  • comm -12 C_Gerrit_Demo_Table.csv gerrit_top_developers.csv | wc -l (explanation)
$:acko\> comm -12 C_Gerrit_Demo_Table.csv gerrit_top_developers.csv | wc -l
comm: file 2 is not in sorted order
3
$:acko\> more C_Gerrit_Demo_Table.csv | wc -l
61

100÷61×3 = 4.918% so far (Q2 2017 not finished yet), comparing who was first active in the 3 months of Q1 2016 and was also active in the 3 months of Q2 2017.

I don't feel in a good condition to audit this process leading to 4.918% (which is an interesting number). If we are going to use this process, I recommend to check it against a manual count at least once, to see whether we get the same result.

However, if we are going to count retention differently...

KPI3:

Notes (on methodology etc):

  • Assuming today is 2017-07-01 and we look at users who got first active between 2016-01-01 and 2016-03-31 and if they were still active between 2016-04-01 and 2016-06-30, there is an implicit assumption that people are active at least once per quarter. Which is likely wrong and lowers our retention numbers.

Good point. The better we want to count that, the more complicated it gets to count.

IF newDevelopers = users uploading a first patch to Gerrit in the past 3 months
IF lostDevelopers = newDevelopers who have not registered any activity in Gerrit for more than 12 months
THEN retainedDevelopers = newDevelopers with Gerrit activity registered in the past 3-12 months.
ALSO newDevelopers active in Gerrit after 12 months would become oldDevelopers, being removed from the calculation above.

(((Idea: imagine that code review would count as an activity for retention too, not only patch upload.)))

I don't know how to count this automatically, but I guess there is a way to count that manually.

  • Also, someone active in GCI in January 2016 and active in next GCI in January 2017 will not be listed as retained.

According to the definition above, a GCi repeater would not count as retained (more than 12 months since first contribution), but they would still count as oldDeveloper. Fair enough, I would say, since that person is not really a newcomer anymore.

I don't know how to count this automatically, but I guess there is a way to count that manually.

That's what I'd like to avoid: Having to look at every single entity individually & manually. No matter which approach to take there will always be deviations, depending on cohorts (currently: one quarter from >1 year ago as one single chunk) and chosen "time control points" (currently: "after one year span", "compare to last three months").

So I wondered if the "implicit assumption that people are active every quarter" is a statistically relevant problem or rather negligible. The "New Authors" widget data goes back to 2011 (when Gerrit was introduced, hence many early entries are only new to Gerrit but not new to Wikimedia). Taking a random quarter:

I see three options:

  1. Go for the currently proposed approach in T160430#3383647 with the implicit expectation of at least one activity every 3 months, resulting in "low numbers".
  2. Take an older quarter for "new volunteer developers" data and increase the length of the more recent comparison time window which is currently "the last 3 months". See the bullet points above for an example covering "the last 12 months" instead of "the last 3 months".
  3. Have more than one cohort (e.g. split the single 2016Q1 chunk into three Jan2016, Feb2016, Mar2016 chunks, or weekly, or even daily as in T160430#3258595) to increase the length of the more recent comparison time window. But the more cohorts, the more work. Hence I am reluctant. Example data for 3 month cohorts:

OK, the good thing about deadlines is that they force us to make decisions. Since the raw data will be there anyway, we can pick one system, and if we are not happy we can pick a different one later on.

The KPI was

KPI 3: New volunteers active 1 year after their first code upload

Then I propose that we go for the simplest approach:

From the new volunteers who uploaded their first patch during Quarter A in Year B, how many did upload a patch during Quarter A in Year B+1.

For instance, in the next quarterly check-in in a couple of weeks:

From the new volunteers during April-June 2016, which percentage has been active in April June 2017.

All the better if the YoY values corresponding to previous quarters are there as well, in order to see whether there is a possible trend.

What do you think?

Cool with me.

For clarification, that means we will not measure "activity >=12 months after", but measure activity in a timeframe within 9 and 15 months. Which turns "New volunteers active 1 year after their first code upload" into "New volunteers active on average 1 year (±3months) after their first code upload". Just saying.

Data:

  • 49 new authors in 2015Q1. 8 still active in 2016Q1 (Base, Cenarium, Chiefwei, Ferdbold, Majr, Negative24, Pmlineditor, PranavK). = 16.33%
  • 29 new authors in 2015Q2. 2 still active in 2016Q2 (Galorefitz, Putnik). = 6.90%
  • 39 new authors in 2015Q3. 5 still active in 2016Q3 (Abhinand, Devirk, Haritha28, Ljonka, MarcoAurelio). = 12.82%
  • 53 new authors in 2015Q4. 2 still active in 2016Q4 (Matěj Suchánek, Mhutti1). = 3.77%
  • 61 new authors in 2016Q1. 3 still active in 2017Q1 (Dalba, Harjotsingh, Oetterer). = 4.92%
  • 41 new authors in 2016Q2. TBD; so far 2 still active in 2017Q2 (Harej, Urbanecm). = 4.88%

Data added (diff) to https://meta.wikimedia.org/wiki/Technical_Collaboration/Metrics#Onboarding_New_Developers

(Clueless question: Is the QoQ/MoM the absolute difference between two percentage values (e.g. 3.77% retention in 2016Q4, 4.92% retention in 2017Q1, hence "+1.15%") or the relative difference (e.g. 4.92% is "+30.50%" more than 3.77%)? I strongly assumed the latter in my on-wiki edit.)

Cool with me.

For clarification, that means we will not measure "activity >=12 months after", but measure activity in a timeframe within 9 and 15 months. Which turns "New volunteers active 1 year after their first code upload" into "New volunteers active on average 1 year (±3months) after their first code upload". Just saying.

Yes. :)

Data:

OK, so it looks like we have a relatively stable trend around ~5% a year after. Since the Wikimedia Hackathon 2017 we are talking about a goal of 10% retention. It looks like the challenge is served. :)

(Clueless question: Is the QoQ/MoM the absolute difference between two percentage values (e.g. 3.77% retention in 2016Q4, 4.92% retention in 2017Q1, hence "+1.15%") or the relative difference (e.g. 4.92% is "+30.50%" more than 3.77%)? I strongly assumed the latter in my on-wiki edit.)

After checking the first seriously looking page I found in the Internet, the correct answer in your example seems to be the latter indeed, +30.50%.

Took a while to grasp... but quite interesting stuff and so well documented!

Questions/comments about KPI3:

  • The way we are measuring it right now would this rule out people who contributed for 12 months since their first submission but left afterward (don't make it to the timeframe where we check)? In any case, I'm wondering if we want to also see "New volunteers active for one year after their first contribution" and what is their activity pattern like. Meaning compare numbers of a quarter with the three subsequent ones. Could be useful in the context of T167085.
  • Might be this a very superficial point but I'm wondering -- when we say retained and active, what do we mean? If someone made a one line commit after a year, would that imply retention?

Questions/comments about KPI3:

  • The way we are measuring it right now would this rule out people who contributed for 12 months since their first submission but left afterward (don't make it to the timeframe where we check)?

We check if a volunteer who uploaded their first patch during Quarter A in Year B also uploaded a patch during Quarter A in Year B+1. First patch could have happened at first day of quarter A in year B and last patch at last day of quarter A in year B+1 (means: ~15 months difference) or other way round (~9 months difference). So people who have been active for exactly 12 months are always included in results.

In any case, I'm wondering if we want to also see "New volunteers active for one year after their first contribution" and what is their activity pattern like. Meaning compare numbers of a quarter with the three subsequent ones. Could be useful in the context of T167085.

I interpret that as "New volunteers active at least once in each of the three calendar quarters after the quarter of their first contribution".
That could already be calculated by performing several iterations of the steps outlined in T160430#3383647 - basically: comm -12 NewDevelopersQuarter1.csv AllDevelopersQuarter2.csv > WhoWeRetainedInQuarter2.csv, then comm -12 WhoWeRetainedInQuarter2.csv AllDevelopersQuarter3.csv > WhoWeRetainedInQuarter3.csv and so on and so on.

  • Might be this a very superficial point but I'm wondering -- when we say retained and active, what do we mean? If someone made a one line commit after a year, would that imply retention?

Yes. :)

Is there anything missing in this task? Since it is a team goal, a summary in the description is welcomed.

Is there anything missing in this task? Since it is a team goal, a summary in the description is welcomed.

Been wondering myself. :)
https://meta.wikimedia.org/w/index.php?title=Technical_Collaboration/Metrics&oldid=16967767#Onboarding_New_Developers lists five metrics:

So that last item is blurry to me. Apart from that this task is resolved indeed (thanks for the ping).

Talked to Quim - the KPIs have been defined and they are measurable.
(For some KPIs the actual data is still missing, but that's out of scope for this task.)
Hence closing as resolved.

I'm pretty late here, but just to clarify, the 3 KPIs are the ones listed in T160430#3275660, and it's just about contributing patches, and no measurement on whether those patches were merged or even reviewed?

@Legoktm yes, this is intended. Technical Collaboration's direct responsibility is to recruit and onboard new developers, which is reflected in new patches from new contributors. We have no direct control over code review processes (a lesson that we have learned after a good couple of years trying hard), and therefore we decided to count patches uploaded as opposed to code review review success / patches merged.

Then again, we are co-responsible of code review efficiency together with the related developer teams and volunteer reviewers. We take that efficiency into account when measuring new developer retention. New developers will stay or leave based on many factors, being an important one how satisfied are they about how their patches were handled.

Posting update steps for KPI3 here (as URLs changed and to clarify admin steps in case someone else performs this) for 2017Q3:

  • On https://wikimedia.biterg.io/app/kibana#/dashboard/C_Gerrit_Demo , filter data of independent new contributors in Gerrit from Q2 2016. See that certain repositories are excluded as they are upstream. Also see that it filters for "author_org_name:Independent". This requires setting an absolute time frame in the upper right corner, in this case From 2016-04-01 00:00:00.000 | To 2016-06-30 23:59:59.999.
  • Manually review the entries in the "New Authors" widget. If the list contains names that are obviously not "Independent" but paid staff, manually adjust the text filter by adding a NOT author_name:"Foo" AND NOT author_name:"Bar" AND ...prefix. (If account data in the DB was always up-to-date this would not be needed.)
  • Double-check that "New Authors" widget on that page lists ALL names (compare with number "New Authors" number in "Summary" widget at the top).
  • So for 2016Q2, we end up with https://wikimedia.biterg.io/goto/71266ccebe8213aec1f92c569b089d0a and 39 new volunteer developers.
  • Click "Export: Raw" at the bottom of the "Submitters" widget.
  • Open C_Gerrit_Demo_Table.csv (or whatever the file is called) in your favorite spreadsheet application
  • Set comma (,) as separator when importing into your favorite spreadsheet application
  • Remove all columns except for first one (Author)
  • Remove first row (Author)
  • Sort column by alphabet
  • Save (overwrite) file as CSV
  • On https://wikimedia.biterg.io/app/kibana#/dashboard/Gerrit , filter data of independent contributors in Gerrit of Q3 2017 (not Q2, to have a full one year after previous quarter). See that certain repositories are excluded as they are upstream. Also see that it filters for "author_org_name:Independent" and exclude bots (via author_bot:false). This requires setting an absolute time frame in the upper right corner, in this case From 2017-07-01 00:00:00.000 | To 2017-09-30 23:59:59.999.
  • Manually review the entries in the "Submitters" widget. If the list contains names that are obviously not "Independent" but paid staff, manually adjust the text filter by adding a NOT author_name:"Foo" AND NOT author_name:"Bar" AND ...prefix. (If account data in the DB was always up-to-date this would not be needed.)
  • Double-check that "Submitters" widget on that page lists ALL names (compare with number "# Changeset Submitters" number in "Summary" widget at the top).
  • Click "Export: Raw" at the bottom of the "Submitters" widget.
  • Open gerrit_top_developers.csv (or whatever the file is called) in your favorite spreadsheet application
  • Set comma (,) as separator when importing into your favorite spreadsheet application
  • Remove all columns except for first one (Submitter)
  • Remove first row (Submitter)
  • Sort column by alphabet
  • Save (overwrite) file as CSV
  • comm -12 C_Gerrit_Demo_Table.csv gerrit_top_developers.csv | wc -l (explanation)
$:acko\> comm -12 C_Gerrit_Demo_Table.csv gerrit_top_developers.csv 
Mbch331
MGChecker
$:acko\> more C_Gerrit_Demo_Table.csv | wc -l
39

100÷39×2 = 5.128% for Q3 2017, comparing who was first active in the 3 months of Q2 2016 and was also active in the 3 months of Q3 2017.