Page MenuHomePhabricator

Check whether it is true that we have lost 40% of (Git) code contributors in the past 12 months
Closed, ResolvedPublic

Description

Conclusion

Graph comparison (⚠: different scale) from before and after excluding / blacklisting 58 repositories :

40perc.png (620×719 px, 67 KB)

The mismatch in numbers (1166 - 58 ≠ 1135) is a result of new repositories created and indexed in the meantime.

Comparing previously displayed numbers in korma:

Data displayed for month ↓ on date ➝2015-06-222015-12-02 after excluding 8 repos2016-04-08 after excluding 58 repos
January 2014 ⚠414303263
May 2014382270245
August 2014371278259
November 2014 ⚠228233189
May 2015225235213
Diff Jan 2014 to Nov 2014:-45%-23%-28%
Diff Jan 2014 to Jan 2015:-19%
Diff Feb 2014 to Feb 2015:-10%
Diff Oct 2014 to Oct 2015:-12%
Diff Nov 2014 to Nov 2015:+1.6%
Diff Jan 2015 to Jan 2016:+0.5%
Diff Mar 2015 to Mar 2016:-21%
Diff Jan 2014 to Jan 2016:-19%

Note there might still be more repositories around which were imported/pulled from upstream at some point in the past and not updated since then. We also have repositories that are "mixed".

So I'd say we indeed lost contributors in Wikimedia Git.

whether we are still losing developers or we are stable in a lower but still flat line.

It's somewhere between "stable" and "losing a few". http://korma.wmflabs.org/browser/scm-contributors.html is available for everybody's interpretation and for picking up specific months to compare to each other.


Original description

According to the "Authors" graph at http://korma.wmflabs.org/browser/scm.html, in 12 months we have lost about 40% of code contributors (users that got their code merged in Wikimedia hosted repositories).

It is a significant number and according to the graph it's not an anti-spike but a consolidated number.

January 2014:    414
May 2014:        382
August 2014:     371
November 2014:   228 !!!
May 2015:        225

@Qgil is happy to highlight this number to WMF management and our community, but first we would need to be sure that these numbers are correct and not the result of a software biug or another type of misunderstanding i.e. single users committing from multiple addresses before, and now only from one. CCing here different people that are good at looking at raw data and stats. Your help is welcome!

The Engineering Community team review is on July 6 (draft materials to be presented on June 30), and this would be a good context to bring these numbers.

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

There are differences in numbers of repositories

There still are. I created T116483: Explain / sort out / fix SCM repository number mismatch on korma and T116484: Explain / sort out / fix SCR repository number mismatch on korma. So I can trust numbers a bit more soon, hopefully.

in http://korma.wmflabs.org/browser/scm.html "Authors - Last 30 days", in the #8 we have Evan Priestley with 125 commits.

Evan is still listed under "Last 365 days" with 1299 commits. His korma page says "Last contribution: 2015-10-05" and our last Phab update was on Oct 07.
My first guess was [[ https://phabricator.wikimedia.org/diffusion/PHAB/history/master/ | https://gerrit.wikimedia.org/r/p/phabricator/phabricator ]] but last commit there was in Oct 2014. So I found libphutil and arcanist as potential offenders.
Created a pull request to exclude them from Git repos scanned. (The repos were already listed in gerrit_trackers_blacklist.conf.)

⟹ Identifying which Git repositories pollute our results is between cumbersome and luck.
⟹ Software archaeology takes time.

Some of our Git repositories pull (imported) upstream (though a pull from upstream is counted as one single commit only) and have downstream commits, e.g.
https://gerrit.wikimedia.org/r/p/phabricator/deployment so there will always be some noise.

More Git repos that seem to have upstream commits only:

Created a pull request to exclude them from Git repos scanned.

I do not know if there's Git activity out of the usual code review process.

There is; depending on repository configuration. Example: I can directly push (self-merge) my commits into wikimedia/bugzilla/triagescripts, bypassing Gerrit.

My current assumption is that hhvm-dev (and its imported upstream-only authors) is a big polluter of our Git author data pre-10/2014 (we will know better after my pull requests (1,2) are merged and deployed on korma). Number of hhvm-dev's unique authors per month:

$:andre\> git clone https://gerrit.wikimedia.org/r/p/operations/software/hhvm-dev
Cloning into 'hhvm-dev'...
$:andre\> cd hhvm-dev/
$:andre\> git log --after=2014-06-01 --before=2014-07-01 --author='' --pretty=format:"%ae" | sort | uniq -c | sort -rn | wc -l
76
$:andre\> git log --after=2014-07-01 --before=2014-08-01 --author='' --pretty=format:"%ae" | sort | uniq -c | sort -rn | wc -l
72
$:andre\> git log --after=2014-08-01 --before=2014-09-01 --author='' --pretty=format:"%ae" | sort | uniq -c | sort -rn | wc -l
77
$:andre\> git log --after=2014-09-01 --before=2014-10-01 --author='' --pretty=format:"%ae" | sort | uniq -c | sort -rn | wc -l
0

Wouldn't whitelisting of commits to only those who are listed in LDAP a bit easier? (just a thought).

Interesting thought, thanks. If there is a 1:1 between all LDAP, wikitech, Gerrit accounts (Edit: there is, according to paravoid) I wonder how hard it would be to implement checking LDAP accounts from korma.

In T103292#1762313 on Oct 28, @Aklapper wrote:

@Dicortazar / @Lcanasdiaz: How often is this deployed?

It's been five days and I don't see these changes reflected to the "Authors" graph on scm.html and scm-contributors (which still lists Evan) yet...

Data were updated now. The issue with Git is that this is still not automated. I remember some discussion about having all of the repos of gerrit + some repos not found in gerrit. I'm afraid that until this is not clear we can not proceed with the automation of Git repositories.

On the other hand, this is quite straightforward, and this is ready to go.

I've removed the commented repositories, so in the following couple of days this should be visible. The process is right now running, so I guess this is more likely to be ready by tomorrow evening than today.

I've removed the commented repositories, so in the following couple of days this should be visible. The process is right now running, so I guess this is more likely to be ready by tomorrow evening than today.

Hmm. Four days later the graph on scm-contributors.html has not changed and I still see Evan listed (which is an indicator that the three upstream Phabricator repositories are not excluded yet).

Daniel says there is some more work to do here to update these lists.

So far, the process has been focused on automating the list of Git repositories to be analyzed and later to remove the upstream ones.

With this data updated, the authors chart shows a peak of activity in January, 2014 of 303 developers active (at least committed one source code change), while the lowest number of active developers is seen in August/September 2015 with 207 developers.

This is not exactly a decrease of 40%, but closer to 30% at least for those two periods comparison.

Up to now, the trends chart for authors shows a decrease of 35% if we compare the last 365-activity-days period with the previous 365-activity-days period.

It's worth mentioning that the activity keeps more stable according to the commits chart at http://korma.wmflabs.org/browser/scm.html with a decrease of 5% in this case.

Finally, although the set of git and gerrit repositories are not comparable as they are different, the activity in Gerrit shows a similar stable trend with increases or decreases of activity or community or around 5% for the last 365 days.

An open question would be to check how many of the developers that were active in 2014 are still active nowadays and from those, how many of them were new developers having their first commit in the project.

On http://korma.wmflabs.org/browser/scm-contributors.html I don't see Evan listed anymore as expected (though T119755 makes that page show wrong data as we all know). After excluding upstream-only repositories in 1 and 2 we now see a smaller number of total contributors, as expected.
Comparison of today's chart to the chart from five weeks ago (please note the very different y scale!):

scm-contributors-20151202.png (500×1 px, 59 KB)

An open question would be to check how many of the developers that were active in 2014 are still active nowadays and from those, how many of them were new developers having their first commit in the project.

The first part should basically be covered by http://korma.wmflabs.org/browser/demographics.html ?

In T103292#1386502 on June 22, 2015, @Qgil wrote:

January 2014: 414
May 2014: 382
August 2014: 371
November 2014: 228 !!!
May 2015: 225

100−(100÷414×228) = 45%

Now:

January 2014:    303
May 2014:        270
August 2014:     278
November 2014:   233 !!! (higher; probably due to adding previously unindexed Git repositories, cf. T110678?)
May 2015:        235

100−(100÷303×233) = 23%

January 2014 went from 414 to 303. May 2014 went from 382 to 270. OK.

Still, it looks like we are losing developers indeed? Some that used Gerrit might be sing GitHub only now, but would they be about 35 developers? And anyway, in the meantime WMF Engineering has grown, although probably not as much as in previous years? (citation needed)

What seems to be clear is that the number of code contributors is not increasing, and especially the amount of volunteer code contributors might be decreasing, party as pure decrease, partly because some of these volunteers become affiliated with the WMF or WMDE.

Aklapper lowered the priority of this task from High to Medium.Dec 5 2015, 1:34 PM
Aklapper moved this task from Need discussion to Doing on the wikimedia.biterg.io board.
Aklapper raised the priority of this task from Medium to High.Dec 5 2015, 2:48 PM

Still, it looks like we are losing developers indeed? Some that used Gerrit might be sing GitHub only now, but would they be about 35 developers? And anyway, in the meantime WMF Engineering has grown, although probably not as much as in previous years? (citation needed)

korma.wmflabs.org/browser/scm-organizations.html only lists total number of authors per org for all-time and commits (not authors) per org per month. I guess it's a mix of both some WMF work (Mobile? Services?) happening on GitHub instead of Wikimedia Git plus a slight decrease in volunteer contributors.

What seems to be clear is that the number of code contributors is not increasing, and especially the amount of volunteer code contributors might be decreasing, partly as pure decrease, partly because some of these volunteers become affiliated with the WMF or WMDE.

I share that interpretation.

Wondering if there is anything left to do in this task? I don't see anything obvious.

Wondering if there is anything left to do in this task? I don't see anything obvious.

Make a list of counted code contributors for two months before and after the suspected decrease and check the names of the list and incoming ones, to look for obvious patterns.

Make a list of counted code contributors for two months before and after the suspected decrease and check the names of the list and incoming ones, to look for obvious patterns.

Problem is that I have a hard time to define a specific point of decrease given the updated graph in F3042158...

Problem is that I have a hard time to define a specific point of decrease given the updated graph in F3042158...

Maybe make an year-long list then. https://www.openhub.net/orgs/wikimedia/projects shows a loss of 50 contributors in core and 50 in extensions in the period "Dec 19 2014 — Dec 19 2015" compared to the previous 12 months period.

With lists such as https://www.openhub.net/p/mediawiki/contributors?sort=latest_commit and https://www.openhub.net/p/mediawiki-extensions-wmf/contributors?sort=latest_commit you can go back to next pages and see what contributors disappeared.

P.s.: To aid such an analysis, I have:

  • added the main skins to the "mediawiki" project (as they were split to separate repositories);
  • created a "mediawiki-skins" project with some skins and the submodule-based parent project for skins;
  • added the mediawiki/extensions parent repository to the mediawiki-extensions-wmf-hosted project, given submodules seem now to work;
  • added 24 extensions newly added in operations/mediawiki-config/wmf-config/extensions-list since the last update of mediawiki-extensions-wmf-supported in march 2014 as found by git diff 8f470d3ac51c5ab768d4a248bfb5b2f696dc61a6..HEAD wmf-config/extension-list | grep -v "^ ", namely:
+$IP/extensions/ApiFeatureUsage/ApiFeatureUsage.php
+$IP/extensions/BounceHandler/BounceHandler.php
+$IP/extensions/Cards/extension.json
+$IP/extensions/CiteThisPage/extension.json
+$IP/extensions/Citoid/Citoid.php
+$IP/extensions/ContentTranslation/extension.json
+$IP/extensions/FundraisingTranslateWorkflow/FundraisingTranslateWorkflow.php
+$IP/extensions/Gather/Gather.php
+$IP/extensions/GlobalCssJs/GlobalCssJs.php
+$IP/extensions/GlobalUserPage/GlobalUserPage.php
+$IP/extensions/Graph/Graph.php
+$IP/extensions/ImageMetrics/ImageMetrics.php
+$IP/extensions/Josa/Josa.php
+$IP/extensions/JsonConfig/JsonConfig.php
+$IP/extensions/ParsoidBatchAPI/extension.json
+$IP/extensions/Petition/Petition.php
+$IP/extensions/Popups/Popups.php
+$IP/extensions/QuickSurveys/extension.json
+$IP/extensions/RestBaseUpdateJobs/RestbaseUpdate.php
+$IP/extensions/SandboxLink/SandboxLink.php
+$IP/extensions/WikidataPageBanner/extension.json
+$IP/extensions/XAnalytics/XAnalytics.php
+$IP/extensions/ZeroBanner/ZeroBanner.php
+$IP/extensions/ZeroPortal/ZeroPortal.php

Maybe make an year-long list then. https://www.openhub.net/orgs/wikimedia/projects shows a loss of 50 contributors in core and 50 in extensions in the period "Dec 19 2014 — Dec 19 2015" compared to the previous 12 months period.

http://korma.wmflabs.org/browser/scm.html under "Authors" states 238 for Nov 2014 and 182 for Nov 2015 in total.

@Nemo_bis: We could get a list of these account names (if that's not in some JSON file under mediawiki-dashboard/browser/data/json already) but that would obviosuly be the data that we want to check (challenge) in this very report, so apart from a complete checkout of all Wikimedia Git repositories (how to do that?) and querying author names for each repository and then merging author names I don't know how we could achieve that? Any ideas?

I'm not sure I follow your question but I'll try to rephrase my question: we have found no proof for the truth or falsehood of the statement "we lost 40 % of contributors". Logical/mathematical practice tells us to now assume the statement is true and find a counterexample. The simplest way to do so is to find the list of persons that the method identified as lost and then find a counterexample among them.

we have found no proof for the truth or falsehood of the statement "we lost 40 % of contributors".

Well, if we trust the numbers provided by korma.wmflabs.org and our settings of repositories to be covered, we have found proof for a loss of 23% in T103292#1844591, and I'm afraid I don't have the time to manually go through lists of users.
So I'm not sure how to proceed here and I propose to close this task as resolved.

When it comes to interpretation, we have assumptions that some projects being worked on on GitHub are an influence.

Thank you! If we assume that it is true that we have lost 23% (within a year?), then the next thing we need to check is whether we are still losing developers or we are stable in a lower but still flat line.

Then another important factor to look at is whether the current population is stalled (not many arrive, not many leave) or having high attrition (many arrive, but many leave).

These indicators would define which type of problem we are facing. What seems clear at this point is that we are facing an important problem, and this data should inform our Developer Relations strategy work during this quarter.

Thank you! If we assume that it is true that we have lost 23% (within a year?), then the next thing we need to check is whether we are still losing developers or we are stable in a lower but still flat line.

Looking at the top graph of scm-contributors.html and comparing a 2015 month to the same 2014 month, we are still losing Wikimedia Git developers.

Then another important factor to look at is whether the current population is stalled (not many arrive, not many leave) or having high attrition (many arrive, but many leave).

According to demographics.html, 175 new code developers were attracted in the time frame between 6m and 12m ago, and 51 of them were still active within the last 6m. But we do not have the complete number of developers to compare on this page, and Authors on scm-contributors.html only lists data per month instead of six months. Taking just the average number of SCM authors per month in Jan-Jun 2015 (248⅓) would be too low I'm afraid.

@Dicortazar: Yes (we talked about this), providing a list of names of our SCM users from something like Nov 2015 and also Nov 2014, to compare names of Git users.

@Dicortazar: Yes (we talked about this), providing a list of names of our SCM users from something like Nov 2015 and also Nov 2014, to compare names of Git users.

Reassigning to @Dicortazar

@Aklapper, I just sent to you an email with such info. As this contains some email addresses, I preferred to to this by direct message.

Compared Nov 2014 (299 identities) and Nov 2015 (228 identities) by importing the two CSV files I received into MariaDB.

There are some duplicates:

MariaDB [T103292]> SELECT email, COUNT(email) FROM x201411 GROUP BY email HAVING ( COUNT(email) > 1 );
35 rows in set
MariaDB [T103292]> SELECT email, COUNT(email) FROM x201511 GROUP BY email HAVING ( COUNT(email) > 1 );
15 rows in set

When it comes to email addresses not in the other set:

MariaDB [T103292]> SELECT email FROM x201511 WHERE email NOT IN (SELECT email FROM x201411);
87 rows in set
MariaDB [T103292]> SELECT email FROM x201411 WHERE email NOT IN (SELECT email FROM x201511);
145 rows in set

I manually went through the results / email addresses. I don't see obvious patterns, but I do have two thoughts:

  1. I see some email address domains like civicrm, redhat, linkedin, IBM, ebay, yelp, HP, OSSystems, LinuxFoundation, Merck, Puppetlabs. They might be contributing to Wikimedia repos, but some of those might also be likely again coming from some imported upstream repositories, like OpenStack stuff. Hence likely we need to indeitify and blacklist more repositories via https://github.com/Bitergia/mediawiki-repositories/blob/master/git_repositories_blacklist.conf
  2. Regarding duplicates, I think old data (like "number of SCM users in Nov 2014" displayed in korma) does get updated accordingly when merging two identities into one in the korma user database, but might still want to double-check.

Ignore this comment:
Just documenting how I imported those CSV files into MariaDB, so I can easily find those steps again:

MariaDB [(none)]> CREATE DATABASE T103292;
MariaDB [(none)]> use T103292;
MariaDB [T103292]> CREATE TABLE x201411 (name VARCHAR(255), email VARCHAR(255), commits INTEGER);
MariaDB [T103292]> CREATE TABLE x201511 (name VARCHAR(255), email VARCHAR(255), commits INTEGER);
MariaDB [T103292]> LOAD DATA LOCAL INFILE 'people_commits_201411.csv' INTO TABLE x201411 FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' (name, email, commits);
Query OK, 299 rows affected             
MariaDB [T103292]> LOAD DATA LOCAL INFILE 'people_commits_201511.csv' INTO TABLE x201511 FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' (name, email, commits);
Query OK, 228 rows affected
  1. As per #2 above: 'people_commits_201411.csv' had some duplicates listed (I just went through them). As I am pretty sure that nobody ever merged some of those accounts manually in korma's identities DB, double-check whether the 2 (sub)ids are counted or the 1 uuid (as expected). One example date:
"9e7dd9958be33ff8b57da8386f516ea58aabd676": {
    "enrollments": [], 
    "identities": [
        {
            "email": "kgoodhope@xxxxxxxxxx.com", 
            "id": "9e7dd9958be33ff8b57da8386f516ea58aabd676", 
            "name": "kgoodhop", 
            "source": "wikimedia:scm", 
            "uuid": "9e7dd9958be33ff8b57da8386f516ea58aabd676"
        }, 
        {
            "email": "kgoodhope@xxxxxxxxxx.com", 
            "id": "9f2f73eb9ddd3c4d17e393038195059294ec3ec5", 
            "name": "kengoodhope", 
            "source": "wikimedia:scm", 
            "uuid": "9e7dd9958be33ff8b57da8386f516ea58aabd676"
        }
    ], 
}
  1. I see some email address domains like civicrm, redhat, linkedin, IBM, ebay, yelp, HP, OSSystems, LinuxFoundation, Merck, Puppetlabs. They might be contributing to Wikimedia repos, but some of those might also be likely again coming from some imported upstream repositories, like OpenStack stuff. Hence likely we need to indeitify and blacklist more repositories via https://github.com/Bitergia/mediawiki-repositories/blob/master/git_repositories_blacklist.conf

Look at author/committer names (and dates!) in e.g. integration/jenkins-job-builder, operations/debs/jenkins-debian-glue or integration/zuul which we currently do not exclude from statistics.
As %&$!* Gerrit does not allow me to create a complete checkout of all our repos and then run a script to list authors/committers per repo (to search for those authors and hence identify candidate repos that we should investigate whether to exclude) I have no idea how to find such repos.

I don't like Gerrit.

Probably possible to find that information easily with gsql access.

Edit:

gerrit> select count(*) from changes, accounts where owner_account_id = account_id and preferred_email like '%@wikimedia.org';
 count(*)
 --------
 98194
(1 row; 302 ms)

Actually, that changes table probably only counts stuff going through review, likely not imported commits.

[offtopic]

Gerrit does not allow me to create a complete checkout of all our repos

To get a complete repository checkout from Gerrit, in order to grep author names in each repository: List all projects in your browser on gerrit.wikimedia.org in Firefox, press Ctrl and then mark content of first column with mouse, drop into textfile gerritrepos.txt, run shell script (that creates some detached repos due to subprojects but whatever):

#!/bin/bash
filename='gerritrepos.txt'
filelines=`cat $filename`
for repos in $filelines ; do
  escapedrepo="$(echo $repos | sed 's/\//-/g')"
  git clone ssh://aklapper@gerrit.wikimedia.org:29418/$repos $escapedrepo
done

After that's done, get the author names for each repo:

#!/bin/bash
for i in $( ls ); do
  if [ -d $i ]; then
    cd $i
      if [ -d ".git" ]; then
        echo "===== " $i ": ====="
        git log --author='' --pretty=format:"%ae" | sort | uniq -c | sort -rn
      fi
    cd ..
  fi
done

List all projects in your browser on gerrit.wikimedia.org in Firefox, press Ctrl and then mark content of first column with mouse, drop into textfile gerritrepos.txt

Might be simpler to ssh username@gerrit.wikimedia.org -p 29418 gerrit ls-projects > gerritrepos.txt

From a quick glance, candidate repositories to check and potentially exclude from our statistics in korma:

  • wikimedia-fundraising-civicrm-buildkit
  • wikimedia-fundraising-civicrm-buildkit-vendor-totten-amp
  • wikimedia-fundraising-civicrm-buildkit-vendor-totten-git-scan
  • wikimedia-fundraising-civicrm-core
  • wikimedia-fundraising-civicrm-drupal
  • wikimedia-fundraising-civicrm-joomla
  • wikimedia-fundraising-civicrm-packages
  • wikimedia-fundraising-civicrm-wordpress
  • gerrit
  • integration-jenkins-job-builder
  • integration-zuul
  • operations-debs-flannel
  • operations-debs-ganglia
  • operations-debs-git-fat
  • operations-debs-golang-burrow
  • operations-debs-jenkins-debian-glue
  • operations-debs-jmxtrans
  • operations-debs-kafka
  • operations-debs-kafkacat
  • operations-debs-kubernetes
  • operations-debs-latexml
  • operations-debs-logstash-gelf
  • operations-debs-logster
  • operations-debs-nodepool
  • operations-debs-perf-tools
  • operations-debs-phantomjs
  • operations-debs-puppet
  • operations-debs-python-dotted
  • operations-debs-python-gevent
  • operations-debs-python-jsonschema
  • operations-debs-python-kafka
  • operations-debs-python-mwparserfromhell
  • operations-debs-python-phabricator
  • operations-debs-python-sprockets
  • operations-debs-python-sprockets-clients-statsd
  • operations-debs-python-sprockets-mixins-statsd
  • operations-debs-python-stopit
  • operations-debs-StatsD
  • operations-debs-varnish
  • operations-debs-varnish4
  • operations-puppet-cdh
  • operations-puppet-cdh4
  • operations-puppet-jmxtrans
  • operations-puppet-kafka
  • operations-puppet-kafkatee
  • operations-puppet-mesos
  • operations-puppet-varnishkafka
  • operations-software-gdash
  • operations-software-grafana
  • operations-software-hhvm-dev
  • operations-software-hhvm-dev-folly
  • operations-software-hhvm-dev-third-party
  • operations-software-kibana
  • operations-software-librenms
  • operations-software-nginx
  • operations-software-tessera
  • operations-software-varnish-libvmod-header
  • operations-software-varnish-libvmod-tbf
  • operations-software-varnish-libvmod-vslp
  • operations-software-varnish-varnishkafka
  • operations-software-xhgui
  • phabricator-arcanist
  • phabricator-libphutil
  • phabricator-phabricator
  • pywikibot-externals-httplib2

Looks good to me and, in fact, I thought we got rid of these already.

Filtering at the repository level is too crude and too inefficient, in my opinion. There are cases where a repository's history consists primarily of upstream commits, but with some substantial packaging work by Wikimedia engineers, for example.

It's also a brittle approach: you are liable to get terribly skewed numbers if you omit even one highly-active repository -- a mistake that is going to be easy to make, given that new repositories are being created all the time.

Could you produce a list of Gerrit changes for each interval? That will be far more accurate.

@ori: "There are cases where a repository's history consists primarily of upstream commits, but with some substantial packaging work by Wikimedia engineers" is a perfect description of the issue we are running into, and that's why I believe we'll never have "clean" statistics on Wikimediasphere-only code activity on korma.wmflabs.org.

Wondering if I get you correctly: Do you say that excluding / blacklist those Git repositories listed in T103292#2071786 from the Git (not: Gerrit) statistics on korma.wmflabs.org is something we should not do?
Note that new repositories being created by default are included on korma.wmflabs.org.

https://github.com/Bitergia/mediawiki-repositories/pull/17 got merged on 2016-03-09 and should exclude 58 SCM repositories from the data here.
SCM data on korma was updated on 2016-03-11 and should incorporate that change.
However, http://korma.wmflabs.org/browser/scm.html still says 1,166 repositories (same number as on 2016-03-10) and the "Authors" line in the graph/chart has not changed at all, so I don't buy this yet.

I am still waiting for the data to get updated which has not happened for a while.

I am still waiting for the data to get updated which has not happened for a while.

We're working on that @Aklapper . Sorry for the inconvenience :(

However, http://korma.wmflabs.org/browser/scm.html still says 1,166 repositories (same number as on 2016-03-10) and the "Authors" line in the graph/chart has not changed at all, so I don't buy this yet.

SCM data on korma was updated but still the same issue: Number of repos has not decreased. :(

However, http://korma.wmflabs.org/browser/scm.html still says 1,166 repositories (same number as on 2016-03-10) and the "Authors" line in the graph/chart has not changed at all, so I don't buy this yet.

SCM data on korma was updated but still the same issue: Number of repos has not decreased. :(

I've been fixing that. SCM repos data with the removal of repos will be ready by today EOB.

However, http://korma.wmflabs.org/browser/scm.html still says 1,166 repositories (same number as on 2016-03-10) and the "Authors" line in the graph/chart has not changed at all, so I don't buy this yet.

SCM data on korma was updated but still the same issue: Number of repos has not decreased. :(

http://korma.wmflabs.org/browser/scm.html shows now 1135 repositories. This list will be automatically updated on Tuesday, Thursday and Sathurday.

whether the current population is stalled (not many arrive, not many leave) or having high attrition (many arrive, but many leave).

Covered by http://korma.wmflabs.org/browser/demographics.html

As Lcanasdiaz said, the script was not running. Now it's updated.

Graph comparison (⚠: different scale) from before and after excluding / blacklisting 58 repositories :

40perc.png (620×719 px, 67 KB)

The mismatch in numbers (1166 - 58 ≠ 1135) is a result of new repositories created and indexed in the meantime.

Comparing previously displayed numbers in korma:

Data displayed for month ↓ on date ➝2015-06-222015-12-02 after excluding 8 repos2016-04-08 after excluding 58 repos
January 2014 ⚠414303263
May 2014382270245
August 2014371278259
November 2014 ⚠228233189
May 2015225235213
Diff Jan 2014 to Nov 2014:-45%-23%-28%
Diff Jan 2014 to Jan 2015:-19%
Diff Feb 2014 to Feb 2015:-10%
Diff Oct 2014 to Oct 2015:-12%
Diff Nov 2014 to Nov 2015:+1.6%
Diff Jan 2015 to Jan 2016:+0.5%
Diff Mar 2015 to Mar 2016:-21%
Diff Jan 2014 to Jan 2016:-19%

Note there might still be more repositories around which were imported/pulled from upstream at some point in the past and not updated since then. We also have repositories that are "mixed".

So I'd say we indeed lost contributors in Wikimedia Git.

whether we are still losing developers or we are stable in a lower but still flat line.

It's somewhere between "stable" and "losing a few". http://korma.wmflabs.org/browser/scm-contributors.html is available for everybody's interpretation and for picking up specific months to compare to each other.

I don't have a good idea how to proceed from here and what good followup items could be.
Personally I'd close this task as we'll never be able to have a clear cut when it comes to blacklisting repositories due to the way some teams work when importing repositories from upstream, plus GitHub.

To me it is a good resolution of this task to conclude that "we indeed lost contributors in Wikimedia Git" and that currently the situation is between "stable" and "losing a few". If we could have some very basic numbers to support this (with all the disclaimers), then that would really help the Technical Collaboration team by our strategy, annual plan, and quarterly goals need to put a focus in developer outreach.

Maybe instead of strategy and the annual plan we should just bring back a little fun to MediaWiki development:)

Sorry for being so boring. :) Before the Amen though, how should we "just bring back a little fun to MediaWiki development"?

Please continue discussing bringing back phun. (I'd order a big chunk of that.)
In the meantime I'll silently close this task as resolved. :) Thanks everybody for your input and patience here.