Page MenuHomePhabricator

Automated generation of (Git) repositories for Korma
Closed, ResolvedPublic

Description

After today's meeting, split this task from T104845: Automated generation of (Gerrit) repositories for Korma to cover Git.
T104845 covers Gerrit.

Event Timeline

Aklapper raised the priority of this task from to Low.
Aklapper updated the task description. (Show Details)
Aklapper added subscribers: Aklapper, Qgil, Dicortazar.

Just checking: Git stats will not be affected by the blacklist at https://github.com/Bitergia/mediawiki-repositories/blob/master/gerrit_trackers_blacklist.conf right? I hope so, because this will allow us to mark repositories as "Inactive" (and therefore ignored -- see T102920) in code review terms, but we will still want to keep their Git data.

If this is the case, we will need another blacklist file to filter out upstream repos from Git data and eventually other weird cases.

@Qgil, that's the point.

However, as the name of the git repositories can be inferred from the name of the gerrit projects, that shouldn't be that difficult.

As far as I understand, the process with Git would be the same, you add projects to the blacklist config file and then the machinery remove those added. This allows to have different sets of projects in Git and Gerrit, but then data may not be that comparable. I think it's worth mentioning this. For instance, growth or decrease of activity/authors would not be comparable given the different set of repos under analysis.

Then, with respect to Git, the way I get that there is a new Git is through Gerrit when the project list is updated. Any other way out of the Gerrit process you think about?

Regarding to deprecated repositories (those that are not listed in Gerrit) do you also want to remove them from the Git database?

because this will allow us to mark repositories as "Inactive"in code review terms, but we will still want to keep their Git data.

Regarding to deprecated repositories (those that are not listed in Gerrit) do you also want to remove them from the Git database?

I currently cannot come up with a good usecase for keeping Git data of deprecated repos, plus we could easily "un-list" previously inactive projects once these projects become maintained again (which I assume would re-index the whole Git history of that repo anyway).

Git and Gerrit are pretty much coupled in Wikimedia (cf. https://www.mediawiki.org/wiki/Gerrit/New_repositories ) so if there are no convincing reasons against, I'd favor having and updating one single blacklist (could even just be a softlink) for Wikimedia.

As I see it, our interest in Gerrit data is mostly about the present: how is the queue doing? are we progressing?

The interest in Git data extends to the past as well. A repository might be dead today, but have an important active past that we want to capture in our metrics. Image an extension that today is deprecated, inactive, and basically dead, but that involved a dozen of developers and hundreds of commits back in 2008-1012. We still want those numbers contributing to the general Wikimedia metrics for those years.

Going back to T103292: Check whether it is true that we have lost 40% of (Git) code contributors in the past 12 months, if we keep the Git history of inactive repos then we are more certain that i.e. "Authors" data of a year ago at http://korma.wmflabs.org/browser/scm.html is correct. If we remove that data, then it is likely that such data from a year ago will be false, because real authors of a year ago will not be counted.

Therefore, keeping this historical data is crucial to know whether we are really losing 40% of authors in one year.

Qgil convinced me. So yeah, we should keep such data.

Aklapper raised the priority of this task from Low to High.Sep 25 2015, 10:52 AM

As similarly done for gerrit, I've added the current list of Git repositories analyzed so far in Korma [1].

There are two new files: git_repositories.conf and git_repositories_blacklist.conf. The second one is empty.

In the following, we should probably add more Git repositories, according to the very last version of the Gerrit projects [2].

And with this in mind, populate the blacklist file. This should probably done by you, and once this is ready to go, we can produce the new metrics and answer T103292: Check whether it is true that we have lost 40% of (Git) code contributors in the past 12 months.

[1] https://github.com/Bitergia/mediawiki-repositories
[2] https://gerrit.wikimedia.org/r/#/admin/projects/

Thank you a lot!
So as the technical infrastructure seems to be in place to support automated generation (and manual exclusion) of Git repositories, is anything left to do here or can this be closed as resolved?

Actually adding entries to the blacklist file feels like a separate task to me, and I'd rather discuss its potential obstacles in T103292.

@Aklapper, how can we automatically retrieve the list of Git repositories available from some place in the mediawiki infrastructure?

@Aklapper, how can we automatically retrieve the list of Git repositories available from some place in the mediawiki infrastructure?

If I get it all right, in this specific case Gerrit=Git. When you use git clone in Wikimedia you will access the URL gerrit.wikimedia.org. As you wrote earlier here,

we should probably add more Git repositories, according to the very last version of the Gerrit projects [2].

So I'd fall back to using the same process as for Gerrit repos to *gather* Git repos, and then differentiate on which SCR/Gerrit repos we want to get displayed in korma vs. which SCM/Git repos we want to get displayed in korma, via the separate blacklists for Git resp. Gerrit in https://github.com/Bitergia/mediawiki-repositories

If I get it all right.

@Aklapper, my only concern is about the git repositories that are not part of Gerrit but still are required in the list of repositories for historical reasons. In gerrit, the list of repositories is updated according to info provided by the gerrit server. If git repositories do not follow this convention, then how can we add and remove new/old repos?

@Aklapper, my only concern is about the git repositories that are not part of Gerrit but still are required in the list of repositories for historical reasons. In gerrit, the list of repositories is updated according to info provided by the gerrit server. If git repositories do not follow this convention, then how can we add and remove new/old repos?

I might not fully understand the current Grimoire process. Or I might be missing something obvious. Let me try:

When it comes to SCR data in gerrit_trackers.conf, my understanding is that the file is generated by Automator running Octopus to retrieve Gerrit data source info and then rremoval takes the list in gerrit_trackers_blacklist.conf to remove some items again in korma. I am simplifying, but is that roughly correct?

When it comes to SCM data, in T110678#1684169 you wrote "As similarly done for gerrit, I've added the current list of Git repositories analyzed so far in Korma." How got that ''initial'' list of Git repos in git_repositories.conf exactly generated, in contrast to the SCR list creation process?

@Dicortazar: Still wondering:

How got that ''initial'' list of Git repos in git_repositories.conf exactly generated, in contrast to the SCR list creation process?

We should not remove/blacklist inactive Gerrit repos for SCM data in korma (as past stuff is still useful to find out about code contributors). But we should remove/blacklist inactive Gerrit repos for SCR data in korma. Don't block inactive ones for Git, only for Gerrit.
So I see us getting the list of all Gerrit repos for SCM and SCR data in korma, and for SCM in korma we blacklist those repos which are upstream only.

I have not checked yet whether some repos are in our SCR data but not in our SCM data in korma. Which would surprise me (but maybe I need a coffee first on this morning).

Our new micro tool is being executed right now to update automagically the list of Git repos based on the following lists:

The URL is https://github.com/MetricsGrimoire/Reposaurs.

I'll let you know if the result is correct

It works like a charm. I executed two times, find below the log of both executions

owl@atari:~/dashboards/mediawiki$ cat log/update_repos.log |grep INFO
[27/Nov/2015:08:19:34] - INFO - Reposaurs starts ..
[27/Nov/2015:08:36:55] - INFO - 1188 git clone directories
[27/Nov/2015:08:36:55] - INFO - 1220 repositories in the database
[27/Nov/2015:08:36:55] - INFO - 1 repos with git directory cloned not stored in database
[27/Nov/2015:08:36:55] - INFO - 33 repos in database without git clone directory
[27/Nov/2015:08:36:55] - INFO - 8 repos to be removed
[27/Nov/2015:08:37:08] - INFO - https://gerrit.wikimedia.org/r/operations/software/hhvm-dev/folly removed from database
[27/Nov/2015:08:37:08] - INFO - directory /home/owl/dashboards/mediawiki/scm/operations/software/hhvm-dev/folly removed
[27/Nov/2015:08:37:26] - INFO - https://gerrit.wikimedia.org/r/operations/software/hhvm-dev removed from database
[27/Nov/2015:08:37:28] - INFO - directory /home/owl/dashboards/mediawiki/scm/operations/software/hhvm-dev removed
[27/Nov/2015:08:37:35] - INFO - https://gerrit.wikimedia.org/r/phabricator/libphutil removed from database
[27/Nov/2015:08:37:35] - INFO - directory /home/owl/dashboards/mediawiki/scm/phabricator/libphutil removed
[27/Nov/2015:08:37:41] - INFO - https://gerrit.wikimedia.org/r/phabricator/arcanist removed from database
[27/Nov/2015:08:37:41] - INFO - directory /home/owl/dashboards/mediawiki/scm/phabricator/arcanist removed
[27/Nov/2015:08:37:49] - INFO - https://gerrit.wikimedia.org/r/gerrit removed from database
[27/Nov/2015:08:37:49] - INFO - directory /home/owl/dashboards/mediawiki/scm/gerrit removed
[27/Nov/2015:08:37:55] - INFO - https://gerrit.wikimedia.org/r/wikimedia/fundraising/crm/drush removed from database
[27/Nov/2015:08:37:55] - INFO - directory /home/owl/dashboards/mediawiki/scm/wikimedia/fundraising/crm/drush removed
[27/Nov/2015:08:38:03] - INFO - https://gerrit.wikimedia.org/r/phabricator/phabricator removed from database
[27/Nov/2015:08:38:04] - INFO - directory /home/owl/dashboards/mediawiki/scm/phabricator/phabricator removed
[27/Nov/2015:08:38:10] - INFO - https://gerrit.wikimedia.org/r/operations/software/hhvm-dev/third-party removed from database
[27/Nov/2015:08:38:10] - INFO - directory /home/owl/dashboards/mediawiki/scm/operations/software/hhvm-dev/third-party already removed by someone else
[27/Nov/2015:08:38:10] - INFO - 0 repos to be downloaded
[27/Nov/2015:08:38:10] - INFO - Finished

So, after the first execution, everything is up to date. Next execution should change nothing

owl@atari:~$ cat dashboards/mediawiki/log/update_repos.log |grep INFO
[27/Nov/2015:08:40:11] - INFO - Reposaurs starts ..
[27/Nov/2015:08:55:48] - INFO - 1180 git clone directories
[27/Nov/2015:08:55:48] - INFO - 1212 repositories in the database
[27/Nov/2015:08:55:48] - INFO - 1 repos with git directory cloned not stored in database
[27/Nov/2015:08:55:48] - INFO - 33 repos in database without git clone directory
[27/Nov/2015:08:55:48] - INFO - 0 repos to be removed
[27/Nov/2015:08:55:48] - INFO - 0 repos to be downloaded
[27/Nov/2015:08:55:48] - INFO - Finished

It's working properly :)