Page MenuHomePhabricator

Exclude certain repositories (upstream / inactive) from Gerrit metrics by blacklisting
Closed, ResolvedPublic

Description

Splitting from T103292#1387470 to discuss a feasible way (blacklist?) here:

in http://korma.wmflabs.org/browser/scm.html "Authors - Last 30 days", in the #8 we have Evan Priestley with 125 commits.

Evan is a Phabricator maintainer, and we are pulling Phabricator from upstream into our git repositories. Evan never went through our Gerrit code review to merge those patches, and neither did all the other Phabricator committers.

We should probably take the repository list from Gerrit, and filter all the absent projects for our Git metrics. Maybe that is what is causing the distortion, or maybe there is stll something more.

Repositories to be removed

  • Apertium* (there are many)
  • Civiccrm*
  • DeskMessMirrored
  • Drupal
  • etherpad-lite
  • Gerrit* (many)
  • HHVM
  • integration/* has lots of stuff that seem upstream
  • libhubbub
  • librsvg
  • moodle
  • Nginx
  • Nodejs
  • Operations/debs/* Anything there seems to be removable
  • Operations/puppet/* too? /puppet seems to be legit, but the repos in the puppet/* subdirectory not.
  • Operations/software/* too?
  • OTRS
  • Phabricator (many)
  • Phantomjs
  • phpunit
  • ruby-jsduck
  • unicodejs
  • fundraising/* needs revision, i.e. is "crm" upstream or own development?

See also: T101777: Remove deprecated repositories from korma.wmflabs.org code review metrics

Related Objects

Event Timeline

https://www.mediawiki.org/wiki/Upstream_projects#Components and below would be useful.

Instead of asking Bitergia to add this repo and remove that other repo, we would need a way to manage this ourselves. It cannot be that difficult...

Aklapper renamed this task from Exclude pulled upstream code repositories from metrics to Exclude third-party / pulled upstream code repositories from metrics.Jun 29 2015, 12:37 PM
Qgil raised the priority of this task from Medium to High.Jul 2 2015, 1:03 PM

I have gone through the list of repos looking for easy catch, and I have updated the description with the findings.

It looks like a lot of cleaning can be done by filtering whole subdirectories (i.e. operations/debs). I'm sure more will be left after removing the list above, but I don't think the impact will be as big.

According to the work done in T104845: Automated generation of (Gerrit) repositories for Korma, I've updated the list of Gerrit repositories. I didn't find some of the ones listed there, but this may be a good way to start working on this list. You can see the differences at [1].

In addition to this, should we keep that list in the Bitergia-GitHub account? or should this be part of the korma repo at some place? In any case this should the central point to work with the list of repos.

[1] https://github.com/Bitergia/mediawiki-repositories/commit/9a3c55614a09b8ec1d4424b1b82481a631d55a0b

Varnish doesn't appear at https://github.com/Bitergia/mediawiki-repositories/blob/master/gerrit_projects.conf anymore, but is still scoring 5th in http://korma.wmflabs.org/browser/gerrit_review_queue.html. Is gerrit_projects.conf already defining the list of projects that korma should feature or not yet.

As agreed in our meeting some weeks ago, let's keep that list in the Bitergia-GitHub account for now.

Varnish [...] is still scoring 5th in gerrit_review_queue.html

Varnish not listed anymore. After testing a few items on the first 3 pages this looks good so far.

After closing some of the tasks related to automation of the gerrit-retrieval repositories, I'll try to detail how we all should modify trackers to be added or removed to Korma.

Let's focus only on Gerrit, what will help to divide the problem.

At https://github.com/Bitergia/mediawiki-repositories, there's a list of files. There are indeed two files with Gerrit information:

  • gerrit_trackers.conf
  • gerrit_trackers_blacklist.conf

The first of them contains information directly retrieved from Gerrit. This was retrieved using a command similar to: "ssh -l <user> -p 29418 gerrit.wikimedia.org gerrit ls-projects"
This information at some point should be automatically updated at the Git repository and should never be updated by any of us.

Regarding to the second file, this file lists all of the trackers found in the first file that shouldn't be part of the analysis. In other words: this is the list of projects to ignore during the analysis.

So, we all expect to update the 'blacklist' files.

With respect to the internal machinery, the process is as follows:

  • The list of gerrit projects is retrieved each day.
  • If there were changes in such file, a new version of the gerrit_trackers.conf would be uploaded to the Git repo.
  • A resultant list of projects to analyze is calculated as gerrit_trackers minus those found in gerrit_trackers_blacklist.
    • If a new project was added to the blacklist, this would be removed from the database.
  • In addition to this, if a project was removed from the gerrit list, the machinery would detect such change and this project would be also removed from the database.

@Qgil, as you commented in another ticket, ExternalArticles should be part of the blacklist, although I have to add this.

@Qgil, thank you for the documentation :).

I'll close this task and the blocking one as we finished the development. The process for Gerrit should be fully automated, this is a matter of updating the blacklist file to fit this with your requirements.

Aklapper renamed this task from Exclude third-party / pulled upstream code repositories from metrics to Exclude certain repositories (upstream / inactive) from Gerrit metrics by blacklisting.Oct 2 2015, 11:20 AM