Page MenuHomePhabricator

Check the style of the commit message
Closed, ResolvedPublic

Description

After d3f9b7f0 was merged with an unfortunately long commit message I thought we might be able to check the style of the commit message using Jenkins. Afaik that job should be voting as I don't see a reason why it should be allowed to be merged without applying them. There is basically one exception and that is mentioning URLs with the line length. These are the tests which seem sensible:

  • If there is a ^Bug: T\d+$ line it should be together with the Change-Id.
  • Preferably it should check if the bug id is valid although that might be to complex. But at least it should have the T-prefix.
  • The line length should be below (or at?) 80 of all lines.
  • There should be no I… references but instead always git commit hashes.
  • The Change-Id must be separated by at least one line from the first line. Alternatively every line in the paragraph together with the Change-Id must match ^.+: .+$.
  • The second line is a blank line (aka \A[^\n]+\n\n).
  • Commit hashes which point to commits in pywikibot should only point to past (merged) commits.

Regarding the references being git commit hashes, my argumentation is that the I… prefix only work on gerrit. But if you work locally or on github it won't work. Git hashes on the other hand will also work on gerrit (afaik if they contain at least one letter) but also work on phabricator and github directly as links. You can also directly use them locally instead of searching for the change-id. And searching locally with change-id will only work when the commit with that id is the current commit or an ancestor. While a commit hash will also work when it's in another branch.

Git commit hashes won't work on github though if they point to another repository (e.g. to reference a change to MediaWiki) but a change-id doesn't work too so it's not lost functionality. There has been afaik the argument that you can see it's visible that it's not a pywikibot commit hash (as that could be a commit hash) but then it may be that someone overlooked it and referenced commits to pywikibot with the change-id. And this can be avoided by mentioning the project together with it like “…after it was added in mediawiki (abcedf12)”. The change-id and commit hash are themselves useless when the commit is shown outside WMF's Gerrit.

Event Timeline

XZise raised the priority of this task from to Needs Triage.
XZise updated the task description. (Show Details)
XZise subscribed.

I'm pro this being a generic job for all projects (iow: I don't think making per-project jobs for this is sustainable/wise).

Okay afaik there is nothing pywikibot specific. So if others projects want they could use it too or are the any projects which handle that differently? Part of the rules are also applying Gerrit/Commit message guidelines.

Also “Commit hashes which point to commits in pywikibot should only point to past (merged) commits.” is also applicable to any other project in gerrit. And in we could use gerrit to determine if that commit hash has been merged even if it's in a different commit.

Unfortunately I don't know how Jenkins run but I think I'd be ale to write a short script checking most of them with access to SSH (to run queries against Gerrit) and git (to get the commit message).

XZise set Security to None.

So I went ahead and “just” hacked something together in P1881. It just needs SSH, git and Python (should work on any Python version since 2.7).

Okay I updated P1881 to work more like pep8/flake8 as they are returning the error codes now in the order they appear in the message. It's not doing a SSH request if these two codes requiring it are ignored anyway. I added help text and a few other codes which check that the commit hash lengths are all equal and don't use mixed upper/lower case. I'd appreciate feedback about other repos than pywikibot whether the rules are also apply to their repos and if there are additional rules.

The code is not really perfect. At the moment M903 will be always replaced by M902 and M204 should check all commit hashes and not one… so if someone is using 1234aBcd it should flagged as well as when 1234abcd and 1234ABCD is used within a message. Note that the codes > 900 are only warnings so (unless enabled) won't cause a failure.

hashar triaged this task as Medium priority.Aug 24 2015, 1:52 PM

Change 237719 had a related patch set uploaded (by Legoktm):
Add commit-message-validator.py tool

https://gerrit.wikimedia.org/r/237719

@bd808 started working on this and neither of us were aware of this task I think :P

Hmm my main concern about such a checker is how we'd handle reasonable exceptions. Maybe there is some reason why something can violate the rules and then I don't want that some script is making it impossible.

Change 237719 merged by jenkins-bot:
Add commit-message-validator.py tool

https://gerrit.wikimedia.org/r/237719

For the script I just merged to be used with all repos a version of T111181 needs to be implemented which, as a minimum uses that script on nodepool slaves (running any other existing CI entry points from it is not a blocker for using that script).

Checking if dependent commits are merged: I have no idea if nodepool based jenkins slaves can ssh directly to gerrit, which is probably a good idea to check before spending more work on checking if dependencies are already merged. The check for already merged commit ids would be more difficult as the script would also run on patch upload where it should not fail. But it should on merge. So those should default to warnings. This is only useful for commit ids across repositories, as the dependencies in the same repo can be expressed through which parent the commit is based on. Some of those repos might be on different hosts. So perhaps it is better to introduce special syntax like a Depends-On: line only for corss repo dependencies on the same host. For different hosts we could add the repo location to the line so it knows which repo needs checking and if it knows how to check it do that remotely.

Regarding warnings vs. errors: Things that don't make the script fail will never be seen on gerrit (people don't look in the job output).

If there are things that should be enabled on some repos but not others a config file with a default location might be needed.

If there are IDs for errors and warnings we should make the ID not depend on the warning or error status, as when one wants to change that status one would need to change the ID.

So I went ahead and “just” hacked something together in P1881. It just needs SSH, git and Python (should work on any Python version since 2.7).

Could you look at the patch I just merged and create follow up patches with additional features you want from your script?

Hmm my main concern about such a checker is how we'd handle reasonable exceptions. Maybe there is some reason why something can violate the rules and then I don't want that some script is making it impossible.

Don't ever add rules that have reasonable exceptions. Only things that should never be done should be checked with such a script.

I volunteer the VE repos for testing this before we put it live everywhere.

One of the easy checks in the original task description is not yet implemented:

  • There should be no I… references but instead always git commit hashes.

That is ~ P1881 line 186

Then the non-trivial follow-on from that in the original task description is yet to be implemented:

  • Commit hashes which point to commits in <same repo> should only point to past (merged) commits.

It would also be good to introduce a strict syntax for referring to commits in other repos.

And another hard one that I would like is a standardised way of referring to regressions, with a reference to the commit where the regression occurred. This will help automated identification of changes which need to be cherry-picked to other branches based on whether the regression commit exists in the branch.

It would be good to standardise the list of other rfc822 headers which are allowed in the message tail. e.g. Signed-off-by , Original-author:/Co-author: .
Do other repos normally allow 'Conflicts:\n foo' which git automatically adds?

Please note that my paste hasn't been implemented and I haven't got around to port my code to the implemented variant.

And checking if a commit hash has been merged has the problem that without SSH (@hashar do you know if it's available for the script?) we can only check if a hash has been merged, but it might be that the hash is for another repository if it's not merged. With SSH we could query gerrit to check if it's a future patch.

Also what do you want with “Conflicts”? Suggest more checks (certainly possible) or are you concerned that the current parser doesn't allow it (as it doesn't allow any other Foo: Bar fields in the last paragraph)? In the latter case it shouldn't be a problem as the “Conflicts” section is separated by newlines.

Maybe we can query Phabricator for the sha1, which would identify which repository it is merged into.
For referencing commits in repos not in Phabricator, we could support github slugs (which will auto link when the commit is merged and replicated to github)
For all other cases, a URL can be used instead of, or as a prefix to, a sha1.

Re Conflicts:, I am wondering what other repos are practising. i.e. when should they be kept vs removed?

One of the easy checks in the original task description is not yet implemented:

  • There should be no I… references but instead always git commit hashes.

So I should not be able to make commit messages to refer to uncommitted change sets (or a series of related changes cherry-picked to multiple branches)? Is this desired so that commit history information is self contained? I assume you would intend it to extend to D... references when we finally get switched to Differential instead of Gerrit.

And checking if a commit hash has been merged has the problem that without SSH (@hashar do you know if it's available for the script?) we can only check if a hash has been merged, but it might be that the hash is for another repository if it's not merged. With SSH we could query gerrit to check if it's a future patch.

Let's try hard not to tie this to Gerrit internals. I think that migration to Differential will happen sooner than many people imagine.

One of the easy checks in the original task description is not yet implemented:

  • There should be no I… references but instead always git commit hashes.

So I should not be able to make commit messages to refer to uncommitted change sets (or a series of related changes cherry-picked to multiple branches)? Is this desired so that commit history information is self contained?

I personally prefer that self containment and don't really see when a patch needs to reference a future patch. That future patch may get abandoned to the feature from that patch may get implemented on another patch. Afaik patches should only reference in the past.

And in which case should a patch mention that it's going to be cherry picked in another branch? The same applies to the above, what happens if that cherry pick does not happen? And in any case that cherry pick references to the original patch it comes from. And what happens if the patch gets cherry picked into other branches and are thus not present in the original patch's message.

Mentioning git commit hashes has the big advantage that you don't have to resolve them as I mention in the original post.

I assume you would intend it to extend to D... references when we finally get switched to Differential instead of Gerrit.

I'm not sure how Differential actually works but alone for the reason I mention in the original post, that you need to resolve that manually or need Differential, it should apply to that too.

Please note that, as John mentioned, the current implementation does not pose that limit, so even if it would be enabled it wouldn't be necessary.

And checking if a commit hash has been merged has the problem that without SSH (@hashar do you know if it's available for the script?) we can only check if a hash has been merged, but it might be that the hash is for another repository if it's not merged. With SSH we could query gerrit to check if it's a future patch.

Let's try hard not to tie this to Gerrit internals. I think that migration to Differential will happen sooner than many people imagine.

Well that all depends on the implementation, doesn't it? I already have written a script to work with Gerrit so it's not much work to me. And then it could be “generic” and just be a function which returns whether a commit/changeid/D… reference hasn't been merged.

https access should be possible from Jenkins nodepool slaves, because it will be needed for npm and composer.

One of the easy checks in the original task description is not yet implemented:

  • There should be no I… references but instead always git commit hashes.

So I should not be able to make commit messages to refer to uncommitted change sets (or a series of related changes cherry-picked to multiple branches)? Is this desired so that commit history information is self contained? I assume you would intend it to extend to D... references when we finally get switched to Differential instead of Gerrit.

@bd808, we would still put uncommitted sha1/whatever in commit messages, due to dependencies, but the patch would be -1'd until the dependencies are committed.

The advantage is that with a checker, we can be confident that we'll be reminded to update git hashes if the dependencies were revised during code review and the referenced sha1 became obsolete. It also provides a helpful reminder that some retesting might be needed if the dependency changed before it was committed.

The disadvantage is that the -1 can reduce the amount of code review the patch will see as it will be filtered out of some queries.

Change 303724 had a related patch set uploaded (by Legoktm):
Initial import from integration/config, set up packaging

https://gerrit.wikimedia.org/r/303724

Change 303724 merged by Legoktm:
Initial import from integration/config, set up packaging

https://gerrit.wikimedia.org/r/303724

This is now running against all mediawiki/extensions/VisualEditor commits in non-voting mode.

Two months later, with no issues, would it be possible to put it everywhere in non-voting mode and make it voting for (at least) VE-MW and maybe a couple of others?

Two months later, with no issues, would it be possible to put it everywhere in non-voting mode and make it voting for (at least) VE-MW and maybe a couple of others?

Yes please. And let's put an explicit date (1 month after?) to make it voting everywhere.

Ops asked for some kind of commit message validation for operations/puppet. I went with adding a new env in tox which installs and run commit-message-validator from Pypi. https://gerrit.wikimedia.org/r/#/c/365609/3/tox.ini

So it should be straightforward for pywikibot and any repository already relying on a job running tox.

For other repositories, we had the idea of crafting a common job that would be attached on every single repos. Maybe we can add the existing job everywhere.

Change 368793 had a related patch set uploaded (by Dalba; owner: Dalba):
[pywikibot/core@master] tox.ini: Check commit message guidelines using commit-message-validator

https://gerrit.wikimedia.org/r/368793

Change 368793 merged by jenkins-bot:
[pywikibot/core@master] tox.ini: Check commit message guidelines using commit-message-validator

https://gerrit.wikimedia.org/r/368793

This comment was removed by Dalba.
jayvdb claimed this task.

@Legoktm , @hashar , I guess this task can be closed. It was originally a pywikibot task, but it grew big legs and achieved many things. Re-open if I am mistaken.