Page MenuHomePhabricator

Hieradata yaml style checking
Open, LowPublic

Description

A yaml style guide/linter might help clean things up a bit.

https://github.com/adrienverge/yamllint is an option. I'm using this config:

# /.yamllint
---

yaml-files:
  - '*.yaml'
  - '.yamllint'

rules:
  braces: enable
  brackets: enable
  colons: enable
  commas: enable
  comments:
    level: warning
  comments-indentation:
    level: warning
  document-end: disable
  document-start:
    level: warning
  empty-lines: enable
  empty-values: disable
  hyphens: enable
  indentation:
    spaces: 2
    indent-sequences: consistent
  key-duplicates: enable
  key-ordering: disable
  line-length:
    level: warning
  new-line-at-end-of-file: enable
  new-lines: enable
  octal-values: disable
  quoted-strings: disable
  trailing-spaces: enable
  truthy:
    level: warning

Event Timeline

Change 557060 had a related patch set uploaded (by Jbond; owner: John Bond):
[operations/puppet@production] yamllint: First stab at adding yamllint CI tests

https://gerrit.wikimedia.org/r/557060

Change 557061 had a related patch set uploaded (by Jbond; owner: John Bond):
[operations/puppet@production] yamllint: test yamllint CI

https://gerrit.wikimedia.org/r/557061

I created a quick CR to test out yamllint and another to make sure it only runs against changed files. I change added the following to the config to the one you provided however there are still a lot of violations

document-start: disable
line-length:
  max: 120

The changesets look great and appear to do the right thing.

The only other thing I could think of doing is to have yamllint run but not fail the build until we've cleaned up a bit. That way we can check against the linter as we go and not break everybody's workflow simultaneously.

The changesets look great and appear to do the right thing.

The only other thing I could think of doing is to have yamllint run but not fail the build until we've cleaned up a bit. That way we can check against the linter as we go and not break everybody's workflow simultaneously.

We probably need a few more eyes on this task before deploying to at least make sure the yaml settings are sane?

Great idea. Lets raise it at the next SRE meeting.

While debugging https://gerrit.wikimedia.org/r/c/operations/puppet/+/631508, @ssingh uncovered a possible bug around how Puppet yaml parser handles unquoted string values:
Given the yaml:

profile::wikidough::dnsdist::webserver:
  host: 0.0.0.0
  port: 8083
  acl:
    - '0.0.0.0/0'
    - ::/0

the catalog compiler renders:

Error: Evaluation Error: Error while evaluating a Function Call, Lookup of key 'profile::wikidough::dnsdist::webserver' failed: Value for key 'profile::wikidough::dnsdist::webserver', in hash returned from data_hash function 'yaml_data', when using location '/srv/jenkins-workspace/puppet-compiler/25616/change/src/hieradata/role/common/wikidough.yaml', has wrong type, expects Puppet::LookupValue, got Hash[Enum['acl', 'host', 'port'], Any, 3, 3] (file: /srv/jenkins-workspace/puppet-compiler/25616/change/src/modules/profile/manifests/wikidough.pp, line: 6, column: 51) on node malmok.wikimedia.org

This was solved by quoting the array value ::/0.

The opaqueness of this issue leads me to believe that there may be benefit in enabling quoted-strings in the yamllint config.

The opaqueness of this issue leads me to believe that there may be benefit in enabling quoted-strings in the yamllint config.

Yes this is an issue with the underlining ruby yaml parser

irb(main):007:0> YAML.safe_load("foo:\n- '::foo'")
=> {"foo"=>["::foo"]}
irb(main):008:0> YAML.safe_load("foo:\n- ::foo")
Traceback (most recent call last):
       16: from /System/Library/Frameworks/Ruby.framework/Versions/2.6/usr/lib/ruby/2.6.0/psych/visitors/visitor.rb:16:in `visit'
.....
Psych::DisallowedClass (Tried to load unspecified class: Symbol)
irb(main):009:0> YAML.safe_load("foo:\n- :foo")
Traceback (most recent call last):
       16: from /System/Library/Frameworks/Ruby.framework/Versions/2.6/usr/lib/ruby/2.6.0/psych/visitors/visitor.rb:16:in `visit'
.....
Psych::DisallowedClass (Tried to load unspecified class: Symbol)

irb(main):010:0> YAML.safe_load("foo:\n- foo")
=> {"foo"=>["foo"]}

for the ipv6 specific case i have uses '0::1' or '0::/0' however i agree enabling quoted-strings is a better solution

It is also worth mentioning that the private git repo on the puppetmaster does use yaml lint so we could possibly start by updating the config there

@jbond & @colewhite I put together a modest alternative proposal, I would love to hear your thoughts.

Problem

YAML is not prescriptive about certain aspects of its formatting, notably the amount of indentation. Having different amounts of indentation in a single file or repository makes YAML files harder to read, e.g.:

Original:

# List of recipes
recipies:
# todo
- pizza: {}
# todo
- lasagna: {}
- barley soup:
      ingredients:
      - barley
      - onions
      - beef
      serves: 4

Formatted, with two space indentation:

# List of recipes
recipies:
  # todo
  - pizza: {}
  # todo
  - lasagna: {}
  - barley soup:
        ingredients:
           - barley
           - onions
           - beef
        serves: 4

Possible Solutions

Linting

One option, as suggested by @colewhite, is to add a linter such as yamllint, which provides suggestions on consistent formatting:

2:1       warning  missing document start "---"  (document-start)
8:7       error    wrong indentation: expected 4 but found 6  (indentation)
Pros
  • Provides the user with suggestions on how to keep their YAML in line with an agreed upon style.
Cons
  • Requires the user to read the suggestions and manually apply them
  • Would require substantial effort to align current hiera data with a style guide
    • With the config provided by @colewhite there are 3599 suggestions from yamllint
    • With the config provided by @jbond there are 1343 suggestions, below are the top ones:
CountError
267error too many spaces after colon (colons)
257warning too few spaces before comment (comments)
153error wrong indentation: expected 2 but found 4 (indentation)
97warning missing starting space in comment (comments)
94error wrong indentation: expected 6 but found 8 (indentation)
80error too many spaces after comma (commas)
61error wrong indentation: expected 6 but found 4 (indentation)
53error wrong indentation: expected 2 but found 0 (indentation)
31error no new line character at the end of file (new-line-at-end-of-file)
30error wrong indentation: expected 4 but found 6 (indentation)
24error too many blank lines (1 > 0) (empty-lines)
19error too many spaces inside braces (braces)
18error wrong indentation: expected 2 but found 1 (indentation)
16warning comment not indented like content (comments-indentation)
15error too few spaces after comma (commas)
10warning truthy value should be one of [false, true] (truthy)

Automatic Formatting

Another option to linting is automatic formatting. With this method a program applies a prescribed format to a file automatically.

There are not that many YAML libraries which support preserving YAML comments when encoding. Two notable ones are Python's ruamel & Go's go-yaml. I decided to use Go's and create a small wrapper around the library, VINYL.

Pros
  • Formatting is done automatically by the program.
  • Current hiera data can be aligned to style with a single run of the formatter.

After applying vinyl to our hiera data, yamllint with a tuned config reports only 66 errors, the majority of those are duplicate mapping key errors, which will be removed in a separate pull request.

Cons
  • Deviations from the automatic formatter are not permitted.

Proposal

Pick vinyl. On balance automatic formatting with vinyl is easier to apply to the existing hiera data and easier to maintain.

Formatting Results

The results of the formatting can be viewed in this patch.

Format Output Style

In general the output format of vinyl closely mirrors the majority of the existing style for our hiera data. However there are a few notable exceptions:

  • Trailing Comment alignment is removed

Before:

cache_hosts:
- 10.64.0.130                    # cp1075.eqiad.wmnet
- 2620:0:861:101:10:64:0:130     # cp1075.eqiad.wmnet
- 10.64.0.131                    # cp1076.eqiad.wmnet
- 2620:0:861:101:10:64:0:131     # cp1076.eqiad.wmnet

After:

cache_hosts:
  - 10.64.0.130 # cp1075.eqiad.wmnet
  - 2620:0:861:101:10:64:0:130 # cp1075.eqiad.wmnet
  - 10.64.0.131 # cp1076.eqiad.wmnet
  - 2620:0:861:101:10:64:0:131 # cp1076.eqiad.wmnet
  • Almost all blank new lines are removed:

Before:

# Additional base overrides
profile::base::remote_syslog: []

# Labs statsd instance
statsd: cloudmetrics1004.eqiad.wmnet:8125

After:

# Additional base overrides
profile::base::remote_syslog: []
# Labs statsd instance
statsd: cloudmetrics1004.eqiad.wmnet:8125

Verification

In order to verify that the formatter did not alter any data, a script was used to verify that the YAML data after being formatted was the same from the perspective of Ruby's & Python's YAML libraries:

Methodology

Editor Plugin

  • Vim, a bare bones format with gq is supplied below, but much better would be a plugin for ALE.
au FileType yaml let &l:formatprg= "vinyl /dev/stdin"
  • Other editors?

Github Pre-Commit Hook

Prior to pushing a commit we can use a git pre-commit hook to verify the format. The downside to pre-commit hooks is that users need to manually install the hook in a given repo.

Jenkins job

As a secondary check we can instruct Jenkins to run the formatter as a CI job

Bugs

  • Trailing commas added to some flow sequences:
diff --git a/hieradata/codfw/profile/swift/proxy.yaml b/hieradata/codfw/profile/swift/proxy.yaml
index 3ffecd054c..1fa09b2250 100644
--- a/hieradata/codfw/profile/swift/proxy.yaml
+++ b/hieradata/codfw/profile/swift/proxy.yaml
@@ -1,12 +1,7 @@
 profile::swift::proxy::proxy_service_host: 'ms-fe.svc.codfw.wmnet'
-profile::swift::proxy::memcached_servers: [
-  'ms-fe2005.codfw.wmnet',
-  'ms-fe2006.codfw.wmnet',
-  'ms-fe2007.codfw.wmnet',
-  'ms-fe2008.codfw.wmnet'
-# no trailing comma!
+profile::swift::proxy::memcached_servers: ['ms-fe2005.codfw.wmnet', 'ms-fe2006.codfw.wmnet', 'ms-fe2007.codfw.wmnet', 'ms-fe2008.codfw.wmnet',
+  # no trailing comma!
 ]
  • Almost all blank new lines are removed. This could probably fixed, but I am not sure it would be accepted upstream.

Change 754114 had a related patch set uploaded (by Jbond; author: JHathaway):

[operations/puppet@production] Hieradata: format yaml with vinyl

https://gerrit.wikimedia.org/r/754114

Thanks for the work on this looks really good, in relation to linting vs automatic formatting i agree with the the conclusion that automatic formatting would be the better option. In relation to tooling i think it would be useful to check some other tools to see if they are free of the aforementioned bugs.

In relation to vinyl itself, it seems (from a scan of the commit histories) that the original author has renamed this to yamlfmt. do you know any more of the history i.e.did this project use to be called vinyl, and why didn't we go with yamlfmt?

copying @Joe's comments from the gerrit patch here

a justification of why we're reformatting yaml, and of why vinyl was specifically chosen. For instance, I'd like to see a consistent use of double quotes rather than single ones, but that doesn't seem to be included here.
why the formatting choices were made (for instance, I don't like the removal of blank lines)

This is an artefact of the tool we pick. for my view, and similar to the conversation on python black. i don't care to much what style ultimately get picked as long as there is consistency

A CI check so that we don't need to reformat everything all the time.

definitely, have updated the original CR with a link to the CI change (We will need to update the CI container to include whichever tool we pick)

Thanks for the work on this looks really good, in relation to linting vs automatic formatting i agree with the the conclusion that automatic formatting would be the better option. In relation to tooling i think it would be useful to check some other tools to see if they are free of the aforementioned bugs.

In relation to vinyl itself, it seems (from a scan of the commit histories) that the original author has renamed this to yamlfmt. do you know any more of the history i.e.did this project use to be called vinyl, and why didn't we go with yamlfmt?

copying @Joe's comments from the gerrit patch here

a justification of why we're reformatting yaml, and of why vinyl was specifically chosen. For instance, I'd like to see a consistent use of double quotes rather than single ones, but that doesn't seem to be included here.
why the formatting choices were made (for instance, I don't like the removal of blank lines)

This is an artefact of the tool we pick. for my view, and similar to the conversation on python black. i don't care to much what style ultimately get picked as long as there is consistency

100% agreed on consistency, I like the general idea and wanted to say +1 on not removing blank lines. For big files (e.g. service::catalog I find it useful to be able to navigate between "sections" (with { and } for vim users)) and for smaller files the blank lines seem to be used to group variables.

From: @Joe
I'm generally not a big fan of reformatting patches, because of how hard they make to reconstruct git history. However, they're often a necessary evil. Still, I would like to see:

  • a justification of why we're reformatting yaml, and of why vinyl was specifically chosen. For instance, I'd like to see a consistent use of double quotes rather than single ones, but that doesn't seem to be included here.
  • why the formatting choices were made (for instance, I don't like the removal of blank lines)
  • A CI check so that we don't need to reformat everything all the time.

@Joe I agree that formatting patches can be very disruptive to git history and should be avoided if possible. My intention is to introduce a formatter that minimizes the number of changes across the repo.

  1. I have created an initial justification in the proposal above, but I happy to expand more on it, if you feel like it is insufficient. VINYL is a thin wrapper around go-yaml, so that is essentially the software being chosen. I am torn on the question of single vs double quotes. Standardizing would create a much larger diff and may be difficult to implement as the escaping rules differ between the two styles, https://yaml.org/spec/1.2.2/#73-flow-scalar-styles.
  1. I consider the removal of blanks lines a bug, @fgiunchedi also mentioned concerns about removing them, as they aid navigation in Vim. If we consider this bug important enough, which I have come to agree with, then we can hold off until the bug is fixed, or an alternative parser is found.
  1. A CI check would definitely be added, more detail is on the proposal above.

Thanks for the work on this looks really good, in relation to linting vs automatic formatting i agree with the the conclusion that automatic formatting would be the better option. In relation to tooling i think it would be useful to check some other tools to see if they are free of the aforementioned bugs.

I have tried Python's ruamel & the Javascript tool prettier, they both unfortunately have their own set of bugs. At present I think the remove of blank lines is the most important bug in go-yaml.

In relation to vinyl itself, it seems (from a scan of the commit histories) that the original author has renamed this to yamlfmt. do you know any more of the history i.e.did this project use to be called vinyl, and why didn't we go with yamlfmt?

Apologies for not making this more clear, vinyl is my personal fork of yamlfmt, plus a few commits. Both libraries are thin wrappers around go-yaml which is doing the vast majority of the work. I created the fork to add a few more options, including duplicate key checking, and to turn off sorting by default, https://github.com/stuart-warren/yamlfmt/pull/16.

100% agreed on consistency, I like the general idea and wanted to say +1 on not removing blank lines. For big files (e.g. service::catalog I find it useful to be able to navigate between "sections" (with { and } for vim users)) and for smaller files the blank lines seem to be used to group variables.

@fgiunchedi thanks for looking over the proposal and for the added insight on using { & } to navigate in yaml docs. I agree that removing blank lines would hinder readability and navigation. I am going to look at getting that bug fixed upstream or explore other options.

Thanks for looking into this! Automatic formatting would be great as long as the output is human-oriented.

  • Deviations from the automatic formatter are not permitted.

Deviations are sometimes necessary to maintain human-readability. When deviations are necessary, they ought to be explicit. Puppet lint supports this with flags.

  • Trailing Comment alignment is removed

This is not human-oriented.

  • Almost all blank new lines are removed:

This hampers navigation and demarcations for major sections of large yaml documents (common comes to mind).

Personally, I much prefer CI to tell me (how) to fix my patch to conform to the style guide than to add commit hooks. Commit hooks tend to never update when new versions are available, dependencies are never available by default when it's time for an OS reinstall, and setting up a commit hook prior to contributing adds an additional burden on casual contributors who "just need this one thing so I can get on with it." I'm not sure how this could best be accomplished. Perhaps Jenkins could (optionally) amend the changeset with my patch run through the automatic formatter for me?

Deviations are sometimes necessary to maintain human-readability. When deviations are necessary, they ought to be explicit. Puppet lint supports this with flags.

The lack of support for deviations is the primary trade off with automatic formatting in comparison to linting. Automatic formatting frees you from needing to spend time to ensure your commit is formatted correctly, but limit's you to the formatters decisions. Ideally the automatic formatter produces easy to read commits for the general case. I think gofmt & shfmt are good examples of programs which produce very legible code for the general case.

  • Trailing Comment alignment is removed

This is not human-oriented.

Trailing comment alignment would be nice to have.

  • Almost all blank new lines are removed:

This hampers navigation and demarcations for major sections of large yaml documents (common comes to mind).

I agree and I consider this a blocker to using go-yaml at present.

Personally, I much prefer CI to tell me (how) to fix my patch to conform to the style guide than to add commit hooks. Commit hooks tend to never update when new versions are available, dependencies are never available by default when it's time for an OS reinstall, and setting up a commit hook prior to contributing adds an additional burden on casual contributors who "just need this one thing so I can get on with it." I'm not sure how this could best be accomplished. Perhaps Jenkins could (optionally) amend the changeset with my patch run through the automatic formatter for me?

Ideally a users local dev environment should give them as strong a signal as possible that their commit is correct. CI is an essential backstop for ensuring a commit does pass all guards, but using it as the primary guard tool slows down the development process. Git pre-commit hooks have their share of problems but they can be effective as a layer of defenses:

  1. Editor runs formatter
  2. Git pre-commit hook checks format
  3. CI checks format

With these three layers the pre-commit hook is helpful to avoid the developer committing their work, only to find out later that their commit was rejected by CI, which found that one of their committed files was not correctly formatted.

Although not optimal, in the worse case scenario in which we will be unable to find/modify a tool to preserve empty lines, we could also consider to have an automatic replacement of all empty lines with the start comment character, so that they will be treated as comments and left untouched, keeping the visual grouping intact.