Page MenuHomePhabricator

CI jobs for authdns linting need to run on Stretch
Closed, ResolvedPublic

Description

Our authdns servers now run stretch, and run a newer package that only exists on stretch, and this means CI fails to match prod (it could also cause false positives that pass CI and fail in prod).

Example: This change: https://gerrit.wikimedia.org/r/c/operations/dns/+/462693 fails to lint: https://integration.wikimedia.org/ci/job/operations-dns-lint/5558/console (because the old version of gdnsd on jessie doesn't support this newer RR type).

What needs to get configured where to update the authdns CI infra to run on stretch-wikimedia?

Event Timeline

BBlack triaged this task as Medium priority.Sep 25 2018, 1:51 PM
BBlack created this task.
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Here is the full context

Current

The operations-dns-lint job runs on Jessie WMCS instances, they are provisioned by puppet and eventually include ::authdns::lint which in turns does:

# == Class authdns::lint
# A class to lint Wikimedia's authoritative DNS system
#
class authdns::lint {
    include ::authdns::scripts
    include ::geoip

    package { 'gdnsd':
        ensure => installed,
    }

    service { 'gdnsd':
        ensure     => 'stopped',
        enable     => false,
        hasrestart => true,
        hasstatus  => true,
        require    => Package['gdnsd'],
    }
}

There might be some additional magic for geoip since we dont have a MaxMind database available on WMCS.

The CI infrastructure relying on Jessie WMCS instances is legacy, there is no plan to add Stretch instances. Olds jobs are being migrated toward Docker container.

Todo

We would want to craft a new Docker container using docker-pkg. The container would:

  • be based on a stretch image
  • install the gdnsd and geoip packages
  • probably copy the script from authdns::lint using a temporary clone of operations/puppet

There are a few troubles though. Anytime a new version of gdnsd package is published or the authdns lint script is updated, we will have to rebuild the CI container manually. But that is fairly easily to do.

Shouldn't the container be able to puppetize from authdns::lint directly, which would provide all the pathways for updating the package/config/geoip/etc? Do the docker containers not get access to a puppetmaster?

Change 468578 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] Fix authdns-lint for 3.x

https://gerrit.wikimedia.org/r/468578

None of the containers get provisioned via puppet. For CI puppet was used mostly to provide a list of packages.

The various scripts are in operations/puppet modules/authdns/ . We can surely have the Dockerfile to clone the puppet repo and just cp all the scripts + install whatever dependencies are needed.

For the current job, the CI puppetmaster ( integration-puppetmaster01.integration.eqiad.wmflabs ) seems to use geoip::data::maxmind though /etc/GeoIP.conf does not have a proper userid/license. Maybe the geoip-database Debian package is sufficient?

Change 478809 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] authdns config/CI refactor [1/5]

https://gerrit.wikimedia.org/r/478809

Change 478810 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] authdns config/CI refactor [3/5]

https://gerrit.wikimedia.org/r/478810

Change 478811 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] authdns config/CI refactor: [5/5]

https://gerrit.wikimedia.org/r/478811

Change 478812 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/dns@master] authdns CI/config refactor [2/5]

https://gerrit.wikimedia.org/r/478812

Change 478813 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/dns@master] authdns config/CI refactor: [4/5]

https://gerrit.wikimedia.org/r/478813

CI runs two jobs for operations/dns:

operations-dns-tabs

Does a git shallow clone on a permanent slave then run the shell command:

- shell: |
    #!/bin/bash -e
    echo "Looking for tabulations in files matching: {fileselector}"
    set -x
    (grep --recursive -P '^\t' --exclude-dir='.git' --include='{fileselector}' .) && HAS_TAB=1 || HAS_TAB=0
    exit $HAS_TAB

That has to migrate to a Docker container which could use git grep as an entrypoint. The placeholder task is T210283 which have updated with some hint as how we can implement it.

operations-dns-lint

My previous comment T205439#4680386 hints how it is provisioned on the slave. The Jenkins job itself ends up being straight forward:

jjb/operations-misc.yaml
- job-template:
    name: 'operations-dns-lint'
    # Depends on production GeoIP T98737
    # Manual workaround has been applied though
    node: contintLabsSlave && DebianJessie
    defaults: use-remoteonly-zuul
    concurrent: true
    triggers:
     - zuul
    builders:
     - shell: |
         mkdir -p "$WORKSPACE"/build
         # Lint script provided via puppet authdns::lint class
         /usr/local/bin/authdns-lint "$WORKSPACE" "$WORKSPACE"/build

Namely it does a full clone ofo the repository then shell out to authdns-lint which is provided via operations/puppet.git.

@hashar - I'm re-working the tools for the linting checks on operations/dns in the commits linked above, and we should be able to get away from cloning/using operations/puppet completely and just run a few simple commands on a checkout of operations/dns from a Docker image. I'm sure we can add the trivial tab-checking into the main CI run as well.

Change 478809 merged by BBlack:
[operations/puppet@production] authdns config/CI refactor [1/5]

https://gerrit.wikimedia.org/r/478809

Change 478812 merged by BBlack:
[operations/dns@master] authdns config/CI refactor [2/5]

https://gerrit.wikimedia.org/r/478812

Change 478810 merged by BBlack:
[operations/puppet@production] authdns config/CI refactor [3/5]

https://gerrit.wikimedia.org/r/478810

Change 478813 merged by BBlack:
[operations/dns@master] authdns config/CI refactor [4/5]

https://gerrit.wikimedia.org/r/478813

Change 478811 merged by BBlack:
[operations/puppet@production] authdns config/CI refactor [5/5]

https://gerrit.wikimedia.org/r/478811

Change 478929 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/dns@master] Add tab checker and a run-test.sh CI script

https://gerrit.wikimedia.org/r/478929

Change 478929 merged by BBlack:
[operations/dns@master] Add tab checker and a run-tests.sh CI script

https://gerrit.wikimedia.org/r/478929

@hashar - So where we're at now is that we just need our CI switched to a Docker with the following properties (which is probably simple, but non-obvious to me!):

  • stretch-wikimedia base
  • install packages: python, python-jinja2, python3, gdnsd (the first 3 are from basic stretch, the last needs the version from stretch-wikimedia)
  • update git clone of operations/dns repo
  • run $REPO/utils/run-tests.sh $REPO where $REPO is wherever the updated checkout is at. This will execute all the CI checks (the equivalent of the legacy authdns-lint and tab checks, plus the new zone validator)

Change 479190 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[integration/config@master] Add docker image operations-dnslint

https://gerrit.wikimedia.org/r/479190

@BBlack that refactoring is awesome!

As for why the task got stuck, I did a first analysis mid October and eventually forgot about that task entirely. I should have concluded stating I was not planning to figure out a solution immediately.

Giuseppe patch seems on the right way, so we should get a CI job soonish :)

Out of curiosity: how do you ship the GeoDNS database? Is that relying on a package available through Debian?

Out of curiosity: how do you ship the GeoDNS database? Is that relying on a package available through Debian?

For basic CI purposes, we don't really need "real" GeoDNS data. We just need some kind of mock file in the correct binary format that loads into the daemon correctly at all. So, as part of the mock-testing setup (where we're mocking other puppet production inputs as well, such as the config file that would supply listen addresses, etc), I put a 261-byte minimal binary database here that suffices for testing: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/dns/+/master/utils/mock_etc/geoip/ .

This magical 261-byte binary test database was in turned copied from the upstream gdnsd repo. Over there it's accompanied by this README which explains the generation of it and why it's not just generated on the fly instead: https://github.com/gdnsd/gdnsd/blob/master/t/014geoip/README .

Change 479204 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/dns@master] Add README explaining the small binary db here

https://gerrit.wikimedia.org/r/479204

^ Fixing it to be self-explanatory! :)

Change 479204 merged by BBlack:
[operations/dns@master] Add README explaining the small binary db here

https://gerrit.wikimedia.org/r/479204

Change 479190 merged by Giuseppe Lavagetto:
[integration/config@master] Add docker image operations-dnslint

https://gerrit.wikimedia.org/r/479190

So I see @Joe has merged up some Dockerfile stuff. What's our next step to flip operations/dns CI checks over to the new operations-dnslint? AFAIK we're ready for this at any time (current repo passes under the new checks and they're ready to use).

BTW: https://gerrit.wikimedia.org/r/c/operations/dns/+/462693 is a good test job when it's flipped. This fails current linting because of the outdated gdnsd version there, but hypothetically should pass on the new Docker-based stuff with updated software.

A build against master:
https://integration.wikimedia.org/ci/job/operations-dns-lint-docker/1/console (parameters).

I have deployed the CI configuration, from now on one can just comment check experimental on a change and the job will be triggered. Once happy with the job we can replace the old one as well as operations-dns-tabs-docker which is now handled in the test script.

Change 479270 had a related patch set uploaded (by Hashar; owner: Hashar):
[integration/config@master] Promote docker job for operations/dns.git

https://gerrit.wikimedia.org/r/479270

Change 479270 merged by jenkins-bot:
[integration/config@master] Promote docker job for operations/dns.git

https://gerrit.wikimedia.org/r/479270

@BBlack refactored the operations/dns test to mock anything that was provided by puppet and GeoDNS. Hence the test suite can be run standalone provided one has the few python dependencies that are required.

@Joe wrapped a new Docker container.

I have done the boilerplate ninja dance with Zuul and Jenkins configs to get the new job up and running. I have confirmed the job to run properly on the master branch.

@BBlack triggered it on a change that requires a new gdnsd version ( Gerrit 462693) and the job works!

Thank you so much to both of you!

Change 468578 abandoned by BBlack:
Fix authdns-lint for 3.x

Reason:
This moved in the meantime!

https://gerrit.wikimedia.org/r/468578

Change 479435 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] authdns: remove authdns-lint bits

https://gerrit.wikimedia.org/r/479435

Change 479435 merged by BBlack:
[operations/puppet@production] authdns: remove authdns-lint bits

https://gerrit.wikimedia.org/r/479435