Page MenuHomePhabricator

Add check for changes applied at all runs
Open, MediumPublic

Description

We have cases of not great puppet code that apply some change to a host at each puppet run. This is usually a symptom that something is wrong and we should detect it. It's true that we have some particular case in which this is expected but we should try to fix it.

My proposal is to add a check that ensure that during the last N runs (with a large enough N, say 48 to account for 24h) of Puppet, at least once it was a noop. The check could run once an hour or even less frequently.
I don't recall if that data is available locally on the hosts but is surely available on puppetdb (see puppetboard).

Event Timeline

Volans created this task.Jan 15 2020, 7:44 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 15 2020, 7:44 PM
MoritzMuehlenhoff triaged this task as Medium priority.Jan 16 2020, 9:19 AM
jbond moved this task from Unsorted 💣 to Friday tasks on the User-jbond board.
jbond added a comment.Fri, Feb 14, 5:03 PM

I had a quick look at this, i dont think anything can be don on the node its self but as you say we can query puppet db. The following script is a quick and dirty example of this, however the problem is that notifies count as a change. which means anything using tlsproxy::localssl will have a changed report on every run see mw2255 as an example. Looking at the comment for this specific notify suggests to me we could probably remove it but need to investigate further.

#!/usr/bin/env python3
from pypuppetdb import connect
from pypuppetdb.QueryBuilder import AndOperator, EqualsOperator
from os.path import basename

def main():
    found_files = set()
    db = connect()
    query_status = EqualsOperator('status', 'unchanged')
    for node in db.nodes():
        query_certname = EqualsOperator('certname', node.name)
        query = AndOperator()
        query.add(query_status)
        query.add(query_certname)
        unchanged_report = db.reports(query=query, limit=1)
        try:
            next(unchanged_report)
        except StopIteration:
            print('{}: fail'.format(node.name))



if __name__ == '__main__':
    raise SystemExit(main())

Change 572282 had a related patch set uploaded (by Jbond; owner: John Bond):
[operations/puppet@production] tlsproxy::localssl: change duplicate definition detection

https://gerrit.wikimedia.org/r/572282

Change 572282 merged by Jbond:
[operations/puppet@production] tlsproxy::localssl: change duplicate definition detection

https://gerrit.wikimedia.org/r/572282

I have deployed the change to remove the tlsproxy noise however there are still quit a few boxes which are showing changed on every run

./reports.py | sort
---
acmechief1001.eqiad.wmnet: fail
acmechief-test1001.eqiad.wmnet: fail
an-airflow1001.eqiad.wmnet: fail
analytics1030.eqiad.wmnet: fail
an-tool1006.eqiad.wmnet: fail
cloudcontrol2001-dev.wikimedia.org: fail
cloudcontrol2003-dev.wikimedia.org: fail
cloudmetrics1001.eqiad.wmnet: fail
cloudmetrics1002.eqiad.wmnet: fail
contint1001.wikimedia.org: fail
contint2001.wikimedia.org: fail
ganeti1001.eqiad.wmnet: fail
ganeti1002.eqiad.wmnet: fail
ganeti1003.eqiad.wmnet: fail
ganeti1004.eqiad.wmnet: fail
ganeti1005.eqiad.wmnet: fail
ganeti1006.eqiad.wmnet: fail
ganeti1007.eqiad.wmnet: fail
ganeti1008.eqiad.wmnet: fail
ganeti2001.codfw.wmnet: fail
ganeti2002.codfw.wmnet: fail
ganeti2003.codfw.wmnet: fail
ganeti2004.codfw.wmnet: fail
ganeti2005.codfw.wmnet: fail
ganeti2006.codfw.wmnet: fail
ganeti2007.codfw.wmnet: fail
ganeti2008.codfw.wmnet: fail
ganeti3001.esams.wmnet: fail
ganeti3002.esams.wmnet: fail
ganeti3003.esams.wmnet: fail
ganeti4001.ulsfo.wmnet: fail
ganeti4002.ulsfo.wmnet: fail
ganeti4003.ulsfo.wmnet: fail
ganeti5001.eqsin.wmnet: fail
ganeti5002.eqsin.wmnet: fail
ganeti5003.eqsin.wmnet: fail
ms-fe2005.codfw.wmnet: fail
mwdebug2001.codfw.wmnet: fail
mwdebug2002.codfw.wmnet: fail
mwmaint2001.codfw.wmnet: fail
netmon1002.wikimedia.org: fail
prometheus1003.eqiad.wmnet: fail
prometheus1004.eqiad.wmnet: fail
prometheus2003.codfw.wmnet: fail
prometheus2004.codfw.wmnet: fail
releases2001.codfw.wmnet: fail
stat1005.eqiad.wmnet: fail
vega.codfw.wmnet: fail
wdqs1003.eqiad.wmnet: fail
wdqs1004.eqiad.wmnet: fail
wdqs1005.eqiad.wmnet: fail
wdqs1006.eqiad.wmnet: fail
wdqs1007.eqiad.wmnet: fail
wdqs1008.eqiad.wmnet: fail
wdqs1009.eqiad.wmnet: fail
wdqs1010.eqiad.wmnet: fail
wdqs2001.codfw.wmnet: fail
wdqs2002.codfw.wmnet: fail
wdqs2003.codfw.wmnet: fail
wdqs2004.codfw.wmnet: fail
wdqs2005.codfw.wmnet: fail
wdqs2006.codfw.wmnet: fail

Change 572659 had a related patch set uploaded (by Jbond; owner: John Bond):
[operations/puppet@production] acme_chief::server: change exec job to a systemd timer

https://gerrit.wikimedia.org/r/572659

Change 572667 had a related patch set uploaded (by Jbond; owner: John Bond):
[operations/puppet@production] profile::ganeti: update the permissions of the users file

https://gerrit.wikimedia.org/r/572667

Change 572659 merged by Jbond:
[operations/puppet@production] acme_chief::server: change exec job to a systemd timer

https://gerrit.wikimedia.org/r/572659

Change 572684 had a related patch set uploaded (by Jbond; owner: John Bond):
[operations/puppet@production] query_service::common: ensure we dont run exec on every run

https://gerrit.wikimedia.org/r/572684

Change 572691 had a related patch set uploaded (by Jbond; owner: John Bond):
[operations/puppet@production] profile::prometheus::ops_mysql: change exec to a system timer

https://gerrit.wikimedia.org/r/572691

Change 572696 had a related patch set uploaded (by Jbond; owner: John Bond):
[operations/puppet@production] openstack::clientpackages::mitaka::buster: change notice to warning

https://gerrit.wikimedia.org/r/572696

Change 572707 had a related patch set uploaded (by Jbond; owner: John Bond):
[operations/puppet@production] profile::ci::docker: manage all group membership in data module

https://gerrit.wikimedia.org/r/572707

Change 572667 merged by Jbond:
[operations/puppet@production] profile::ganeti: update the permissions of the users file

https://gerrit.wikimedia.org/r/572667

Change 572829 had a related patch set uploaded (by Jbond; owner: John Bond):
[operations/puppet@production] profile::ganeti: update the permissions of the users file

https://gerrit.wikimedia.org/r/572829

Change 572829 merged by Jbond:
[operations/puppet@production] profile::ganeti: update the permissions of the users file

https://gerrit.wikimedia.org/r/572829

Change 572707 merged by Jbond:
[operations/puppet@production] profile::ci::docker: manage all group membership in data module

https://gerrit.wikimedia.org/r/572707

Change 572696 merged by Jbond:
[operations/puppet@production] openstack::clientpackages::mitaka::buster: change notice to warning

https://gerrit.wikimedia.org/r/572696

Change 572691 merged by Jbond:
[operations/puppet@production] profile::prometheus::ops_mysql: change exec to a system timer

https://gerrit.wikimedia.org/r/572691

Change 572684 merged by Jbond:
[operations/puppet@production] query_service::common: ensure we dont run exec on every run

https://gerrit.wikimedia.org/r/572684

Change 573243 had a related patch set uploaded (by Jbond; owner: John Bond):
[operations/puppet@production] ores::base: myspell-nl is provided by hunspell-nl in buster

https://gerrit.wikimedia.org/r/573243

Change 573243 merged by Jbond:
[operations/puppet@production] ores::base: myspell-nl is provided by hunspell-nl in buster

https://gerrit.wikimedia.org/r/573243

Change 573261 had a related patch set uploaded (by Jbond; owner: John Bond):
[operations/puppet@production] profile::prometheus::ops: ensure rsync service is stopped

https://gerrit.wikimedia.org/r/573261

Change 573265 had a related patch set uploaded (by Jbond; owner: John Bond):
[operations/puppet@production] swift::swiftrepl: force directory removal if resource absent

https://gerrit.wikimedia.org/r/573265

Change 573268 had a related patch set uploaded (by Jbond; owner: John Bond):
[operations/puppet@production] librenms: librenms and puppet managing files with different permissions

https://gerrit.wikimedia.org/r/573268

Change 573265 merged by Jbond:
[operations/puppet@production] swift::swiftrepl: force directory removal if resource absent

https://gerrit.wikimedia.org/r/573265

Change 573261 merged by Jbond:
[operations/puppet@production] profile::prometheus::ops: ensure rsync service is stopped

https://gerrit.wikimedia.org/r/573261

Change 573287 had a related patch set uploaded (by Jbond; owner: John Bond):
[operations/puppet@production] profile::microsites::static_rt: disable the rsync service

https://gerrit.wikimedia.org/r/573287

jbond added a comment.Wed, Feb 19, 5:23 PM

this is what we have left

  • flowspec1001.eqiad.wmnet: currently down
  • mwdebug1001.eqiad.wmnet: puppet disabled
  • mwdebug2001.codfw.wmnet: puppet disabled
  • mwmaint2001.codfw.wmnet: puppet disabled
  • vega.codfw.wmnet: change waiting review

Change 573287 merged by Dzahn:
[operations/puppet@production] profile::microsites::static_rt: disable the rsync service

https://gerrit.wikimedia.org/r/573287

Dzahn added a subscriber: Dzahn.Wed, Feb 19, 11:05 PM

flowspec1001: setup in progress (Arzhel)

mwdebug: tests in progress (Effie)

mwmaint: not disabled anymore, ran puppet

vega: change merged , ran puppet