Page MenuHomePhabricator

Upgrade facter to version 2.4.6
Closed, ResolvedPublic

Description

Upgrade facter across the fleet to the version 2.4.6, ensuring that it's a noop in each and every host. Manually investigate and fix any host where it is not a noop.

Event Timeline

The upgrade will be performed with those steps:

  • disable puppet reliably (waiting for any in-flight run)
  • compile the catalog and output the facts to a directory
  • upgrade facter
  • compile the catalog again and output the fact to another directory
  • compare the result of the two runs
  • enable puppet
  • remove temporary files

This can be achieved with a single Cumin run, if the diff of the two runs will be different it will stop and leave the host with puppet disabled for manual inspection:

sudo cumin -m async -d -b 4 -s 1 'not F:facterversion = "2.4.6"' \
'disable-puppet "Upgrade facter - volans"' \
'puppet agent --onetime --no-daemonize --no-splay --ignorecache --no-usecacheonfailure --noop --vardir /root/__facter_upgrade_current__' \
'DEBIAN_FRONTEND=noninteractive apt-get install -y facter > /tmp/__apt_get_install_facter__' \
'puppet agent --onetime --no-daemonize --no-splay --ignorecache --no-usecacheonfailure --noop --vardir /root/__facter_upgrade_new__' \
'diff -u <(jq . /root/__facter_upgrade_current__/client_data/catalog/$(hostname -f).json | grep -v "\"version\"") \
            <(jq . /root/__facter_upgrade_new__/client_data/catalog/$(hostname -f).json | grep -v "\"version\"")' \
'enable-puppet "Upgrade facter - volans"' \
'rm -rf /root/__facter_upgrade_current__ /root/__facter_upgrade_new__ /tmp/__apt_get_install_facter__'

Mentioned in SAL (#wikimedia-operations) [2017-05-24T13:28:46Z] <volans> slowly upgrading facter across the fleet checking is a noop T166203

Mentioned in SAL (#wikimedia-operations) [2017-05-24T16:54:36Z] <volans> pause slowly upgrading facter across the fleet, resuming tomorrow T166203

Mentioned in SAL (#wikimedia-operations) [2017-05-25T08:03:04Z] <volans> resuming slow upgrade of facter across the fleet checking is a noop T166203

Mentioned in SAL (#wikimedia-operations) [2017-05-25T18:47:39Z] <volans> completed upgrade of facter across the fleet T166203 (apart few hosts down)

Facter upgraded and verified was a noop across the fleet.

Only remaining hosts are few that are currently offline:
analytics1030.eqiad.wmnet,cp3003.esams.wmnet,labstore[1001-1002].eqiad.wmnet

I'm investigating why the query to puppetdb was returning inconsistent results (against both nitrogen and nihal):

# Direct curl
curl -sG https://nitrogen.eqiad.wmnet/v3/nodes --data-urlencode 'query=["not", ["=", ["fact", "facterversion"], "2.4.6"]]' | grep '"name"'

# Through cumin
sudo cumin --dry-run 'not F:facterversion = "2.4.6"' date

Here the list of hosts that the above query returns running it few minutes apart from one to each other:

analytics1030.eqiad.wmnet,cp3003.esams.wmnet,db2079.codfw.wmnet,db1068.eqiad.wmnet,labstore[1001-1002].eqiad.wmnet
analytics1030.eqiad.wmnet,cp3003.esams.wmnet,db2079.codfw.wmnet,db1068.eqiad.wmnet,labstore[1001-1002].eqiad.wmnet,lvs1006.wikimedia.org,tegmen.wikimedia.org
analytics1030.eqiad.wmnet,cp3003.esams.wmnet,db1068.eqiad.wmnet,labstore[1001-1002].eqiad.wmnet,tegmen.wikimedia.org
analytics1030.eqiad.wmnet,cp3003.esams.wmnet,db1068.eqiad.wmnet,labstore[1001-1002].eqiad.wmnet
analytics1030.eqiad.wmnet,cp3003.esams.wmnet,labstore[1001-1002].eqiad.wmnet
analytics1030.eqiad.wmnet,cp3003.esams.wmnet,db1068.eqiad.wmnet,labstore[1001-1002].eqiad.wmnet,phab2001.codfw.wmnet

@akosiaris by any chance do this rings a bell to something we already encountered in the past?

So it seems that those flapping results are due to puppet running ALSO as a daemon on those hosts (thanks @faidon ), because if at any time when running a puppet agent there is a typo in the options around the -t puppet smartly decides to ignore the wrong option and run as daemon in background.
Some examples were:

/usr/bin/ruby /usr/bin/puppet agent 0tv
/usr/bin/ruby /usr/bin/puppet agent .-t
/usr/bin/ruby /usr/bin/puppet agent -d

As discussed with @faidon I've opened T166371 to create an alert for it.

Mentioned in SAL (#wikimedia-operations) [2017-05-26T08:45:30Z] <volans> killed daemonized puppet on tegmen, lvs1006 T166203

Since we're planning to gradually introduce features (e.g. structured facts) that are Facter >= 2 specific, we should probably do the same upgrade on Labs hosts as well. Since this has been tested in prod already, I don't think there's tremendous value in testing it the same in Labs; we should just do it there with salt/clush/whatever.

faidon triaged this task as Medium priority.May 29 2017, 11:51 AM
faidon added a project: Cloud-Services.

Change 356212 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] Force fact stringification servermon reporter

https://gerrit.wikimedia.org/r/356212

Change 356212 merged by Alexandros Kosiaris:
[operations/puppet@production] Force fact stringification in servermon reporter

https://gerrit.wikimedia.org/r/356212

Volans claimed this task.

Facter is upgraded in production on the whole fleet apart cp3003.esams.wmnet,labstore[1001-1002].eqiad.wmnet that will need to be reimaged anyway. Labs also was upgraded by Faidon via Salt.