Upgrade facter across the fleet to the version 2.4.6, ensuring that it's a noop in each and every host. Manually investigate and fix any host where it is not a noop.
Description
Details
Project | Branch | Lines +/- | Subject | |
---|---|---|---|---|
operations/puppet | production | +3 -2 | Force fact stringification in servermon reporter |
Related Objects
Event Timeline
The upgrade will be performed with those steps:
- disable puppet reliably (waiting for any in-flight run)
- compile the catalog and output the facts to a directory
- upgrade facter
- compile the catalog again and output the fact to another directory
- compare the result of the two runs
- enable puppet
- remove temporary files
This can be achieved with a single Cumin run, if the diff of the two runs will be different it will stop and leave the host with puppet disabled for manual inspection:
sudo cumin -m async -d -b 4 -s 1 'not F:facterversion = "2.4.6"' \ 'disable-puppet "Upgrade facter - volans"' \ 'puppet agent --onetime --no-daemonize --no-splay --ignorecache --no-usecacheonfailure --noop --vardir /root/__facter_upgrade_current__' \ 'DEBIAN_FRONTEND=noninteractive apt-get install -y facter > /tmp/__apt_get_install_facter__' \ 'puppet agent --onetime --no-daemonize --no-splay --ignorecache --no-usecacheonfailure --noop --vardir /root/__facter_upgrade_new__' \ 'diff -u <(jq . /root/__facter_upgrade_current__/client_data/catalog/$(hostname -f).json | grep -v "\"version\"") \ <(jq . /root/__facter_upgrade_new__/client_data/catalog/$(hostname -f).json | grep -v "\"version\"")' \ 'enable-puppet "Upgrade facter - volans"' \ 'rm -rf /root/__facter_upgrade_current__ /root/__facter_upgrade_new__ /tmp/__apt_get_install_facter__'
Mentioned in SAL (#wikimedia-operations) [2017-05-24T13:28:46Z] <volans> slowly upgrading facter across the fleet checking is a noop T166203
Mentioned in SAL (#wikimedia-operations) [2017-05-24T16:54:36Z] <volans> pause slowly upgrading facter across the fleet, resuming tomorrow T166203
Mentioned in SAL (#wikimedia-operations) [2017-05-25T08:03:04Z] <volans> resuming slow upgrade of facter across the fleet checking is a noop T166203
Mentioned in SAL (#wikimedia-operations) [2017-05-25T18:47:39Z] <volans> completed upgrade of facter across the fleet T166203 (apart few hosts down)
Facter upgraded and verified was a noop across the fleet.
Only remaining hosts are few that are currently offline:
analytics1030.eqiad.wmnet,cp3003.esams.wmnet,labstore[1001-1002].eqiad.wmnet
I'm investigating why the query to puppetdb was returning inconsistent results (against both nitrogen and nihal):
# Direct curl curl -sG https://nitrogen.eqiad.wmnet/v3/nodes --data-urlencode 'query=["not", ["=", ["fact", "facterversion"], "2.4.6"]]' | grep '"name"' # Through cumin sudo cumin --dry-run 'not F:facterversion = "2.4.6"' date
Here the list of hosts that the above query returns running it few minutes apart from one to each other:
analytics1030.eqiad.wmnet,cp3003.esams.wmnet,db2079.codfw.wmnet,db1068.eqiad.wmnet,labstore[1001-1002].eqiad.wmnet analytics1030.eqiad.wmnet,cp3003.esams.wmnet,db2079.codfw.wmnet,db1068.eqiad.wmnet,labstore[1001-1002].eqiad.wmnet,lvs1006.wikimedia.org,tegmen.wikimedia.org analytics1030.eqiad.wmnet,cp3003.esams.wmnet,db1068.eqiad.wmnet,labstore[1001-1002].eqiad.wmnet,tegmen.wikimedia.org analytics1030.eqiad.wmnet,cp3003.esams.wmnet,db1068.eqiad.wmnet,labstore[1001-1002].eqiad.wmnet analytics1030.eqiad.wmnet,cp3003.esams.wmnet,labstore[1001-1002].eqiad.wmnet analytics1030.eqiad.wmnet,cp3003.esams.wmnet,db1068.eqiad.wmnet,labstore[1001-1002].eqiad.wmnet,phab2001.codfw.wmnet
@akosiaris by any chance do this rings a bell to something we already encountered in the past?
So it seems that those flapping results are due to puppet running ALSO as a daemon on those hosts (thanks @faidon ), because if at any time when running a puppet agent there is a typo in the options around the -t puppet smartly decides to ignore the wrong option and run as daemon in background.
Some examples were:
/usr/bin/ruby /usr/bin/puppet agent 0tv /usr/bin/ruby /usr/bin/puppet agent .-t /usr/bin/ruby /usr/bin/puppet agent -d
As discussed with @faidon I've opened T166371 to create an alert for it.
Mentioned in SAL (#wikimedia-operations) [2017-05-26T08:45:30Z] <volans> killed daemonized puppet on tegmen, lvs1006 T166203
Since we're planning to gradually introduce features (e.g. structured facts) that are Facter >= 2 specific, we should probably do the same upgrade on Labs hosts as well. Since this has been tested in prod already, I don't think there's tremendous value in testing it the same in Labs; we should just do it there with salt/clush/whatever.
Change 356212 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] Force fact stringification servermon reporter
Change 356212 merged by Alexandros Kosiaris:
[operations/puppet@production] Force fact stringification in servermon reporter
Facter is upgraded in production on the whole fleet apart cp3003.esams.wmnet,labstore[1001-1002].eqiad.wmnet that will need to be reimaged anyway. Labs also was upgraded by Faidon via Salt.