Page MenuHomePhabricator

Salt minions randomly crashing when the deployment server grain gets changed
Closed, DeclinedPublic

Description

When I switched over the deployment server, puppet ran

grain-ensure set trebuchet_master mira.codfw.wmnet

this worked fine on ~ 60% of the hosts, while on the others (independently of the OS version) this crashed the salt-minion.

This seems to be caused by some race condition; in the minion logs I find:

2016-01-25 10:10:58,656 [salt.log.setup   ][ERROR   ] An un-handled exception was caught by salt's global exception handler:
TypeError: string indices must be integers, not str
Traceback (most recent call last):
  File "/usr/bin/salt-minion", line 14, in <module>
    salt_minion()
  File "/usr/lib/python2.7/dist-packages/salt/scripts.py", line 57, in salt_minion
    minion.start()
  File "/usr/lib/python2.7/dist-packages/salt/__init__.py", line 264, in start
    self.minion.tune_in()
  File "/usr/lib/python2.7/dist-packages/salt/minion.py", line 558, in tune_in
    minion['minion'].pillar_refresh()
  File "/usr/lib/python2.7/dist-packages/salt/minion.py", line 1407, in pillar_refresh
    self.opts['environment'],
  File "/usr/lib/python2.7/dist-packages/salt/pillar/__init__.py", line 91, in compile_pillar
    ret_pillar = self.sreq.crypted_transfer_decode_dictentry(load, dictkey='pillar', tries=3, timeout=7200)
  File "/usr/lib/python2.7/dist-packages/salt/transport/__init__.py", line 243, in crypted_transfer_decode_dictentry
    aes = key.private_decrypt(ret['key'], 4)
TypeError: string indices must be integers, not str

which honestly doesn't leave me any clue.

This seems serious enough to be investigated further though.

Event Timeline

Joe raised the priority of this task from to High.
Joe updated the task description. (Show Details)
Joe added projects: SRE, Salt.
Joe subscribed.

Tenatively this looks like an issue with the singleton cache of master aes keys at the minion end, a part of the code in transport that needs to be updated. Still investigating.

To keep minions from dying we should do this:

in transport/__init__.py, in crypted_transfer_decode_dictentry()

instead of

aes = key.private_decrypt(ret['key'], 4)
pcrypt = salt.crypt.Crypticle(self.opts, aes)
return pcrypt.loads(ret[dictkey])

we should have

try:
    aes = key.private_decrypt(ret['key'], 4)
except (TypeError, KeyError):
    return None
else:
    pcrypt = salt.crypt.Crypticle(self.opts, aes)
    return pcrypt.loads(ret[dictkey])

Forgot to mention, this is actually an issue with the pillar refresh after the grain is set.

I've updated my docker salt testbed to work with latest docker api and latest wmf packages: https://github.com/apergos/docker-saltcluster

I'll be doing small scale testing to see if I can replicate this problem there; if not, I'll roll out the above change and log errors in hopes of catching the cause. The fix above will at least keep the minions from dying.

No joy so I'll add the above change to our salt packages with logging and update them all.

ArielGlenn moved this task from active to testing needed on the Salt board.