Page MenuHomePhabricator

wmf-auto-reimage errors: failure to downtime (w/ no rename), pytho gc whine
Open, LowPublic

Description

When reimaging a few mediawiki app servers, I encountered two sorts of errors.

Example 1:

19:48:01 | mw2232.codfw.wmnet | Scheduled delayed downtime on Icinga
19:48:01 | mw2232.codfw.wmnet | Started first puppet run (sit back, relax, and enjoy the wait)
START - Cookbook sre.hosts.downtime
Forcing a Puppet run on the Icinga server
Running Puppet with args --quiet --attempts 30 on 1 hosts: icinga1001.wikimedia.org
Exception raised while executing cookbook sre.hosts.downtime:
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/spicerack/cookbook.py", line 409, in _run
    ret = self.module.run(args, self.spicerack)
  File "/srv/deployment/spicerack/cookbooks/sre/hosts/downtime.py", line 64, in run
    puppet.run(quiet=True, attempts=30)
  File "/usr/lib/python3/dist-packages/spicerack/puppet.py", line 170, in run
    self._remote_hosts.run_sync(Command(command, timeout=timeout), batch_size=batch_size)
  File "/usr/lib/python3/dist-packages/spicerack/remote.py", line 476, in run_sync
    batch_sleep=batch_sleep, is_safe=is_safe)
  File "/usr/lib/python3/dist-packages/spicerack/remote.py", line 646, in _execute
    raise RemoteExecutionError(ret, 'Cumin execution failed')
spicerack.remote.RemoteExecutionError: Cumin execution failed (exit_code=2)
END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
20:34:43 | mw2232.codfw.wmnet | First Puppet run completed
20:34:43 | WARNING: failed to downtime host on Icinga, wmf-downtime-host returned 99
20:35:29 | cumin1001.eqiad.wmnet | Puppet run completed

Example 2:

20:41:17 | mw2234.codfw.wmnet | Puppet run checked
20:41:30 | mw2234.codfw.wmnet | Successfully tested with Apache fast-test
20:41:30 | To set back the conftool status to their previous values run:
sudo -i confctl select 'name=mw2234.codfw.wmnet,service=apache2,cluster=appserver,dc=codfw' set/pooled=yes
sudo -i confctl select 'name=mw2234.codfw.wmnet,service=nginx,cluster=appserver,dc=codfw' set/pooled=yes
20:41:30 | mw2234.codfw.wmnet | Reimage completed
20:41:30 | mw2234.codfw.wmnet | REIMAGE END | retcode=0
Fatal Python error: GC object already tracked

Thread 0x00007f81171c3700 (most recent call first):
  File "/usr/lib/python3/dist-packages/tqdm/_tqdm.py", line 97 in run
  File "/usr/lib/python3.5/threading.py", line 914 in _bootstrap_inner
  File "/usr/lib/python3.5/threading.py", line 882 in _bootstrap

Current thread 0x00007f8121b47700 (most recent call first):
Aborted

The downtime failure just meant icinga noise in the channel, while the second one did not have any side effects. It would still be nice to get rid of them though.

Related Objects

Event Timeline

ArielGlenn triaged this task as Medium priority.Dec 5 2019, 11:02 AM
ArielGlenn created this task.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptDec 5 2019, 11:02 AM
Volans added a subscriber: jijiki.Dec 5 2019, 11:45 AM

For the first one the downtime cookbook failed to run puppet on the Icinga active host to get the definitions of the reimaged hosts to downtime. Given how much puppet is slow on the icinga host it can happen if there are multiple runs at the same time, that we hit the timeout even with --attempts 30.
My suggestion for running parallel reimages is to open 2~3 tmux and run there sequential reimages and let them start few minutes apart from each other.

For the second one (the Python's segfault) I've already discussed offline with @jijiki as it happened to her too. It seems to happen only when multiple reimages are done at the same time and it doesn't affect the execution because it's triggered when Python it's doing the final shutdown of the code after the execution.
Funnily enough we encountered the same in Python 3.4 too, see P7462.

So far happened only few times so was deemed not worth the time to deep dive on it as it might a long rabbit hole, but if the frequency is unacceptable I'll find the time to look more in depth at it.

jcrespo lowered the priority of this task from Medium to Low.Dec 12 2019, 6:28 PM
jcrespo added a project: SRE-tools.
jcrespo moved this task from Backlog to Acknowledged on the Operations board.
jcrespo added a subscriber: jcrespo.

Low (for now) based on Riccardo's comments.