Page MenuHomePhabricator

[cumin] runs are broken on latest master (2683afe) but working on tag v4.0.0
Closed, ResolvedPublic

Description

When running the latest master I get:

04:05 PM <cumin-python3> ~/Work/wikimedia/cumin  (master|✔)
dcaro@vulcanus$ cumin -c ~/.config/cumin/cumin-config.yaml 'D{localhost}' hostname
1 hosts will be targeted:
localhost
Ok to proceed on 1 hosts? Enter the number of affected hosts to confirm or "q" to quit 1
0.0% (0/1) success ratio (< 100.0% threshold) of nodes successfully executed all commands. Aborting.
Caught AttributeError exception: 'str' object has no attribute 'command'
PASS |                                                                                                                                                                                                                                                                |   0% (0/1) [00:00<?, ?hosts/s]Exception ignored in: <function tqdm.__del__ at 0x7f5c83beb280>                                                                                                                                                                                                                 |   0% (0/1) [00:00<?, ?hosts/s]
Traceback (most recent call last):
  File "/home/dcaro/.virtualenvs/cumin-python3/lib/python3.9/site-packages/tqdm/_tqdm.py", line 883, in __del__
    self.close()
  File "/home/dcaro/.virtualenvs/cumin-python3/lib/python3.9/site-packages/tqdm/_tqdm.py", line 1088, in close
    self._decr_instances(self)
  File "/home/dcaro/.virtualenvs/cumin-python3/lib/python3.9/site-packages/tqdm/_tqdm.py", line 439, in _decr_instances
    cls._instances.remove(instance)
  File "/usr/lib/python3.9/_weakrefset.py", line 110, in remove
    self.data.remove(ref(item))
KeyError: <weakref at 0x7f5c832eeb30; to 'tqdm' at 0x7f5c832e4d60>
Exception ignored in: <function tqdm.__del__ at 0x7f5c83beb280>
Traceback (most recent call last):
  File "/home/dcaro/.virtualenvs/cumin-python3/lib/python3.9/site-packages/tqdm/_tqdm.py", line 883, in __del__
    self.close()
  File "/home/dcaro/.virtualenvs/cumin-python3/lib/python3.9/site-packages/tqdm/_tqdm.py", line 1088, in close
    self._decr_instances(self)
  File "/home/dcaro/.virtualenvs/cumin-python3/lib/python3.9/site-packages/tqdm/_tqdm.py", line 439, in _decr_instances
    cls._instances.remove(instance)
  File "/usr/lib/python3.9/_weakrefset.py", line 110, in remove
    self.data.remove(ref(item))
KeyError: <weakref at 0x7f5c832ee860; to 'tqdm' at 0x7f5c832f11f0>

While the same run on tag v4.0.0 works (it's expected to fail ssh):

04:07 PM <cumin-python3> ~/Work/wikimedia/cumin  (master|…1)
dcaro@vulcanus$ cumin -c ~/.config/cumin/cumin-config.yaml 'D{localhost}' hostname
1 hosts will be targeted:
localhost
Confirm to continue [y/n]? y
----- OUTPUT of 'hostname' -----
ssh: connect to host localhost port 22: Connection refused
================
PASS |                                                                                                                                                                                                                                                                |   0% (0/1) [00:00<?, ?hosts/s]
FAIL |████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100% (1/1) [00:00<00:00, 72.75hosts/s]
100.0% (1/1) of nodes failed to execute command 'hostname': localhost
0.0% (0/1) success ratio (< 100.0% threshold) for command: 'hostname'. Aborting.
0.0% (0/1) success ratio (< 100.0% threshold) of nodes successfully executed all commands. Aborting.

Cumin logs:
https://phabricator.wikimedia.org/P14419

It seems that something changed an now there's a string getting passed to the transport instead of a Comand instance.

Relevant config options from the cumin.yaml config:

transport: clustershell
log_file: cumin.log
default_backend: direct

environment: {}

Event Timeline

This seems to be the commit that breaks it:

commit d46cf8b10e675d7076e4188ec3efd47f2cdd052b (HEAD -> master)
Author: Guillaume Lederrey <guillaume.lederrey@wikimedia.org>
Date:   Fri Sep 11 14:09:16 2020 +0200

    Extracting obvious reporting code to a Reporter class.
    
    This is a first cleanup pass, more will follow. This should allow to
    switch between Reporters, to have different kind of output (including
    null output).
    
    This first pass extract obvious reporting code to a single class,
    without much changes to the current structure of the code. It
    already moves all access to tqdm in a single place. More reporting
    done via logging in various methods still needs to be moved.
    
    Bug: T212783
    Change-Id: Iabdf4904e9855544557158f92be7ba7b23070c50

looking

This is the line where the command becomes a string:
https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/cumin/+/refs/heads/master/cumin/transports/clustershell.py#697

Task.task_self().shell(command.command, handler=timer.eh, timeout=command.timeout, nodes=nodeset(node.name))

From there all around the code 'command' is a string, not a 'Command', testing to swap that out...

It seems that in other places a Command is passed instead:

2021-02-19 17:13:45,521 [ERROR 1047676 cumin.cli.main] Failed to execute
Traceback (most recent call last):
  File "/home/dcaro/Work/wikimedia/cumin/cumin/cli.py", line 472, in main
    exit_code = run(args, config)
  File "/home/dcaro/Work/wikimedia/cumin/cumin/cli.py", line 404, in run
    exit_code = worker.execute()
  File "/home/dcaro/Work/wikimedia/cumin/cumin/transports/clustershell.py", line 79, in execute
    self.task.run(timeout=self.timeout, stdin=False)
  File "/home/dcaro/.virtualenvs/cumin-python3/lib/python3.9/site-packages/ClusterShell/Task.py", line 849, in run
    self.resume(timeout)
  File "/home/dcaro/.virtualenvs/cumin-python3/lib/python3.9/site-packages/ClusterShell/Task.py", line 803, in resume
    self._resume()
  File "/home/dcaro/.virtualenvs/cumin-python3/lib/python3.9/site-packages/ClusterShell/Task.py", line 766, in _resume
    self._run(self.timeout)
  File "/home/dcaro/.virtualenvs/cumin-python3/lib/python3.9/site-packages/ClusterShell/Task.py", line 400, in _run
    self._engine.run(timeout)
  File "/home/dcaro/.virtualenvs/cumin-python3/lib/python3.9/site-packages/ClusterShell/Engine/Engine.py", line 723, in run
    self.runloop(timeout)
  File "/home/dcaro/.virtualenvs/cumin-python3/lib/python3.9/site-packages/ClusterShell/Engine/EPoll.py", line 194, in runloop
    self.fire_timers()
  File "/home/dcaro/.virtualenvs/cumin-python3/lib/python3.9/site-packages/ClusterShell/Engine/Engine.py", line 681, in fire_timers
    self.timerq.fire_expired()
  File "/home/dcaro/.virtualenvs/cumin-python3/lib/python3.9/site-packages/ClusterShell/Engine/Engine.py", line 329, in fire_expired
    client._fire()
  File "/home/dcaro/.virtualenvs/cumin-python3/lib/python3.9/site-packages/ClusterShell/Engine/Engine.py", line 189, in _fire
    self.eh.ev_timer(self)
  File "/home/dcaro/Work/wikimedia/cumin/cumin/transports/clustershell.py", line 722, in ev_timer
    restart = self.end_command()
  File "/home/dcaro/Work/wikimedia/cumin/cumin/transports/clustershell.py", line 609, in end_command
    self.reporter.failed_commands_report(nodes=self.nodes, num_hosts=self.counters['total'], commands=self.commands,
  File "/home/dcaro/Work/wikimedia/cumin/cumin/transports/clustershell.py", line 311, in failed_commands_report
    short_command = self._get_short_command(command) if command is not None else ''
  File "/home/dcaro/Work/wikimedia/cumin/cumin/transports/clustershell.py", line 224, in _get_short_command
    return (command[:sublen] + '...' + command[-sublen:]) if len(command) > self.short_command_length else command
TypeError: object of type 'Command' has no len()

Change 665366 had a related patch set uploaded (by David Caro; owner: David Caro):
[operations/software/cumin@master] transport.clustershell: handle str when reporting commands

https://gerrit.wikimedia.org/r/665366

Icinga downtime set by dcaro@cumin1001 for 5:00:00 27 host(s) and their services with reason: Restarting cloudcanary instances

cloudvirt[1012-1014,1016-1039].eqiad.wmnet

Icinga downtime set by dcaro@cumin1001 for 5:00:00 3 host(s) and their services with reason: Restarting cloudcanary instances

cloudvirt-wdqs[1001-1003].eqiad.wmnet

The downtime comments are for another task xd

dcaro triaged this task as High priority.Feb 22 2021, 2:15 PM

Change 665366 merged by jenkins-bot:
[operations/software/cumin@master] transport.clustershell: handle str when reporting commands

https://gerrit.wikimedia.org/r/665366

Merged to master

Change 666322 had a related patch set uploaded (by Volans; owner: Volans):
[operations/software/cumin@master] integration tests: add undeduplicated output test

https://gerrit.wikimedia.org/r/666322

Change 666322 merged by jenkins-bot:
[operations/software/cumin@master] integration tests: add undeduplicated output test

https://gerrit.wikimedia.org/r/666322