Page MenuHomePhabricator

codfw: wmf-auto reimage failing
Closed, ResolvedPublic

Description

Since this morning all the re image I am doing are failing . Please see the list of host that are failing below . On the first 2 after it failed to downtime the host in icinga it just get stuck there . on the last one I get

Unable to run wmf-auto-reimage-host: could not convert string to float: "Warning: Permanently added the ECDSA host key for IP address '2620:0:860:101:10:192:0:14' to the list of known hosts.\n1606867802

and the re image just ends
on the ms-be

2020-12-01 21:35:59 [INFO] (pt1979) wmf-auto-reimage::print_line: Unable to run wmf-auto-reimage-host: Failed to puppet_first_run
2020-12-01 21:35:59 [ERROR] (pt1979) wmf-auto-reimage::main: Unable to run wmf-auto-reimage-host
Traceback (most recent call last):
  File "/usr/local/sbin/wmf-auto-reimage-host", line 264, in main
    run(args, user, log_path)
  File "/usr/local/sbin/wmf-auto-reimage-host", line 183, in run
    lib.puppet_first_run(args.host)
  File "/usr/local/lib/python3.7/dist-packages/wmf_auto_reimage_lib.py", line 744, in puppet_first_run
    run_cumin('puppet_first_run', host, commands, timeout=10800, installer=True)
  File "/usr/local/lib/python3.7/dist-packages/wmf_auto_reimage_lib.py", line 473, in run_cumin
    raise RuntimeError('Failed to {label}'.format(label=label))
RuntimeError: Failed to puppet_first_run
2020-12-01 21:35:59 [INFO] (pt1979) wmf-auto-reimage::print_line: REIMAGE END | retcode=2
2020-12-01 21:35:59 [INFO] (pt1979) wmf-auto-reimage::phabricator_task_update: Updated Phabricator task 'T265419'

ms-be2060
ms-be2061
db2142

Event Timeline

@Papaul

  • for ms-be2061 it's puppet failing, from the logs in /var/log/wmf-auto-reimage/202012012121_pt1979_23205_ms-be2061_codfw_wmnet_cumin.out:
Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Function Call, No puppet role has been assigned to this node. (file: /etc/puppet/manifests/site.pp, line: 2497, column: 9) on node ms-be2061.codfw.wmnet
  • for ms-be2060 there have been 6 different runs, in two of them the error is the same of ms-be2061, not sure if the others attempt were interrupted by you or what.
  • for db2142 the poll of the completion of the first puppet run failed because it couldn't parse the string "Warning: Permanently added the ECDSA host key for IP address '2620:0:860:101:10:192:0:14' to the list of known hosts.\n1606867802" as it was just expecting the timestamp. I'll check why is that and worse case scenario add a redirection to /dev/null of stderr for now.

@Papaul for db2142 due to the failure it didn't update Netbox, you should run the https://netbox.wikimedia.org/extras/scripts/interface_automation/ImportPuppetDB/ Netbox script for it.

For the record, the mapped v6 address was set later only at the 3rd puppet run due to the custom fact plugins being installed only in the 2nd puppet run because of custom configs in the puppet config file that are set in the 1st puppet run. https://gerrit.wikimedia.org/r/c/operations/puppet/+/644850 should fix it

Change 644850 had a related patch set uploaded (by Jbond; owner: John Bond):
[operations/puppet@production] late_command: configure the run, var and fact paths to match puppet conf

https://gerrit.wikimedia.org/r/644850

Change 644850 merged by Jbond:
[operations/puppet@production] late_command: configure the run, var and fact paths to match puppet conf

https://gerrit.wikimedia.org/r/644850

Volans triaged this task as Medium priority.Dec 2 2020, 4:39 PM

@Papaul feel free to resolve once you've tested the patch above on the next host and if the ms-be were resolved by their addition in site.pp.

Change 644892 had a related patch set uploaded (by Volans; owner: Volans):
[operations/puppet@production] wmf-auto-reimage: force a second puppet run

https://gerrit.wikimedia.org/r/644892

Change 644892 merged by Volans:
[operations/puppet@production] wmf-auto-reimage: force a second puppet run

https://gerrit.wikimedia.org/r/644892

db2144 failed

2020-12-02 20:52:24 [INFO] (pt1979) wmf-auto-reimage::print_line: Unable to run wmf-auto-reimage-host: hosts must be a non-empty ClusterShell NodeSet or list, got '<class 'ClusterShell.NodeSet.NodeSet'>':
2020-12-02 20:52:24 [ERROR] (pt1979) wmf-auto-reimage::main: Unable to run wmf-auto-reimage-host
cumin.transports.WorkerError: hosts must be a non-empty ClusterShell NodeSet or list, got '<class 'ClusterShell.NodeSet.NodeSet'>':
2020-12-02 20:52:24 [INFO] (pt1979) wmf-auto-reimage::print_line: REIMAGE END | retcode=2
2020-12-02 20:52:25 [INFO] (pt1979) wmf-auto-reimage::phabricator_task_update: Updated Phabricator task 'T267041'

Change 645080 had a related patch set uploaded (by Volans; owner: Volans):
[operations/puppet@production] wmf-autp-reimage: prevent race condition

https://gerrit.wikimedia.org/r/645080

Change 645080 merged by Volans:
[operations/puppet@production] wmf-autp-reimage: prevent race condition

https://gerrit.wikimedia.org/r/645080

Change 645102 had a related patch set uploaded (by Volans; owner: Volans):
[operations/puppet@production] wmf-auto-reimage: safer output parsing

https://gerrit.wikimedia.org/r/645102

Change 645102 merged by Volans:
[operations/puppet@production] wmf-auto-reimage: safer output parsing

https://gerrit.wikimedia.org/r/645102

Papaul claimed this task.

wmf-auto reimage is back working in codfw. @Volans thanks

Change 645168 had a related patch set uploaded (by Volans; owner: Volans):
[operations/puppet@production] wmf-auto-reimage: fix hack for SSH warning

https://gerrit.wikimedia.org/r/645168

Change 645168 merged by RobH:
[operations/puppet@production] wmf-auto-reimage: fix hack for SSH warning

https://gerrit.wikimedia.org/r/645168