Cumin execution failed
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Dzahn
	Feb 12 2021, 11:31 PM

Description

[cumin1001:~] $ sudo cookbook sre.hosts.decommission -t T274023 mwdebug1002.eqiad.wmnet
START - Cookbook sre.hosts.decommission
>>> ATTENTION: destructive action for 1 hosts: mwdebug1002.eqiad.wmnet
Are you sure to proceed?
Type "go" to proceed or "abort" to interrupt the execution
> go
Looking for matches in puppetmaster1001.eqiad.wmnet:/var/lib/git/operations/puppet
----- OUTPUT of 'cd /var/lib/git/...46[^0-9A-Za-z])'' -----                                                                
conftool-data/node/eqiad.yaml:    mwdebug1002.eqiad.wmnet: [apache2]                                                       
modules/install_server/files/dhcpd/linux-host-entries.ttyS0-115200:    fixed-address mwdebug1002.eqiad.wmnet;              
modules/profile/files/trafficserver/x-wikimedia-debug-routing.lua:        ["mwdebug1002.eqiad.wmnet"] = "mwdebug1002.eqiad.wmnet",                                                                                                                    
================                                                                                                           
PASS |███████████████████████████████████████████████████████████████████| 100% (1/1) [00:00<00:00,  1.23hosts/s]          
FAIL |                                                                           |   0% (0/1) [00:00<?, ?hosts/s]
100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'cd /var/lib/git/...46[^0-9A-Za-z])''.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
Looking for matches in puppetmaster1001.eqiad.wmnet:/srv/private
----- OUTPUT of 'cd /srv/private ...46[^0-9A-Za-z])'' -----                                                                
================                                                                                                           
PASS |███████████████████████████████████████████████████████████████████| 100% (1/1) [00:00<00:00,  1.72hosts/s]          
FAIL |                                                                           |   0% (0/1) [00:00<?, ?hosts/s]
100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'cd /srv/private ...46[^0-9A-Za-z])''.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
Looking for matches in deploy1001.eqiad.wmnet:/srv/mediawiki-staging
----- OUTPUT of 'cd /srv/mediawik...46[^0-9A-Za-z])'' -----                                                                
debug.json:    "mwdebug1002.eqiad.wmnet",                                                                                  
================                                                                                                           
PASS |███████████████████████████████████████████████████████████████████| 100% (1/1) [00:01<00:00,  1.16s/hosts]          
FAIL |                                                                           |   0% (0/1) [00:01<?, ?hosts/s]
100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'cd /srv/mediawik...46[^0-9A-Za-z])''.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
>>> Found match(es) in the Puppet or mediawiki-config repositories (see above), proceed anyway?
Type "go" to proceed or "abort" to interrupt the execution
> go
Looking for Kerberos credentials on KDC kadmin node.
----- OUTPUT of 'find /srv/kerber...02.eqiad.wmnet*"' -----                                                                
================                                                                                                           
PASS |███████████████████████████████████████████████████████████████████| 100% (1/1) [00:00<00:00,  1.79hosts/s]          
FAIL |                                                                           |   0% (0/1) [00:00<?, ?hosts/s]
100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'find /srv/kerber...02.eqiad.wmnet*"'.
----- OUTPUT of '/usr/local/sbin/...02.eqiad.wmnet*"' -----                                                                
================                                                                                                           
PASS |███████████████████████████████████████████████████████████████████| 100% (1/1) [00:00<00:00,  1.88hosts/s]          
FAIL |                                                                           |   0% (0/1) [00:00<?, ?hosts/s]
100.0% (1/1) success ratio (>= 100.0% threshold) for command: '/usr/local/sbin/...02.eqiad.wmnet*"'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
No Kerberos credentials found.
Scheduling downtime on Icinga server alert1001.wikimedia.org for hosts: ['mwdebug1002.eqiad.wmnet']
----- OUTPUT of 'icinga-downtime ...n1001 - T274023"' -----                                                                
================                                                                                                           
PASS |███████████████████████████████████████████████████████████████████| 100% (1/1) [00:00<00:00,  2.45hosts/s]          
FAIL |                                                                           |   0% (0/1) [00:00<?, ?hosts/s]
100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'icinga-downtime ...n1001 - T274023"'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
Downtimed host on Icinga
Found Ganeti VM
Shutting down VM mwdebug1002.eqiad.wmnet in cluster ganeti01.svc.eqiad.wmnet
----- OUTPUT of 'gnt-instance shu...1002.eqiad.wmnet' -----                                                                
Waiting for job 1134523 for mwdebug1002.eqiad.wmnet ...                                                                    
================                                                                                                           
PASS |███████████████████████████████████████████████████████████████████| 100% (1/1) [00:10<00:00, 10.66s/hosts]          
FAIL |                                                                           |   0% (0/1) [00:10<?, ?hosts/s]
100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'gnt-instance shu...1002.eqiad.wmnet'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
VM shutdown
----- OUTPUT of 'systemctl start ...iad_sync.service' -----                                                                
================                                                                                                           
PASS |███████████████████████████████████████████████████████████████████| 100% (1/1) [00:00<00:00,  1.87hosts/s]          
FAIL |                                                                           |   0% (0/1) [00:00<?, ?hosts/s]
100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'systemctl start ...iad_sync.service'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
Started forced sync of VMs in Ganeti cluster ganeti01.svc.eqiad.wmnet to Netbox
Sleeping for 20s to avoid race conditions...
Removed host mwdebug1002.eqiad.wmnet from Debmonitor
Removed from DebMonitor
----- OUTPUT of 'puppet node clea...1002.eqiad.wmnet' -----                                                                
Notice: Revoked certificate with serial 2340                                                                               
Notice: Revoked certificate with serial 3962                                                                               
Notice: Revoked certificate with serial 5498                                                                               
mwdebug1002.eqiad.wmnet                                                                                                    
================                                                                                                           
PASS |███████████████████████████████████████████████████████████████████| 100% (1/1) [00:01<00:00,  1.96s/hosts]          
FAIL |                                                                           |   0% (0/1) [00:01<?, ?hosts/s]
100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'puppet node clea...1002.eqiad.wmnet'.
----- OUTPUT of 'puppet node deac...1002.eqiad.wmnet' -----                                                                
Submitted 'deactivate node' for mwdebug1002.eqiad.wmnet with UUID db4a3b35-c781-4438-ae8e-618a1689d227                     
================                                                                                                           
PASS |███████████████████████████████████████████████████████████████████| 100% (1/1) [00:01<00:00,  1.83s/hosts]          
FAIL |                                                                           |   0% (0/1) [00:01<?, ?hosts/s]
100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'puppet node deac...1002.eqiad.wmnet'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
Removed from Puppet master and PuppetDB
Issuing Ganeti remove command, it can take up to 15 minutes...
Removing VM mwdebug1002.eqiad.wmnet in cluster ganeti01.svc.eqiad.wmnet. This may take a few minutes.
----- OUTPUT of 'gnt-instance rem...1002.eqiad.wmnet' -----                                                                
================                                                                                                           
PASS |███████████████████████████████████████████████████████████████████| 100% (1/1) [00:05<00:00,  5.50s/hosts]          
FAIL |                                                                           |   0% (0/1) [00:05<?, ?hosts/s]
100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'gnt-instance rem...1002.eqiad.wmnet'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
VM removed
----- OUTPUT of 'systemctl start ...iad_sync.service' -----                                                                
================                                                                                                           
PASS |███████████████████████████████████████████████████████████████████| 100% (1/1) [00:00<00:00,  1.93hosts/s]          
FAIL |                                                                           |   0% (0/1) [00:00<?, ?hosts/s]
100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'systemctl start ...iad_sync.service'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
Started forced sync of VMs in Ganeti cluster ganeti01.svc.eqiad.wmnet to Netbox
Generating the DNS records from Netbox data. It will take a couple of minutes.
----- OUTPUT of 'cd /tmp && runus...e asset tag one"' -----                                                                
2021-02-12 22:56:23,745 [INFO] Gathering devices, interfaces, addresses and prefixes from Netbox                           
2021-02-12 23:02:15,871 [ERROR] Failed to run                                                                              
Traceback (most recent call last):                                                                                         
  File "/srv/deployment/netbox-extras/dns/generate_dns_snippets.py", line 686, in main                                     
    batch_status, ret_code = run_commit(args, config, tmpdir)                                                              
  File "/srv/deployment/netbox-extras/dns/generate_dns_snippets.py", line 590, in run_commit                               
    netbox.collect()                                                                                                       
  File "/srv/deployment/netbox-extras/dns/generate_dns_snippets.py", line 156, in collect                                  
    self._collect_device(device, True)                                                                                     
  File "/srv/deployment/netbox-extras/dns/generate_dns_snippets.py", line 197, in _collect_device                          
    if self.addresses[primary.id].dns_name:                                                                                
KeyError: 6398                                                                                                             
================                                                                                                           
PASS |                                                                           |   0% (0/1) [05:53<?, ?hosts/s]          
FAIL |██████████████████████████████████████████████████████████████████| 100% (1/1) [05:53<00:00, 353.00s/hosts]
100.0% (1/1) of nodes failed to execute command 'cd /tmp && runus...e asset tag one"': netbox1001.wikimedia.org
0.0% (0/1) success ratio (< 100.0% threshold) for command: 'cd /tmp && runus...e asset tag one"'. Aborting.
0.0% (0/1) success ratio (< 100.0% threshold) of nodes successfully executed all commands. Aborting.
Failed to run the sre.dns.netbox cookbook
Traceback (most recent call last):
  File "/srv/deployment/spicerack/cookbooks/sre/hosts/decommission.py", line 365, in run
    dns_netbox_run(dns_netbox_args, spicerack)
  File "/srv/deployment/spicerack/cookbooks/sre/dns/netbox.py", line 73, in run
    results = netbox_host.run_sync(command, is_safe=True)
  File "/usr/lib/python3/dist-packages/spicerack/remote.py", line 475, in run_sync
    batch_sleep=batch_sleep, is_safe=is_safe)
  File "/usr/lib/python3/dist-packages/spicerack/remote.py", line 637, in _execute
    raise RemoteExecutionError(ret, 'Cumin execution failed')
spicerack.remote.RemoteExecutionError: Cumin execution failed (exit_code=2)
**Failed to run the sre.dns.netbox cookbook**: Cumin execution failed (exit_code=2)
ERROR: some step failed, check the task updates.
Updated Phabricator task T274023
END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1)

Details

	Subject	Repo	Branch	Lines +/-
	sre.hosts.decommission: temporary fix for Netbox	operations/cookbooks	master	+2 -0

Customize query in gerrit

Related Objects
Search...

Status	Subtype	Assigned	Task
Resolved		None	T248925 Make MediaWiki release tarball compatible with PHP 8.0
Resolved		Jdforrester-WMF	T300463 Make PHP 8.0 voting on MW master
Resolved		Jdforrester-WMF	T313563 Bump lcobucci/jwt & league/uri-components for php8
Resolved		Jdforrester-WMF	T313564 Bump onoi/message-reporter in vendor.git to 1.4.2 for php 8 support
Resolved		Jdforrester-WMF	T247658 Make Wikimedia Production MediaWiki compatible with PHP 7.4
Resolved		• toan	T243590 Fix WikibaseDataModel CI for php 7.4
Resolved		Lucas_Werkmeister_WMDE	T316923 Restore skipped test in ReferenceListTest.php
Resolved		Joe	T318918 Undeploy patch to use old PHP serialization in PHP 7.4
Resolved		Jdforrester-WMF	T264168 Drop PHP 7.2 support from Wikibase master branch, once Wikimedia production is on 7.4
Resolved		Ladsgroup	T270740 Drop hacky support for doctrine/dbal class renaming
Invalid		None	T303505 [S] Remove Deprecated methods "serialize" and "unserialize" after php production upgrade to PHP 7.4
Resolved		Reedy	T251043 Cleanup css-sanitizer when it only requires PHP >= 7.4
Open		None	T166010 The Great Namespaceization Effort
Resolved		• tstarling	T277618 var_dump() on various objects writes gigabytes of data and takes minutes to run
Resolved		Jdforrester-WMF	T261872 Drop PHP 7.2 & 7.3 support from MediaWiki master branch, once Wikimedia production is on 7.4
Stalled		None	T302086 Set scap minimum python version to 3.7
Resolved		None	T247045 Migrate all of production metal and VMs to Buster or later
Resolved	PRODUCTION ERROR	Legoktm	T293568 PHP Notice: Undefined offset in wikimedia/remex-html when rendering rest.php error page
Resolved		• tstarling	T297667 mysqli/mysqlnd memory leak
Resolved		Joe	T271736 Migrate WMF production from PHP 7.2 to PHP 7.4
Resolved		Krinkle	T248191 Can't reopen table in wikidb-unittest_ (from SpecialPageFatalTest)
Resolved		hashar	T278203 Migrate all CI jobs from stretch to buster or later and drop stretch testing support
Resolved		Jdforrester-WMF	T252432 Drop MediaWiki testing in stretch and instead test only in buster
Resolved		Dzahn	T245757 Upgrade MediaWiki clusters to Debian Buster (debian 10)
Resolved		hashar	T252434 Test MW code in buster as well as stretch
Resolved		Jdforrester-WMF	T250514 Create buster-based images for quibble
Resolved		Ladsgroup	T279068 Wikibase data-client bridge selenium failure on buster but not stretch
Resolved		MoritzMuehlenhoff	T250515 Please provide our special component/php72 in buster-wikimedia
Resolved		jijiki	T264991 Upgrade the MediaWiki servers to ICU 63
Resolved		MoritzMuehlenhoff	T253377 WMF deployed EasyTimeline extension depends on Ploticus package which is not available in Debian Buster (but available again in Debian Bullseye)
Resolved		Trizek-WMF	T267145 CommRel support for ICU 63 upgrade
Resolved		Dzahn	T267248 create mwdebug1003 - ganeti VM with buster and appserver role
Resolved		• Gilles	T268188 Release WikimediaDebug for mwdebug1003 addition
Resolved		Dzahn	T267607 upgrade mwmaint servers to buster
Resolved	Request	Papaul	T275928 decommission mwmaint2001.codfw.wmnet
Resolved		jijiki	T268524 Upgrade Parsoid servers to buster
Declined		None	T245888 Rename wtp* servers to parse* (Parsoid PHP servers)
Resolved		Dzahn	T268248 upgrade scandium to buster
Resolved		Dzahn	T265963 Replace production deployment servers and update them to Buster
			Unknown Object (Task)
Resolved		Papaul	T266363 (Need By: TBD) rack/setup/install deploy2002
Resolved		• Cmjohnson	T265653 (Need By: TBD) setup/install deploy1002
Resolved	Request	• Cmjohnson	T275831 decommission deploy1001.eqiad.wmnet
Resolved	Request	Papaul	T275832 decommission deploy2001.codfw.wmnet
Resolved		Andrew	T269004 Upgrade labweb servers to buster
Resolved		ArielGlenn	T269377 Upgrade snapshot hosts to Buster
Resolved		Dzahn	T270517 Investigate opcache hit rate on Buster appserver
Resolved		Legoktm	T273312 Investigate possible performance degradation on mediawiki servers after Debian Buster upgrade
Resolved		Papaul	T273803 mw2220 - broken IPMI / mgmt
Resolved		Dzahn	T274023 Convert mwdebug VMs to debian buster
Resolved		Volans	T274689 sre.hosts.decomission -> generate_dns_snippets - > Cumin execution failed
Resolved		Dzahn	T274403 mw1379 - down after reboot attempt and DRAC can't powercycle
Resolved		Legoktm	T275752 Jobrunner timeouts on cross-DC file uploads because of HTTP/2

Event Timeline

Dzahn created this task.Feb 12 2021, 11:31 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 12 2021, 11:31 PM

Dzahn added projects: SRE, SRE-tools.Feb 12 2021, 11:32 PM

• crusnov subscribed.Feb 13 2021, 12:00 AM

Correct me if I'm wrong, but is this the vm that is replacing another vm of the same name? There might be some assumptions about the connectivity in netbox of things.

It's just an attempt tp remove an existing VM. (after reimaging the existing VM resulted in it not coming back from reboot)

Mentioned in SAL (#wikimedia-operations) [2021-02-13T00:08:32Z] <mutante> ganeti1011 - manually deleting VM mwdebug1002 - T274689 T274023

Dzahn added a parent task: T274023: Convert mwdebug VMs to debian buster.Feb 13 2021, 12:08 AM

Mentioned in SAL (#wikimedia-operations) [2021-02-13T00:26:49Z] <mutante> ganeti - attempting to recreate VM mwdebug1002 with cookbook that wsa previously deleted manually (T274689 T274023)

summary:

existing VM, actually very old from 2016, works fine
tried to install new distro version on it, which was no problem on other VMs in codfw, but this one simply did not come back from reboot
checked console.. nothing... gnt-instance said status is UP, Icinga disagrees says it is down, can't ssh to it
try to restart it again.. nothing on console, nothing happens
decide to just delete it, use decom script I am supposed to use for removing VMs
decom script fails with errors reported here
manually delete it with gnt-instance to attempt to then recreate it with makevm under the same name
makevm cookbook suggest to add a public IP even though I said I need private:

sudo cookbook sre.ganeti.makevm --vcpus 4 --memory 4 --disk 50 --network private eqiad_A mwdebug1002.eqiad.wmnet

+mwdebug1002                              1H IN A 208.80.154.6
+mwdebug1002                              1H IN AAAA 2620:0:861:1:208:80:154:6
+mwdebug1002                              1H IN AAAA 2620:0:861:101:10:64:0:46

I ABORT because that is clearly wrong, it's not supposed to have a public IP.

I am stuck now how I should properly resolve this.

I'm doing a little looking around on this.

the last part about the public IP might have been just about the ordering of parameters I had in makevm.. give me a minute, trying that one more time

Nope, it wasn't. It is trying to assign a public IP again.

Cas is cleaning up netbox and we will try it again.

+mwdebug1002                              1H IN A 10.64.0.93                                                                                
+mwdebug1002                              1H IN A 208.80.154.6                                                                              
+mwdebug1002                              1H IN AAAA 2620:0:861:1:208:80:154:6                                                              
+mwdebug1002                              1H IN AAAA 2620:0:861:101:10:64:0:46                                                              
+mwdebug1002                              1H IN AAAA 2620:0:861:101:10:64:0:93                                                              
 mwdebug1003                              1H IN A 10.64.32.9                                                                                
 mwdebug1003                              1H IN AAAA 2620:0:861:103:10:64:32:9

I've successfully got makevm to work as expected after deleting the IP addresses for mwdebug1002 that were left behind by the DNS generation at the decom step failing. That code failure in the DNS generation may be some vm/physical confusion, but it's not obvious from the code why this would have happened.

To be clear about the timeline:

Daniel attempted to reimage ganeti host which failed
Attempted to reboot ganeti host which failed
Attempted to decom ganeti host which mostly worked (the host was removed, but DNS generation failed)
Attempted to create a new ganeti host under the name (which added a public IP address for some reason) and was aborted at DNS generation due to weird diff above
Repeated attempt, aborted for same reason
I removed IP addresses from Netbox and reattempted makevm which worked.

So I think Daniel is unblocked, but there are open questions:

Why did decom fail for this box?
Why would makevm get a public address if the private address already existed, and/or why would it do this at all if --network private is passed? Or did this happen?

Also there's a notable UX issue with makevm, in that when prompted to review the diff from the DNS generation, aborting does not actually clean up the changes that happened prior, so the addresses, like it or not have already been allocated, which is not entirely the expected behavior.

This is also not an expected use case (basically recreating the same box), but I think it's a reasonable use case which currently may always necessitate some manual cleanup of leftover IPs - it might be better to be able to use makevm to create a new VM with existing IPs as an option. There could be safeguards like checking if the IP is currently unconnected to a VM and there is no ganeti host under the name requested, but it seems a reasonable compromise.

Volans subscribed.Feb 16 2021, 5:37 PM

MoritzMuehlenhoff triaged this task as Medium priority.Feb 17 2021, 11:34 AM

Yes the addition of a revert of netbox changes on failure in the makevm cookbook was already in the TODO, I didn't check if it has already a task but is something we should definitely do.

@crusnov Anything I should do for this task?

for the record, yes, I am unblocked and this issue did NOT show up when I reimaged mwdebug1001. Simply because that VM came back from reboot as normally expected. So that did not trigger trying to delete a VM.

The issue that stays though is if this will happen again when the next VM is decom'ed.

Change 668505 had a related patch set uploaded (by Volans; owner: Volans):
[operations/cookbooks@master] sre.hosts.decommission: temporary fix for Netbox

https://gerrit.wikimedia.org/r/668505

gerritbot added a project: Patch-For-Review.Mar 4 2021, 5:47 PM

Change 668505 merged by jenkins-bot:
[operations/cookbooks@master] sre.hosts.decommission: temporary fix for Netbox

https://gerrit.wikimedia.org/r/668505

The additional sleep in the above patch should have workaround the issue. Resolving it for now, feel free to reopen if it happens again.

Maintenance_bot removed a project: Patch-For-Review.Mar 8 2021, 2:10 PM

Dzahn awarded a token.Mar 8 2021, 3:43 PM

Volans mentioned this in rCCKB27fa31b7eccb: sre.hosts.decommission: temporary fix for Netbox.Dec 14 2022, 3:27 PM

Aklapper edited projects, added Cumin; removed SRE-tools.May 10 2023, 11:15 AM

sre.hosts.decomission -> generate_dns_snippets - > Cumin execution failedClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

sre.hosts.decomission -> generate_dns_snippets - > Cumin execution failed
Closed, ResolvedPublic
Actions

Related Objects
Search...