Do something to better handle wmf-reimage runs cleanups/failures
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	jcrespo
	May 30 2017, 3:22 PM

Description

wmf-auto-reimage fails 1 out of 2 times for me (not necessarily because the script, but faulty harware, installer, ipmi problems, not using --new after if failed the first time, etc.), and then it doesn't fail gracefully. Normally the neodymium process gets stuck (which is relatively easy to kill, just screen + control-c), but also at least 3 processes on puppetmaster1001.

Ideally, control-c would be captured and try to clean up better, alternatively, some parameter to clean up or, on start, make sure there are not other zombie installs ongoing on puppetmaster. Maybe some watchdog to detect stuck installations? Maybe this will not be needed if wmf-reimage gets deprecated.

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		Volans	T148814 wmf-auto-reimage improvements
		Resolved		Volans	T166570 Do something to better handle wmf-reimage runs cleanups/failures

Event Timeline

jcrespo created this task.May 30 2017, 3:22 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 30 2017, 3:22 PM

I think most of this will go away when working on T166300 probably on Q1 as part of the salt deprecation goal. My plan is to get rid of wmf-reimage completely and have a single script that handle the whole process.

Leave it open for now to accept feedbacks, but I'm leaning towards closing this one in favor of T166300. Thoughts?

I am ok with that, I just needed to create this because it annoyed me quite some.

jcrespo mentioned this in T166683: db2044 cannot install jessie - requires BIOS firmware upgrade.May 31 2017, 2:44 PM

The script failed on every single run I did in the past week except 1 time.

root@neodymium:/var/log/wmf-auto-reimage$ ls 20170[56]*jynus* | wc -l
20

@jcrespo I'm resolving this as resolved given that we merged the new reimage script that doens't use anymore Salt and the wmf-reimage script.
The new one should be more reliable, fail immediately if remote IPMI doesn't work and should not leave running processes.

Feel free to open new tasks to improve the new script if you encounter any issue (or re-open this one) with the new script.

jcrespo mentioned this in T165348: Check long-running screen/tmux sessions.Sep 21 2017, 1:58 PM

Do something to better handle wmf-reimage runs cleanups/failuresClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Do something to better handle wmf-reimage runs cleanups/failures
Closed, ResolvedPublic
Actions

Related Objects
Search...