Page MenuHomePhabricator

Do something to better handle wmf-reimage runs cleanups/failures
Closed, ResolvedPublic

Description

wmf-auto-reimage fails 1 out of 2 times for me (not necessarily because the script, but faulty harware, installer, ipmi problems, not using --new after if failed the first time, etc.), and then it doesn't fail gracefully. Normally the neodymium process gets stuck (which is relatively easy to kill, just screen + control-c), but also at least 3 processes on puppetmaster1001.

Ideally, control-c would be captured and try to clean up better, alternatively, some parameter to clean up or, on start, make sure there are not other zombie installs ongoing on puppetmaster. Maybe some watchdog to detect stuck installations? Maybe this will not be needed if wmf-reimage gets deprecated.

Event Timeline

I think most of this will go away when working on T166300 probably on Q1 as part of the salt deprecation goal. My plan is to get rid of wmf-reimage completely and have a single script that handle the whole process.

Leave it open for now to accept feedbacks, but I'm leaning towards closing this one in favor of T166300. Thoughts?

I am ok with that, I just needed to create this because it annoyed me quite some.

jcrespo renamed this task from Do something to better handle run cleanups/failures to Do something to better handle wmf-reimage runs cleanups/failures.Jun 7 2017, 10:44 AM

The script failed on every single run I did in the past week except 1 time.

root@neodymium:/var/log/wmf-auto-reimage$ ls 20170[56]*jynus* | wc -l
20
Volans claimed this task.

@jcrespo I'm resolving this as resolved given that we merged the new reimage script that doens't use anymore Salt and the wmf-reimage script.
The new one should be more reliable, fail immediately if remote IPMI doesn't work and should not leave running processes.

Feel free to open new tasks to improve the new script if you encounter any issue (or re-open this one) with the new script.