Page MenuHomePhabricator

Suggest install-console tool in sre.makevm cookbook failure message
Closed, ResolvedPublic

Description

Per email conversation with @Volans , raising a ticket as I've been bitten by this in T341792 .

  • Suggest usage of install-console in MakeVM cookbook failure message. This tool was extremely helpful to me when it comes to unblocking VM builds. I've updated the Ganeti docs to mention it as well.

Thanks for looking, please let me know if you need more info.

Event Timeline

I think that we have various possible improvements to avoid what happen in your related task:

  1. Puppet CI could catch that the role assigned in site.pp doesn' exists
  1. The reimage cookbook could check that a host has a role applied in site.pp before starting (but in this case it would have not caught the issue as it was assigned)
  1. The reimage cookbook could try to be smarter to report a complete puppet failure (already tracked in T338990 )
  1. The install_console command should be common knowledge and part of everyone's onboarding.

Thanks for the links, very helpful. I'll file a ticket for 1.

As for 4, I think we need to be careful about what we expect out of the user. If we have a WMF-specific command or process that is very important for the troubleshooting process, it should be called out as often as possible. Think of it like Linux rescue mode's Press CTRL-S to drop into a shell.

Change 956082 had a related patch set uploaded (by Bking; author: Bking):

[operations/cookbooks@master] sre.hosts.reimage: Suggest install-console for troubleshooting

https://gerrit.wikimedia.org/r/956082

Volans triaged this task as Medium priority.Jan 22 2024, 4:15 PM

@bking Assigning it to you as you've already sent a patch for it. When you have a minute please resume it so we can complete and merge it.

@Volans ACK, working on it as a Friday project now. Sorry for the delay.

Change 956082 merged by Bking:

[operations/cookbooks@master] sre.hosts.reimage: Suggest install-console for troubleshooting

https://gerrit.wikimedia.org/r/956082

bking moved this task from Backlog to Done on the Infrastructure-Foundations board.

@Volans Thanks again for your help. I just merged the change, so closing this ticket.