Page MenuHomePhabricator

Research/fix IPMI errors on production elastic hosts
Open, MediumPublic3 Estimated Story Points

Description

While working on T289135 , I observed some IPMI errors emitted by the reimage cookbook .

Per conversation with Papaul in DC Ops IRC, the error means that IPMI is disabled on the node. We can potentially enable it in the idrac GUI under IPMI setting or with the provision script with the --no-dhcp --no-user flags .

Creating this ticket to address the issues, and find/add needed documentation.

Event Timeline

Looks like the physical hosts installed in https://phabricator.wikimedia.org/T230746 ( elastic10[53-67].eqiad.wmnet ) never had IPMI enabled:

ansible -i ipmi.hosts --become all -m shell -a "ipmi-config --section=Lan_Channel --key-pair="Lan_Channel:Volatile_Access_Mode=Always_Available" --key-pair="Lan_Channel:Non_Volatile_Access_Mode=Always_Available" --diff" | tee ~/Documents/ansible-runs/ipmi-$(gdate -Iseconds).txt

Using these docs , enabled IPMI permissions for elastic10[53-67].eqiad.wmnet and ran the reimage playbook again. Leaving this ticket open until we confirm that the problem is fixed (keeping in mind that we may discover other hosts that need this fix).

Gehel triaged this task as Medium priority.Aug 29 2022, 3:51 PM
Gehel edited projects, added Discovery-Search; removed Discovery-Search (Current work).
Gehel moved this task from needs triage to Ops / SRE on the Discovery-Search board.
Gehel removed bking as the assignee of this task.Sep 22 2022, 9:14 AM