Page MenuHomePhabricator

Provision cookbook not setting serial console and other settings
Closed, ResolvedPublic

Description

I see some changes made @ https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1037573/8/cookbooks/sre/hosts/provision.py for the serial console configure. We are having issues when running the provision cookbook on r650xs and 450 this doesn't setup the serial console to com2

Hosts to fix:

  • cloudcephmon1004.mgmt.eqiad.wmnet
  • cloudcephmon1005.mgmt.eqiad.wmnet
  • cloudcephmon1006.mgmt.eqiad.wmnet
  • cloudcephosd1035.mgmt.eqiad.wmnet
  • cloudcephosd1036.mgmt.eqiad.wmnet
  • cloudcephosd1037.mgmt.eqiad.wmnet
  • cloudcephosd1038.mgmt.eqiad.wmnet
  • cloudcephosd1039.mgmt.eqiad.wmnet
  • cloudcephosd1040.mgmt.eqiad.wmnet
  • cloudcephosd1041.mgmt.eqiad.wmnet
  • cloudvirt-wdqs1001.mgmt.eqiad.wmnet (still not set up, doesn't need a re-run)
  • db1179.mgmt.eqiad.wmnet (serving prod traffic, needs to be depooled)
  • db2221.mgmt.codfw.wmnet
  • db2222.mgmt.codfw.wmnet
  • db2223.mgmt.codfw.wmnet
  • db2224.mgmt.codfw.wmnet
  • db2225.mgmt.codfw.wmnet
  • db2226.mgmt.codfw.wmnet
  • db2227.mgmt.codfw.wmnet
  • db2228.mgmt.codfw.wmnet
  • db2229.mgmt.codfw.wmnet
  • db2230.mgmt.codfw.wmnet
  • db2231.mgmt.codfw.wmnet
  • db2232.mgmt.codfw.wmnet
  • db2233.mgmt.codfw.wmnet
  • db2234.mgmt.codfw.wmnet
  • db2235.mgmt.codfw.wmnet
  • db2236.mgmt.codfw.wmnet
  • db2237.mgmt.codfw.wmnet
  • db2238.mgmt.codfw.wmnet
  • db2239.mgmt.codfw.wmnet
  • db2240.mgmt.codfw.wmnet
  • gerrit2003.mgmt.codfw.wmnet
  • mw2432.mgmt.codfw.wmnet (renamed to wikikube-worker2035)
  • mw2433.mgmt.codfw.wmnet (renamed to wikikube-worker2036)
  • mw2438.mgmt.codfw.wmnet (renamed to wikikube-worker2037)
  • mw2439.mgmt.codfw.wmnet (renamed to wikikube-worker2038)
  • mw2441.mgmt.codfw.wmnet (renamed to wikikube-worker2039)
  • pc1017.mgmt.eqiad.wmnet
  • pc2017.mgmt.codfw.wmnet
  • sretest1001.mgmt.eqiad.wmnet
  • sretest2002.mgmt.codfw.wmnet - retry, weird error when executing it in _config_dell_host
  • wikikube-ctrl2003.mgmt.codfw.wmnet
  • wikikube-worker1240.mgmt.eqiad.wmnet

Event Timeline

@Papaul is there an host that I can check via Redfish? In theory the change was a no-op, it should work as before (we just introduced the Supermicro support).

elukey triaged this task as Medium priority.Jul 29 2024, 2:32 PM

@elukey no we have no servers right now. We manually fixed al the db nodes where we were having isssue on.
https://phabricator.wikimedia.org/T369654. if we get any i will ping you

@Papaul can you give us some provision cookbook run that didn't set it so we could check the logs please? Hostname and date/time if they were run multiple times.

I've noticed that from the logs we're not setting any additional values since May 31st, from SAL I saw that we didn't had reimages between the 31st and when we merged the above patch.

I've commented on the patch on the code issue. I'm sorry I missed it during code review.

Volans renamed this task from Provision cookbook not setting serial console on 450 and 650xs model to Provision cookbook not setting serial console and other settings.Jul 29 2024, 10:45 PM
Volans raised the priority of this task from Medium to High.

Fix is in https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1057927

Ideally we should re-run the provision on all hosts provisioned since then (list can be obtained in [1]), but we should first check if the missing changes could trigger a reboot of the underlying host or not, and it might also be host-dependent (some requiring reboot, some not).

[1] https://sal.toolforge.org/production?p=0&q=%22START+-+Cookbook+sre.hosts.provision%22&d=

I agree yes, there are probably some misconfigured hosts waiting there.. We can pick up one node that can be easily depooled and run the provision cookbook, I can check before/after changes with a script using Redfish.

Probably wikikube-worker1240?

Marostegui subscribed.

From a databases point of view, only db1179 is in production but it is a slave, so it is easily depoolable. The other servers are not in production, they don't have any data.

Volans updated the task description. (Show Details)

pc1017 and pc2017 also belong to DBAs, they are not in production either. Can be done anytime

Mentioned in SAL (#wikimedia-operations) [2024-07-31T06:47:53Z] <marostegui@cumin1002> dbctl commit (dc=all): 'Depool db1179 T371132', diff saved to https://phabricator.wikimedia.org/P67120 and previous config saved to /var/cache/conftool/dbconfig/20240731-064752-root.json

elukey claimed this task.
elukey updated the task description. (Show Details)