Page MenuHomePhabricator

IPMI Audit 2018-04
Closed, ResolvedPublic

Description

Investigating a reimage issue I've noticed something strange in the output of the IPMI command chassis bootparam get 5, hence I've done an audit of the whole fleet and found a worrying situation.

Hosts with broken remote IPMI

  • bast4002.mgmt.ulsfo.wmnet (FIXED, remote IPMI disabled)
  • cp2010.mgmt.codfw.wmnet (FIXED, reset password)
  • cp2021.mgmt.codfw.wmnet (FIXED, reset password)
  • cp2022.mgmt.codfw.wmnet (FIXED, reset password)
  • cp4024.mgmt.ulsfo.wmnet (FIXED, reset password)
  • dns4001.mgmt.ulsfo.wmnet (FIXED, remote IPMI disabled)
  • dns4002.mgmt.ulsfo.wmnet (FIXED, remote IPMI disabled)
  • es1019.mgmt.eqiad.wmnet (TO BE FIXED, had many failures in the past: T187530 T155691 T167121)
  • lawrencium.mgmt.eqiad.wmnet (IGNORING, to be decomm'ed: T191360)
  • mw2251.mgmt.codfw.wmnet (FIXED, reset password)
  • scb1004.mgmt.eqiad.wmnet (FIXED, racadm racreset)

Hosts with Sleep Button and Console overridden

This hosts have:

Boot parameter data: 0000020000
- Lock Out Sleep Button
- BIOS verbosity : Request console redirection be enabled

As opposed to the default:

Boot parameter data: 0000000000
 - BIOS verbosity : Console redirection occurs per BIOS configuration setting (default)

List of affected hosts:

graphite2002.mgmt.codfw.wmnet
rdb1006.mgmt.eqiad.wmnet
scb2003.mgmt.codfw.wmnet
scb2004.mgmt.codfw.wmnet

Are you ok to set the overridden bit?

Hosts with Boot Flat, Sleep Button and Console overridden

This hosts have:

Boot parameter data: 8000020000
   - Boot Flag Valid
   - Lock Out Sleep Button
   - BIOS verbosity : Request console redirection be enabled

As opposed to the default:

Boot parameter data: 0000000000
 - Boot Flag Invalid
 - BIOS verbosity : Console redirection occurs per BIOS configuration setting (default)

List of affected hosts:

bast4001.mgmt.ulsfo.wmnet
conf2001.mgmt.codfw.wmnet
db1061.mgmt.eqiad.wmnet
db1062.mgmt.eqiad.wmnet
db1069.mgmt.eqiad.wmnet
labstore1004.mgmt.eqiad.wmnet
labstore1005.mgmt.eqiad.wmnet
labtestvirt2001.mgmt.codfw.wmnet
neodymium.mgmt.eqiad.wmnet
phab1001.mgmt.eqiad.wmnet
puppetmaster1001.mgmt.eqiad.wmnet
sodium.mgmt.eqiad.wmnet

Are you ok to set the overridden bit?

Force PXE (FIXED)

150 hosts had the Boot Device Selector overridden to Force PXE at the next reboot. This is the most worrying, in particular because most of them host stateful services. We agreed that there is no use case to have any host in PXE mode given our current infrastructure configuration, hence I've already fixed it resetting the bit to not override the default boot order.

aqs1006.mgmt.eqiad.wmnet
conf1004.mgmt.eqiad.wmnet
conf1006.mgmt.eqiad.wmnet
db1087.mgmt.eqiad.wmnet
db1090.mgmt.eqiad.wmnet
db2033.mgmt.codfw.wmnet
db2034.mgmt.codfw.wmnet
db2036.mgmt.codfw.wmnet
db2037.mgmt.codfw.wmnet
db2041.mgmt.codfw.wmnet
db2043.mgmt.codfw.wmnet
db2044.mgmt.codfw.wmnet
db2046.mgmt.codfw.wmnet
db2050.mgmt.codfw.wmnet
db2052.mgmt.codfw.wmnet
db2053.mgmt.codfw.wmnet
db2054.mgmt.codfw.wmnet
db2069.mgmt.codfw.wmnet
db2070.mgmt.codfw.wmnet
dbstore2001.mgmt.codfw.wmnet
dbstore2002.mgmt.codfw.wmnet
elastic1032.mgmt.eqiad.wmnet
elastic1033.mgmt.eqiad.wmnet
elastic1034.mgmt.eqiad.wmnet
elastic1035.mgmt.eqiad.wmnet
elastic1036.mgmt.eqiad.wmnet
elastic1037.mgmt.eqiad.wmnet
elastic1038.mgmt.eqiad.wmnet
elastic1039.mgmt.eqiad.wmnet
elastic1040.mgmt.eqiad.wmnet
elastic1041.mgmt.eqiad.wmnet
elastic1042.mgmt.eqiad.wmnet
elastic1043.mgmt.eqiad.wmnet
elastic1044.mgmt.eqiad.wmnet
elastic1045.mgmt.eqiad.wmnet
elastic1046.mgmt.eqiad.wmnet
elastic1047.mgmt.eqiad.wmnet
elastic1048.mgmt.eqiad.wmnet
elastic1049.mgmt.eqiad.wmnet
elastic1050.mgmt.eqiad.wmnet
elastic1051.mgmt.eqiad.wmnet
elastic1052.mgmt.eqiad.wmnet
elastic2018.mgmt.codfw.wmnet
elastic2020.mgmt.codfw.wmnet
elastic2025.mgmt.codfw.wmnet
elastic2026.mgmt.codfw.wmnet
elastic2027.mgmt.codfw.wmnet
elastic2028.mgmt.codfw.wmnet
elastic2029.mgmt.codfw.wmnet
elastic2030.mgmt.codfw.wmnet
elastic2031.mgmt.codfw.wmnet
elastic2032.mgmt.codfw.wmnet
elastic2033.mgmt.codfw.wmnet
elastic2034.mgmt.codfw.wmnet
elastic2035.mgmt.codfw.wmnet
elastic2036.mgmt.codfw.wmnet
labcontrol1003.mgmt.eqiad.wmnet
labcontrol1004.mgmt.eqiad.wmnet
labtestcontrol2003.mgmt.codfw.wmnet
labtestmetal2001.mgmt.codfw.wmnet
labtestneutron2002.mgmt.codfw.wmnet
labtestservices2002.mgmt.codfw.wmnet
labtestvirt2003.mgmt.codfw.wmnet
lvs1010.mgmt.eqiad.wmnet
lvs1011.mgmt.eqiad.wmnet
lvs1012.mgmt.eqiad.wmnet
lvs2004.mgmt.codfw.wmnet
lvs2005.mgmt.codfw.wmnet
lvs2006.mgmt.codfw.wmnet
mc1019.mgmt.eqiad.wmnet
mc1020.mgmt.eqiad.wmnet
mc1021.mgmt.eqiad.wmnet  -- iLO was not accepting changes, it worked after a reset
mc1022.mgmt.eqiad.wmnet
mc1023.mgmt.eqiad.wmnet
mc1024.mgmt.eqiad.wmnet
mc1025.mgmt.eqiad.wmnet
mc1026.mgmt.eqiad.wmnet
mc1027.mgmt.eqiad.wmnet
mc1028.mgmt.eqiad.wmnet
mc1029.mgmt.eqiad.wmnet
mc1030.mgmt.eqiad.wmnet
mc1031.mgmt.eqiad.wmnet
mc1032.mgmt.eqiad.wmnet
mc1033.mgmt.eqiad.wmnet
mc1034.mgmt.eqiad.wmnet
mc1035.mgmt.eqiad.wmnet
mc1036.mgmt.eqiad.wmnet
mc2036.mgmt.codfw.wmnet
ms-be1019.mgmt.eqiad.wmnet
ms-be1020.mgmt.eqiad.wmnet
ms-be1021.mgmt.eqiad.wmnet
ms-be1022.mgmt.eqiad.wmnet
ms-be1023.mgmt.eqiad.wmnet
ms-be1024.mgmt.eqiad.wmnet
ms-be1025.mgmt.eqiad.wmnet
ms-be1026.mgmt.eqiad.wmnet
ms-be1027.mgmt.eqiad.wmnet
ms-be1028.mgmt.eqiad.wmnet
ms-be1029.mgmt.eqiad.wmnet
ms-be1030.mgmt.eqiad.wmnet
ms-be1031.mgmt.eqiad.wmnet
ms-be1032.mgmt.eqiad.wmnet
ms-be1033.mgmt.eqiad.wmnet
ms-be1034.mgmt.eqiad.wmnet
ms-be1035.mgmt.eqiad.wmnet
ms-be1036.mgmt.eqiad.wmnet
ms-be1037.mgmt.eqiad.wmnet
ms-be1038.mgmt.eqiad.wmnet
ms-be1039.mgmt.eqiad.wmnet
ms-be2017.mgmt.codfw.wmnet
ms-be2018.mgmt.codfw.wmnet
ms-be2019.mgmt.codfw.wmnet
ms-be2020.mgmt.codfw.wmnet
ms-be2023.mgmt.codfw.wmnet
ms-be2025.mgmt.codfw.wmnet
ms-be2026.mgmt.codfw.wmnet
ms-be2027.mgmt.codfw.wmnet
ms-be2028.mgmt.codfw.wmnet
ms-be2029.mgmt.codfw.wmnet
ms-be2030.mgmt.codfw.wmnet
ms-be2031.mgmt.codfw.wmnet
ms-be2032.mgmt.codfw.wmnet
ms-be2033.mgmt.codfw.wmnet
ms-be2034.mgmt.codfw.wmnet
ms-be2035.mgmt.codfw.wmnet
ms-be2036.mgmt.codfw.wmnet
ms-be2037.mgmt.codfw.wmnet
ms-be2038.mgmt.codfw.wmnet
ms-be2039.mgmt.codfw.wmnet
relforge1001.mgmt.eqiad.wmnet
relforge1002.mgmt.eqiad.wmnet
restbase1010.mgmt.eqiad.wmnet
restbase1011.mgmt.eqiad.wmnet
restbase1012.mgmt.eqiad.wmnet
restbase1013.mgmt.eqiad.wmnet
restbase1014.mgmt.eqiad.wmnet
restbase1015.mgmt.eqiad.wmnet
restbase2001.mgmt.codfw.wmnet
restbase2002.mgmt.codfw.wmnet
restbase2003.mgmt.codfw.wmnet
restbase2004.mgmt.codfw.wmnet
restbase2005.mgmt.codfw.wmnet
restbase2006.mgmt.codfw.wmnet
restbase2007.mgmt.codfw.wmnet
restbase2008.mgmt.codfw.wmnet
restbase2009.mgmt.codfw.wmnet
stat1006.mgmt.eqiad.wmnet
wasat.mgmt.codfw.wmnet
wdqs1003.mgmt.eqiad.wmnet
wdqs2003.mgmt.codfw.wmnet

Event Timeline

Volans triaged this task as Medium priority.Apr 26 2018, 11:22 AM
Vvjjkkii renamed this task from IPMI Audit 2018-04 to w5daaaaaaa.Jul 1 2018, 1:14 AM
Vvjjkkii removed Volans as the assignee of this task.
Vvjjkkii raised the priority of this task from Medium to High.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed a subscriber: Aklapper.
Mainframe98 renamed this task from w5daaaaaaa to IPMI Audit 2018-04.Jul 1 2018, 7:51 AM
Mainframe98 assigned this task to Volans.
Mainframe98 lowered the priority of this task from High to Medium.
Mainframe98 updated the task description. (Show Details)
Mainframe98 added a subscriber: Aklapper.

Resolving as it's a too old audit now, we could re-run it again if needed, but we've also alerts that check most of those scenarios.