Page MenuHomePhabricator

Degraded RAID on elastic2051
Closed, ResolvedPublic

Description

TASK AUTO-GENERATED by Nagios/Icinga RAID event handler

A degraded RAID (md) was detected on host elastic2051. An automatic snapshot of the current RAID status is attached below.

Please sync with the service owner to find the appropriate time window before actually replacing any failed hardware.

CRITICAL: State: degraded, Active: 3, Working: 3, Failed: 1, Spare: 0

$ sudo /usr/local/lib/nagios/plugins/get-raid-status-md
Personalities : [raid0] [raid1] [linear] [multipath] [raid6] [raid5] [raid4] [raid10] 
md1 : active raid0 sdb2[1] sda2[0]
      3066771456 blocks super 1.2 512k chunks
      
md0 : active raid1 sda1[0](F) sdb1[1]
      29279232 blocks super 1.2 [2/1] [_U]
      
unused devices: <none>

Event Timeline

dcausse added a subscriber: dcausse.

elastic2051 being an eligible master on the omega cluster we might perhaps want to change the list of masters if this host is going to be down for long.

Change 751958 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] elasticsearch: changed master eligible on codfw omega to 2052

https://gerrit.wikimedia.org/r/751958

Change 751958 merged by Bking:

[operations/puppet@production] elasticsearch: changed master eligible on codfw omega to 2052

https://gerrit.wikimedia.org/r/751958

Mentioned in SAL (#wikimedia-operations) [2022-01-06T16:37:16Z] <inflatador> restarting elastic2052 for configuration change - T298674

Mentioned in SAL (#wikimedia-operations) [2022-01-06T20:19:34Z] <inflatador> banned elastic2051 from both chi and omega search clusters - T298674

@Papaul do you know if we have spare SSDs for this host?

The host is already banned from the cluster, you can take it offline and reboot it whenever you want.

(@bking will add some more details)

@Gehel we have some disks that we took out from decom servers I will look when i am back on site tomorrow if we can find one.

@Papaul Checked the box with 'hdparm', the failed disk is at sda, but it is not displaying its serial number.

The working disk (sdb) has a serial number of 68GS105XTBWT , this one should NOT be removed.

I've suppressed alerts and halted the machine, so feel free to replace the drive and re-image the server at your convenience. My IRC handle is "inflatador" if you have any questions.

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host elastic2051.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host elastic2051.codfw.wmnet with OS bullseye executed with errors:

  • elastic2051 (FAIL)
    • Downtimed on Icinga
    • Unable to disable Puppet, the host may have been unreachable
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run failed, asking the operator what to do
    • First Puppet run failed, asking the operator what to do
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202201071829_pt1979_1070579_elastic2051.out
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host elastic2051.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host elastic2051.codfw.wmnet with OS bullseye executed with errors:

  • elastic2051 (FAIL)
    • Downtimed on Icinga
    • Unable to disable Puppet, the host may have been unreachable
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run failed, asking the operator what to do
    • First Puppet run failed, asking the operator what to do
    • First Puppet run failed, asking the operator what to do
    • First Puppet run failed, asking the operator what to do
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202201071918_pt1979_1077194_elastic2051.out
    • The reimage failed, see the cookbook logs for the details
Papaul added a subscriber: jbond.

This is ready but according to @jbond, it still has some puppet errors but that looks like it is related to this puppet policy not being ready for debian bullseye.

This is ready but according to @jbond, it still has some puppet errors but that looks like it is related to this puppet policy not being ready for debian bullseye.

to expand it looks like there are some packages that are not available (or possibly renamed)

Error: Execution of '/usr/bin/apt-get -q -y -o DPkg::Options::=--force-confold install elasticsearch-oss' returned 100: Reading package lists...
Building dependency tree...
Reading state information...
E: Unable to locate package elasticsearch-oss
Error: /Stage[main]/Elasticsearch::Packages/Package[elasticsearch]/ensure: change from 'purged' to 'present' failed: Execution of '/usr/bin/apt-get -q -y -o DPkg::Options::=--force-confold install elasticsearch-oss' returned 100: Reading package lists...
Building dependency tree...
Reading state information...
E: Unable to locate package elasticsearch-oss
Error: Execution of '/usr/bin/apt-get -q -y -o DPkg::Options::=--force-confold install liblogstash-gelf-java' returned 100: Reading package lists...
Building dependency tree...
Reading state information...
E: Unable to locate package liblogstash-gelf-java
Error: /Stage[main]/Elasticsearch::Packages/Package[liblogstash-gelf-java]/ensure: change from 'purged' to 'present' failed: Execution of '/usr/bin/apt-get -q -y -o DPkg::Options::=--force-confold install liblogstash-gelf-java' returned 100: Reading package lists...
Building dependency tree...
Reading state information...
E: Unable to locate package liblogstash-gelf-java
Error: Could not set 'link' on ensure: No such file or directory @ dir_chdir - /usr/share/elasticsearch/lib (file: /etc/puppet/modules/elasticsearch/manifests/packages.pp, line: 29)
Error: Could not set 'link' on ensure: No such file or directory @ dir_chdir - /usr/share/elasticsearch/lib (file: /etc/puppet/modules/elasticsearch/manifests/packages.pp, line: 29)
Wrapped exception:
No such file or directory @ dir_chdir - /usr/share/elasticsearch/lib
Error: /Stage[main]/Elasticsearch::Packages/File[/usr/share/elasticsearch/lib/logstash-gelf.jar]/ensure: change from 'absent' to 'link' failed: Could not set 'link' on ensure: No such file or directory @ dir_chdir - /usr/share/elasticsearch/lib (file: /etc/puppet/modules/elasticsearch/manifests/packages.pp, line: 29)
Error: Could not set 'link' on ensure: No such file or directory @ dir_chdir - /usr/share/elasticsearch/lib (file: /etc/puppet/modules/elasticsearch/manifests/packages.pp, line: 33)
Error: Could not set 'link' on ensure: No such file or directory @ dir_chdir - /usr/share/elasticsearch/lib (file: /etc/puppet/modules/elasticsearch/manifests/packages.pp, line: 33)
Wrapped exception:
No such file or directory @ dir_chdir - /usr/share/elasticsearch/lib
Error: /Stage[main]/Elasticsearch::Packages/File[/usr/share/elasticsearch/lib/json-simple.jar]/ensure: change from 'absent' to 'link' failed: Could not set 'link' on ensure: No such file or directory @ dir_chdir - /usr/share/elasticsearch/lib (file: /etc/puppet/modules/elasticsearch/manifests/packages.pp, line: 33)
Error: Execution of '/usr/bin/apt-get -q -y -o DPkg::Options::=--force-confold install wmf-elasticsearch-search-plugins' returned 100: Reading package lists...
Building dependency tree...
Reading state information...
E: Unable to locate package wmf-elasticsearch-search-plugins
Error: /Stage[main]/Profile::Elasticsearch::Cirrus/Package[wmf-elasticsearch-search-plugins]/ensure: change from 'purged' to 'present' failed: Execution of '/usr/bin/apt-get -q -y -o DPkg::Options::=--force-confold install wmf-elasticsearch-search-plugins' returned 100: Reading package lists...
Building dependency tree...
Reading state information...
E: Unable to locate package wmf-elasticsearch-search-plugins
Error: Execution of '/usr/bin/apt-get -q -y -o DPkg::Options::=--force-confold install elasticsearch-madvise' returned 100: Reading package lists...
Building dependency tree...
Reading state information...
E: Unable to locate package elasticsearch-madvise
Error: /Stage[main]/Profile::Elasticsearch::Cirrus/Package[elasticsearch-madvise]/ensure: change from 'purged' to 'present' failed: Execution of '/usr/bin/apt-get -q -y -o DPkg::Options::=--force-confold install elasticsearch-madvise' returned 100: Reading package lists...
Building dependency tree...
Reading state information...
E: Unable to locate package elasticsearch-madvise
Error: Execution of '/usr/bin/apt-get -q -y -o DPkg::Options::=--force-confold install logstash-oss' returned 100: Reading package lists...
Building dependency tree...
Reading state information...
E: Unable to locate package logstash-oss
Error: /Stage[main]/Logstash/Package[logstash]/ensure: change from 'purged' to 'present' failed: Execution of '/usr/bin/apt-get -q -y -o DPkg::Options::=--force-confold install logstash-oss' returned 100: Reading package lists...
Building dependency tree...
Reading state information...
E: Unable to locate package logstash-oss

This just alerted with:

20:04:31 <icinga-wm> PROBLEM - Check whether microcode mitigations for CPU vulnerabilities are applied on elastic2051 is CRITICAL: CRITICAL - Server is missing the following CPU flags: {md_clear} https://wikitech.wikimedia.org/wiki/Microcode

Downtimed the host for a day (from now), so that it will not show up in icinga.

The server reimage to bullseye is incomplete due to missing packages (among other things). I found an epic with more details , the next steps for me are to look at the output of the sre hosts reimage puppet run posted above, and address the failures one by one.

Was this intentionally reimaged with Bullseye? I wouldn't entangle this with a hardware maintenance and simply reimage with stretch and then start the Bullseye migration on e.g. relforge?

@MoritzMuehlenhoff This is a good point; will discuss further with my team today.

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host elastic2051.codfw.wmnet with OS stretch

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host elastic2051.codfw.wmnet with OS stretch executed with errors:

  • elastic2051 (FAIL)
    • Downtimed on Icinga
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • The reimage failed, see the cookbook logs for the details
bking reopened this task as In Progress.Thu, Jan 13, 2:49 PM
bking claimed this task.

Per yesterday's conversation with @Gehel (and Moritz's suggestion above) , we have elected to reimage this server to Stretch and deal with the Bullseye issues separately. Working this now...

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host elastic2051.codfw.wmnet with OS stretch

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host elastic2051.codfw.wmnet with OS stretch executed with errors:

  • elastic2051 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • The reimage failed, see the cookbook logs for the details

More details on failure:
`Exception raised while executing cookbook sre.hosts.reimage:
Traceback (most recent call last):

File "/usr/lib/python3/dist-packages/spicerack/_menu.py", line 234, in run
  raw_ret = runner.run()
File "/srv/deployment/spicerack/cookbooks/sre/hosts/reimage.py", line 455, in run
  self._install_os()
File "/srv/deployment/spicerack/cookbooks/sre/hosts/reimage.py", line 303, in _install_os
  self.remote_installer.wait_reboot_since(di_reboot_time, print_progress_bars=False)
File "/usr/lib/python3/dist-packages/wmflib/decorators.py", line 210, in wrapper
  return func(*args, **kwargs)
File "/usr/lib/python3/dist-packages/spicerack/remote.py", line 582, in wait_reboot_since
  f"Uptime for {nodeset} higher than threshold: {round(uptime, 2)} > {round(delta, 2)}"

spicerack.remote.RemoteCheckError: Uptime for elastic2051.codfw.wmnet higher than threshold: 1415.17 > 1355.2
The reimage failed, see the cookbook logs for the details
Reimage executed with errors:

  • elastic2051 (FAIL)`

Since this has happened twice, I'll login to the mgmt console and watch the install as it is happening, hopefully it will have some hints.

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host elastic2051.codfw.wmnet with OS stretch

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host elastic2051.codfw.wmnet with OS stretch executed with errors:

  • elastic2051 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host elastic2051.codfw.wmnet with OS stretch

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host elastic2051.codfw.wmnet with OS stretch executed with errors:

  • elastic2051 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host elastic2051.codfw.wmnet with OS stretch

Looks like the server is trying to PXE boot from its 1 GB NICs, but it should be using its 10GB NICs. Guessing this can be fixed through the BIOS based on papaul's recommendations.

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host elastic2051.codfw.wmnet with OS stretch executed with errors:

  • elastic2051 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host elastic2051.codfw.wmnet with OS stretch

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host elastic2051.codfw.wmnet with OS stretch executed with errors:

  • elastic2051 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh stretch OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202201131750_bking_2092302_elastic2051.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host elastic2051.codfw.wmnet with OS stretch

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host elastic2051.codfw.wmnet with OS stretch executed with errors:

  • elastic2051 (FAIL)
    • Downtimed on Icinga
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh stretch OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202201131943_bking_2107733_elastic2051.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by ryankemper@cumin2002 for host elastic2051.codfw.wmnet with OS stretch

Cookbook cookbooks.sre.hosts.reimage started by ryankemper@cumin2002 for host elastic2051.codfw.wmnet with OS stretch completed:

  • elastic2051 (WARN)
    • Downtimed on Icinga
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh stretch OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202201142226_ryankemper_2294794_elastic2051.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB
RKemper triaged this task as Medium priority.Fri, Jan 14, 11:11 PM