Page MenuHomePhabricator

db1140 (backup source) crashed
Closed, ResolvedPublic

Description

Times in UTC:

03:26:26 <+icinga-wm> PROBLEM - Host db1140 is DOWN: PING CRITICAL - Packet loss = 100%
iLO Standard 1.40 at  Feb 05 2019
Server Name:
Server Power: Off

</>hpiLO->

Looks like a HW issue:

/system1/log1/record21
  Targets
  Properties
    number=21
    severity=Critical
    date=04/19/2020
    time=03:22:39
    description=Server Critical Fault (Service Information: Runtime Fault, Processor(Intel),  Processor 1 (02h))
  Verbs
    cd version exit show

This server is under warranty.
The backups that run from this host are x1 and s2.

Event Timeline

Marostegui triaged this task as High priority.
Marostegui added projects: ops-eqiad, DC-Ops.
Marostegui added subscribers: jcrespo, wiki_willy.

This server is fully broken apparently, it is not powering ON :-( - I have tried multiple combinations:

</>hpiLO-> power on

status=0
status_tag=COMMAND COMPLETED
Sun Apr 19 05:52:58 2020



Server powering on .......


</>hpiLO-> power

status=0
status_tag=COMMAND COMPLETED
Sun Apr 19 05:54:29 2020



power: server power is currently: Off

</>hpiLO-> power reset

status=2
status_tag=COMMAND PROCESSING FAILED
error_tag=COMMAND ERROR-UNSPECIFIED
Sun Apr 19 05:55:27 2020

Server power off.




</>hpiLO-> power

status=0
status_tag=COMMAND COMPLETED
Sun Apr 19 05:55:48 2020



power: server power is currently: Off

</>hpiLO-> power off hard

status=2
status_tag=COMMAND PROCESSING FAILED
error_tag=COMMAND ERROR-UNSPECIFIED
Sun Apr 19 05:56:51 2020

Server power already off.




</>hpiLO-> power on

status=0
status_tag=COMMAND COMPLETED
Sun Apr 19 05:56:57 2020



Server powering on .......



</>hpiLO-> power

status=0
status_tag=COMMAND COMPLETED
Sun Apr 19 05:58:22 2020



power: server power is currently: Off

We should probably contact HP and get a new mother board - assigning to @jcrespo so he decide on the next steps for this backup source host.

Change 590255 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] db1140: Disable notifications

https://gerrit.wikimedia.org/r/590255

Change 590255 merged by Marostegui:
[operations/puppet@production] db1140: Disable notifications

https://gerrit.wikimedia.org/r/590255

Change 590995 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb-backups: Add s2, x1 to db1095 (eqiad backup source)

https://gerrit.wikimedia.org/r/590995

Change 590995 merged by Jcrespo:
[operations/puppet@production] mariadb-backups: Add s2, x1 to db1095 (eqiad backup source)

https://gerrit.wikimedia.org/r/590995

Mentioned in SAL (#wikimedia-operations) [2020-04-20T08:09:44Z] <jynus> restarting s3 instance on db1095 to reduce its buffer pool T250602

Change 590999 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb-backups: Move s2, x1 eqiad backups to db1095

https://gerrit.wikimedia.org/r/590999

Change 590999 merged by Jcrespo:
[operations/puppet@production] mariadb-backups: Move s2, x1 eqiad backups to db1095

https://gerrit.wikimedia.org/r/590999

Service failover done, this is now 100% on DC-Ops side for vendor handling, as per description and initial triage by Manuel. No production service is there now, and data is useless (will be reimaged) -can be handled freely by dcops (although please note there is private data on these disks).

Change 591047 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb-backups: Enable notifications on db1095 after failover from db1140

https://gerrit.wikimedia.org/r/591047

Cmjohnson added a subscriber: Cmjohnson.

@Marostegui @jynus Before I can start a ticket with HPE some local troubleshooting has to be done.

Mentioned in SAL (#wikimedia-operations) [2020-04-21T05:34:11Z] <marostegui> Add db1095:3312, db1095:3320 to tendril - T250602

Change 591047 merged by Jcrespo:
[operations/puppet@production] mariadb-backups: Enable notifications on db1095 after failover from db1140

https://gerrit.wikimedia.org/r/591047

I pulled all power plugs, reseated psu's, DIMM and CPU. The server will not power on, the LEDs are flashing orange and red.

@Marostegui this could be down for a bit, between HPE troubleshooting and getting a tech on-site.

I will be the first contact for this server (although manuel will be around if needed, ofc :-D) @Cmjohnson we are aware- that is why migrated the service away, as it couldn't wait for the repair/etc. So that is not immediate issue. However, the reduced redundancy is not ideal for long term- we will budget additional servers for next fiscal. Thanks for taking care of it!

Cmjohnson added a subscriber: Jclark-ctr.

@Jclark-ctr can you start the process with HPE please.

led`s flashing in sequence of 2 most likely a processor error

Case Reference ID: 5346998524
Status: Case is generated and in Progress
Product: HPE ProLiant DL360 Gen10 8SFF Configure-to-order Server
Product number: 867959-B21
Serial number: MXQ91300JD

updated ticket with HP. Scheduled main board replacement.

Downtime expired - I have acked the alerts in Icinga

Main board replaced today entered password & management address into server

Per my IRC chat with John, assigning this back to Jaime as the on-site part is done
Thank you John!

Per my IRC chat with John

Could you tell me more, as before a processor error was mentioned, but then a board change?

Per my IRC chat with John

Could you tell me more, as before a processor error was mentioned, but then a board change?

My chat was more like a heads up from John that the mainboard was replaced and to check if we are able to connect back to the host.
I would assume that the vendor decided just to replace the whole mainboard rather than just the CPU (this is a guess)

Change 597780 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb-backups: Prepare db1140 for reimage to buster

https://gerrit.wikimedia.org/r/597780

Change 597780 merged by Jcrespo:
[operations/puppet@production] mariadb-backups: Prepare db1140 for reimage to buster

https://gerrit.wikimedia.org/r/597780

Hi,

I cannot reinstall the server because the remote ipmi interface doesn't work (and the ssh or the https acesses, that are enabled, don't accept my password). It looks like the password wasn't setup correctly after reset, but common default user/password combinations doesn't work either.

I could restart the server and change it from post utilities, but I cannot connect to serial interface through ssh/https to be able to do so- needs local access :-(.

Hi, @wiki_willy I just want to ping you so your team is aware that the maintenance here didn't complete correctly and that we need more onsite help (I don't need this fast, just making sure it doesn't fall under the radar).

Thanks @jcrespo . I don't think @Jclark-ctr has been onsite at the data center since the last update, but I'll follow up with him on this when he's out there this week. Thanks, Willy

corrected user name. Jcrespo confirmed able to log in

@Jclark-ctr One last thing- this was not an issue for me because as I had remote login so I could fix it myself, but may be interesting for you: remote IPMI was disabled, so I had to enable on management interface, following the documented procedure.

Script wmf-auto-reimage was launched by jynus on cumin1001.eqiad.wmnet for hosts:

['db1140.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202006010810_jynus_86215.log.

@Jclark-ctr Serial port redirection doesn't work. This is a blocker because I cannot read the console output on restart, and understand why it is not booting from network when requested on ipmi (it just rebooted into hard drive).

One last thing- this was not an issue for me because as I had remote login so I could fix it myself, but may be interesting for you: remote IPMI was disabled, so I had to enable on management interface, following the documented procedure.

This host seems to not have the regular WMF setup (serial port, initial setup), could you please have a look so the net installer works?

Script wmf-auto-reimage was launched by jynus on cumin1001.eqiad.wmnet for hosts:

['db1140.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202006021417_jynus_99411.log.

Change 601744 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] install_server: Update NIC hw address for db1140

https://gerrit.wikimedia.org/r/601744

Change 601744 merged by Jcrespo:
[operations/puppet@production] install_server: Update NIC hw address for db1140

https://gerrit.wikimedia.org/r/601744

Script wmf-auto-reimage was launched by jynus on cumin1001.eqiad.wmnet for hosts:

['db1140.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202006021438_jynus_117969.log.

Jclark changed the serial port and we now have output on serial console, including post. The other issue was the mac address changing because of the board change. After the 2 changes, reimage went through successfully. I will take care of remaining sw steps.

I think the reason why this took so long and failure-correction steps is that there is no documented procedure/checklist for DC Ops to update on HP hosts after a board replacement (network updates, installer updates, bios changes, managemement changes, ipmi changes, passwords, etc.). Even if it is not a common operation, maybe this ticket could be used as a starting point to document that so future maintenances are smoother :-D. CC willy

Completed auto-reimage of hosts:

['db1140.eqiad.wmnet']

and were ALL successful.

Thanks @jcrespo , our documentation looks to be a bit outdated, so we'll get this added in

Jclark changed the serial port and we now have output on serial console, including post. The other issue was the mac address changing because of the board change. After the 2 changes, reimage went through successfully. I will take care of remaining sw steps.

I think the reason why this took so long and failure-correction steps is that there is no documented procedure/checklist for DC Ops to update on HP hosts after a board replacement (network updates, installer updates, bios changes, managemement changes, ipmi changes, passwords, etc.). Even if it is not a common operation, maybe this ticket could be used as a starting point to document that so future maintenances are smoother :-D. CC willy

jcrespo reassigned this task from jcrespo to Jclark-ctr.

db1140 has been repopulated from dbprov snapshots of s1 and s6 and upgraded to 10.4.

It has been added to tendril and zarcillo, it appears on prometheus.

Replication is catching up on s1; s6 already did.

I will do now try backups and snapshots of these sections, even if we still have backup sources on 10.1 for both s1 and s6.

Change 602044 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] Revert "db1140: Disable notifications"

https://gerrit.wikimedia.org/r/602044

Change 602044 merged by Jcrespo:
[operations/puppet@production] Revert "db1140: Disable notifications"

https://gerrit.wikimedia.org/r/602044