db1140 (backup source) crashed
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• Marostegui
	Apr 19 2020, 5:51 AM

Description

Times in UTC:

03:26:26 <+icinga-wm> PROBLEM - Host db1140 is DOWN: PING CRITICAL - Packet loss = 100%

iLO Standard 1.40 at  Feb 05 2019
Server Name:
Server Power: Off

</>hpiLO->

Looks like a HW issue:

/system1/log1/record21
  Targets
  Properties
    number=21
    severity=Critical
    date=04/19/2020
    time=03:22:39
    description=Server Critical Fault (Service Information: Runtime Fault, Processor(Intel),  Processor 1 (02h))
  Verbs
    cd version exit show

This server is under warranty.
The backups that run from this host are x1 and s2.

Details

Subject	Repo	Branch	Lines +/-
Revert "db1140: Disable notifications"	operations/puppet	production	+0 -1
db1140: Disable notifications	operations/puppet	production	+1 -0
install_server: Update NIC hw address for db1140	operations/puppet	production	+1 -1
mariadb-backups: Prepare db1140 for reimage to buster	operations/puppet	production	+3 -5
mariadb-backups: Enable notifications on db1095 after failover from db1140	operations/puppet	production	+0 -1
mariadb-backups: Move s2, x1 eqiad backups to db1095	operations/puppet	production	+4 -4
mariadb-backups: Add s2, x1 to db1095 (eqiad backup source)	operations/puppet	production	+5 -2

Customize query in gerrit

Related Objects

Mentioned In: T254272: Update Documentation for dl360 Motherboard Swap

Event Timeline

• Marostegui created this task.Apr 19 2020, 5:51 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 19 2020, 5:51 AM

Mentioned in SAL (#wikimedia-operations) [2020-04-19T05:51:57Z] <marostegui> Power back on db1140 T250602

This server is fully broken apparently, it is not powering ON :-( - I have tried multiple combinations:

</>hpiLO-> power on

status=0
status_tag=COMMAND COMPLETED
Sun Apr 19 05:52:58 2020



Server powering on .......


</>hpiLO-> power

status=0
status_tag=COMMAND COMPLETED
Sun Apr 19 05:54:29 2020



power: server power is currently: Off

</>hpiLO-> power reset

status=2
status_tag=COMMAND PROCESSING FAILED
error_tag=COMMAND ERROR-UNSPECIFIED
Sun Apr 19 05:55:27 2020

Server power off.




</>hpiLO-> power

status=0
status_tag=COMMAND COMPLETED
Sun Apr 19 05:55:48 2020



power: server power is currently: Off

</>hpiLO-> power off hard

status=2
status_tag=COMMAND PROCESSING FAILED
error_tag=COMMAND ERROR-UNSPECIFIED
Sun Apr 19 05:56:51 2020

Server power already off.




</>hpiLO-> power on

status=0
status_tag=COMMAND COMPLETED
Sun Apr 19 05:56:57 2020



Server powering on .......



</>hpiLO-> power

status=0
status_tag=COMMAND COMPLETED
Sun Apr 19 05:58:22 2020



power: server power is currently: Off

We should probably contact HP and get a new mother board - assigning to @jcrespo so he decide on the next steps for this backup source host.

• Marostegui moved this task from Triage to In progress on the DBA board.Apr 19 2020, 6:00 AM

Change 590255 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] db1140: Disable notifications

https://gerrit.wikimedia.org/r/590255

gerritbot added a project: Patch-For-Review.Apr 19 2020, 7:47 AM

Change 590255 merged by Marostegui:
[operations/puppet@production] db1140: Disable notifications

https://gerrit.wikimedia.org/r/590255

Maintenance_bot removed a project: Patch-For-Review.Apr 19 2020, 8:10 AM

Change 590995 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb-backups: Add s2, x1 to db1095 (eqiad backup source)

https://gerrit.wikimedia.org/r/590995

gerritbot added a project: Patch-For-Review.Apr 20 2020, 7:47 AM

Change 590995 merged by Jcrespo:
[operations/puppet@production] mariadb-backups: Add s2, x1 to db1095 (eqiad backup source)

https://gerrit.wikimedia.org/r/590995

Mentioned in SAL (#wikimedia-operations) [2020-04-20T08:09:44Z] <jynus> restarting s3 instance on db1095 to reduce its buffer pool T250602

Maintenance_bot removed a project: Patch-For-Review.Apr 20 2020, 8:10 AM

Change 590999 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb-backups: Move s2, x1 eqiad backups to db1095

https://gerrit.wikimedia.org/r/590999

gerritbot added a project: Patch-For-Review.Apr 20 2020, 8:20 AM

Change 590999 merged by Jcrespo:
[operations/puppet@production] mariadb-backups: Move s2, x1 eqiad backups to db1095

https://gerrit.wikimedia.org/r/590999

Maintenance_bot removed a project: Patch-For-Review.Apr 20 2020, 1:10 PM

Service failover done, this is now 100% on DC-Ops side for vendor handling, as per description and initial triage by Manuel. No production service is there now, and data is useless (will be reimaged) -can be handled freely by dcops (although please note there is private data on these disks).

Change 591047 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb-backups: Enable notifications on db1095 after failover from db1140

https://gerrit.wikimedia.org/r/591047

gerritbot added a project: Patch-For-Review.Apr 20 2020, 1:18 PM

@Marostegui @jynus Before I can start a ticket with HPE some local troubleshooting has to be done.

• Cmjohnson moved this task from Backlog to Hardware Failure / Troubleshoot on the ops-eqiad board.Apr 20 2020, 1:56 PM

Mentioned in SAL (#wikimedia-operations) [2020-04-21T05:34:11Z] <marostegui> Add db1095:3312, db1095:3320 to tendril - T250602

Change 591047 merged by Jcrespo:
[operations/puppet@production] mariadb-backups: Enable notifications on db1095 after failover from db1140

https://gerrit.wikimedia.org/r/591047

Maintenance_bot removed a project: Patch-For-Review.Apr 21 2020, 9:10 AM

I pulled all power plugs, reseated psu's, DIMM and CPU. The server will not power on, the LEDs are flashing orange and red.

@Marostegui this could be down for a bit, between HPE troubleshooting and getting a tech on-site.

RhinosF1 subscribed.Apr 21 2020, 4:13 PM

I will be the first contact for this server (although manuel will be around if needed, ofc :-D) @Cmjohnson we are aware- that is why migrated the service away, as it couldn't wait for the repair/etc. So that is not immediate issue. However, the reduced redundancy is not ideal for long term- we will budget additional servers for next fiscal. Thanks for taking care of it!

@Jclark-ctr can you start the process with HPE please.

led`s flashing in sequence of 2 most likely a processor error

Case Reference ID: 5346998524
Status: Case is generated and in Progress
Product: HPE ProLiant DL360 Gen10 8SFF Configure-to-order Server
Product number: 867959-B21
Serial number: MXQ91300JD

updated ticket with HP. Scheduled main board replacement.

Downtime expired - I have acked the alerts in Icinga

Main board replaced today entered password & management address into server

Per my IRC chat with John, assigning this back to Jaime as the on-site part is done
Thank you John!

Per my IRC chat with John

Could you tell me more, as before a processor error was mentioned, but then a board change?

In T250602#6155334, @jcrespo wrote:

Per my IRC chat with John

Could you tell me more, as before a processor error was mentioned, but then a board change?

My chat was more like a heads up from John that the mainboard was replaced and to check if we are able to connect back to the host.
I would assume that the vendor decided just to replace the whole mainboard rather than just the CPU (this is a guess)

Change 597780 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb-backups: Prepare db1140 for reimage to buster

https://gerrit.wikimedia.org/r/597780

gerritbot added a project: Patch-For-Review.May 21 2020, 2:00 PM

Change 597780 merged by Jcrespo:
[operations/puppet@production] mariadb-backups: Prepare db1140 for reimage to buster

https://gerrit.wikimedia.org/r/597780

Hi,

I cannot reinstall the server because the remote ipmi interface doesn't work (and the ssh or the https acesses, that are enabled, don't accept my password). It looks like the password wasn't setup correctly after reset, but common default user/password combinations doesn't work either.

I could restart the server and change it from post utilities, but I cannot connect to serial interface through ssh/https to be able to do so- needs local access :-(.

Maintenance_bot removed a project: Patch-For-Review.May 21 2020, 4:11 PM

Hi, @wiki_willy I just want to ping you so your team is aware that the maintenance here didn't complete correctly and that we need more onsite help (I don't need this fast, just making sure it doesn't fall under the radar).

Thanks @jcrespo . I don't think @Jclark-ctr has been onsite at the data center since the last update, but I'll follow up with him on this when he's out there this week. Thanks, Willy

corrected user name. Jcrespo confirmed able to log in

jcrespo claimed this task.May 28 2020, 2:20 PM

@Jclark-ctr One last thing- this was not an issue for me because as I had remote login so I could fix it myself, but may be interesting for you: remote IPMI was disabled, so I had to enable on management interface, following the documented procedure.

Script wmf-auto-reimage was launched by jynus on cumin1001.eqiad.wmnet for hosts:

['db1140.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202006010810_jynus_86215.log.

@Jclark-ctr Serial port redirection doesn't work. This is a blocker because I cannot read the console output on restart, and understand why it is not booting from network when requested on ipmi (it just rebooted into hard drive).

One last thing- this was not an issue for me because as I had remote login so I could fix it myself, but may be interesting for you: remote IPMI was disabled, so I had to enable on management interface, following the documented procedure.

This host seems to not have the regular WMF setup (serial port, initial setup), could you please have a look so the net installer works?

Script wmf-auto-reimage was launched by jynus on cumin1001.eqiad.wmnet for hosts:

['db1140.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202006021417_jynus_99411.log.

Change 601744 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] install_server: Update NIC hw address for db1140

https://gerrit.wikimedia.org/r/601744

gerritbot added a project: Patch-For-Review.Jun 2 2020, 2:30 PM

Change 601744 merged by Jcrespo:
[operations/puppet@production] install_server: Update NIC hw address for db1140

https://gerrit.wikimedia.org/r/601744

Script wmf-auto-reimage was launched by jynus on cumin1001.eqiad.wmnet for hosts:

['db1140.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202006021438_jynus_117969.log.

Jclark changed the serial port and we now have output on serial console, including post. The other issue was the mac address changing because of the board change. After the 2 changes, reimage went through successfully. I will take care of remaining sw steps.

I think the reason why this took so long and failure-correction steps is that there is no documented procedure/checklist for DC Ops to update on HP hosts after a board replacement (network updates, installer updates, bios changes, managemement changes, ipmi changes, passwords, etc.). Even if it is not a common operation, maybe this ticket could be used as a starting point to document that so future maintenances are smoother :-D. CC willy

Completed auto-reimage of hosts:

['db1140.eqiad.wmnet']

and were ALL successful.

Maintenance_bot removed a project: Patch-For-Review.Jun 2 2020, 3:11 PM

Thanks @jcrespo , our documentation looks to be a bit outdated, so we'll get this added in

In T250602#6185325, @jcrespo wrote:

Jclark changed the serial port and we now have output on serial console, including post. The other issue was the mac address changing because of the board change. After the 2 changes, reimage went through successfully. I will take care of remaining sw steps.

I think the reason why this took so long and failure-correction steps is that there is no documented procedure/checklist for DC Ops to update on HP hosts after a board replacement (network updates, installer updates, bios changes, managemement changes, ipmi changes, passwords, etc.). Even if it is not a common operation, maybe this ticket could be used as a starting point to document that so future maintenances are smoother :-D. CC willy

wiki_willy mentioned this in T254272: Update Documentation for dl360 Motherboard Swap.Jun 2 2020, 6:31 PM

db1140 has been repopulated from dbprov snapshots of s1 and s6 and upgraded to 10.4.

It has been added to tendril and zarcillo, it appears on prometheus.

Replication is catching up on s1; s6 already did.

I will do now try backups and snapshots of these sections, even if we still have backup sources on 10.1 for both s1 and s6.

Change 602044 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] Revert "db1140: Disable notifications"

https://gerrit.wikimedia.org/r/602044

gerritbot added a project: Patch-For-Review.Jun 3 2020, 10:37 AM

Change 602044 merged by Jcrespo:
[operations/puppet@production] Revert "db1140: Disable notifications"

https://gerrit.wikimedia.org/r/602044

Maintenance_bot removed a project: Patch-For-Review.Jun 3 2020, 11:10 AM

db1140 (backup source) crashed Closed, ResolvedPublicActions

Description

Details

Related Objects

Event Timeline

db1140 (backup source) crashed
Closed, ResolvedPublic
Actions