Page MenuHomePhabricator

(OoW) lvs2006 crashed into (what it seems) an unrecoverable state
Closed, DeclinedPublic

Description

At around 20 UTC on Nov 12th lvs2006 crashed logging the following:

Uncorrectable PCI Express Error (Embedded device, Bus 0, Device 2, Function 0, Error status 0x00000020)

Unrecoverable System Error (NMI) has occurred.  System Firmware will log additional details in a separate IML entry if possible

I tried with power reset and power off hard + power on but I can't see anything from vsp.

Event Timeline

elukey triaged this task as High priority.Nov 13 2018, 7:17 AM
elukey created this task.

The system is online since 07:30 UTC

Change 473238 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] lvs: configure lvs2010 interfaces

https://gerrit.wikimedia.org/r/473238

Change 473238 merged by Vgutierrez:
[operations/puppet@production] lvs: configure lvs2010 interfaces

https://gerrit.wikimedia.org/r/473238

we will be replacing lvs2006 with lvs2010

Script wmf-auto-reimage was launched by vgutierrez on neodymium.eqiad.wmnet for hosts:

lvs2010.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/201811131628_vgutierrez_11175_lvs2010_codfw_wmnet.log.

Completed auto-reimage of hosts:

['lvs2010.codfw.wmnet']

Of which those FAILED:

['lvs2010.codfw.wmnet']

Script wmf-auto-reimage was launched by vgutierrez on neodymium.eqiad.wmnet for hosts:

lvs2010.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/201811131709_vgutierrez_20665_lvs2010_codfw_wmnet.log.

Completed auto-reimage of hosts:

['lvs2010.codfw.wmnet']

Of which those FAILED:

['lvs2010.codfw.wmnet']

Script wmf-auto-reimage was launched by vgutierrez on neodymium.eqiad.wmnet for hosts:

lvs2010.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/201811131710_vgutierrez_20786_lvs2010_codfw_wmnet.log.

Completed auto-reimage of hosts:

['lvs2010.codfw.wmnet']

Of which those FAILED:

['lvs2010.codfw.wmnet']

Script wmf-auto-reimage was launched by vgutierrez on neodymium.eqiad.wmnet for hosts:

lvs2010.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/201811141554_vgutierrez_13160_lvs2010_codfw_wmnet.log.

Completed auto-reimage of hosts:

['lvs2010.codfw.wmnet']

Of which those FAILED:

['lvs2010.codfw.wmnet']

Change 473576 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] hieradata: Add lvs2010 specific settings

https://gerrit.wikimedia.org/r/473576

Change 473576 merged by Vgutierrez:
[operations/puppet@production] hieradata: Add lvs2010 specific settings

https://gerrit.wikimedia.org/r/473576

Change 473734 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] lvs: Replace lvs2006 with lvs2010

https://gerrit.wikimedia.org/r/473734

Vgutierrez changed the task status from Open to Stalled.Nov 30 2018, 3:39 PM
Vgutierrez removed Papaul as the assignee of this task.
Vgutierrez added a subscriber: Papaul.

lvs2010 replacement is currently blocked by T203194

It's been blocked for some months; where are we on this?

@ArielGlenn that system is out of warranty and the plan is to replace it with the systems in T196560

wiki_willy renamed this task from lvs2006 crashed into (what it seems) an unrecoverable state to (OoW) lvs2006 crashed into (what it seems) an unrecoverable state.Jul 15 2019, 8:45 PM
wiki_willy assigned this task to Papaul.

we will be replacing lvs2006 with lvs2010

Papaul lowered the priority of this task from High to Lowest.Jul 17 2019, 3:43 PM
Papaul raised the priority of this task from Lowest to Low.Sep 3 2019, 3:29 PM

Change 473734 abandoned by Vgutierrez:
lvs: Replace lvs2006 with lvs2010

Reason:
superseded by I1a1cd3b0148a51431836989080784d40d36dc9b8

https://gerrit.wikimedia.org/r/473734

Server is decommissioned in T246329