Page MenuHomePhabricator

(OoW) lvs2006 crashed into (what it seems) an unrecoverable state
Open, Stalled, LowPublic

Description

At around 20 UTC on Nov 12th lvs2006 crashed logging the following:

Uncorrectable PCI Express Error (Embedded device, Bus 0, Device 2, Function 0, Error status 0x00000020)

Unrecoverable System Error (NMI) has occurred.  System Firmware will log additional details in a separate IML entry if possible

I tried with power reset and power off hard + power on but I can't see anything from vsp.

Details

Related Gerrit Patches:
operations/puppet : productionlvs: Replace lvs2006 with lvs2010
operations/puppet : productionhieradata: Add lvs2010 specific settings
operations/puppet : productionlvs: configure lvs2010 interfaces

Event Timeline

elukey triaged this task as High priority.Nov 13 2018, 7:17 AM
elukey created this task.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 13 2018, 7:18 AM

The system is online since 07:30 UTC

Change 473238 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] lvs: configure lvs2010 interfaces

https://gerrit.wikimedia.org/r/473238

Change 473238 merged by Vgutierrez:
[operations/puppet@production] lvs: configure lvs2010 interfaces

https://gerrit.wikimedia.org/r/473238

we will be replacing lvs2006 with lvs2010

Script wmf-auto-reimage was launched by vgutierrez on neodymium.eqiad.wmnet for hosts:

lvs2010.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/201811131628_vgutierrez_11175_lvs2010_codfw_wmnet.log.

Completed auto-reimage of hosts:

['lvs2010.codfw.wmnet']

Of which those FAILED:

['lvs2010.codfw.wmnet']

Script wmf-auto-reimage was launched by vgutierrez on neodymium.eqiad.wmnet for hosts:

lvs2010.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/201811131709_vgutierrez_20665_lvs2010_codfw_wmnet.log.

Completed auto-reimage of hosts:

['lvs2010.codfw.wmnet']

Of which those FAILED:

['lvs2010.codfw.wmnet']

Script wmf-auto-reimage was launched by vgutierrez on neodymium.eqiad.wmnet for hosts:

lvs2010.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/201811131710_vgutierrez_20786_lvs2010_codfw_wmnet.log.

Completed auto-reimage of hosts:

['lvs2010.codfw.wmnet']

Of which those FAILED:

['lvs2010.codfw.wmnet']

Script wmf-auto-reimage was launched by vgutierrez on neodymium.eqiad.wmnet for hosts:

lvs2010.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/201811141554_vgutierrez_13160_lvs2010_codfw_wmnet.log.

Completed auto-reimage of hosts:

['lvs2010.codfw.wmnet']

Of which those FAILED:

['lvs2010.codfw.wmnet']

Change 473576 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] hieradata: Add lvs2010 specific settings

https://gerrit.wikimedia.org/r/473576

Change 473576 merged by Vgutierrez:
[operations/puppet@production] hieradata: Add lvs2010 specific settings

https://gerrit.wikimedia.org/r/473576

ema moved this task from Triage to LoadBalancer on the Traffic board.Nov 15 2018, 10:06 AM

Change 473734 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] lvs: Replace lvs2006 with lvs2010

https://gerrit.wikimedia.org/r/473734

Vgutierrez changed the task status from Open to Stalled.Nov 30 2018, 3:39 PM
Vgutierrez removed Papaul as the assignee of this task.
Vgutierrez added a subscriber: Papaul.

lvs2010 replacement is currently blocked by T203194

It's been blocked for some months; where are we on this?

@ArielGlenn that system is out of warranty and the plan is to replace it with the systems in T196560

wiki_willy renamed this task from lvs2006 crashed into (what it seems) an unrecoverable state to (OoW) lvs2006 crashed into (what it seems) an unrecoverable state.Jul 15 2019, 8:45 PM
wiki_willy assigned this task to Papaul.

we will be replacing lvs2006 with lvs2010

Papaul lowered the priority of this task from High to Lowest.Jul 17 2019, 3:43 PM
Papaul raised the priority of this task from Lowest to Low.Sep 3 2019, 3:29 PM