Page MenuHomePhabricator

lvs2002 random shut down
Closed, ResolvedPublic

Description

lvs2002 suddenly went down by itself: (timezone PDT)

15:02 < icinga-wm> PROBLEM - Host lvs2002 is DOWN: PING CRITICAL - Packet loss = 100%

I connected to mgmt console ("vsp" since this is a HP server) but i saw nothing and server stayed down for at least 5 min.

Then i powercycled it with "power reset" and after a little waiting it just came back as normal, booted straight to login.

15:09 < mutante> !log power cycling lvs2002, it was down and console showed nothing
15:12 < icinga-wm> RECOVERY - Host lvs2002 is UP: PING OK - Packet loss = 0%, RTA = 36.08 ms

@Volans pointed out:

15:30 < volans> mutante: since all the kern.log logs it's looping every minute on an USB device flapping
15:30 < volans> Apr  3 22:26:16 lvs2002 kernel: [ 1044.983217] usb 3-1.3: USB disconnect, device number 19
15:30 < volans> Apr  3 22:26:16 lvs2002 kernel: [ 1045.407124] usb 3-1.3: new high-speed USB device number 20 using ehci-pci
15:30 < volans> Apr  3 22:26:17 lvs2002 kernel: [ 1045.499426] usb 3-1.3: New USB device found, idVendor=0424, idProduct=2660
15:30 < volans> Apr  3 22:26:17 lvs2002 kernel: [ 1045.499429] usb 3-1.3: New USB device strings: Mfr=0, Product=0, SerialNumber=0
15:30 < volans> Apr  3 22:26:17 lvs2002 kernel: [ 1045.499577] hub 3-1.3:1.0: USB hub found
15:30 < volans> Apr  3 22:26:17 lvs2002 kernel: [ 1045.499673] hub 3-1.3:1.0: 2 ports detected
15:31 < volans> according to a quick google search should be a "Standard Microsystems Corp. Hub" from idVendor and idProduct
15:31 < mutante> ah, i see, yea, there's a lot of that from hours 
15:31 < volans> from days
15:31 < volans> anyway it doesn't look too healthy as a behaviour ;)
15:31 < mutante> sounds like it ends in "replace mainboard" already

Event Timeline

Dzahn created this task.Apr 3 2017, 10:41 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 3 2017, 10:41 PM
ema triaged this task as Medium priority.Apr 4 2017, 10:14 AM
ema moved this task from Triage to LoadBalancer on the Traffic board.
ema added a subscriber: ema.
Dzahn added a comment.Apr 5 2017, 6:01 PM

it went down again: 11:02 < icinga-wm> PROBLEM - Host lvs2002 is DOWN: PING CRITICAL - Packet loss = 100%

Dzahn added a comment.Apr 5 2017, 6:05 PM

< bblack> perhaps power it off to make sure it doesn't blip back on, for now

Server Power: On

</>hpiLO-> power off

status=0
status_tag=COMMAND COMPLETED
Wed Apr  5 18:03:48 2017



Server powering off .......
Dzahn raised the priority of this task from Medium to High.Apr 5 2017, 6:05 PM
Dzahn added a subscriber: Papaul.Apr 5 2017, 6:08 PM

@Papaul Could you take a look at this. It seems we might have to call HP. We should make this a priority since we'll soon be moving all our traffic to codfw temporarily.

ayounsi added a subscriber: ayounsi.Apr 5 2017, 7:43 PM

Pushed the following to cr1/2.codfw.

When lvs2002 comes back online for troubleshooting it should not receive any traffic.

[edit routing-options rib inet6.0 static route 2620:0:860:ed1a::2:0/111]
-     next-hop 2620:0:860:101:10:192:1:2;
      readvertise;
      no-resolve;
+     next-hop 2620:0:860:102:10:192:17:5;
+     readvertise;
+     no-resolve;
[edit routing-options static route 208.80.153.240/28]
-    next-hop 10.192.1.2;
     readvertise;
     no-resolve;
+    next-hop 10.192.17.5;
+    readvertise;
+    no-resolve;
[edit policy-options policy-statement LVS_import]
     term secondary { ... }
+    /* T162099 */
+    term temporary {
+        from {
+            protocol bgp;
+            neighbor 10.192.1.2;
+        }
+        then {
+            metric add 20;
+        }
+    }
     term service_IPs { ... }
Papaul claimed this task.Apr 5 2017, 10:57 PM
Papaul added a comment.Apr 6 2017, 6:42 PM

Will have a replacement board tomorrow between 10:00am and 1:30 PM

Dear Mr Papaul Tshibamba,

Thank you for contacting Hewlett Packard Enterprise for your service request. This email confirms your request for service and the details are below.

Your request is being worked on under reference number 5318706460
Status: Case is generated and in Progress

Product description: HP ProLiant DL360p Gen8 4 LFF Configure-to-order Server
Product number: 655651-B21
Serial number: MXQ428048K
Subject: DL360p Gen8 - Server reboots every 1 hour with SD card error

Yours sincerely,
Hewlett Packard Enterprise

Papaul added a comment.Apr 7 2017, 9:33 PM

The HP guy called around 12:30pm to let me know that he was at UPS and haven't received the main board yet they told him that the truck will be there between 1 and 3 pm. The appointment once again was for 12:30pm today and did not happen. Since it is Friday and traffic is very bad on this side of town on Friday's I couldn't stay after 3pm I had to reschedule for Monday.

@Rob once again HP didn't respect their appointment.

Papaul added a subscriber: elukey.Apr 10 2017, 5:19 PM

main board replacement complete on lvs2002, System is back up. @elukey please check everything is okay while I am on site.
Thanks.

Papaul reassigned this task from Papaul to elukey.Apr 10 2017, 5:29 PM
BBlack claimed this task.Apr 10 2017, 5:32 PM
BBlack added a subscriber: BBlack.

Switching this to me

BBlack reassigned this task from BBlack to ayounsi.Apr 10 2017, 6:11 PM

@Papaul Everything looks good with lvs2002 (checked icinga, interfaces on correct vlans, etc).

@ayounsi Let's let it burn in with no traffic until tomorrow sometime, then sync up on reverting the router config hacks and watching the traffic come back to it.

faidon added a subscriber: faidon.Apr 13 2017, 11:41 AM

@ayounsi Let's let it burn in with no traffic until tomorrow sometime, then sync up on reverting the router config hacks and watching the traffic come back to it.

I reverted these today while deploying another change.

ayounsi reassigned this task from ayounsi to BBlack.Apr 13 2017, 1:05 PM

Moving that one back to Brandon

BBlack closed this task as Resolved.Oct 23 2017, 4:18 PM