⚓ T155180 codfw: mw2251-mw2260 rack/setup

	Subject	Repo	Branch	Lines +/-
	Add new MW appservers and api-appservers in codfw	operations/puppet	production	+22 -0

Papaul created this task.Jan 12 2017, 4:54 PM

Restricted Application added a project: SRE. · View Herald TranscriptJan 12 2017, 4:54 PM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Papaul updated the task description. (Show Details)Jan 12 2017, 4:55 PM

Papaul created subtask T155181: codfw:mw2251-mw2260 switch port configuration.Jan 12 2017, 5:12 PM

Papaul triaged this task as Medium priority.Jan 12 2017, 5:17 PM

RobH closed subtask T155181: codfw:mw2251-mw2260 switch port configuration as Resolved.Jan 18 2017, 6:30 PM

During installation at the partition disks step I can the following error:

┌───────────────────────┤ [!!] Partition disks ├────────────────────────┐
│                                                                       │
│ The attempt to mount a file system with type ext4 in RAID1 device #1  │
│ at / failed.                                                          │
│                                                                       │
│ You may resume partitioning from the partitioning menu.               │
│                                                                       │
│ Do you want to resume partitioning?                                   │
│                                                                       │
│     <Go Back>                                       <Yes>    <No>     │
│                                                                       │
└───────────────────────────────────────────────────────────────────────┘

This happen on all 10 servers.
I remember we had this same problem with our Debian installer on Jan 2016 that is almost a year ago:
https://phabricator.wikimedia.org/T125256

To make sure it is our Debian installer causing the problem I asked Daniel to help me hack the installation configuration of mw2251 by changing it from Jessie to Trusty. When running Trusty, I do not have the problem; the installation complete with no issue.
@MoritzMuehlenhoff
@faidon
Please advice.
Thanks

RobH merged a task: T152698: rack/setup/install mw2251-mw2260.Jan 19 2017, 2:17 AM

RobH added a parent task: Unknown Object (Task).

RobH added subscribers: Southparkfan, mark, • emailbot.

Yup, mw2251 switched to trusty, first manual then in puppet (https://gerrit.wikimedia.org/r/#/c/332930/), issue only happens in jessie, not trusty.

That happens after jessie point releases (and there was one last weekend). Until a while ago the Squid cache needed to be purged manually, but I commited some config tweaks a few months ago which should make this obsolete. What's still needed is the rebuild of the images, quoting Faidon from IRC a while ago:

[Montag, 19. September 2016] [10:58:02] <paravoid> updating the d-i isn't that simple though, as we need to bundle firmware into the initrd
[Montag, 19. September 2016] [10:58:12] <paravoid> I have an unpuppetized script for that in my home directory on palladium

That script is now in puppetmaster1001:/home/faidon, I just ran it, /var/lib/puppet/volatile is synced via rsync every 15 minutes, so it should work soon. If it still fails, please ping me on IRC.

@Joe installation complete.

elukey added a project: User-Elukey.Feb 24 2017, 1:27 PM

elukey subscribed.

elukey moved this task from Backlog to Ops Backlog on the User-Elukey board.Mar 2 2017, 2:48 PM

Adding a note to check for re-occurrences:

09:09  <icinga-wm> PROBLEM - Host mw2256 is DOWN: PING CRITICAL - Packet loss = 100%
09:09  <icinga-wm> RECOVERY - Host mw2256 is UP: PING OK - Packet loss = 0%, RTA = 36.11 ms

elukey@asw-b-codfw> show log messages | match fpc3
Mar  3 09:07:14  asw-b-codfw fpc3 [EX-BCM PIC] ex_bcm_linkscan_handler: Link 16 DOWN
Mar  3 09:07:28  asw-b-codfw fpc3 [EX-BCM PIC] ex_bcm_linkscan_handler: Link 16 UP
Mar  3 09:09:10  asw-b-codfw fpc3 [EX-BCM PIC] ex_bcm_linkscan_handler: Link 16 DOWN
Mar  3 09:09:13  asw-b-codfw fpc3 [EX-BCM PIC] ex_bcm_linkscan_handler: Link 16 UP

https://phabricator.wikimedia.org/T156023#3046855 as a reminder of the appservers status after the last rebalance.

Mentioned in SAL (#wikimedia-operations) [2017-03-08T15:08:34Z] <elukey> rebooting mw225[123] as part of sanity check for T155180

Mentioned in SAL (#wikimedia-operations) [2017-03-08T15:19:57Z] <elukey> rebooting mw22(5[4-9]|60) as part of sanity check for T155180

So hosts rebooted, verified that puppet ran correctly and executed apt-get dist-upgrade. Verified also ROW allocation:

{'mw2251.codfw.wmnet': '    SysName:      asw-a-codfw'}
{'mw2252.codfw.wmnet': '    SysName:      asw-a-codfw'}
{'mw2253.codfw.wmnet': '    SysName:      asw-a-codfw'}
{'mw2254.codfw.wmnet': '    SysName:      asw-b-codfw'}
{'mw2255.codfw.wmnet': '    SysName:      asw-b-codfw'}
{'mw2256.codfw.wmnet': '    SysName:      asw-b-codfw'}
{'mw2257.codfw.wmnet': '    SysName:      asw-b-codfw'}
{'mw2258.codfw.wmnet': '    SysName:      asw-b-codfw'}
{'mw2259.codfw.wmnet': '    SysName:      asw-b-codfw'}
{'mw2260.codfw.wmnet': '    SysName:      asw-b-codfw'}

Change 342194 had a related patch set uploaded (by Elukey):
[operations/puppet] Add new MW appservers and api-appservers in codfw

https://gerrit.wikimedia.org/r/342194

gerritbot added a project: Patch-For-Review.Mar 10 2017, 11:07 AM

Change 342194 merged by Elukey:
[operations/puppet] Add new MW appservers and api-appservers in codfw

https://gerrit.wikimedia.org/r/342194

Ran puppet, rebooted the nodes for the mw-cgroups, re-ran puppet and scap pull, pooled the nodes via conftool.

Still to check: I had to reboot mw2256 because it wasn't responding to ssh, and given the fact that we had a similar issue on march 3rd I'd like to investigate a bit more.

Happened again today:

04:08  <icinga-wm> PROBLEM - Host mw2256 is DOWN: PING CRITICAL - Packet loss = 100%
04:09  <icinga-wm> RECOVERY - Host mw2256 is UP: PING OK - Packet loss = 0%, RTA = 36.19 ms

Definitely something odd, mw2256 needs to be investigated.

It keeps repeating, I can see a lot of EDAC errors in kern.log:

elukey@mw2256:~$ sudo grep -i EDAC /var/log/kern.log
Mar 12 15:29:09 mw2256 kernel: [40837.335365] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
Mar 12 15:29:09 mw2256 kernel: [40837.335368] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 1: 940000000000009f
Mar 12 15:29:09 mw2256 kernel: [40837.335370] EDAC sbridge MC0: TSC 51cb6d97f462
Mar 12 15:29:09 mw2256 kernel: [40837.335372] EDAC sbridge MC0: ADDR 857743fc0 EDAC sbridge MC0: MISC 0
Mar 12 15:29:09 mw2256 kernel: [40837.335377] EDAC sbridge MC0: PROCESSOR 0:406f1 TIME 1489332549 SOCKET 0 APIC 0
Mar 12 15:29:10 mw2256 kernel: [40837.837147] EDAC MC0: 0 CE memory read error on CPU_SrcID#0_Ha#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0x857743 offset:0xfc0 grain:32 syndrome:0x0 -  area:DRAM err_code:0000:009f socket:0 ha:0 channel_mask:1 rank:1)
[..]

@Papaul - Hi! This host might have a faulty RAM bank that causes reboots and errors, whenever you have time can we sync to try to replace it?

elukey moved this task from Ops Backlog to Stalled on the User-Elukey board.Mar 13 2017, 12:01 PM

@elukey Can we get mw2256 depooled from the dsh lists etc?

Will stop scap giving timeout errors for the host

Mentioned in SAL (#wikimedia-operations) [2017-03-15T00:08:52Z] <mutante> depooled mw2256 because it's down again (T155180)

In T155180#3100345, @Reedy wrote:

@elukey Can we get mw2256 depooled from the dsh lists etc?

Will stop scap giving timeout errors for the host

Looks like we are still getting these timeout errors with scap :(

set mw2256 as inactive and ran puppet on tin, host removed from DSH.

The system log shows DIMM A1 faulty. I swapped DIMM A1 with DIMM B1 and did system check again and now the system log shows DIMM B1 faulty. The memory is bad. Will contact Dell to send me a replacement.

Addshore unsubscribed.Mar 16 2017, 5:33 PM

this host is logging a ton of noise in logstash... can we shut down hhvm until the ram problem is fixed?

@mmodell: mw2256 should be shutdown now, we are not planning to bring it up again until we'll have the new DIMM :)

Service Request Information:
Dispatch Information: Customer Information:
Dispatch Number: 324983627
Service Tag: FXLPND2
Service Request Number: 945459239
Express Service Code: 34683587462
System Type: PowerEdge R430 Server

DIMM replacement complete, system is back up online.

Removed downtime on mw2256, let's see if it holds up without rebooting for a couple of days.

Last step before closing - set mw2256 as "active" via conftool and make sure that it is put back in the scap dsh (do a scap pull from the host before closing too!).

Created T161488 to decommission 7 old API codfw appservers (replaced by these ones).

mw2256 works fine and it is now active in conftool.

elukey mentioned this in T161488: Reclaim/Decommission mw2090->mw2096 (OOW).Mar 27 2017, 8:48 AM

elukey mentioned this in T156023: Check the size of every cluster in codfw to see if it matches eqiad's capacity.Mar 27 2017, 9:57 AM

mw2256 died again , it was powercycled, came back, was repooled.. and shortly after went down again

07:40 < icinga-wm> PROBLEM - Host mw2256 is DOWN: PING CRITICAL - Packet loss = 100%
07:40 < elukey> mmm checking mw2256
07:42 < akosiaris> elukey: I think mw2256 has died.. console doesn't spew out anything meaningful
07:44 < akosiaris> !log powercycle mw2256
07:46 < akosiaris> mw2256 issues garbled text at the console.. looks like baud rate misconfiguration
07:47 < _joe_> akosiaris: mw2256 - if it comes up, please sync it
07:47 < icinga-wm> RECOVERY - Host mw2256 is UP: PING OK - Packet loss = 0%, RTA = 36.11 ms
07:47 < akosiaris> mw2256 is back up
07:50 < elukey> godog: is all mw2256 related? (nutcracker
07:51 <+logmsgbot> !log akosiaris@puppetmaster1001 conftool action : set/pooled=no; selector: name=mw2256.codfw.wmnet,service=apache2
07:51 <+logmsgbot> !log akosiaris@puppetmaster1001 conftool action : set/pooled=no; selector: name=mw2256.codfw.wmnet,service=nginx
07:55 < akosiaris> mw2256 synced I 'll repool it
07:56 <+logmsgbot> !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: name=mw2256.codfw.wmnet,service=nginx
07:56 <+logmsgbot> !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: name=mw2256.codfw.wmnet,service=apache2
08:34 < icinga-wm> PROBLEM - Host mw2256 is DOWN: PING CRITICAL - Packet loss = 100%
08:35 < mutante> !log mw2256 went down and showed " PANIC: double fault, error_code: 0x0"

@Dzahn anything error log?

@Papaul just the kernel panic PANIC: double fault, error_code: 0x0 and this during boot:

5939 Apr 19 14:44:03 mw2256 kernel: [   28.623132] ACPI Error: No handler for Region [SYSI] (ffff88085eca09c8) [IPMI] (20150930/evregion-163)
5940 Apr 19 14:44:03 mw2256 kernel: [   28.623140] ACPI Error: Region IPMI (ID=7) has no handler (20150930/exfldio-297)
5941 Apr 19 14:44:03 mw2256 kernel: [   28.623146] ACPI Error: Method parse/execution failed [\_SB.PMI0._GHL] (Node ffff88105e4192c0), AE_NOT_EXIST (20150930/psparse-542)
5942 Apr 19 14:44:03 mw2256 kernel: [   28.623158] ACPI Error: Method parse/execution failed [\_SB.PMI0._PMC] (Node ffff88105e419270), AE_NOT_EXIST (20150930/psparse-542)
5943 Apr 19 14:44:03 mw2256 kernel: [   28.623168] ACPI Exception: AE_NOT_EXIST, Evaluating _PMC (20150930/power_meter-755)

closing this again to handle mw2256 in a subtask (please continue on T163346)

elukey closed subtask T163346: mw2256 - hardware issue as Resolved.Aug 29 2017, 8:04 AM

elukey reopened subtask T163346: mw2256 - hardware issue as Open.Aug 29 2017, 10:05 AM

MoritzMuehlenhoff closed subtask T163346: mw2256 - hardware issue as Resolved.Sep 15 2017, 7:28 AM

codfw: mw2251-mw2260 rack/setup
Closed, ResolvedPublic
Actions

Description

Details

Related Objects
Search...

Event Timeline

Status	Assigned	Task
		Unknown Object (Task)
Resolved	Joe	T155180 codfw: mw2251-mw2260 rack/setup
Resolved	RobH	T155181 codfw:mw2251-mw2260 switch port configuration
Resolved	elukey	T163346 mw2256 - hardware issue

	F6629420: Selection_004.png
	Mar 16 2017, 4:41 PM

codfw: mw2251-mw2260 rack/setupClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

codfw: mw2251-mw2260 rack/setup
Closed, ResolvedPublic
Actions

Related Objects
Search...