Page MenuHomePhabricator

codfw: mw2251-mw2260 rack/setup
Closed, ResolvedPublic

Description

This task will track the racking and setup of 10 new app-servers received T151779

mw2251-mw2253 Row A rack A4

  • - receive in and attach packing slip to parent task T151779
  • - rack systems, update racktables
  • - create mgmt dns entries (both asset tag and hostname)
  • - create production dns entries (internal vlan)
  • - update/create sub task with network port info for all new hosts
  • - install_server module update (mac address and partitioning info, partition like existing mw systems)
  • - install os
  • - puppet/salt accept
  • - hand off to @Joe for service implementation.

mw2254-mw2260 row B rack B3

  • - receive in and attach packing slip to parent task T151779
  • - rack systems, update racktables
  • - create mgmt dns entries (both asset tag and hostname)
  • - create production dns entries (internal vlan)
  • - update/create sub task with network port info for all new hosts
  • - install_server module update (mac address and partitioning info, partition like existing mw systems)
  • - install os
  • - puppet/salt accept
  • - hand off to @Joe for service implementation.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Papaul triaged this task as Medium priority.Jan 12 2017, 5:17 PM

During installation at the partition disks step I can the following error:

┌───────────────────────┤ [!!] Partition disks ├────────────────────────┐
│                                                                       │
│ The attempt to mount a file system with type ext4 in RAID1 device #1  │
│ at / failed.                                                          │
│                                                                       │
│ You may resume partitioning from the partitioning menu.               │
│                                                                       │
│ Do you want to resume partitioning?                                   │
│                                                                       │
│     <Go Back>                                       <Yes>    <No>     │
│                                                                       │
└───────────────────────────────────────────────────────────────────────┘

This happen on all 10 servers.
I remember we had this same problem with our Debian installer on Jan 2016 that is almost a year ago:
https://phabricator.wikimedia.org/T125256

To make sure it is our Debian installer causing the problem I asked Daniel to help me hack the installation configuration of mw2251 by changing it from Jessie to Trusty. When running Trusty, I do not have the problem; the installation complete with no issue.
@MoritzMuehlenhoff
@faidon
Please advice.
Thanks

RobH added a parent task: Unknown Object (Task).
RobH added subscribers: Southparkfan, mark, emailbot.

Yup, mw2251 switched to trusty, first manual then in puppet (https://gerrit.wikimedia.org/r/#/c/332930/), issue only happens in jessie, not trusty.

That happens after jessie point releases (and there was one last weekend). Until a while ago the Squid cache needed to be purged manually, but I commited some config tweaks a few months ago which should make this obsolete. What's still needed is the rebuild of the images, quoting Faidon from IRC a while ago:

[Montag, 19. September 2016] [10:58:02] <paravoid> updating the d-i isn't that simple though, as we need to bundle firmware into the initrd
[Montag, 19. September 2016] [10:58:12] <paravoid> I have an unpuppetized script for that in my home directory on palladium

That script is now in puppetmaster1001:/home/faidon, I just ran it, /var/lib/puppet/volatile is synced via rsync every 15 minutes, so it should work soon. If it still fails, please ping me on IRC.

Papaul updated the task description. (Show Details)
Papaul updated the task description. (Show Details)
Papaul updated the task description. (Show Details)

@Joe installation complete.

Adding a note to check for re-occurrences:

09:09  <icinga-wm> PROBLEM - Host mw2256 is DOWN: PING CRITICAL - Packet loss = 100%
09:09  <icinga-wm> RECOVERY - Host mw2256 is UP: PING OK - Packet loss = 0%, RTA = 36.11 ms

elukey@asw-b-codfw> show log messages | match fpc3
Mar  3 09:07:14  asw-b-codfw fpc3 [EX-BCM PIC] ex_bcm_linkscan_handler: Link 16 DOWN
Mar  3 09:07:28  asw-b-codfw fpc3 [EX-BCM PIC] ex_bcm_linkscan_handler: Link 16 UP
Mar  3 09:09:10  asw-b-codfw fpc3 [EX-BCM PIC] ex_bcm_linkscan_handler: Link 16 DOWN
Mar  3 09:09:13  asw-b-codfw fpc3 [EX-BCM PIC] ex_bcm_linkscan_handler: Link 16 UP

https://phabricator.wikimedia.org/T156023#3046855 as a reminder of the appservers status after the last rebalance.

Mentioned in SAL (#wikimedia-operations) [2017-03-08T15:08:34Z] <elukey> rebooting mw225[123] as part of sanity check for T155180

Mentioned in SAL (#wikimedia-operations) [2017-03-08T15:19:57Z] <elukey> rebooting mw22(5[4-9]|60) as part of sanity check for T155180

So hosts rebooted, verified that puppet ran correctly and executed apt-get dist-upgrade. Verified also ROW allocation:

{'mw2251.codfw.wmnet': '    SysName:      asw-a-codfw'}
{'mw2252.codfw.wmnet': '    SysName:      asw-a-codfw'}
{'mw2253.codfw.wmnet': '    SysName:      asw-a-codfw'}
{'mw2254.codfw.wmnet': '    SysName:      asw-b-codfw'}
{'mw2255.codfw.wmnet': '    SysName:      asw-b-codfw'}
{'mw2256.codfw.wmnet': '    SysName:      asw-b-codfw'}
{'mw2257.codfw.wmnet': '    SysName:      asw-b-codfw'}
{'mw2258.codfw.wmnet': '    SysName:      asw-b-codfw'}
{'mw2259.codfw.wmnet': '    SysName:      asw-b-codfw'}
{'mw2260.codfw.wmnet': '    SysName:      asw-b-codfw'}

Change 342194 had a related patch set uploaded (by Elukey):
[operations/puppet] Add new MW appservers and api-appservers in codfw

https://gerrit.wikimedia.org/r/342194

Change 342194 merged by Elukey:
[operations/puppet] Add new MW appservers and api-appservers in codfw

https://gerrit.wikimedia.org/r/342194

Ran puppet, rebooted the nodes for the mw-cgroups, re-ran puppet and scap pull, pooled the nodes via conftool.

Still to check: I had to reboot mw2256 because it wasn't responding to ssh, and given the fact that we had a similar issue on march 3rd I'd like to investigate a bit more.

Happened again today:

04:08  <icinga-wm> PROBLEM - Host mw2256 is DOWN: PING CRITICAL - Packet loss = 100%
04:09  <icinga-wm> RECOVERY - Host mw2256 is UP: PING OK - Packet loss = 0%, RTA = 36.19 ms

Definitely something odd, mw2256 needs to be investigated.

It keeps repeating, I can see a lot of EDAC errors in kern.log:

elukey@mw2256:~$ sudo grep -i EDAC /var/log/kern.log
Mar 12 15:29:09 mw2256 kernel: [40837.335365] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
Mar 12 15:29:09 mw2256 kernel: [40837.335368] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 1: 940000000000009f
Mar 12 15:29:09 mw2256 kernel: [40837.335370] EDAC sbridge MC0: TSC 51cb6d97f462
Mar 12 15:29:09 mw2256 kernel: [40837.335372] EDAC sbridge MC0: ADDR 857743fc0 EDAC sbridge MC0: MISC 0
Mar 12 15:29:09 mw2256 kernel: [40837.335377] EDAC sbridge MC0: PROCESSOR 0:406f1 TIME 1489332549 SOCKET 0 APIC 0
Mar 12 15:29:10 mw2256 kernel: [40837.837147] EDAC MC0: 0 CE memory read error on CPU_SrcID#0_Ha#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0x857743 offset:0xfc0 grain:32 syndrome:0x0 -  area:DRAM err_code:0000:009f socket:0 ha:0 channel_mask:1 rank:1)
[..]

@Papaul - Hi! This host might have a faulty RAM bank that causes reboots and errors, whenever you have time can we sync to try to replace it?

@elukey Can we get mw2256 depooled from the dsh lists etc?

Will stop scap giving timeout errors for the host

Mentioned in SAL (#wikimedia-operations) [2017-03-15T00:08:52Z] <mutante> depooled mw2256 because it's down again (T155180)

@elukey Can we get mw2256 depooled from the dsh lists etc?

Will stop scap giving timeout errors for the host

Looks like we are still getting these timeout errors with scap :(

set mw2256 as inactive and ran puppet on tin, host removed from DSH.

The system log shows DIMM A1 faulty. I swapped DIMM A1 with DIMM B1 and did system check again and now the system log shows DIMM B1 faulty. The memory is bad. Will contact Dell to send me a replacement.

Selection_004.png (171×974 px, 29 KB)

this host is logging a ton of noise in logstash... can we shut down hhvm until the ram problem is fixed?

@mmodell: mw2256 should be shutdown now, we are not planning to bring it up again until we'll have the new DIMM :)

Service Request Information:
Dispatch Information: Customer Information:
Dispatch Number: 324983627
Service Tag: FXLPND2
Service Request Number: 945459239
Express Service Code: 34683587462
System Type: PowerEdge R430 Server

DIMM replacement complete, system is back up online.

Removed downtime on mw2256, let's see if it holds up without rebooting for a couple of days.

Last step before closing - set mw2256 as "active" via conftool and make sure that it is put back in the scap dsh (do a scap pull from the host before closing too!).

Created T161488 to decommission 7 old API codfw appservers (replaced by these ones).

mw2256 works fine and it is now active in conftool.

Dzahn reopened this task as Open.EditedApr 19 2017, 3:37 PM

mw2256 died again , it was powercycled, came back, was repooled.. and shortly after went down again


07:40 < icinga-wm> PROBLEM - Host mw2256 is DOWN: PING CRITICAL - Packet loss = 100%
07:40 < elukey> mmm checking mw2256
07:42 < akosiaris> elukey: I think mw2256 has died.. console doesn't spew out anything meaningful
07:44 < akosiaris> !log powercycle mw2256
07:46 < akosiaris> mw2256 issues garbled text at the console.. looks like baud rate misconfiguration
07:47 < _joe_> akosiaris: mw2256 - if it comes up, please sync it
07:47 < icinga-wm> RECOVERY - Host mw2256 is UP: PING OK - Packet loss = 0%, RTA = 36.11 ms
07:47 < akosiaris> mw2256 is back up
07:50 < elukey> godog: is all mw2256 related? (nutcracker
07:51 <+logmsgbot> !log akosiaris@puppetmaster1001 conftool action : set/pooled=no; selector: name=mw2256.codfw.wmnet,service=apache2
07:51 <+logmsgbot> !log akosiaris@puppetmaster1001 conftool action : set/pooled=no; selector: name=mw2256.codfw.wmnet,service=nginx
07:55 < akosiaris> mw2256 synced I 'll repool it
07:56 <+logmsgbot> !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: name=mw2256.codfw.wmnet,service=nginx
07:56 <+logmsgbot> !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: name=mw2256.codfw.wmnet,service=apache2
08:34 < icinga-wm> PROBLEM - Host mw2256 is DOWN: PING CRITICAL - Packet loss = 100%
08:35 < mutante> !log mw2256 went down and showed " PANIC: double fault, error_code: 0x0"

@Papaul just the kernel panic PANIC: double fault, error_code: 0x0 and this during boot:

5939 Apr 19 14:44:03 mw2256 kernel: [   28.623132] ACPI Error: No handler for Region [SYSI] (ffff88085eca09c8) [IPMI] (20150930/evregion-163)
5940 Apr 19 14:44:03 mw2256 kernel: [   28.623140] ACPI Error: Region IPMI (ID=7) has no handler (20150930/exfldio-297)
5941 Apr 19 14:44:03 mw2256 kernel: [   28.623146] ACPI Error: Method parse/execution failed [\_SB.PMI0._GHL] (Node ffff88105e4192c0), AE_NOT_EXIST (20150930/psparse-542)
5942 Apr 19 14:44:03 mw2256 kernel: [   28.623158] ACPI Error: Method parse/execution failed [\_SB.PMI0._PMC] (Node ffff88105e419270), AE_NOT_EXIST (20150930/psparse-542)
5943 Apr 19 14:44:03 mw2256 kernel: [   28.623168] ACPI Exception: AE_NOT_EXIST, Evaluating _PMC (20150930/power_meter-755)

closing this again to handle mw2256 in a subtask (please continue on T163346)