codfw: mw2251-mw2260 rack/setup
Open, NormalPublic

Description

This task will track the racking and setup of 10 new app-servers received T151779

mw2251-mw2253 Row A rack A4

  • - receive in and attach packing slip to parent task T151779
  • - rack systems, update racktables
  • - create mgmt dns entries (both asset tag and hostname)
  • - create production dns entries (internal vlan)
  • - update/create sub task with network port info for all new hosts
  • - install_server module update (mac address and partitioning info, partition like existing mw systems)
  • - install os
  • - puppet/salt accept
  • - hand off to @Joe for service implementation.

mw2254-mw2260 row B rack B3

  • - receive in and attach packing slip to parent task T151779
  • - rack systems, update racktables
  • - create mgmt dns entries (both asset tag and hostname)
  • - create production dns entries (internal vlan)
  • - update/create sub task with network port info for all new hosts
  • - install_server module update (mac address and partitioning info, partition like existing mw systems)
  • - install os
  • - puppet/salt accept
  • - hand off to @Joe for service implementation.

Related Objects

StatusAssignedTask
OpenJoe
ResolvedRobH
Papaul created this task.Jan 12 2017, 4:54 PM
Restricted Application added a project: Operations. · View Herald TranscriptJan 12 2017, 4:54 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Papaul edited the task description. (Show Details)Jan 12 2017, 4:55 PM
Papaul triaged this task as "Normal" priority.Jan 12 2017, 5:17 PM

During installation at the partition disks step I can the following error:

┌───────────────────────┤ [!!] Partition disks ├────────────────────────┐
│                                                                       │
│ The attempt to mount a file system with type ext4 in RAID1 device #1  │
│ at / failed.                                                          │
│                                                                       │
│ You may resume partitioning from the partitioning menu.               │
│                                                                       │
│ Do you want to resume partitioning?                                   │
│                                                                       │
│     <Go Back>                                       <Yes>    <No>     │
│                                                                       │
└───────────────────────────────────────────────────────────────────────┘

This happen on all 10 servers.
I remember we had this same problem with our Debian installer on Jan 2016 that is almost a year ago:
https://phabricator.wikimedia.org/T125256

To make sure it is our Debian installer causing the problem I asked Daniel to help me hack the installation configuration of mw2251 by changing it from Jessie to Trusty. When running Trusty, I do not have the problem; the installation complete with no issue.
@MoritzMuehlenhoff
@faidon
Please advice.
Thanks

RobH added a parent task: Unknown Object (Task).
RobH added subscribers: Southparkfan, mark, emailbot.
Dzahn added a subscriber: Dzahn.Jan 19 2017, 6:26 AM

Yup, mw2251 switched to trusty, first manual then in puppet (https://gerrit.wikimedia.org/r/#/c/332930/), issue only happens in jessie, not trusty.

That happens after jessie point releases (and there was one last weekend). Until a while ago the Squid cache needed to be purged manually, but I commited some config tweaks a few months ago which should make this obsolete. What's still needed is the rebuild of the images, quoting Faidon from IRC a while ago:

[Montag, 19. September 2016] [10:58:02] <paravoid> updating the d-i isn't that simple though, as we need to bundle firmware into the initrd
[Montag, 19. September 2016] [10:58:12] <paravoid> I have an unpuppetized script for that in my home directory on palladium

That script is now in puppetmaster1001:/home/faidon, I just ran it, /var/lib/puppet/volatile is synced via rsync every 15 minutes, so it should work soon. If it still fails, please ping me on IRC.

Papaul edited the task description. (Show Details)Jan 20 2017, 5:37 AM
Papaul edited the task description. (Show Details)
Papaul edited the task description. (Show Details)
Papaul reassigned this task from Papaul to Joe.

@Joe installation complete.

elukey added a subscriber: elukey.
elukey moved this task from Backlog to Ops Backlog on the User-Elukey board.Thu, Mar 2, 2:48 PM
elukey added a comment.Fri, Mar 3, 9:54 AM

Adding a note to check for re-occurrences:

09:09  <icinga-wm> PROBLEM - Host mw2256 is DOWN: PING CRITICAL - Packet loss = 100%
09:09  <icinga-wm> RECOVERY - Host mw2256 is UP: PING OK - Packet loss = 0%, RTA = 36.11 ms

elukey@asw-b-codfw> show log messages | match fpc3
Mar  3 09:07:14  asw-b-codfw fpc3 [EX-BCM PIC] ex_bcm_linkscan_handler: Link 16 DOWN
Mar  3 09:07:28  asw-b-codfw fpc3 [EX-BCM PIC] ex_bcm_linkscan_handler: Link 16 UP
Mar  3 09:09:10  asw-b-codfw fpc3 [EX-BCM PIC] ex_bcm_linkscan_handler: Link 16 DOWN
Mar  3 09:09:13  asw-b-codfw fpc3 [EX-BCM PIC] ex_bcm_linkscan_handler: Link 16 UP
elukey added a comment.Wed, Mar 8, 3:07 PM

https://phabricator.wikimedia.org/T156023#3046855 as a reminder of the appservers status after the last rebalance.

Mentioned in SAL (#wikimedia-operations) [2017-03-08T15:08:34Z] <elukey> rebooting mw225[123] as part of sanity check for T155180

Mentioned in SAL (#wikimedia-operations) [2017-03-08T15:19:57Z] <elukey> rebooting mw22(5[4-9]|60) as part of sanity check for T155180

So hosts rebooted, verified that puppet ran correctly and executed apt-get dist-upgrade. Verified also ROW allocation:

{'mw2251.codfw.wmnet': '    SysName:      asw-a-codfw'}
{'mw2252.codfw.wmnet': '    SysName:      asw-a-codfw'}
{'mw2253.codfw.wmnet': '    SysName:      asw-a-codfw'}
{'mw2254.codfw.wmnet': '    SysName:      asw-b-codfw'}
{'mw2255.codfw.wmnet': '    SysName:      asw-b-codfw'}
{'mw2256.codfw.wmnet': '    SysName:      asw-b-codfw'}
{'mw2257.codfw.wmnet': '    SysName:      asw-b-codfw'}
{'mw2258.codfw.wmnet': '    SysName:      asw-b-codfw'}
{'mw2259.codfw.wmnet': '    SysName:      asw-b-codfw'}
{'mw2260.codfw.wmnet': '    SysName:      asw-b-codfw'}

Change 342194 had a related patch set uploaded (by Elukey):
[operations/puppet] Add new MW appservers and api-appservers in codfw

https://gerrit.wikimedia.org/r/342194

Change 342194 merged by Elukey:
[operations/puppet] Add new MW appservers and api-appservers in codfw

https://gerrit.wikimedia.org/r/342194

Ran puppet, rebooted the nodes for the mw-cgroups, re-ran puppet and scap pull, pooled the nodes via conftool.

Still to check: I had to reboot mw2256 because it wasn't responding to ssh, and given the fact that we had a similar issue on march 3rd I'd like to investigate a bit more.

Happened again today:

04:08  <icinga-wm> PROBLEM - Host mw2256 is DOWN: PING CRITICAL - Packet loss = 100%
04:09  <icinga-wm> RECOVERY - Host mw2256 is UP: PING OK - Packet loss = 0%, RTA = 36.19 ms

Definitely something odd, mw2256 needs to be investigated.

It keeps repeating, I can see a lot of EDAC errors in kern.log:

elukey@mw2256:~$ sudo grep -i EDAC /var/log/kern.log
Mar 12 15:29:09 mw2256 kernel: [40837.335365] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
Mar 12 15:29:09 mw2256 kernel: [40837.335368] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 1: 940000000000009f
Mar 12 15:29:09 mw2256 kernel: [40837.335370] EDAC sbridge MC0: TSC 51cb6d97f462
Mar 12 15:29:09 mw2256 kernel: [40837.335372] EDAC sbridge MC0: ADDR 857743fc0 EDAC sbridge MC0: MISC 0
Mar 12 15:29:09 mw2256 kernel: [40837.335377] EDAC sbridge MC0: PROCESSOR 0:406f1 TIME 1489332549 SOCKET 0 APIC 0
Mar 12 15:29:10 mw2256 kernel: [40837.837147] EDAC MC0: 0 CE memory read error on CPU_SrcID#0_Ha#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0x857743 offset:0xfc0 grain:32 syndrome:0x0 -  area:DRAM err_code:0000:009f socket:0 ha:0 channel_mask:1 rank:1)
[..]

@Papaul - Hi! This host might have a faulty RAM bank that causes reboots and errors, whenever you have time can we sync to try to replace it?

elukey moved this task from Ops Backlog to Stalled on the User-Elukey board.Mon, Mar 13, 12:01 PM
Reedy added a subscriber: Reedy.Tue, Mar 14, 11:23 PM

@elukey Can we get mw2256 depooled from the dsh lists etc?

Will stop scap giving timeout errors for the host

Mentioned in SAL (#wikimedia-operations) [2017-03-15T00:08:52Z] <mutante> depooled mw2256 because it's down again (T155180)

@elukey Can we get mw2256 depooled from the dsh lists etc?

Will stop scap giving timeout errors for the host

Looks like we are still getting these timeout errors with scap :(

set mw2256 as inactive and ran puppet on tin, host removed from DSH.

The system log shows DIMM A1 faulty. I swapped DIMM A1 with DIMM B1 and did system check again and now the system log shows DIMM B1 faulty. The memory is bad. Will contact Dell to send me a replacement.

Addshore removed a subscriber: Addshore.Thu, Mar 16, 5:33 PM

this host is logging a ton of noise in logstash... can we shut down hhvm until the ram problem is fixed?

@mmodell: mw2256 should be shutdown now, we are not planning to bring it up again until we'll have the new DIMM :)

Service Request Information:
Dispatch Information: Customer Information:
Dispatch Number: 324983627
Service Tag: FXLPND2
Service Request Number: 945459239
Express Service Code: 34683587462
System Type: PowerEdge R430 Server

DIMM replacement complete, system is back up online.

Removed downtime on mw2256, let's see if it holds up without rebooting for a couple of days.

Last step before closing - set mw2256 as "active" via conftool and make sure that it is put back in the scap dsh (do a scap pull from the host before closing too!).