Page MenuHomePhabricator

mw2201, mw2202 - contact Dell and replace main board
Closed, ResolvedPublic

Description

per parent task (T169360)

mw2201 and mw2202 have broken BMC / unresponsive DRACs

We tried draining flea power and resetting (which worked for another host, mw2154) but it did not work for these.

See errors Papaul got:

It looks like broken hardware now, contact Dell and get it replaced.

Details

Related Gerrit Patches:
operations/puppet : productionmw2202: remove from conftool-data
operations/puppet : productioninstall_server: update MAC address of mw2201 in DHCP

Event Timeline

Dzahn created this task.Jul 11 2017, 6:57 PM
Dzahn updated the task description. (Show Details)Jul 11 2017, 7:16 PM
Papaul added a comment.EditedJul 12 2017, 2:57 PM

Case for mw2201
Hi Papaul,

Thank you for contacting the Dell Enterprise Technical Support team.

We have opened up a new case # 950882366 for issues reported with – idrac not working.

I will be your case owner and single point of contact to ensure that the issue you have reported is resolved.

Please feel free to reach out to me on mail for information or help.

Regards,

Sumesh Ravindran

Technical Support Analyst

Dell EMC | NA Basic Server Support

Enterprise Remote Services and Solutions

Case for mw2202
Hi Papaul,

Thank you for contacting the Dell Enterprise Technical Support team.

We have opened up a new case # 950882983 for issues reported with – idrac not working.

I will be your case owner and single point of contact to ensure that the issue you have reported is resolved.

Please feel free to reach out to me on mail for information or help.

Regards,

Sumesh Ravindran

Technical Support Analyst

Dell EMC | NA Basic Server Support

Enterprise Remote Services and Solutions

@Dzahn Received a call form the the dispatch team saying that they received part only for one server that they don't know what the ETA is for part for the other server. He will come and replace the main board on one server today before 3 PM. if not he will let me know

Dzahn added a comment.Jul 13 2017, 4:04 PM

Thanks @Papaul I have depooled both servers just in case. Let me know when we can repool them.

Dzahn added a comment.Jul 13 2017, 6:40 PM

Dell has told Papaul that it was delayed and they won't be there today. Repooled servers for now.

Update on case

Dell now uses a company called Unisys to perform all the cal services when I comes to part replacement. They do received the call and send the dispatch to Unisys and one of the Unisys tech call you to let you know that they have received the case that the will be there between 8 and 5.

So the tech schedule for this service call, called me this am to let me know that he only received 1 part off of 2 that he doesn't have the ETA on when seconds part. I told him that it is okay to process on working on 1 server today and we can work on the other one once later when we received the part and I told him that I will be on site until 3 pm because can't stay after 3pm because of traffic that call was made at 9:45 am.

At 1:30 since I didn't hear form the tech I decide to call him to have an update he said that he was on a another service call that he will not be able to show up before 3pm.

I decide then to call directly Dell to explain them what was going on. And requested this time that the part been sent to me directly and don't want any tech on site to replace the part. The Dell engineer guaranty me that he will do his best to have the part to me tomorrow Friday since the do not have the part in stock if not it will be first thing on Monday.

@Dzahn I received 1 main board can you please depool mw2201 so I can go ahead and replacement the main board?

Thanks.

Dzahn added a comment.Jul 17 2017, 4:13 PM

@Papaul Ok, thanks. Done. you can go ahead.

mw2201.codfw.wmnet: pooled changed yes => no
mw2201.codfw.wmnet: pooled changed yes => no

@Dzahn main board replacement on mw2201 complete.Please test and let me know. thanks

Change 365673 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] install_server: update MAC address of mw2201 in DHCP

https://gerrit.wikimedia.org/r/365673

RobH assigned this task to Dzahn.EditedJul 17 2017, 6:29 PM
RobH added a subscriber: RobH.

chatted with Daniel and he is handling the follow up on mw2201.

when mw2201 is done, please assign this task back to @Papaul for processing on mw2202

in the future, its best to make a single task per hardware/server failure so its easier to track them

Change 365673 merged by Dzahn:
[operations/puppet@production] install_server: update MAC address of mw2201 in DHCP

https://gerrit.wikimedia.org/r/365673

Mentioned in SAL (#wikimedia-operations) [2017-07-17T20:33:30Z] <mutante> mw2201 - reinstalling OS after mainboard replacement (network interfaces became eth2/eth3 from eth0/eth1 so ferm failed etc) - T170307

Mentioned in SAL (#wikimedia-operations) [2017-07-17T21:01:40Z] <mutante> mw2201 - revoke old puppet cert, salt key, accept/sign news cert and key, initial pupet run .. T170307

Dzahn reassigned this task from Dzahn to Papaul.Jul 17 2017, 11:48 PM

mw2201 has been reinstalled and repooled and is working now. the issue is resolved. thanks papaul. giving the ticket back for mw2202 now.

Mentioned in SAL (#wikimedia-operations) [2017-07-18T18:26:02Z] <mutante> mw2202 - remove /etc/udev/rules.d/70-persistent-net.rules for mainboard replacement - to detect new NICs with new MACs (T170307)

Return information for bad main board on mw2201

Change 366024 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] mw2202: remove from conftool-data

https://gerrit.wikimedia.org/r/366024

Change 366024 merged by Dzahn:
[operations/puppet@production] mw2202: remove from conftool-data

https://gerrit.wikimedia.org/r/366024

Main board replacement complete on mw2202. System is back up. See below for return information for bad main board.

Papaul reassigned this task from Papaul to Dzahn.Jul 18 2017, 7:33 PM
Dzahn added a comment.EditedJul 18 2017, 7:47 PM

looks good and came back just fine without having to reinstall, deleting /etc/udev/rules.d/70-persistent-net.rules did the trick, thanks!

re-adding to conftool data https://gerrit.wikimedia.org/r/#/c/366031/

Chad pulled latest code so it should be back in deployment-sync.

Dzahn closed this task as Resolved.Jul 18 2017, 7:57 PM

repooled mw2202 - this should resolve this ticket - thanks @Papaul

Dzahn reassigned this task from Dzahn to Papaul.Jul 18 2017, 7:57 PM