Page MenuHomePhabricator

mw2201, mw2202 - contact Dell and replace main board
Closed, ResolvedPublic

Description

per parent task (T169360)

mw2201 and mw2202 have broken BMC / unresponsive DRACs

We tried draining flea power and resetting (which worked for another host, mw2154) but it did not work for these.

See errors Papaul got:

It looks like broken hardware now, contact Dell and get it replaced.

Event Timeline

Case for mw2201
Hi Papaul,

Thank you for contacting the Dell Enterprise Technical Support team.

We have opened up a new case # 950882366 for issues reported with – idrac not working.

I will be your case owner and single point of contact to ensure that the issue you have reported is resolved.

Please feel free to reach out to me on mail for information or help.

Regards,

Sumesh Ravindran

Technical Support Analyst

Dell EMC | NA Basic Server Support

Enterprise Remote Services and Solutions

Case for mw2202
Hi Papaul,

Thank you for contacting the Dell Enterprise Technical Support team.

We have opened up a new case # 950882983 for issues reported with – idrac not working.

I will be your case owner and single point of contact to ensure that the issue you have reported is resolved.

Please feel free to reach out to me on mail for information or help.

Regards,

Sumesh Ravindran

Technical Support Analyst

Dell EMC | NA Basic Server Support

Enterprise Remote Services and Solutions

@Dzahn Received a call form the the dispatch team saying that they received part only for one server that they don't know what the ETA is for part for the other server. He will come and replace the main board on one server today before 3 PM. if not he will let me know

Thanks @Papaul I have depooled both servers just in case. Let me know when we can repool them.

Dell has told Papaul that it was delayed and they won't be there today. Repooled servers for now.

Update on case

Dell now uses a company called Unisys to perform all the cal services when I comes to part replacement. They do received the call and send the dispatch to Unisys and one of the Unisys tech call you to let you know that they have received the case that the will be there between 8 and 5.

So the tech schedule for this service call, called me this am to let me know that he only received 1 part off of 2 that he doesn't have the ETA on when seconds part. I told him that it is okay to process on working on 1 server today and we can work on the other one once later when we received the part and I told him that I will be on site until 3 pm because can't stay after 3pm because of traffic that call was made at 9:45 am.

At 1:30 since I didn't hear form the tech I decide to call him to have an update he said that he was on a another service call that he will not be able to show up before 3pm.

I decide then to call directly Dell to explain them what was going on. And requested this time that the part been sent to me directly and don't want any tech on site to replace the part. The Dell engineer guaranty me that he will do his best to have the part to me tomorrow Friday since the do not have the part in stock if not it will be first thing on Monday.

@Dzahn I received 1 main board can you please depool mw2201 so I can go ahead and replacement the main board?

Thanks.

@Papaul Ok, thanks. Done. you can go ahead.

mw2201.codfw.wmnet: pooled changed yes => no
mw2201.codfw.wmnet: pooled changed yes => no

@Dzahn main board replacement on mw2201 complete.Please test and let me know. thanks

Change 365673 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] install_server: update MAC address of mw2201 in DHCP

https://gerrit.wikimedia.org/r/365673

RobH added a subscriber: RobH.

chatted with Daniel and he is handling the follow up on mw2201.

when mw2201 is done, please assign this task back to @Papaul for processing on mw2202

in the future, its best to make a single task per hardware/server failure so its easier to track them

Change 365673 merged by Dzahn:
[operations/puppet@production] install_server: update MAC address of mw2201 in DHCP

https://gerrit.wikimedia.org/r/365673

Mentioned in SAL (#wikimedia-operations) [2017-07-17T20:33:30Z] <mutante> mw2201 - reinstalling OS after mainboard replacement (network interfaces became eth2/eth3 from eth0/eth1 so ferm failed etc) - T170307

Mentioned in SAL (#wikimedia-operations) [2017-07-17T21:01:40Z] <mutante> mw2201 - revoke old puppet cert, salt key, accept/sign news cert and key, initial pupet run .. T170307

mw2201 has been reinstalled and repooled and is working now. the issue is resolved. thanks papaul. giving the ticket back for mw2202 now.

Mentioned in SAL (#wikimedia-operations) [2017-07-18T18:26:02Z] <mutante> mw2202 - remove /etc/udev/rules.d/70-persistent-net.rules for mainboard replacement - to detect new NICs with new MACs (T170307)

Return information for bad main board on mw2201

Change 366024 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] mw2202: remove from conftool-data

https://gerrit.wikimedia.org/r/366024

Change 366024 merged by Dzahn:
[operations/puppet@production] mw2202: remove from conftool-data

https://gerrit.wikimedia.org/r/366024

Main board replacement complete on mw2202. System is back up. See below for return information for bad main board.

looks good and came back just fine without having to reinstall, deleting /etc/udev/rules.d/70-persistent-net.rules did the trick, thanks!

re-adding to conftool data https://gerrit.wikimedia.org/r/#/c/366031/

Chad pulled latest code so it should be back in deployment-sync.

repooled mw2202 - this should resolve this ticket - thanks @Papaul