Fri, Oct 11
Great job @Papaul in troubleshooting this and tracking it down to the root cause. Thanks! ~Willy
@Jclark-ctr - this arrived Thursday via https://www.fedex.com/en-us/home.html. Just a heads up, this will need to be replaced before the PDU upgrade next Tuesday, to retain redundant power on labsdb1009. Thanks, Willy
I'll dig around a bit and check with Dell to see if we can figure why Com1 and Com2 have to be flipped to get it working. Talked to Luca and worse case, if we can't find any answers to why it's happening, then we'll just leave them as is. Thanks, Willy
Hi @ayounsi - I talked to a couple other people who had the same concern the other day, and I agree as well...so I started scheduling downtime for the PDU alerts in Icinga starting from today's B1 PDU upgrade, and will continue for the remaining PDU swaps. Thanks, Willy
Wed, Oct 9
Hi @Papaul - this task is relabel, update in Netbox, and update switchport descriptions to the newly renamed hostnames. Thanks, Willy
@Cmjohnson - this task is relabel, update in Netbox, and update switchport descriptions to the newly renamed hostnames
@Jclark-ctr - can you wrap up the netbox entries on this one, and then close out the task? Thanks, Willy
Thanks for confirming @ayounsi Resolving task.
Tue, Oct 8
Re-assigning to @RobH to complete install/updating of new PDU. Thanks, Willy
@RobH - can you take care of DNS for this to get things completed from the dc-ops side for this install? This one's super urgent, so if you can complete in the AM, it would be much appreciated. Thanks, Willy
Mon, Oct 7
Ok @Dzahn - just let us know when it's ready to go. Thanks, Willy
@Cmjohnson - let me know if we need to order a replacement drive (along with what type of disk), since it's out of warranty. Thanks, Willy
@Dzahn - just wanted to confirm that this has been depooled. Thanks, Willy
Thanks @elukey . Should we ignore/resolve this alert then? Thanks, Willy
Hi @elukey - looks like this host is out of warranty (ended in June 2018). Let me know if you want us to purchase a replacement part or if this system is close to being decommissioned. Thanks, Willy
@Marostegui - it was ordered last Friday morning. We haven't received the tracking number from the vendor yet, but will update that in T233277 once provided. There's still a chance it arrives before the 15th, but we should know have an ETA soon. Thanks, Willy
Tue, Oct 1
Hi @Vgutierrez - just following up on this to see if there was an ETA, since these are supposed to replace lvs2001-2006...which are all past their 5yr mark, and have the following hardware issues associated with them:
@Marostegui - sure, will do. This week is the approval & ordering phase of the procurement cycle, so it shouldn't be an issue getting the PO submitted for labsdb1009. Thanks, Willy
Mon, Sep 30
New target date for upgrading the PDUs on this network rack is Thursday 10/17 @11am UTC. @ayounsi will be in Europe this week to oversee, in case any potential issues occur. Thanks, Willy
New date for upgrading the remaining PDU on the network rack A1 will be targeting Tuesday, 10/15 at 11am UTC. Thanks, Willy
Wed, Sep 25
Mon, Sep 23
Sat, Sep 21
Thu, Sep 19
Wed, Sep 18
Tue, Sep 17
@Jclark-ctr - since Chris had to use a sick day, can one of you guys take a look at this for Luca? Thanks, Willy
Mon, Sep 16
Checked with @Cmjohnson , who says he'll follow up to check the connections.
Hi @Dzahn @jbond - looks like this host is out of warranty, and about 3/4 of a year away from a hardware refresh....so just wanted to double-check if you're considering to retire this system soon or if you'd like us to purchase the hardware part for replacement? Thanks, Willy
Originally scheduled for Thursday 9/19, but will reschedule for a later date, since this is a network rack.
Thanks @Jclark-ctr , can you have the drive replaced this week? Also, you might need to coordinate with @jcrespo via IRC to get a couple other things completed to get backup1001 up and running. Thanks, Willy
Sep 13 2019
@Cmjohnson - can you provide an update on this one next week? Thanks, Willy
Hi @Dzahn - just following up on this one, to see when the server can be taken down. Thanks, Willy
Sep 11 2019
Sep 10 2019
Talked to @akosiaris, who will open up a new task to replace the newly failed drive. We ordered a few of them last time, so hopefully we'll have more spares lying around.
@Cmjohnson - just following up to see if we have the correct part
@Cmjohnson - could be the drive is seated securely or possibly a loose cable /connection
Looks like the warranty expired on Jan. 14, 2018. @Papaul - let me know if you have any spares lying around or if we need to purchase a new disk. Thanks, Willy
Sep 9 2019
Here's the response I got from Dell (pasted below). @Cmjohnson or @Jclark-ctr : can one of you guys call Dell at 1-800-456-3355, explain to them the numerous parts we've already replaced (and that it continues to crash on load) and get them to analyze the logs for the system? Let me know how it goes.
Per SRE meeting, we'll be rescheduling the PDU upgrades for this rack to a later date TBA due to a lot of the ongoing work related to the recent outages.
Sep 4 2019
Emailed our Dell account rep, who responded that they will look into what our options are and get back to us. Thanks, Willy
Assigning to @Bstorm to follow up on the previous comment.
Thanks @Andrew - I'll reach out to our Account Rep, to see if something else can be done.
Hi @Andrew - I mentioned the ongoing issues with this machine to our Dell account rep last week, since we've basically replaced every CPU/DIMM/MB on this box. They mentioned we could install Live Optics to evaluate load, but I'm not sure this is something we want to run on our hardware. Do you have another cloudvirt machine up and running right now on the same hardware specs? Essentially running at the same CPU usage...mainly so we can compare and try to isolate any other type of config differences between them.
Hi @Volans - I was wondering in the mean time, would it be possible to give all the FTE dc-ops engineers the necessary permissions to install and decom hosts from beginning to end? Maybe either by adding these rights to a dc-ops group or granting root access for Papaul? He's definitely going to need the ability to do all this in the next 1.5 months, since he'll be in Amsterdam refreshing the entire site. Thanks, Willy
Aug 30 2019
Thanks for confirming @Cmjohnson , subtask T231670 created for Rob to order the part. Thanks, Willy
Aug 27 2019
@Bstorm - I was able to confirm we originally ordered this machine to include 1.6tb drives via https://phabricator.wikimedia.org/T155075 , but wasn't able to find any other tasks that showed when/how they were replaced with 1.9tb drives (which Dell won't support). Do you have any details from previous records on where these 1.9tb disks came from? (ie swapped from another server, ordered separately, etc) Thanks, Willy
@Volans - ah that makes. Thanks, let's just resolve out this task then.
@Jclark-ctr - can we resolve this task? Thanks, Willy
@Volans - hey Riccardo, not sure if you're the right person for this, but thought I'd try asking you. Is there a different output we can get for this alert, to help us isolate the disk issue a bit more?
Aug 23 2019
Aug 20 2019
Confirmed by Chris that the drive arrived on August 8
Thanks @Marostegui , I appreciate it.
Aug 19 2019
@Marostegui - I would say just go for it and fail out in advance, if it's not too much trouble. Master DBs are very critical, so my opinion is to just take the extra precautionary measures. Thanks, Willy
@Marostegui - I'll defer to Faidon or Mark for their opinion, but my suggestion is to go ahead and fail out in advance if it's not too much of a hassle. The success rate of us upgrading PDUs without any issues is pretty good, but unexpected accidents can occur, and master DBs are very critical to the infrastructure.
Aug 16 2019
Thanks Chris, hopefully this will solve things.