@Jclark-ctr - since Chris had to use a sick day, can one of you guys take a look at this for Luca? Thanks, Willy
Mon, Sep 16
Checked with @Cmjohnson , who says he'll follow up to check the connections.
Hi @Dzahn @jbond - looks like this host is out of warranty, and about 3/4 of a year away from a hardware refresh....so just wanted to double-check if you're considering to retire this system soon or if you'd like us to purchase the hardware part for replacement? Thanks, Willy
Originally scheduled for Thursday 9/19, but will reschedule for a later date, since this is a network rack.
Thanks @Jclark-ctr , can you have the drive replaced this week? Also, you might need to coordinate with @jcrespo via IRC to get a couple other things completed to get backup1001 up and running. Thanks, Willy
Fri, Sep 13
@Cmjohnson - can you provide an update on this one next week? Thanks, Willy
Hi @Dzahn - just following up on this one, to see when the server can be taken down. Thanks, Willy
Wed, Sep 11
Tue, Sep 10
Talked to @akosiaris, who will open up a new task to replace the newly failed drive. We ordered a few of them last time, so hopefully we'll have more spares lying around.
@Cmjohnson - just following up to see if we have the correct part
@Cmjohnson - could be the drive is seated securely or possibly a loose cable /connection
Looks like the warranty expired on Jan. 14, 2018. @Papaul - let me know if you have any spares lying around or if we need to purchase a new disk. Thanks, Willy
Mon, Sep 9
Here's the response I got from Dell (pasted below). @Cmjohnson or @Jclark-ctr : can one of you guys call Dell at 1-800-456-3355, explain to them the numerous parts we've already replaced (and that it continues to crash on load) and get them to analyze the logs for the system? Let me know how it goes.
Per SRE meeting, we'll be rescheduling the PDU upgrades for this rack to a later date TBA due to a lot of the ongoing work related to the recent outages.
Wed, Sep 4
Emailed our Dell account rep, who responded that they will look into what our options are and get back to us. Thanks, Willy
Assigning to @Bstorm to follow up on the previous comment.
Thanks @Andrew - I'll reach out to our Account Rep, to see if something else can be done.
Hi @Andrew - I mentioned the ongoing issues with this machine to our Dell account rep last week, since we've basically replaced every CPU/DIMM/MB on this box. They mentioned we could install Live Optics to evaluate load, but I'm not sure this is something we want to run on our hardware. Do you have another cloudvirt machine up and running right now on the same hardware specs? Essentially running at the same CPU usage...mainly so we can compare and try to isolate any other type of config differences between them.
Hi @Volans - I was wondering in the mean time, would it be possible to give all the FTE dc-ops engineers the necessary permissions to install and decom hosts from beginning to end? Maybe either by adding these rights to a dc-ops group or granting root access for Papaul? He's definitely going to need the ability to do all this in the next 1.5 months, since he'll be in Amsterdam refreshing the entire site. Thanks, Willy
Fri, Aug 30
Thanks for confirming @Cmjohnson , subtask T231670 created for Rob to order the part. Thanks, Willy
Tue, Aug 27
@Bstorm - I was able to confirm we originally ordered this machine to include 1.6tb drives via https://phabricator.wikimedia.org/T155075 , but wasn't able to find any other tasks that showed when/how they were replaced with 1.9tb drives (which Dell won't support). Do you have any details from previous records on where these 1.9tb disks came from? (ie swapped from another server, ordered separately, etc) Thanks, Willy
@Volans - ah that makes. Thanks, let's just resolve out this task then.
@Jclark-ctr - can we resolve this task? Thanks, Willy
@Volans - hey Riccardo, not sure if you're the right person for this, but thought I'd try asking you. Is there a different output we can get for this alert, to help us isolate the disk issue a bit more?
Fri, Aug 23
Tue, Aug 20
Confirmed by Chris that the drive arrived on August 8
Thanks @Marostegui , I appreciate it.
Mon, Aug 19
@Marostegui - I would say just go for it and fail out in advance, if it's not too much trouble. Master DBs are very critical, so my opinion is to just take the extra precautionary measures. Thanks, Willy
@Marostegui - I'll defer to Faidon or Mark for their opinion, but my suggestion is to go ahead and fail out in advance if it's not too much of a hassle. The success rate of us upgrading PDUs without any issues is pretty good, but unexpected accidents can occur, and master DBs are very critical to the infrastructure.
Aug 16 2019
Thanks Chris, hopefully this will solve things.
Aug 15 2019
Aug 14 2019
Aug 13 2019
Aug 12 2019
Just a heads up Chris, the system is under warranty thru June 2021. Thanks, Willy
Aug 9 2019
Info entered into Netbox by @RobH Resolving task
Aug 8 2019
Drives received last Wed, July 31 by @Jclark-ctr
@Cmjohnson - just following up on this one, since you were out on vacation last week when the task came in.
Aug 7 2019
Moving back to @Cmjohnson - can you try getting Dell to RMA you a motherboard? If they give you push back, let me know and I can try escalating with our account manager.
Aug 6 2019
created task as a test. resolving.
@elukey , thank you
Aug 5 2019
Confirmed server is under warranty thru March 2021.
@Marostegui - Ha, we tied. =)
Cable between mr1-eqsin p4 <---> asw-0603-eqsin p23 looks like it accidentally got bumped by the contractor during the server install. Called him back and he was able to resolve the issue by reseating the cables. Link has been stable for the past 15min now. Resolving task.
Alright, I'm asking him to go back to the datacenter to check all the connections on mr1-eqsin.
Asset tags applied by Jin from DreamICC today as follows (also emailed out via a spreadsheet):
@CDanis - I just checked with our 3rd party contractor and he says it shouldn't have been affected from the work he was doing. Although, he was working in the racks from 1:45-4:00 UTC, and If it only alerted for a few minutes, it could've been possible that something might've accidentally been bumped while he was installing the 3 servers. It's no longer alerting, right?
Info gathered by Jin from DreamICC today. Here's the info below (also sent out via email):
Completed by Jin from DreamICC today. The missing IPV4 IP addresses used are the following, with the gateway set to 10.132.129.1 accordingly (instead of 10.132.128.1):
Aug 2 2019
@faidon - The majority of the influx in Netbox errors looks like it's from the new PDUs. Some of the info was updated into Netbox to fix the discrepancies being reported by Accounting earlier this week, but also created new Netbox errors, like missing purchase date and procurement ticket. I'll follow up with up with @RobH or @Cmjohnson next week - it'll be good training exercise/task for John to work on. Thanks, Willy
Aug 1 2019
@Papaul - if you can't find a spare from any of those decom servers, we can order it, since it's still a while before the 5yr mark.
Jul 31 2019
@Jclark-ctr - whenever you have a few min free, can you see if this is just a loose cable that maybe got accidentally pulled from the PDU swap last week? If it's actually a bad PSU, I think we can leave it, since it's due to be refreshed via T221636.
Jul 30 2019
Assigning to @RobH for results from ePSA pre-boot system assessment, before determining the next steps.
Jul 29 2019
System is in-warranty (doesn't expire until May 2020)
Jul 26 2019
Approved for the following: