@andrewbogott This server is ready for you, i updated raid cfg to R10 and 2 spare disks. Network switch updated (old switch info removed) with new ports and correct vlans. Disabled the ethernet ports and the 10G ports are enabled. I sent you the mac address via PM in IRC. Resolve this task once finished. I am removing the ops-eqiad tag and assigned to you. If there is an issue please add the tag back and assign to me.
Tue, Jul 16
@andrewbogott: Is it safe to move forward with this task?
@Volans I have the new ssd, are you positive that /dev/sda is in slot 0?
cmjohnson@analytics1072:~$ sudo megacli -LdPdInfo -aall | grep -e 'Virtual Drive' -e Slot
Virtual Drive: 0 (Target Id: 0)
Slot Number: 12
Slot Number: 13
Virtual Drive: 2 (Target Id: 2)
Slot Number: 1
Virtual Drive: 3 (Target Id: 3)
Slot Number: 2
Virtual Drive: 4 (Target Id: 4)
Slot Number: 3
Virtual Drive: 5 (Target Id: 5)
Slot Number: 4
Virtual Drive: 6 (Target Id: 6)
Slot Number: 5
Virtual Drive: 7 (Target Id: 7)
Slot Number: 6
Virtual Drive: 8 (Target Id: 8)
Slot Number: 7
Virtual Drive: 9 (Target Id: 9)
Slot Number: 8
Virtual Drive: 10 (Target Id: 10)
Slot Number: 9
Virtual Drive: 11 (Target Id: 11)
Slot Number: 10
Virtual Drive: 12 (Target Id: 12)
Slot Number: 11
@elukey the disk has been replaced, it is in still unconfigured (good) the disk needs to be mapped back to Virtual Drive: 1 (Target Id: 1)
Slot Number: 0
@godog replaced the disk, all should be good now. If you find that it's not please re-open the task and ping me.
Resolving this task for now, if the error returns please re-open and ping me.
I am resolving this ticket, please re-open and ping me if the problem returns.
I swapped all the DIMM from side A to side B cleared the log and powered back up. Please put the server back in service and let's see if the reseating worked.
Last log paste before clearing the log
Swapped DIMM A3 with DIMM B3, now we have to powrer the server back on and let it go for a few days to see if the error returns and where it returns.
one last paste of the idrac log
@godog I did get the new disk but since it's not failed...I am not sure which disk is actually bad on my end. Do you know which slot the disk is in or let's coordinate and see if you can make the disk blink.
I received the disk on-site but I cannot tell which disk is failed, they all have green LEDs. @elukey could you please let me know which disk slot or let's coordinate to make the disk blink.
Mon, Jul 15
@godog, no worries about the earlier comment. Dell approved the disk replacement. I will update task once it's been replaced.
Disks is on it's way
Thanks, @godog is there any way you can put some stress on that disk? It's hard for me to justify to Dell that we need a disk replacement when it shows that it is working
Fri, Jul 12
You have successfully submitted request SR994463766 is the Dell ticket created. I did see the disk in megacli so I am not sure the TSR report I sent them will include the disk. I did include what you pasted in this ticket showing sdb as failed. Hopefully, that's enough to get a new disk shipped.
I am not sure what I was looking at yesterday but this server is out of warranty. However, I think I have a 4TB disks that I can replace it with. I will confirm when I get back to eqiad next week.
This server is out of warranty, I can reseat the DIMM but will need the server to taken down.
here is the Dell task You have successfully submitted request SR994463101.
This is a dell server, I will try and put in a ticket with Dell but all h/w is showing that there isn't a problem so I may have trouble with Dell giving me a disk for an issue that they do not think exists.
Thu, Jul 11
Declining the task since the server is out of warranty.
Since there is no need to replace these disks...declining the task
All items on rows A, B an C have been updated. Row D will need some on-site verification
This was completed awhile ago...never updated task
@elukey I am not sure which disk this? I think it's a smaller ssd? Can you confirm the disk type and size please ?
I still need to move the DIMM around ...I need the server taken down. If this needs to be scheduled, please let me know when you can have the server down?
@Dzahn, I need to know I don't know what that means? What does DC-ops need to troubleshoot? Thanks
I am removing the ops-eqiad tag, if you onsite work is still required please add the ops-eqiad tag.
I am removing the ops-eqiad tag on this task, if you need additional dc ops work please add the tag back.
The server is out of warranty and I do not have any spare DIMM
I am resolving this task
I have removed the ops-eqiad tag, if you have an issue that required DC ops please add the ops-eqiad tag back to the task.
Tue, Jul 9
@Eevans We did a test run for an install and the server was able to reach the installer without an issue. I did see on IRC something about stretch. I will leave that up to you if you like and the server can be installed whenever you need it.
restbase1017 has been moved to rack B5
network port updated
Wed, Jul 3
Please decommission the current servers to spare role
Please provide the new hostnames you want to use
These are all located in row B...will that be okay or do you need them spread out across the rows?
Thu, Jun 27
@Eevans Do you still want to move this server? Let's coordinate a day/time
Assigning this to @RobH he is able to allocate a spare
@Marostegui disk swapped but this server is out of warranty. I would suggest moving masters to new servers.
Mon, Jun 24
Wed, Jun 19
I updated the switch config to private1-d.....both servers are currently off and ready for installs. assigning to @RobH to install
Dell is sending me a new Raid card, cables and backplane. Sorry, it took so long, I had to call them after they denied my second request.
Closing this for now, let me know if there is another issue. Keep in mind this server is out of warranty
The DIMM has been reseated and swapped to the opposite sides.
Jun 17 2019
these have been racked
servers are set up and have been added to the tracking sheet
servers are ready as spares and in tracking sheet
@ayounsi I rather not move the servers...I racked them based on the instructions and they're already in racks and setup
@Marostegui: do they all go to the cloud vlan? if they do then 1020 and 1021 are in row D...that support-cloud vlan is not available on row D yet. I need Arzhel to copy the vlan over.
Assigning to @ayounsi to add cloud-support1-d-eqiad. Once that is done, the vlan for dbproxy1020 and 1021 will need to be set up. Switch port descriptions are done.
Jun 11 2019
they declined my ticket...says I didn't isolate the problem well enough.
This server accepts all the racadm commands successfully. I verified on-site that these things actually happened
@Andrew what parts? There is nothing that suggests that it is CPU on the server side of things. I reseated and moved the DIMM and that error has not returned. It may very well have been poorly seated DIMM. I checked dmesg and do not see any more errors related to memory or CPU. Try putting it back into production and let's see if anything comes back. Unfortunately, I need to demonstrate and prove there is a problem for Dell to do anything and right now I do not have anything to give them.