hw troubleshooting: Failing SSD for restbase1030.eqiad.wmnet
Closed, ResolvedPublicRequest
Actions

Assigned To

Authored By

	Eevans
	Aug 15 2023, 3:18 PM

Description

Provide FQDN of system.
If other than a hard drive issue, please depool the machine (and confirm that it’s been depooled) for us to work on it. If not, please provide time frame for us to take the machine down.
Put system into a failed state in Netbox.
Provide urgency of request, along with justification (redundancy, dependencies, etc)
Describe issue and/or attach hardware failure log. (Refer to https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook if you need help)
Assign correct project tag and appropriate owner (based on above). Also, please ensure the service owners of the host(s) are added as subscribers to provide any additional input.

restbase1030.eqiad.wmnet has a failing SSD (see dmesg output below).

The errors are causing one of three Cassandra instances/nodes hosted to exit with sigsegv. So long as the other two instances continue to function, we're better off leaving them in service, than shutting it down until repairs can be made. Errors are accumulating however, and the longer it is in a degraded state like this, the greater the possibility of inconsistency (so urgency is medium(ish)?).

The SSD can be replaced at any time, but all three instances should be shutdown and prevented from restarting prior to doing so. I am happy to coordinate (and drop everything), or someone else can do this with the following:

$ sudo rm /etc/cassandra-{a,b,c}/service-enabled
$ sudo systemctl stop cassandra-a
$ sudo systemctl stop cassandra-b
$ sudo systemctl stop cassandra-c

Once the SSD is replaced, I can take it from there.

NOTE: Normally we'd decommission the entire host and wait with minimal urgency for a repair, but we are over-capacity and no longer able to do so. 🙁

dmesg output

[ 2042.187996] ata5.00: exception Emask 0x0 SAct 0xf880440f SErr 0x0 action 0x0
[ 2042.195050] ata5.00: irq_stat 0x40000008
[ 2042.198995] ata5.00: failed command: READ FPDMA QUEUED
[ 2042.204147] ata5.00: cmd 60/00:b8:a8:72:c1/01:00:9b:00:00/40 tag 23 ncq dma 131072 in
                        res 51/40:00:a8:72:c1/00:01:9b:00:00/40 Emask 0x409 (media error) <F>
[ 2042.220222] ata5.00: status: { DRDY ERR }
[ 2042.224240] ata5.00: error: { UNC }
[ 2042.228286] ata5.00: configured for UDMA/133
[ 2042.228360] sd 4:0:0:0: [sdc] tag#23 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[ 2042.228365] sd 4:0:0:0: [sdc] tag#23 Sense Key : Medium Error [current] 
[ 2042.228371] sd 4:0:0:0: [sdc] tag#23 Add. Sense: Unrecovered read error - auto reallocate failed
[ 2042.228376] sd 4:0:0:0: [sdc] tag#23 CDB: Read(10) 28 00 9b c1 72 a8 00 01 00 00
[ 2042.228379] print_req_error: I/O error, dev sdc, sector 2613146280
[ 2042.234633] ata5: EH complete
[ 2042.363986] ata5.00: exception Emask 0x0 SAct 0x1202fff8 SErr 0x0 action 0x0
[ 2042.371045] ata5.00: irq_stat 0x40000008
[ 2042.374992] ata5.00: failed command: READ FPDMA QUEUED
[ 2042.380151] ata5.00: cmd 60/08:e0:60:73:c1/00:00:9b:00:00/40 tag 28 ncq dma 4096 in
                        res 51/40:08:60:73:c1/00:00:9b:00:00/40 Emask 0x409 (media error) <F>
[ 2042.396050] ata5.00: status: { DRDY ERR }
[ 2042.400068] ata5.00: error: { UNC }
[ 2042.404049] ata5.00: configured for UDMA/133
[ 2042.404078] sd 4:0:0:0: [sdc] tag#28 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[ 2042.404080] sd 4:0:0:0: [sdc] tag#28 Sense Key : Medium Error [current] 
[ 2042.404082] sd 4:0:0:0: [sdc] tag#28 Add. Sense: Unrecovered read error - auto reallocate failed
[ 2042.404084] sd 4:0:0:0: [sdc] tag#28 CDB: Read(10) 28 00 9b c1 73 60 00 00 08 00
[ 2042.404086] print_req_error: I/O error, dev sdc, sector 2613146464
[ 2042.410275] ata5: EH complete

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved	Request	Jclark-ctr	T344259 hw troubleshooting: Failing SSD for restbase1030.eqiad.wmnet
					Unknown Object (Task)

Event Timeline

Eevans created this task.Aug 15 2023, 3:18 PM

Eevans updated the task description. (Show Details)Aug 15 2023, 3:21 PM

Maintenance_bot added a project: SRE.Aug 15 2023, 3:29 PM

RhinosF1 subscribed.Aug 15 2023, 3:29 PM

Eevans updated the task description. (Show Details)Aug 15 2023, 3:47 PM

Eevans mentioned this in T344210: restbase1030: Cassandra crashing (signal 11).Aug 16 2023, 2:26 PM

@wiki_willy @RobH we do not have any replacement ssd for this server and is out of warranty. we would need to order replacement

RobH mentioned this in Unknown Object (Task).Aug 22 2023, 7:16 PM

RobH added a subtask: Unknown Object (Task).

Papaul moved this task from Backlog to Hardware Failure / Troubleshoot on the ops-eqiad board.Aug 28 2023, 10:39 PM

Mentioned in SAL (#wikimedia-operations) [2023-08-29T08:26:21Z] <claime> downtiming cassandra-a alerts on restbase1030.eqiad.wmnet for 14 days T344210 T344259

Replaced Failed Drive sdc

RobH closed subtask Unknown Object (Task) as Resolved.Aug 30 2023, 8:29 PM

The new SSD was picked up as /dev/sdd (instead of /dev/sdc), so I rebooted the host (and the new device came up as sdc).

Afterward, I copied the partition table (from sda), and created a random guid. Basically, I followed the directions from SRE/Dc-operations/Sw_raid_rebuild_directions, stopping short of the mdadm ... --add ..., where I rebooted once more so that I could first verify the partition table. This time though, the host never came back up.

There was no output on the serial console, so I rebooted again, and found that output from the boot sequence stopped at about the point it handed off to a boot device. For example:

last console screen.png (575×1 px, 46 KB)

At this point, I (mistakenly) assumed that I must have somehow messed up the disk config, rendering it no longer bootable. Since the Cassandra nodes on this host were to be a total loss anyway, and since my very next action was to be an upgrade to Bullseye (see: T331713), I moved on pretty quickly to the reimage cookbook (I had been very close to killing both with one stone anyway).

The reimage cookbook however failed, and output from the serial console continued to freeze at the same place. It was at this point that I tried the virtual console in the web UI and realized that it was just the console cutting out, the boot sequence was carrying on after the serial console cut out (it just wasn't visible). See below for an example of the actual failures:

I have also tried upgrading the NIC firmware (from 21.40.21 to 22.31.6), in the hopes that might permit a PXE boot. The iDRAC (from 4.00 to 7.00), in the hopes that might fix the serial console. And the BIOS (from 2.4.8 to 2.17.1) ~~out of desperation~~ because why not?

The server is currently powered off, I'll regroup on it tomorrow. In the meantime, if anyone has any ideas, they would be greatly appreciated! /cc @Jclark-ctr @ssingh

Hi @Eevans,

I have also tried upgrading the NIC firmware (from 21.40.21 to 22.31.6)

As per https://wikitech.wikimedia.org/wiki/SRE/Dc-operations/Platform-specific_documentation/Dell_Documentation and what we have seen in the Traffic hosts reimaging, NIC firmware 22.x and above breaks the installer -- which is not the same as your issue, but might be worth a look. I definitely do recommend trying 21.85, if you run out of other options.

In T344259#9136979, @ssingh wrote:

Hi @Eevans,

I have also tried upgrading the NIC firmware (from 21.40.21 to 22.31.6)

As per https://wikitech.wikimedia.org/wiki/SRE/Dc-operations/Platform-specific_documentation/Dell_Documentation and what we have seen in the Traffic hosts reimaging, NIC firmware 22.x and above breaks the installer -- which is not the same as your issue, but might be worth a look. I definitely do recommend trying 21.85, if you run out of other options.

As discussed elsewhere, but noted here for posterity:

This machine has the 1G Broadcom NICs; 21.85 is for the Netextreme-E (10G) (see: https://www.dell.com/support/home/en-us/drivers/driversdetails?driverid=rxp80)

From IRC:

1:29 PM <papaul> urandom: looking at the server in netbox is looks like it racked in a 10G rack or it connected using 1g so in the pass when we had this issue the fix was to  swaped the tranceiver that is the first place i will start in looking 
1:31 PM <urandom> papaul: the transceiver?  Is that the SFP-T?
1:31 PM <papaul> urandom: yes
1:40 PM <topranks> urandom: just checked and on the switch side it's showing hard down 
1:40 PM <topranks> at the same time on the console it was trying to get IP off DHCP as part of PXEboot 
1:40 PM <topranks> so yeah I think first step is probably to check the physicals, maybe replace the transceiver 
1:40 PM <urandom> wait, I powered it down last I handled it
1:41 PM <topranks> it's trying to PXEboot right now
1:41 PM <topranks> https://usercontent.irccloud-cdn.com/file/1pSVwdf2/image.png
1:42 PM <topranks> But switch-side shows as down:
1:42 PM <topranks> cmooney@asw2-d-eqiad> show interfaces ge-4/0/20 
1:42 PM <topranks> Physical interface: ge-4/0/20, Enabled, Physical link is Down
1:42 PM <urandom> oh, ok
1:43 PM <urandom> yeah, that looks like https://phabricator.wikimedia.org/T340055, where we had to replace the SFP-T with a specific brand (Wave2Wave).
1:43 PM <topranks> so yeah probably just a physical cable issue or transceiver is bad 
1:43 PM <topranks> hmm yeah I remember that now (odd brand name!)
1:44 PM <topranks> definitely the first thing to try though yep 
1:44 PM <urandom> awesome, thanks for having a look!
1:44 PM <topranks> np!

@Jclark-ctr would you be able to check the physical connection (network)?

This is looking similar to T340055. For some reason, it wasn't simply a matter of a bad SFP-T, it would only work with one of the Wave2Wave optics transceivers (Wave2Wave 77J-S010-T).

@Eevans we do not have that optic at our site in eqiad and never have wave2wave is a newer distributer we are using for optics and cables. it might be a mixture of cable /optic loose connection it was down but when pressure was applied to cable link returned.

@Eevans disposed of old optic and replaced cable can you verify if error is still present? Previous eqiad onsite staff did not always dispose of defect optics might of been replaced with defective optic

In T344259#9142803, @Jclark-ctr wrote:

@Eevans disposed of old optic and replaced cable can you verify if error is still present? Previous eqiad onsite staff did not always dispose of defect optics might of been replaced with defective optic

Thanks @Jclark-ctr, unfortunately though it's still not working.

I don't know what this looks like from the switch perspective, but from the DRAC I see this:

racadm>>racadm nicstatistics NIC.Embedded.1-1-1
Device Description:                           Embedded NIC 1 Port 1 Partition 1
Total Bytes Received:                         135094
Total Bytes Transmitted:                      5346
Total Unicast Bytes Received:                 0
Total Multicast Bytes Received:               0
Total Broadcast Bytes Received:               0
Total Unicast Bytes Transmitted:              0
Total Multicast Bytes Transmitted:            0
Total Broadcast Bytes Transmitted:            9
FCS error packets Received:                   0
Alignment error packets Received:             0
False Carrier error packets Received:         Not Applicable
Runt frames Received:                         0
Jabber error frames Received:                 0
Total Pause XON frames Received:              0
Total Pause XOFF frames Received:             0
Discarded packets:                            0
Single Collision frames Transmitted:          0
Multiple Collision frames Transmitted:        0
Late Collision frames Transmitted:            0
Excessive Collision frames Transmitted:       0
Link Status:                                  Up
OS Driver State:                              Non Operational

This is what it looked like initially (link up, and a few bytes on the tx/rx counters). However, after the reimage cookbook kicked off a restart and it attempted to PXE boot for the first time, the NIC stats disappeared, which seems...weird.

racadm>>racadm nicstatistics NIC.Embedded.1-1-1
No Port Statistics found for FQDD "NIC.Embedded.1-1-1"
No Partition Statistics found for FQDD "NIC.Embedded.1-1-1"
racadm>>racadm nicstatistics NIC.Embedded.1-1-1
No Port Statistics found for FQDD "NIC.Embedded.1-1-1"
No Partition Statistics found for FQDD "NIC.Embedded.1-1-1"
racadm>>racadm nicstatistics NIC.Embedded.1-1-1
No Port Statistics found for FQDD "NIC.Embedded.1-1-1"
No Partition Statistics found for FQDD "NIC.Embedded.1-1-1"
racadm>>racadm nicstatistics NIC.Embedded.1-1-1
No Port Statistics found for FQDD "NIC.Embedded.1-1-1"
No Partition Statistics found for FQDD "NIC.Embedded.1-1-1"
racadm>>racadm nicstatistics NIC.Embedded.1-1-1
No Port Statistics found for FQDD "NIC.Embedded.1-1-1"
No Partition Statistics found for FQDD "NIC.Embedded.1-1-1"
racadm>>racadm nicstatistics NIC.Embedded.1-1-1
No Port Statistics found for FQDD "NIC.Embedded.1-1-1"
No Partition Statistics found for FQDD "NIC.Embedded.1-1-1"
racadm>>

After the first attempt at DHCP failed, the stats output returned (also weird). And notice how the counters have been zeroed out.

racadm>>racadm nicstatistics NIC.Embedded.1-1-1
Device Description:                           Embedded NIC 1 Port 1 Partition 1
Total Bytes Received:                         0
Total Bytes Transmitted:                      0
Total Unicast Bytes Received:                 0
Total Multicast Bytes Received:               0
Total Broadcast Bytes Received:               0
Total Unicast Bytes Transmitted:              0
Total Multicast Bytes Transmitted:            0
Total Broadcast Bytes Transmitted:            0
FCS error packets Received:                   0
Alignment error packets Received:             0
False Carrier error packets Received:         Not Applicable
Runt frames Received:                         0
Jabber error frames Received:                 0
Total Pause XON frames Received:              0
Total Pause XOFF frames Received:             0
Discarded packets:                            0
Single Collision frames Transmitted:          0
Multiple Collision frames Transmitted:        0
Late Collision frames Transmitted:            0
Excessive Collision frames Transmitted:       0
Link Status:                                  Up
OS Driver State:                              Operational
No Partition Statistics found for FQDD "NIC.Embedded.1-1-1"
racadm>>

After the subsequent DHCP attempt (and for each one that follows, in perpetuity), the tx bytes get bumped, but rx bytes remains zeroed out.

racadm>>racadm nicstatistics NIC.Embedded.1-1-1
Device Description:                           Embedded NIC 1 Port 1 Partition 1
Total Bytes Received:                         0
Total Bytes Transmitted:                      2376
Total Unicast Bytes Received:                 0
Total Multicast Bytes Received:               0
Total Broadcast Bytes Received:               0
Total Unicast Bytes Transmitted:              0
Total Multicast Bytes Transmitted:            0
Total Broadcast Bytes Transmitted:            4
FCS error packets Received:                   0
Alignment error packets Received:             0
False Carrier error packets Received:         Not Applicable
Runt frames Received:                         0
Jabber error frames Received:                 0
Total Pause XON frames Received:              0
Total Pause XOFF frames Received:             0
Discarded packets:                            0
Single Collision frames Transmitted:          0
Multiple Collision frames Transmitted:        0
Late Collision frames Transmitted:            0
Excessive Collision frames Transmitted:       0
Link Status:                                  Up
OS Driver State:                              Operational
No Partition Statistics found for FQDD "NIC.Embedded.1-1-1"
racadm>>

From install1004, I can see my attempts on Aug 31 failing with no free leases, which would seem to suggest that (at least for some subset of attempts), that it wasn't a hardware/networking issue.

...
Aug 31 16:45:19 install1004 dhcpd[1181237]: DHCPDISCOVER from 4c:d9:8f:a7:f9:c9 via 10.64.48.2: network 10.64.48.0/22: no free leases
Aug 31 16:45:19 install1004 dhcpd[1181237]: DHCPDISCOVER from 4c:d9:8f:a7:f9:c9 via 10.64.48.3: network 10.64.48.0/22: no free leases
Aug 31 16:45:23 install1004 dhcpd[1181237]: DHCPDISCOVER from 4c:d9:8f:a7:f9:c9 via 10.64.48.2: network 10.64.48.0/22: no free leases
Aug 31 16:45:23 install1004 dhcpd[1181237]: DHCPDISCOVER from 4c:d9:8f:a7:f9:c9 via 10.64.48.3: network 10.64.48.0/22: no free leases
Aug 31 16:45:31 install1004 dhcpd[1181237]: DHCPDISCOVER from 4c:d9:8f:a7:f9:c9 via 10.64.48.3: network 10.64.48.0/22: no free leases
Aug 31 16:45:31 install1004 dhcpd[1181237]: DHCPDISCOVER from 4c:d9:8f:a7:f9:c9 via 10.64.48.2: network 10.64.48.0/22: no free leases
Aug 31 16:45:47 install1004 dhcpd[1181237]: DHCPDISCOVER from 4c:d9:8f:a7:f9:c9 via 10.64.48.3: network 10.64.48.0/22: no free leases
Aug 31 16:45:47 install1004 dhcpd[1181237]: DHCPDISCOVER from 4c:d9:8f:a7:f9:c9 via 10.64.48.2: network 10.64.48.0/22: no free leases
Aug 31 16:46:19 install1004 dhcpd[1181237]: DHCPDISCOVER from 4c:d9:8f:a7:f9:c9 via 10.64.48.3: network 10.64.48.0/22: no free leases
Aug 31 16:46:19 install1004 dhcpd[1181237]: DHCPDISCOVER from 4c:d9:8f:a7:f9:c9 via 10.64.48.2: network 10.64.48.0/22: no free leases
...

Similar logs exist from yesterday's attempts as well:

...
Sep  5 21:09:16 install1004 dhcpd[2055096]: DHCPDISCOVER from 4c:d9:8f:a7:f9:c9 via 10.64.48.3: network 10.64.48.0/22: no free leases
Sep  5 21:09:16 install1004 dhcpd[2055096]: DHCPDISCOVER from 4c:d9:8f:a7:f9:c9 via 10.64.48.2: network 10.64.48.0/22: no free leases
Sep  5 21:09:32 install1004 dhcpd[2055096]: DHCPDISCOVER from 4c:d9:8f:a7:f9:c9 via 10.64.48.3: network 10.64.48.0/22: no free leases
Sep  5 21:09:32 install1004 dhcpd[2055096]: DHCPDISCOVER from 4c:d9:8f:a7:f9:c9 via 10.64.48.2: network 10.64.48.0/22: no free leases
Sep  5 21:10:04 install1004 dhcpd[2055096]: DHCPDISCOVER from 4c:d9:8f:a7:f9:c9 via 10.64.48.3: network 10.64.48.0/22: no free leases
Sep  5 21:10:04 install1004 dhcpd[2055096]: DHCPDISCOVER from 4c:d9:8f:a7:f9:c9 via 10.64.48.2: network 10.64.48.0/22: no free leases
Sep  5 21:10:56 install1004 dhcpd[2055096]: DHCPDISCOVER from 4c:d9:8f:a7:f9:c9 via 10.64.48.2: network 10.64.48.0/22: no free leases
Sep  5 21:10:56 install1004 dhcpd[2055096]: DHCPDISCOVER from 4c:d9:8f:a7:f9:c9 via 10.64.48.3: network 10.64.48.0/22: no free leases
Sep  5 21:11:00 install1004 dhcpd[2055096]: DHCPDISCOVER from 4c:d9:8f:a7:f9:c9 via 10.64.48.2: network 10.64.48.0/22: no free leases
Sep  5 21:11:00 install1004 dhcpd[2055096]: DHCPDISCOVER from 4c:d9:8f:a7:f9:c9 via 10.64.48.3: network 10.64.48.0/22: no free leases
Sep  5 21:11:08 install1004 dhcpd[2055096]: DHCPDISCOVER from 4c:d9:8f:a7:f9:c9 via 10.64.48.2: network 10.64.48.0/22: no free leases
...

But these logs don't seem to cover all of my attempts during this time period, and I see no corresponding log output from dhcpd for the attempts I've made today.

I'm not 100% sure what's going on here. What I can say:

During PXEboot or from Linux booted from debian ISO the eno1 interface show's "UP" on the host side
In both cases it stays hard DOWN on the switch side
It appears that at some point the link came up, as there are logs on install1004 showing DHCP DISCOVER packets from the correct MAC getting there
The logs on the switch, for the time it was UP, show this error:

Sep  5 21:12:26  asw2-d-eqiad fpc4 DCBCM: dcbcm_drv_port_get_lipa_status BCM speed is set on ge-4/0/20 with external phy speed is 100Mbps and bcm_port_speed is 1000Mbps

Forcing the switch-side to 100Mb brought the port up on the switch, after which DHCP worked fine (cookbook was running so server responded):

{F37667728}

All that points very much to a bad cable, or maybe bad SFP module, but they've been swapped already. The SFP model is the same as others that are working ok on the switch, but obviously could be faulty. It's also not impossible the switch port itself is somehow bad, but I've rarely if ever seen that.

cmooney@asw2-d-eqiad> show chassis pic fpc-slot 4 pic-slot 0 | match "^  20"    
  20   GIGE 1000T        n/a   FiberStore         SFP-GB-GE-T       n/a                       0.0           REV 02   SFF-8472 ver n/a

It could potentially also be firmware. I'm not sure we've seen issues like this with the embedded 1G BCM5720 NICs before to be honest. I do note restbase1030 is using firmware 22.31.6, whereas say restbase1029 is on 21.40.21 (but I've not done any kind of extensive audit here).

At this point I'd maybe try to confirm with dc-ops the firmware is ok, and if it is potentially move to another switch port and see if that performs any better.

Ok so I downgraded the NIC firmware to 21.80.9 but the pattern is the exact same.

I'd possibly try another SFP just in case, and another cable. If it was an amplifier I'd spray contact cleaner on the RJ45 pins on the server too perhaps they are somehow dirty?

I've left the system booted into a live Debian environment from virtual CD (reachable from iDRAC web GUI only). That means the NIC should be UP 100% of the time, which might make troubleshooting easier. We can reboot out of that into the normal PXEboot cycle if we wish.

I'd still probably start with swapping cable/sfp again, and seeing if the switch-side link will come up. May also be worth connecting the server to a laptop or something else with RJ45 1G port to see if it'll come up plugged into that. And lastly move to another switch port see if it helps. Very strange.

RobH unsubscribed.Sep 7 2023, 3:07 PM

Replaced optic and cable again @cmooney @Eevans

In T344259#9152542, @Jclark-ctr wrote:

Replaced optic and cable again @cmooney @Eevans

Thanks @Jclark-ctr. Unfortunately it didn't work. :(

@cmooney, could we try forcing the port to 100mbit? I realize that shouldn't be necessary, but it would be an interesting data point if it worked to reimage the server. We could set it back afterward and return the machine to service if there were no other signs of trouble.

In T344259#9152935, @Eevans wrote:

In T344259#9152542, @Jclark-ctr wrote:

Replaced optic and cable again @cmooney @Eevans

Thanks @Jclark-ctr. Unfortunately it didn't work. :(

From IRC:

9:00 AM <urandom> topranks: Did you see https://phabricator.wikimedia.org/T344259#9152935 by chance?  Any thoughts?  
9:00 AM <urandom> it seems like we're reaching the desperation phase of this ticket, and I'm not sure where else to go :)
...
9:01 AM <topranks> I think we can be certain that it won't work at 1G when booted to debian, given it doesn't from live debian environment 
9:01 AM <topranks> 1G uses all 4 pairs in the RJ45, 100Mb only 2
9:01 AM <topranks> So usually this scenario means bad cable, but it could be the port on server or SFP on switch 
9:01 AM <topranks> did we try connecting a laptop to the server?  did it work at 1G?
9:02 AM <urandom> I don't think we did, no.
9:02 AM <topranks> worth a shot - also moving to a new switch port 
9:02 AM <urandom> I'll update the ticket.

@Jclark-ctr could we try connecting something else —a notebook for example— and see if that works at 1G. If so, that might indicate a bad port on the server. If not, perhaps trying a different port on the switch?

In T344259#9156545, @Eevans wrote:
In T344259#9152935, @Eevans wrote:

In T344259#9152542, @Jclark-ctr wrote:

Replaced optic and cable again @cmooney @Eevans

Thanks @Jclark-ctr. Unfortunately it didn't work. :(

From IRC:
9:00 AM <urandom> topranks: Did you see https://phabricator.wikimedia.org/T344259#9152935 by chance?  Any thoughts?  
9:00 AM <urandom> it seems like we're reaching the desperation phase of this ticket, and I'm not sure where else to go :)
...
9:01 AM <topranks> I think we can be certain that it won't work at 1G when booted to debian, given it doesn't from live debian environment 
9:01 AM <topranks> 1G uses all 4 pairs in the RJ45, 100Mb only 2
9:01 AM <topranks> So usually this scenario means bad cable, but it could be the port on server or SFP on switch 
9:01 AM <topranks> did we try connecting a laptop to the server?  did it work at 1G?
9:02 AM <urandom> I don't think we did, no.
9:02 AM <topranks> worth a shot - also moving to a new switch port 
9:02 AM <urandom> I'll update the ticket.
@Jclark-ctr could we try connecting something else —a notebook for example— and see if that works at 1G. If so, that might indicate a bad port on the server. If not, perhaps trying a different port on the switch?

On second though @Jclark-ctr, let's hold off on this for the time being; We are going to try a reimage with the port forced at 100mbit, and see if we can't troubleshoot from the OS.

In T344259#9157174, @Eevans wrote:

In T344259#9156545, @Eevans wrote:

[ ... ]
@Jclark-ctr could we try connecting something else —a notebook for example— and see if that works at 1G. If so, that might indicate a bad port on the server. If not, perhaps trying a different port on the switch?

On second though @Jclark-ctr, let's hold off on this for the time being; We are going to try a reimage with the port forced at 100mbit, and see if we can't troubleshoot from the OS.

This (actually) worked. Forcing the switch port to 100mbit was enough to successful PXE boot and reimage the server. Once complete, auto-negotiation was re-enabled, and everything seems to be working "normally".

slowlydisappears

In T344259#9158656, @Eevans wrote:

This (actually) worked. Forcing the switch port to 100mbit was enough to successful PXE boot and reimage the server. Once complete, auto-negotiation was re-enabled, and everything seems to be working "normally".

Not sure if you can hear me from behind this hedge but... just for the sake of clarification. On the switch side auto-neg was never disabled, we just hard-set the port to 100Mb, which means it still auto-negotiated, but only advertised 100Mb as a supported speed. I'm not sure if this distinction matters that much (without auto-neg the host would end up at half-duplex which would be another complication), but I figured I may as well mention it.

In T344259#9159690, @cmooney wrote:

In T344259#9158656, @Eevans wrote:

This (actually) worked. Forcing the switch port to 100mbit was enough to successful PXE boot and reimage the server. Once complete, auto-negotiation was re-enabled, and everything seems to be working "normally".
[ ... ]

Not sure if you can hear me from behind this hedge but... just for the sake of clarification. On the switch side auto-neg was never disabled, we just hard-set the port to 100Mb, which means it still auto-negotiated, but only advertised 100Mb as a supported speed. I'm not sure if this distinction matters that much (without auto-neg the host would end up at half-duplex which would be another complication), but I figured I may as well mention it.

Auh, right; Makes senses. Thanks!

	Restricted File
	Sep 6 2023, 5:39 PM

	F37645490: rpviewer(4).png
	Sep 1 2023, 12:01 AM

hw troubleshooting: Failing SSD for restbase1030.eqiad.wmnetClosed, ResolvedPublicRequestActions

Description

dmesg output

Related ObjectsSearch...

Event Timeline

hw troubleshooting: Failing SSD for restbase1030.eqiad.wmnet
Closed, ResolvedPublicRequest
Actions

Related Objects
Search...