Page MenuHomePhabricator

Jhancock.wm (Jenn Hancock)
User

Projects

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Tuesday

  • Clear sailing ahead.

User Details

User Since
Dec 5 2022, 4:37 PM (76 w, 6 d)
Availability
Available
LDAP User
Jhancock.wm
MediaWiki User
Jhancock.wm [ Global Accounts ]

Recent Activity

Thu, May 23

Jhancock.wm closed T365712: Relabel codfw Kubernetes hosts as Resolved.

completed

Thu, May 23, 2:43 PM · SRE, serviceops, DC-Ops, ops-codfw
Jhancock.wm moved T365712: Relabel codfw Kubernetes hosts from Backlog to Hardware Failure / Troubleshoot on the ops-codfw board.
Thu, May 23, 1:51 PM · SRE, serviceops, DC-Ops, ops-codfw

Wed, May 22

Jhancock.wm closed T365291: ml-serve2002 memory errors on DIMM_B1 as Resolved.

no new errors.

Wed, May 22, 1:21 PM · SRE, Machine-Learning-Team, ops-codfw, DC-Ops
Jhancock.wm closed T365543: ManagementSSHDown as Resolved.

this was me yesterday. was fixing a cabling issue and must have unseated it by accident. reseated and pinging.

Wed, May 22, 1:16 PM · SRE, DC-Ops, ops-codfw

Tue, May 21

Jhancock.wm added a comment to T365291: ml-serve2002 memory errors on DIMM_B1.

not seeing any alerts right now. I'll keep an eye on it. if it stays up until tomorrow I'll close the ticket. thanks!

Tue, May 21, 2:44 PM · SRE, Machine-Learning-Team, ops-codfw, DC-Ops
Jhancock.wm closed T365213: Degraded RAID on es2022 as Resolved.

It's been replaced. Alert has cleared. Let us know if you need any further help!

Tue, May 21, 2:43 PM · DC-Ops, DBA, SRE, ops-codfw
Jhancock.wm closed T365423: Duplicate IP on mgmt network as Resolved.
Tue, May 21, 1:50 PM · SRE, DC-Ops, ops-codfw
Jhancock.wm added a comment to T365213: Degraded RAID on es2022.

It does. Do you need to do anything before I replace the drive?

Tue, May 21, 1:23 PM · DC-Ops, DBA, SRE, ops-codfw
Jhancock.wm added a comment to T365204: Problem re-imaging hosts on row-wide vlan on EVPN switches.

@cmooney I put the server in the wrong vlan. can you fix it for me. private1-a8 to private-a-codfw. thanks!

Tue, May 21, 12:21 AM · DC-Ops, ops-codfw, netops, Infrastructure-Foundations, SRE

Mon, May 20

Jhancock.wm added a comment to T365204: Problem re-imaging hosts on row-wide vlan on EVPN switches.

@Papaul still getting an error on provisioning of the new server.

Mon, May 20, 11:17 PM · DC-Ops, ops-codfw, netops, Infrastructure-Foundations, SRE
Jhancock.wm closed T365379: Duplicate IP on mgmt network as Resolved.

manually reset the idrac ip of the offending server. alert cleared.

Mon, May 20, 11:11 PM · SRE, DC-Ops, ops-codfw
Jhancock.wm added a comment to T365291: ml-serve2002 memory errors on DIMM_B1.

I rotated B1 to B2 to see if the error moves with it. After booting, not getting any errors. Can we repeal it to see if the error comes back? If it does, I can find a replacement. Server is out of warranty.

Mon, May 20, 2:10 PM · SRE, Machine-Learning-Team, ops-codfw, DC-Ops

Fri, May 17

Jhancock.wm reassigned T363209: Q4:rack/setup/install kafka-main200[6789] & kafka-main2010 from Jhancock.wm to Papaul.

@Papaul I'm still having trouble with the same spot as noted before. Can you take a look at it?

Fri, May 17, 6:40 PM · SRE, ops-codfw, serviceops, DC-Ops
Jhancock.wm added a comment to T365213: Degraded RAID on es2022.

@ABran-WMF I tried to check the warranty status on this server on Dell's site but that function is not working at the moment. they are having technical difficulties. I do not have any 2TB drives on hand but I do have some 4TB ones that are from a decommissioned server. Is it possible to provision the drive down to the size you need?

Fri, May 17, 2:04 PM · DC-Ops, DBA, SRE, ops-codfw
Jhancock.wm added a comment to T365217: Degraded RAID on backup2010.

I found this in the idrac log.

Fri, May 17, 1:38 PM · Data-Persistence-Backup, Data-Persistence, DC-Ops, SRE, ops-codfw
Jhancock.wm moved T365213: Degraded RAID on es2022 from Backlog to Hardware Failure / Troubleshoot on the ops-codfw board.
Fri, May 17, 1:26 PM · DC-Ops, DBA, SRE, ops-codfw
Jhancock.wm moved T365217: Degraded RAID on backup2010 from Backlog to Hardware Failure / Troubleshoot on the ops-codfw board.
Fri, May 17, 1:26 PM · Data-Persistence-Backup, Data-Persistence, DC-Ops, SRE, ops-codfw

Wed, May 15

Dzahn awarded T364863: InterfaceSpeedError - mw2286 a Like token.
Wed, May 15, 3:23 PM · serviceops, SRE, ops-codfw
Jhancock.wm added a comment to T363209: Q4:rack/setup/install kafka-main200[6789] & kafka-main2010.

2009 is on timeout until I can take another crack at it. stuck on this even thought the rest passed.

Wed, May 15, 2:50 PM · SRE, ops-codfw, serviceops, DC-Ops
Jhancock.wm updated the task description for T363209: Q4:rack/setup/install kafka-main200[6789] & kafka-main2010.
Wed, May 15, 2:49 PM · SRE, ops-codfw, serviceops, DC-Ops
Jhancock.wm updated the task description for T363209: Q4:rack/setup/install kafka-main200[6789] & kafka-main2010.
Wed, May 15, 2:28 PM · SRE, ops-codfw, serviceops, DC-Ops
Jhancock.wm claimed T364863: InterfaceSpeedError - mw2286.
Wed, May 15, 2:27 PM · serviceops, SRE, ops-codfw
Jhancock.wm added a comment to T364863: InterfaceSpeedError - mw2286.

I reseated the sfp and the cable. Looks fixed and remained steady for an hour. Should be good to add back.

Wed, May 15, 2:27 PM · serviceops, SRE, ops-codfw
Jhancock.wm closed T364948: ManagementSSHDown as Resolved.

rebooted switch. back up. alerts clearing.

Wed, May 15, 1:52 PM · SRE, ops-codfw

Tue, May 14

Jhancock.wm added a comment to T364863: InterfaceSpeedError - mw2286.

the cable or the 1G SFP might need to be replaced. can we downtime the server for a small window to test the cabling?

Tue, May 14, 2:50 PM · serviceops, SRE, ops-codfw
Jhancock.wm closed T364810: ManagementSSHDown as Resolved.

rebooted. all in C6 up now.

Tue, May 14, 2:46 PM · SRE, ops-codfw
Jhancock.wm moved T364863: InterfaceSpeedError - mw2286 from Backlog to Hardware Failure / Troubleshoot on the ops-codfw board.
Tue, May 14, 2:45 PM · serviceops, SRE, ops-codfw
Jhancock.wm closed T364809: ManagementSSHDown as Resolved.

reseated. pings on mgmt.

Tue, May 14, 1:50 PM · SRE, ops-codfw

Mon, May 13

Jhancock.wm updated subscribers of T363209: Q4:rack/setup/install kafka-main200[6789] & kafka-main2010.

@Papaul, This was the last screen I got. The servers all have the OS installed and it failed at the certificate stage. I think it's cause I used python 7 instead of 5. when I attempt to retry with 5, it fails.

Mon, May 13, 3:13 PM · SRE, ops-codfw, serviceops, DC-Ops
Jhancock.wm updated Other Assignee for T363209: Q4:rack/setup/install kafka-main200[6789] & kafka-main2010, added: Papaul.
Mon, May 13, 1:36 PM · SRE, ops-codfw, serviceops, DC-Ops
Jhancock.wm closed T364633: connected console ports attached to unracked device as Resolved.

Initiated: 2024-05-13 13:35 Duration: 0 minutes, 1.60 seconds Completed

Mon, May 13, 1:36 PM · SRE, ops-codfw
Jhancock.wm moved T364559: Create (or teach Andrew how to create) private connections+dns entries for new cloudcontrols from Backlog to Codfw Switch migration on the ops-codfw board.
Mon, May 13, 1:33 PM · SRE, netops, ops-codfw, Infrastructure-Foundations, cloud-services-team

Thu, May 9

Jhancock.wm updated the task description for T363209: Q4:rack/setup/install kafka-main200[6789] & kafka-main2010.
Thu, May 9, 10:33 PM · SRE, ops-codfw, serviceops, DC-Ops

Wed, May 8

Jhancock.wm updated the task description for T363209: Q4:rack/setup/install kafka-main200[6789] & kafka-main2010.
Wed, May 8, 2:56 PM · SRE, ops-codfw, serviceops, DC-Ops
Jhancock.wm closed T364439: ManagementSSHDown as Resolved.

uplink for msw2 was degraded and flapping. repaired. staying up now.

Wed, May 8, 2:14 PM · SRE, ops-codfw
Jhancock.wm closed T364464: Comms to msw-d2-codfw down as Resolved.

port 47 on the maw was going up and down on it's own. replaced the rj-45 terminator. remained steady.

Wed, May 8, 2:11 PM · netops, SRE, Infrastructure-Foundations, ops-codfw

Tue, May 7

Jhancock.wm closed T364358: Inbound interface errors as Resolved.
Tue, May 7, 2:17 PM · SRE, ops-codfw

Mon, May 6

Jhancock.wm added a comment to T362938: Degraded RAID on mw2382.

Forgot I left it there. All yours now!

Mon, May 6, 3:18 PM · serviceops, SRE, ops-codfw
Jhancock.wm claimed T363209: Q4:rack/setup/install kafka-main200[6789] & kafka-main2010.
Mon, May 6, 2:51 PM · SRE, ops-codfw, serviceops, DC-Ops

Thu, May 2

Jhancock.wm added a comment to T362938: Degraded RAID on mw2382.

@JMeybohm papaul helped me identify the missing disk. I replaced it with a compatible drive. please let me know if that fixed the issue. Thanks.

Thu, May 2, 4:27 PM · serviceops, SRE, ops-codfw
Jhancock.wm closed T363926: PowerSupplyFailure as Resolved.

reseated psu2 and cable. alert cleared on machine.

Thu, May 2, 1:19 PM · SRE, ops-codfw

Wed, May 1

Jhancock.wm closed T363838: Degraded RAID on mw2382 as Declined.

see T362938

Wed, May 1, 2:53 PM · SRE, ops-codfw
Jhancock.wm closed T363847: Degraded RAID on mw2382 as Declined.

see T362938

Wed, May 1, 2:53 PM · SRE, ops-codfw
Jhancock.wm moved T363838: Degraded RAID on mw2382 from Backlog to Hardware Failure / Troubleshoot on the ops-codfw board.
Wed, May 1, 2:51 PM · SRE, ops-codfw
Jhancock.wm moved T363847: Degraded RAID on mw2382 from Backlog to Hardware Failure / Troubleshoot on the ops-codfw board.
Wed, May 1, 2:51 PM · SRE, ops-codfw
Jhancock.wm closed T363756: PowerSupplyFailure as Resolved.

removed the error by rebooting the idrac

Wed, May 1, 2:49 PM · SRE, ops-codfw
Jhancock.wm claimed T363756: PowerSupplyFailure.

fixed the main source of the alert (PSU and power cable reseated) but still getting the following error.

Wed, May 1, 2:41 PM · SRE, ops-codfw
Jhancock.wm moved T363756: PowerSupplyFailure from Backlog to Hardware Failure / Troubleshoot on the ops-codfw board.
Wed, May 1, 2:24 PM · SRE, ops-codfw

Tue, Apr 30

Jhancock.wm added a comment to T362938: Degraded RAID on mw2382.

idrac upgraded to 7.0.0. won't go any higher. Bios is already at 2.9.3. Reset the factory defaults and tried rebooting the idrac. reseated the backplane. None of these have fixed the issue. Going to look into getting a replacement part. Might need to be salvaged from decommissioned servers. Will update when we have a solution

Tue, Apr 30, 4:36 PM · serviceops, SRE, ops-codfw
Jhancock.wm closed T363783: Inbound interface errors as Resolved.

known issue with no impact

Tue, Apr 30, 3:10 PM · SRE, ops-codfw
Jhancock.wm added a comment to T362938: Degraded RAID on mw2382.

draining didn't fix it. I'm gonna update the firmware and bios and then see where it is.

Tue, Apr 30, 2:07 PM · serviceops, SRE, ops-codfw

Mon, Apr 29

Jhancock.wm closed T362801: decommission db2103.codfw.wmnet as Resolved.
Mon, Apr 29, 6:36 PM · SRE, ops-codfw, decommission-hardware
Jhancock.wm closed T362799: decommission db2106.codfw.wmnet as Resolved.
Mon, Apr 29, 6:35 PM · SRE, ops-codfw, decommission-hardware
Jhancock.wm closed T362800: decommission db2105.codfw.wmnet as Resolved.
Mon, Apr 29, 6:35 PM · SRE, ops-codfw, decommission-hardware
Jhancock.wm closed T362798: decommission db2107.codfw.wmnet as Resolved.
Mon, Apr 29, 6:35 PM · SRE, ops-codfw, decommission-hardware
Jhancock.wm closed T362797: decommission db2108.codfw.wmnet as Resolved.
Mon, Apr 29, 6:34 PM · SRE, ops-codfw, decommission-hardware
Jhancock.wm closed T362796: decommission db2109.codfw.wmnet as Resolved.
Mon, Apr 29, 6:34 PM · SRE, ops-codfw, decommission-hardware
Jhancock.wm closed T362795: decommission db2110.codfw.wmnet as Resolved.
Mon, Apr 29, 6:33 PM · SRE, ops-codfw, decommission-hardware
Jhancock.wm closed T362794: decommission db2111.codfw.wmnet as Resolved.
Mon, Apr 29, 6:32 PM · SRE, ops-codfw, decommission-hardware
Jhancock.wm closed T362793: decommission db2112.codfw.wmnet as Resolved.
Mon, Apr 29, 6:32 PM · SRE, ops-codfw, decommission-hardware
Jhancock.wm closed T362792: decommission db2113.codfw.wmnet as Resolved.
Mon, Apr 29, 6:32 PM · SRE, ops-codfw, decommission-hardware
Jhancock.wm closed T362790: decommission db2119.codfw.wmnet as Resolved.
Mon, Apr 29, 6:31 PM · SRE, ops-codfw, decommission-hardware
Jhancock.wm closed T362787: decommission db2120.codfw.wmnet as Resolved.
Mon, Apr 29, 6:30 PM · SRE, ops-codfw, decommission-hardware
Jhancock.wm added a comment to T362938: Degraded RAID on mw2382.

Apologies for the wait on this one. I checked out the server and the drives look to be working physically. But when I logged into the idrac it sees zero disks. Checked the warranty and it expired in February. I do have a pair of decommed 960GB drives that could replace it. However, I cannot tell which drive needs to be replaced. Please let me know if this still needs attention and how I can help.

Mon, Apr 29, 5:12 PM · serviceops, SRE, ops-codfw

Apr 23 2024

Jhancock.wm moved T362938: Degraded RAID on mw2382 from Backlog to Hardware Failure / Troubleshoot on the ops-codfw board.
Apr 23 2024, 12:37 PM · serviceops, SRE, ops-codfw
Jhancock.wm closed T363120: Inbound interface errors as Resolved.

known issue with no impact

Apr 23 2024, 12:37 PM · SRE, ops-codfw

Apr 22 2024

Jhancock.wm updated the task description for T362729: Q4:rack/setup/install cp70[01-16].
Apr 22 2024, 5:23 PM · Traffic, ops-magru, DC-Ops
Jhancock.wm updated the task description for T362730: Q4:rack/setup/install magru misc servers.
Apr 22 2024, 5:22 PM · Traffic, netops, ops-magru, DC-Ops, Infrastructure-Foundations
Jhancock.wm updated the task description for T362730: Q4:rack/setup/install magru misc servers.
Apr 22 2024, 4:07 PM · Traffic, netops, ops-magru, DC-Ops, Infrastructure-Foundations
Jhancock.wm updated the task description for T362729: Q4:rack/setup/install cp70[01-16].
Apr 22 2024, 4:06 PM · Traffic, ops-magru, DC-Ops
Jhancock.wm closed Unknown Object (Task), a subtask of T346722: Sao Paulo, Brazil, South America POP tracking task, as Resolved.
Apr 22 2024, 4:01 PM · ops-magru, Patch-For-Review

Apr 18 2024

Jhancock.wm added a comment to T361525: Degraded RAID on elastic2088.

All tests passed on the diagnostic test, including the pci bus. It's pinging on the idrac and the network ips.
@RKemper give it another go. @ me if you run into an issue again.

Apr 18 2024, 6:26 PM · ops-codfw, Data-Platform-SRE (2024.04.15 - 2024.05.05)
Jhancock.wm added a comment to T361525: Degraded RAID on elastic2088.

Tried to run a diagnostic from the Lifecycle controller. Haunted because of a DIMM error on B4. It's been replaced. re-running the diagnostic to check for any more issues.

Apr 18 2024, 4:01 PM · ops-codfw, Data-Platform-SRE (2024.04.15 - 2024.05.05)

Apr 17 2024

Jhancock.wm moved T362787: decommission db2120.codfw.wmnet from Backlog to Decommission on the ops-codfw board.
Apr 17 2024, 3:58 PM · SRE, ops-codfw, decommission-hardware
Jhancock.wm added a comment to T361525: Degraded RAID on elastic2088.

@RKemper I am going to check it out and get back in touch with dell. These are the same errors we were getting before the card was replaced.

Apr 17 2024, 1:14 PM · ops-codfw, Data-Platform-SRE (2024.04.15 - 2024.05.05)
Jhancock.wm added a project to T361525: Degraded RAID on elastic2088: ops-codfw.
Apr 17 2024, 1:12 PM · ops-codfw, Data-Platform-SRE (2024.04.15 - 2024.05.05)

Apr 16 2024

Jhancock.wm added a comment to T358542: Netbox errors caused by system board replacement .

I updated the sheet with the needed information but spaced submitting that to this task. Please let me know if there's anything else I can do to help out with the tasks. Thanks!

Apr 16 2024, 4:53 PM · SRE, ops-codfw
Jhancock.wm closed T361229: titan200[12] RAM/SSD upgrade coordination as Resolved.
Apr 16 2024, 4:47 PM · SRE Observability (FY2023/2024-Q4), SRE, observability, ops-codfw
Jhancock.wm added a comment to T362438: decommission cloudbackup200[12].codfw.wmnet.

ty!

Apr 16 2024, 2:30 PM · SRE, ops-codfw, cloud-services-team, decommission-hardware
Jhancock.wm closed T362438: decommission cloudbackup200[12].codfw.wmnet as Resolved.
Apr 16 2024, 2:29 PM · SRE, ops-codfw, cloud-services-team, decommission-hardware
Jhancock.wm updated subscribers of T362438: decommission cloudbackup200[12].codfw.wmnet.

@Papaul @Andrew
what are we doing with cloudbackup2001-array1 and cloudbackup2002-array1?

Apr 16 2024, 1:51 PM · SRE, ops-codfw, cloud-services-team, decommission-hardware
Jhancock.wm closed T362465: ManagementSSHDown as Resolved.

alert cleared. being decommed in T362438

Apr 16 2024, 1:28 PM · SRE, ops-codfw
Jhancock.wm moved T361305: decommission elastic20[37-54].codfw.wmnet from Decommission to Blocked on the ops-codfw board.
Apr 16 2024, 1:25 PM · SRE, ops-codfw, decommission-hardware
Jhancock.wm moved T346661: cloud: prepare codfw for expansion (racks, switches, ceph) from Racking Tasks to Blocked on the ops-codfw board.
Apr 16 2024, 1:25 PM · User-dcaro, SRE, cloud-services-team (Hardware), ops-codfw, User-aborrero
Jhancock.wm moved T356216: Q#:rack/setup/install (2) cloudbackup hosts from Racking Tasks to Blocked on the ops-codfw board.
Apr 16 2024, 1:25 PM · SRE, ops-codfw, cloud-services-team (Hardware), DC-Ops
Jhancock.wm moved T361229: titan200[12] RAM/SSD upgrade coordination from Racking Tasks to Blocked on the ops-codfw board.
Apr 16 2024, 1:24 PM · SRE Observability (FY2023/2024-Q4), SRE, observability, ops-codfw
Jhancock.wm closed T362311: Decommission db2101 (was: db2101 crashed) as Resolved.
Apr 16 2024, 1:24 PM · SRE, ops-codfw, decommission-hardware, DC-Ops, database-backups, Data-Persistence-Backup, DBA
Jhancock.wm closed T362311: Decommission db2101 (was: db2101 crashed), a subtask of T358741: Decommission db2096-db2120, as Resolved.
Apr 16 2024, 1:22 PM · DBA
Jhancock.wm closed T362596: Inbound interface errors as Resolved.

known issue, no impact

Apr 16 2024, 1:36 AM · SRE, ops-codfw
Jhancock.wm moved T362311: Decommission db2101 (was: db2101 crashed) from Backlog to Decommission on the ops-codfw board.
Apr 16 2024, 1:35 AM · SRE, ops-codfw, decommission-hardware, DC-Ops, database-backups, Data-Persistence-Backup, DBA
Jhancock.wm moved T362596: Inbound interface errors from Backlog to Hardware Failure / Troubleshoot on the ops-codfw board.
Apr 16 2024, 1:35 AM · SRE, ops-codfw

Apr 15 2024

Jhancock.wm renamed T354896: Q3:rack/setup/install cloudcontrol2009-dev.codfw.wmnet from Q3:rack/setup/install cloudcontrol2006-dev.codfw.wmnet to Q3:rack/setup/install cloudcontrol2009-dev.codfw.wmnet.
Apr 15 2024, 4:51 PM · SRE, ops-codfw, cloud-services-team (Hardware), DC-Ops
Jhancock.wm claimed T354896: Q3:rack/setup/install cloudcontrol2009-dev.codfw.wmnet.

@cmooney what is the vlan for this server?

Apr 15 2024, 4:41 PM · SRE, ops-codfw, cloud-services-team (Hardware), DC-Ops
Jhancock.wm updated the task description for T354896: Q3:rack/setup/install cloudcontrol2009-dev.codfw.wmnet.
Apr 15 2024, 4:35 PM · SRE, ops-codfw, cloud-services-team (Hardware), DC-Ops
Jhancock.wm closed T362550: PowerSupplyFailure as Resolved.

reseated blue cable

Apr 15 2024, 4:31 PM · SRE, ops-codfw
Jhancock.wm moved T362550: PowerSupplyFailure from Backlog to Hardware Failure / Troubleshoot on the ops-codfw board.
Apr 15 2024, 4:29 PM · SRE, ops-codfw
Jhancock.wm moved T362465: ManagementSSHDown from Backlog to Hardware Failure / Troubleshoot on the ops-codfw board.
Apr 15 2024, 1:52 PM · SRE, ops-codfw

Apr 12 2024

Jhancock.wm updated the task description for T354896: Q3:rack/setup/install cloudcontrol2009-dev.codfw.wmnet.
Apr 12 2024, 5:17 PM · SRE, ops-codfw, cloud-services-team (Hardware), DC-Ops
Jhancock.wm added a comment to T361525: Degraded RAID on elastic2088.

@bking I got the HBA card replaced and it booted without any issues that I can find in the iDRAC. Can you check CLI to see if the raid is still degraded?

Apr 12 2024, 4:52 PM · ops-codfw, Data-Platform-SRE (2024.04.15 - 2024.05.05)

Apr 11 2024

Jhancock.wm added a comment to T361525: Degraded RAID on elastic2088.

Update: Dell finally agreed to replace the HBA card. I sent the shipping address confirmation just now. Hopefully it'll be here tomorrow. Latest Monday morning.

Apr 11 2024, 1:40 PM · ops-codfw, Data-Platform-SRE (2024.04.15 - 2024.05.05)