Page MenuHomePhabricator

RobH (Rob Halsell)
Senior Data Center EngineerAdministrator

Today

  • No visible events.

Tomorrow

  • No visible events.

Friday

  • No visible events.

User Details

User Since
Nov 24 2014, 1:43 PM (594 w, 2 d)
Roles
Administrator
Availability
Available
IRC Nick
RobH
LDAP User
RobH
MediaWiki User
RobH [ Global Accounts ]

My GPG Key fingerprint = CB1F C7E7 0FF8 5DB2 6820 9C7E 75ED 14C7 0245 D22A

I am an Senior Data Center Engineer on Wikimedia's Data Center SRE Team.

Please note that private message via phabricator is not my preferred contact means. Please feel free to contact me (robh) directly via irc/freenode, or email my @wikimedia.org email address.

Recent Activity

Yesterday

RobH added a parent task for T423314: Q4:rack/setup/install wdqs103[6-8]: Unknown Object (Task).
Tue, Apr 14, 4:24 PM · Wikidata Platform Team, Data-Platform-SRE (2026-03-27 - 2026-04-17), ops-eqiad, SRE, DC-Ops
RobH assigned T423314: Q4:rack/setup/install wdqs103[6-8] to bking.

Please update the site.pp file with the insetup role for your team (detailed on https://wikitech.wikimedia.org/wiki/SRE/Dc-operations) and add the new servers to preseed.yml for partition info.

Tue, Apr 14, 4:24 PM · Wikidata Platform Team, Data-Platform-SRE (2026-03-27 - 2026-04-17), ops-eqiad, SRE, DC-Ops
RobH created T423314: Q4:rack/setup/install wdqs103[6-8].
Tue, Apr 14, 4:23 PM · Wikidata Platform Team, Data-Platform-SRE (2026-03-27 - 2026-04-17), ops-eqiad, SRE, DC-Ops
RobH assigned T423312: Q4:rack/setup/install wdqs20[28-31] to bking.

Please update the site.pp file with the insetup role for your team (detailed on https://wikitech.wikimedia.org/wiki/SRE/Dc-operations) and add the new servers to preseed.yml for partition info.

Tue, Apr 14, 4:22 PM · Data-Platform-SRE (2026-03-27 - 2026-04-17), Wikidata Platform Team, SRE, ops-codfw, DC-Ops
RobH added a parent task for T423312: Q4:rack/setup/install wdqs20[28-31]: Unknown Object (Task).
Tue, Apr 14, 4:21 PM · Data-Platform-SRE (2026-03-27 - 2026-04-17), Wikidata Platform Team, SRE, ops-codfw, DC-Ops
RobH added a project to T423312: Q4:rack/setup/install wdqs20[28-31]: Data-Platform-SRE (2026-03-27 - 2026-04-17).
Tue, Apr 14, 4:21 PM · Data-Platform-SRE (2026-03-27 - 2026-04-17), Wikidata Platform Team, SRE, ops-codfw, DC-Ops
RobH created T423312: Q4:rack/setup/install wdqs20[28-31].
Tue, Apr 14, 4:21 PM · Data-Platform-SRE (2026-03-27 - 2026-04-17), Wikidata Platform Team, SRE, ops-codfw, DC-Ops

Mon, Apr 13

RobH added a comment to T408704: offline rackspace wikitech-static, online aws wikitech-static.

They make it very difficult to cancel:

  • called into 800-480-8365, they do nothing but open a ticket on our behalf
  • the account retension team called to try to convince me to stay
    • "I am just an employee and don't make decisions, I am canceling this because I was told to do so."
    • Not quite true but it is the short answer ;D
  • 1-3 business days for the cancelallation and billing team to call me back, so I should get a call from them later this week.
Mon, Apr 13, 3:07 PM · Infrastructure-Foundations

Thu, Apr 9

RobH added a comment to T414411: cp5022 is unreachable.

Dell has confirmed case update and will dispatch a new mainboard and cpu bracket. Once they do, they'll email/update with tracking and then dispatch will reach back out to schedule the third unisys site visit.

Thu, Apr 9, 6:00 PM · SRE, DC-Ops, ops-eqsin, Traffic

Wed, Apr 8

RobH added a comment to T414411: cp5022 is unreachable.

I'll put a more detailed timeline and update tomorrow but as it stands now:

Wed, Apr 8, 4:37 AM · SRE, DC-Ops, ops-eqsin, Traffic

Tue, Apr 7

RobH added a comment to T414411: cp5022 is unreachable.

Mainboard swap will occur on Wednesday, April 8th @ 10:00Singapore time which is Tuesday, Tuesday April 7th 18:00 Pacific.

Tue, Apr 7, 2:31 PM · SRE, DC-Ops, ops-eqsin, Traffic

Tue, Mar 31

RobH added a comment to T420623: netbox report error for puppetdb serial versus netbox serial for backup1012.

Please note 'S480845X3505676' is NOT a valid serial under Supermicro support, but S480845X4915849 is.

Tue, Mar 31, 3:00 PM · collaboration-services, SRE, ops-eqiad, DC-Ops

Mon, Mar 30

RobH added a comment to T420623: netbox report error for puppetdb serial versus netbox serial for backup1012.

This host was purchased 2024-08-07, so it is still under warranty. If Papaul doesn't know how to use the SUM (I've never used it) then the support ticket is the way to go.

Mon, Mar 30, 3:32 PM · collaboration-services, SRE, ops-eqiad, DC-Ops

Fri, Mar 27

RobH reassigned T419884: NVMe versus standard SSD performance info from RobH to gmodena.

I've gotten back the following links to the whitepapers for our currently used SSDs and NVMe offerings:

Fri, Mar 27, 2:51 PM · Wikidata, DC-Ops, SRE
RobH updated the task description for T419884: NVMe versus standard SSD performance info.
Fri, Mar 27, 2:49 PM · Wikidata, DC-Ops, SRE

Thu, Mar 26

RobH added a comment to T419298: Alert for device asw1-b4-magru.mgmt.magru.wmnet - Port with no description on access switch.

Who is best to address and fix this bug?

Thu, Mar 26, 3:23 PM · ops-magru

Tue, Mar 24

RobH updated subscribers of T419298: Alert for device asw1-b4-magru.mgmt.magru.wmnet - Port with no description on access switch.

description: Rule: Port with no description on access switch Faults: #1: ge-0/0/47 - ge-0/0/47

Tue, Mar 24, 3:26 PM · ops-magru
RobH closed T418978: cr2-magru <-> asw1-b3-magru link down March 2026 as Resolved.
Tue, Mar 24, 3:23 PM · ops-magru, netops, Infrastructure-Foundations, SRE
RobH closed T403275: Alert for device ps1-b4-magru.mgmt.magru.wmnet - PDU sensor over limit as Resolved.
Tue, Mar 24, 3:23 PM · ops-magru
RobH closed T403273: Alert for device ps1-b3-magru.mgmt.magru.wmnet - PDU sensor over limit as Resolved.
Tue, Mar 24, 3:23 PM · ops-magru
RobH closed T415743: Inbound errors on interface cr2-magru:xe-0/1/0 (Transit: EdgeUno (E1-SER-7853-IP) {#70091}) as Resolved.
Tue, Mar 24, 3:23 PM · ops-magru
RobH added a comment to T415743: Inbound errors on interface cr2-magru:xe-0/1/0 (Transit: EdgeUno (E1-SER-7853-IP) {#70091}).

Fixed on Friday, synced up in meeting today and no morre errors. Cathal closing the ticket on the Lumen portal.

Tue, Mar 24, 3:22 PM · ops-magru
RobH added a comment to T408704: offline rackspace wikitech-static, online aws wikitech-static.

Account cancellation cannot be accomplished in the portal, I'll have to call in later today.

Tue, Mar 24, 2:50 PM · Infrastructure-Foundations

Thu, Mar 19

RobH renamed T420623: netbox report error for puppetdb serial versus netbox serial for backup1012 from netbox report error for puppetdb serial versus netbox serial to netbox report error for puppetdb serial versus netbox serial for backup1012.
Thu, Mar 19, 6:10 PM · collaboration-services, SRE, ops-eqiad, DC-Ops
RobH created T420623: netbox report error for puppetdb serial versus netbox serial for backup1012.
Thu, Mar 19, 6:04 PM · collaboration-services, SRE, ops-eqiad, DC-Ops
RobH added a comment to T415743: Inbound errors on interface cr2-magru:xe-0/1/0 (Transit: EdgeUno (E1-SER-7853-IP) {#70091}).

Summary:

  • EdgeUno says they see no errors only our flap
  • Arzhel replied back stating that we are still seeing errors, stressed that we've already swapped optics and fibers on our end and re-requested they do the same as I did in the original request.
  • They replied back about 45 minutes or so ago stating 'We will check our side as requested and let you know.'
Thu, Mar 19, 3:08 PM · ops-magru
RobH added a comment to T415743: Inbound errors on interface cr2-magru:xe-0/1/0 (Transit: EdgeUno (E1-SER-7853-IP) {#70091}).

Please note the ticket was opened but their portal doesn't seem to email myself, Arzhel, or Cathal even though I listed all three of us on the ticket. The only way to see ticket updates is to login to the actual ticket view: https://edgeuno.cloud/tickets.php/view/484278000266995093

Thu, Mar 19, 3:02 PM · ops-magru

Wed, Mar 18

RobH claimed T415743: Inbound errors on interface cr2-magru:xe-0/1/0 (Transit: EdgeUno (E1-SER-7853-IP) {#70091}).
Wed, Mar 18, 9:41 PM · ops-magru
RobH added a comment to T415743: Inbound errors on interface cr2-magru:xe-0/1/0 (Transit: EdgeUno (E1-SER-7853-IP) {#70091}).

The optic was swapped, but the errors resumed.

Wed, Mar 18, 9:41 PM · ops-magru
RobH added a comment to T415743: Inbound errors on interface cr2-magru:xe-0/1/0 (Transit: EdgeUno (E1-SER-7853-IP) {#70091}).

Errors returned, Arzhel redrained the link, update sent to ticket:

Wed, Mar 18, 7:45 PM · ops-magru
RobH added a comment to T415743: Inbound errors on interface cr2-magru:xe-0/1/0 (Transit: EdgeUno (E1-SER-7853-IP) {#70091}).

Comentário gerado em Smart Hands: Good afternoon,

We carried out the replacement of the fiber optic patch cable. A 10‑meter patch cable available in Rack B03 was used.
Attached are the evidences of the activity performed.

Inventory of materials available in Rack B03:

07 units of fiber optic patch cables – 2 meters
02 units of MPO patch cables – 1 meter
02 units QFX‑SFP‑10GE‑LR
01 unit JNP‑SFP‑25G‑LR
01 unit QFX‑SFP‑1GE‑T

Wed, Mar 18, 7:34 PM · ops-magru
RobH added a comment to T415743: Inbound errors on interface cr2-magru:xe-0/1/0 (Transit: EdgeUno (E1-SER-7853-IP) {#70091}).

Support,

The link came back up after your cleaning and re-seating the optic and patch cable, but the errors have resumed after the circuit came back online.

Next step, we would like to have you swap patch cable 70152 with a spare patch cable in our rack (should be on top of the servers) for this link at your earliest possible convenience. Please let us know the cable ID of this new patch, if it doesn't have one, please apply ID 260301.

The link is currently depooled (will show a link light but it is not serving traffic). When you replace the fiber patch cable, it should resume its link light.

Please also check in our racks and report back an inventory of how many spare fiber optic patch cables and lengths we have, along with the spare optics. These should be in our racks on top of the servers.

This work can take place at any time. Thank you in advance!

Wed, Mar 18, 5:07 PM · ops-magru
RobH added a comment to T413409: Inbound errors on interface cr1-magru:xe-0/1/1 (Transport: cr2-eqiad:xe-1/0/1:3 (Telxius, CRT-008508) {#70089}).

This also looks like its no longer throwing errors, but I've done nothing:

Wed, Mar 18, 4:46 PM · ops-magru
RobH added a comment to T413409: Inbound errors on interface cr1-magru:xe-0/1/1 (Transport: cr2-eqiad:xe-1/0/1:3 (Telxius, CRT-008508) {#70089}).

Remote Hands Directions:
I can write up the directions for them to pull the patch and clean it, and also reseat the optic in the port. However, all the patch IDs so far in magru have been incorrect when photos are matched to netbox, so we'll likely need to tell them the port and optic serial, and suggest the patch ID and note it may differ. They'll take photos at our request to match the patch cable, so someone needs to be around and answering the tickets in case they require another confirmation.

Wed, Mar 18, 4:11 PM · ops-magru
RobH added a comment to T415743: Inbound errors on interface cr2-magru:xe-0/1/0 (Transit: EdgeUno (E1-SER-7853-IP) {#70091}).

728a9fed-0f12-45d5-80e1-c31c44a1295a.jpg (1×900 px, 72 KB)

Wed, Mar 18, 3:32 PM · ops-magru
RobH added a comment to T415743: Inbound errors on interface cr2-magru:xe-0/1/0 (Transit: EdgeUno (E1-SER-7853-IP) {#70091}).

Remote hands cleaned the patch cable and reseated the optic along with photos to show the work.

Wed, Mar 18, 3:32 PM · ops-magru

Tue, Mar 17

RobH updated the task description for T419884: NVMe versus standard SSD performance info.
Tue, Mar 17, 9:48 PM · Wikidata, DC-Ops, SRE
RobH added a subtask for T414411: cp5022 is unreachable: Unknown Object (Task).
Tue, Mar 17, 2:56 PM · SRE, DC-Ops, ops-eqsin, Traffic
RobH removed a parent task for T414411: cp5022 is unreachable: Unknown Object (Task).
Tue, Mar 17, 2:56 PM · SRE, DC-Ops, ops-eqsin, Traffic
RobH added a parent task for T414411: cp5022 is unreachable: Unknown Object (Task).
Tue, Mar 17, 2:56 PM · SRE, DC-Ops, ops-eqsin, Traffic
RobH closed Unknown Object (Task), a subtask of T414411: cp5022 is unreachable, as Resolved.
Tue, Mar 17, 2:50 PM · SRE, DC-Ops, ops-eqsin, Traffic
RobH added a comment to T414411: cp5022 is unreachable.

The distro swap did not fix this host, it will require a mainboard swap via a procurement task (linked in)

Tue, Mar 17, 2:49 PM · SRE, DC-Ops, ops-eqsin, Traffic

Mon, Mar 16

RobH added a comment to T419884: NVMe versus standard SSD performance info.

Sent a gentle followup to the Dell team today.

Mon, Mar 16, 9:41 PM · Wikidata, DC-Ops, SRE

Mar 16 2026

RobH added a comment to T415743: Inbound errors on interface cr2-magru:xe-0/1/0 (Transit: EdgeUno (E1-SER-7853-IP) {#70091}).

Confirm with @cmooney via IRC that 70152 is indeed xe-0//0 in these photos and updated the remote hands for Wednesday.

Mar 16 2026, 5:44 PM · ops-magru
RobH added a comment to T415743: Inbound errors on interface cr2-magru:xe-0/1/0 (Transit: EdgeUno (E1-SER-7853-IP) {#70091}).

IMG_20260316_105509576.jpg (3×4 px, 2 MB)

Mar 16 2026, 5:27 PM · ops-magru
RobH added a comment to T415743: Inbound errors on interface cr2-magru:xe-0/1/0 (Transit: EdgeUno (E1-SER-7853-IP) {#70091}).

I want to ensure I'm reading the photos correctly, but the update from remote hands is the fiber ID 70091 wasn't found, and it appears to me that the fiber ID for the patch in xe-0/1/0 is 70152.

Mar 16 2026, 5:24 PM · ops-magru
RobH moved T420229: ganeti3005 didn't come up after reboot from Backlog to Hardware Failure / Repair on the ops-esams board.
Mar 16 2026, 4:43 PM · DC-Ops, ops-esams, SRE
RobH assigned T420229: ganeti3005 didn't come up after reboot to MoritzMuehlenhoff.

Ok, bios update done and its booting to the debian loader so handing back to @MoritzMuehlenhoff

Mar 16 2026, 4:43 PM · DC-Ops, ops-esams, SRE
RobH added a comment to T420229: ganeti3005 didn't come up after reboot.

Updated the idrac, then the backplane firmware, and as it was rebooting to update the BIOS firmware the SEL updated with:

Mar 16 2026, 4:34 PM · DC-Ops, ops-esams, SRE
RobH added a comment to T415743: Inbound errors on interface cr2-magru:xe-0/1/0 (Transit: EdgeUno (E1-SER-7853-IP) {#70091}).

They had an issue where they couldn't locate the fiber listed and instead skipped the work entirely! I need to review the photos and find out what the patch is actually labeled and confirm they need to remove it.

Mar 16 2026, 2:51 PM · ops-magru

Mar 13 2026

RobH added a comment to T414411: cp5022 is unreachable.

Tech is onsite and performing the hw power distro board swap on cp5022

Mar 13 2026, 2:51 AM · SRE, DC-Ops, ops-eqsin, Traffic
RobH added a comment to T414411: cp5022 is unreachable.

Tech is running late, their dispatcher called me to let me know. They were set to be onsite at 7AM, but it will now be closer to 10:30AM / 19:30 Pacific

Mar 13 2026, 1:43 AM · SRE, DC-Ops, ops-eqsin, Traffic

Mar 12 2026

RobH added a comment to T408704: offline rackspace wikitech-static, online aws wikitech-static.

They'll bill us less though so already improvement thank you! I've gone ahead and updated the checklist so it is a bit more detailed on what has happened and next steps.

Mar 12 2026, 9:45 PM · Infrastructure-Foundations
RobH updated the task description for T408704: offline rackspace wikitech-static, online aws wikitech-static.
Mar 12 2026, 9:44 PM · Infrastructure-Foundations
RobH updated the task description for T408704: offline rackspace wikitech-static, online aws wikitech-static.
Mar 12 2026, 9:44 PM · Infrastructure-Foundations
RobH assigned T416395: Q3:rack/setup/install cloudcephosd1054 to Andrew.

Please note there are (3) racking tasks for the (3) orders of cloudcephosd hosts in eqiad that have just been placed. Assumptions have been made about hostname, simply picking the next in sequence. Please review each of these racking tasks and update the racking info for network and rack placement in addition to the boilderplate below:

Mar 12 2026, 6:47 PM · ops-eqiad, SRE, DC-Ops
RobH assigned T416394: Q3:rack/setup/install cloudcephosd1053 to Andrew.

Please note there are (3) racking tasks for the (3) orders of cloudcephosd hosts in eqiad that have just been placed. Assumptions have been made about hostname, simply picking the next in sequence. Please review each of these racking tasks and update the racking info for network and rack placement in addition to the boilderplate below:

Mar 12 2026, 6:47 PM · ops-eqiad, SRE, DC-Ops
RobH assigned T419892: Q3:rack/setup/install cloudcephosd105[56] to Andrew.

Please note there are (3) racking tasks for the (3) orders of cloudcephosd hosts in eqiad that have just been placed. Assumptions have been made about hostname, simply picking the next in sequence. Please review each of these racking tasks and update the racking info for network and rack placement in addition to the boilderplate below:

Mar 12 2026, 6:47 PM · SRE, cloud-services-team (Hardware), ops-eqiad, DC-Ops
RobH added a parent task for T419892: Q3:rack/setup/install cloudcephosd105[56]: Unknown Object (Task).
Mar 12 2026, 6:44 PM · SRE, cloud-services-team (Hardware), ops-eqiad, DC-Ops
RobH renamed T419892: Q3:rack/setup/install cloudcephosd105[56] from Q#:rack/setup/install X to Q3:rack/setup/install cloudcephosd105[56].
Mar 12 2026, 6:44 PM · SRE, cloud-services-team (Hardware), ops-eqiad, DC-Ops
RobH created T419892: Q3:rack/setup/install cloudcephosd105[56].
Mar 12 2026, 6:44 PM · SRE, cloud-services-team (Hardware), ops-eqiad, DC-Ops
RobH added a parent task for T416395: Q3:rack/setup/install cloudcephosd1054: Unknown Object (Task).
Mar 12 2026, 6:41 PM · ops-eqiad, SRE, DC-Ops
RobH added a parent task for T416394: Q3:rack/setup/install cloudcephosd1053: Unknown Object (Task).
Mar 12 2026, 6:40 PM · ops-eqiad, SRE, DC-Ops
RobH added a comment to T408704: offline rackspace wikitech-static, online aws wikitech-static.

With the migration of status.wikimedia.org, are we good to kill this? It has tripled in cost over the last year amd is overdue for extension in coupa.

Mar 12 2026, 5:05 PM · Infrastructure-Foundations
RobH added a comment to T419884: NVMe versus standard SSD performance info.

After spending about 30 minutes on the Dell site I'm not locating the usual whitepapers the Dell Team sent us back when we selected the SSDs years and years ago, so I've asked for them directly:

Mar 12 2026, 4:22 PM · Wikidata, DC-Ops, SRE
RobH updated the task description for T419884: NVMe versus standard SSD performance info.
Mar 12 2026, 4:20 PM · Wikidata, DC-Ops, SRE
RobH created T419884: NVMe versus standard SSD performance info.
Mar 12 2026, 4:11 PM · Wikidata, DC-Ops, SRE
RobH added a comment to T414411: cp5022 is unreachable.

Please note the maint window for this offline host is 2026-03-13 @ 07:00 AM Singapore / which is 5PM Thursday evening for me. I'll be online to remotely supervise the swap and attempt to login to the idrac when done.

Mar 12 2026, 3:36 PM · SRE, DC-Ops, ops-eqsin, Traffic
RobH closed T418411: Data Required for Energy Efficiency Directive: Due March 13 for DRMRS & May 15 for ESAMS as Resolved.

Hi Rob,

This is to confirm that we received the files, and this have been shared with the relevant team.

Thank you.

Regards,

ROSCHELLE SHIELLA LOTO
Customer Care Associate - Special Projects

Mar 12 2026, 1:21 AM · SRE, ops-esams, ops-magru, DC-Ops

Mar 11 2026

RobH updated the task description for T419611: hw troubleshooting: Comm Error: Backplane 0 for cp7012.
Mar 11 2026, 10:48 PM · Traffic, ops-magru, DC-Ops
RobH updated the task description for T419611: hw troubleshooting: Comm Error: Backplane 0 for cp7012.
Mar 11 2026, 10:48 PM · Traffic, ops-magru, DC-Ops
RobH reassigned T419611: hw troubleshooting: Comm Error: Backplane 0 for cp7012 from RobH to BCornwall.

After firmware updates and resetting the SEL and rebooting the issue now seems to have cleared up.

Mar 11 2026, 10:47 PM · Traffic, ops-magru, DC-Ops
RobH added a comment to T419611: hw troubleshooting: Comm Error: Backplane 0 for cp7012.

I had some ISP issues with upload speeds to magru, so Papaul helped me out and flashed the firmware for idrac, bios, and backplane. The error persists, so I'm now doing a TSR report collection and download before resetting the logs and seeing if it returns.

Mar 11 2026, 9:41 PM · Traffic, ops-magru, DC-Ops
RobH added a comment to T415743: Inbound errors on interface cr2-magru:xe-0/1/0 (Transit: EdgeUno (E1-SER-7853-IP) {#70091}).

Ok, that is annoying, these auto created tasks cannot have things appended into the task descirption or phaultfinder removes it...

Mar 11 2026, 9:05 PM · ops-magru
RobH added a comment to T418411: Data Required for Energy Efficiency Directive: Due March 13 for DRMRS & May 15 for ESAMS.

To:CustomerCare@digitalrealty.com
Wikimedia metrics for Energy Efficiency Directive (EED) - AMS17 and MRS02
Customer Care,

We (Wikimedia) received a notice that we were required to fill out these templates for our customer accounts/deployments in AMS17 and MRS02. I've done so according to directions and am now submitting to CustomerCare@digitalrealty.com.

Please advise if any further action is required at this time.

Thank you in advance,

Mar 11 2026, 9:04 PM · SRE, ops-esams, ops-magru, DC-Ops
RobH added a comment to T418411: Data Required for Energy Efficiency Directive: Due March 13 for DRMRS & May 15 for ESAMS.

Thanks everyone for the feedback, I'll fill out the templates and submit them over!

Mar 11 2026, 8:18 PM · SRE, ops-esams, ops-magru, DC-Ops
RobH updated the task description for T419611: hw troubleshooting: Comm Error: Backplane 0 for cp7012.
Mar 11 2026, 5:27 PM · Traffic, ops-magru, DC-Ops
RobH added a comment to T419611: hw troubleshooting: Comm Error: Backplane 0 for cp7012.

Dell will require all the firmware and such be the latest versions before they call it a failure, so I'll steal this and update the firmware on this host and pull the logs and see if it clears it up or not.

Mar 11 2026, 5:18 PM · Traffic, ops-magru, DC-Ops
RobH claimed T419611: hw troubleshooting: Comm Error: Backplane 0 for cp7012.
Mar 11 2026, 5:17 PM · Traffic, ops-magru, DC-Ops
RobH moved T419611: hw troubleshooting: Comm Error: Backplane 0 for cp7012 from Backlog to Hardware Failure / Repair on the ops-magru board.
Mar 11 2026, 5:17 PM · Traffic, ops-magru, DC-Ops
RobH moved T413409: Inbound errors on interface cr1-magru:xe-0/1/1 (Transport: cr2-eqiad:xe-1/0/1:3 (Telxius, CRT-008508) {#70089}) from Backlog to Hardware Failure / Repair on the ops-magru board.

Rob, could you investigate those as well. Same as T415743: Inbound errors on interface cr2-magru:xe-0/1/0 (Transit: EdgeUno (E1-SER-7853-IP) {#70091}). Please sync up with us to drain the link ahead of time.

Mar 11 2026, 5:01 PM · ops-magru
RobH reassigned T415743: Inbound errors on interface cr2-magru:xe-0/1/0 (Transit: EdgeUno (E1-SER-7853-IP) {#70091}) from RobH to ayounsi.

Not sure who wants to take point on this, but since I chatted briefly with Arzhel in IRC I'll default to him and ya'll can reassign as needed!

Mar 11 2026, 5:00 PM · ops-magru
RobH moved T415743: Inbound errors on interface cr2-magru:xe-0/1/0 (Transit: EdgeUno (E1-SER-7853-IP) {#70091}) from Backlog to Hardware Failure / Repair on the ops-magru board.
Mar 11 2026, 4:57 PM · ops-magru
RobH triaged T415743: Inbound errors on interface cr2-magru:xe-0/1/0 (Transit: EdgeUno (E1-SER-7853-IP) {#70091}) as High priority.
Mar 11 2026, 4:57 PM · ops-magru
RobH added a comment to T415743: Inbound errors on interface cr2-magru:xe-0/1/0 (Transit: EdgeUno (E1-SER-7853-IP) {#70091}).

Remote hands ticket CS1254900 filed:

Mar 11 2026, 4:55 PM · ops-magru
RobH updated the task description for T415743: Inbound errors on interface cr2-magru:xe-0/1/0 (Transit: EdgeUno (E1-SER-7853-IP) {#70091}).
Mar 11 2026, 4:48 PM · ops-magru
RobH added a comment to T415743: Inbound errors on interface cr2-magru:xe-0/1/0 (Transit: EdgeUno (E1-SER-7853-IP) {#70091}).

Apologies this was neglected. Since we need to likely give 24 hours notice for smart hands to avoid expedite fees, I suggest we schedule this for Monday, 2026-03-16.

Mar 11 2026, 4:46 PM · ops-magru
RobH updated the task description for T418411: Data Required for Energy Efficiency Directive: Due March 13 for DRMRS & May 15 for ESAMS.
Mar 11 2026, 4:39 PM · SRE, ops-esams, ops-magru, DC-Ops
RobH added a comment to T418411: Data Required for Energy Efficiency Directive: Due March 13 for DRMRS & May 15 for ESAMS.

I added rough network numbers.

Mar 11 2026, 4:38 PM · SRE, ops-esams, ops-magru, DC-Ops

Mar 10 2026

RobH added a comment to T418978: cr2-magru <-> asw1-b3-magru link down March 2026.

Sent an email to investigate the return/repair of the two optics

Mar 10 2026, 8:06 PM · ops-magru, netops, Infrastructure-Foundations, SRE
RobH added a comment to T418978: cr2-magru <-> asw1-b3-magru link down March 2026.

They swapped the optic GT3AAG00314 out of the switch for optic GT3AAG00316 and now the link shows up:

Mar 10 2026, 7:35 PM · ops-magru, netops, Infrastructure-Foundations, SRE
RobH added a comment to T414411: cp5022 is unreachable.

The order is placed and I'm currently scheduling the Unisys/Dell engineer to go onsite sometime between Friday-Wednesday of this/next week. Host is hard down, so no traffic intervention required.

Mar 10 2026, 6:21 PM · SRE, DC-Ops, ops-eqsin, Traffic
RobH added a comment to T418978: cr2-magru <-> asw1-b3-magru link down March 2026.

Support,

You have swapped the optic on the router side, and the MPO patch cable. The link is still down, so we'd like you to swap the optic on the switch side. The switch is located in B3:U37::asw1-b3-magru port et-0/0/50. Please remove the optic serial GT3AAG00314 in asw1-b3-magru port et-0/0/50 and swap it with another 100G optic spare from our rack.

Please place the optic serial GT3AAG00314 in an envelope marked T119524-switch and set aside in our racks until we determine what caused the link failure.

Mar 10 2026, 4:47 PM · ops-magru, netops, Infrastructure-Foundations, SRE

Mar 9 2026

RobH added a comment to T418978: cr2-magru <-> asw1-b3-magru link down March 2026.

They've now replace the patch cable but we're still seeing down:

Mar 9 2026, 10:28 PM · ops-magru, netops, Infrastructure-Foundations, SRE
RobH added a comment to T418978: cr2-magru <-> asw1-b3-magru link down March 2026.

Support,

Thank you, we can see the old module QSFP-100GBASE-SR4 SN GT3AAG00321 was removed and replaced with QSFP-100GBASE-SR4 module GT3AAG00315. However, the link is still showing down for us.

Acknowledged that all optics are actually QSFP-100GBASE-SR4, my initial listing of 40G was incorrect.

Thank you for the photos, as they show the new optic did not resolve the issue and the link is still offline (red) not online (green).
Please re-seat both sides of the patch cable and check for link light. If the link light doesn't go green, please source a new fiber patch from our rack spares, note its label (and report it back to us) and then replace patch cable ID 70130 with a new patch.

Mar 9 2026, 7:35 PM · ops-magru, netops, Infrastructure-Foundations, SRE
RobH added a comment to T418978: cr2-magru <-> asw1-b3-magru link down March 2026.

Ok, they swapped the optic in cr2-magru but still shows down:

Mar 9 2026, 7:27 PM · ops-magru, netops, Infrastructure-Foundations, SRE
RobH moved T418978: cr2-magru <-> asw1-b3-magru link down March 2026 from Backlog to Hardware Failure / Repair on the ops-magru board.

CS1253254 filed, listed myself, Arzhel, Cathal, and Papaul on the CC list.

Mar 9 2026, 4:00 PM · ops-magru, netops, Infrastructure-Foundations, SRE
RobH added a comment to T418978: cr2-magru <-> asw1-b3-magru link down March 2026.

I'll work on this now.

Mar 9 2026, 3:48 PM · ops-magru, netops, Infrastructure-Foundations, SRE

Mar 6 2026

RobH added a comment to T414411: cp5022 is unreachable.

Set to failed.

Mar 6 2026, 5:33 PM · SRE, DC-Ops, ops-eqsin, Traffic

Mar 4 2026

RobH updated the task description for T418012: eqiad row A/B switch upgrade.
Mar 4 2026, 9:48 PM · Infrastructure-Foundations, netops, DC-Ops, SRE, ops-eqiad