Page MenuHomePhabricator

RobH (Rob Halsell)
Senior Data Center EngineerAdministrator

Today

  • No visible events.

Tomorrow

  • No visible events.

Friday

  • No visible events.

User Details

User Since
Nov 24 2014, 1:43 PM (576 w, 2 d)
Roles
Administrator
Availability
Available
IRC Nick
RobH
LDAP User
RobH
MediaWiki User
RobH [ Global Accounts ]

My GPG Key fingerprint = CB1F C7E7 0FF8 5DB2 6820 9C7E 75ED 14C7 0245 D22A

I am an Senior Data Center Engineer on Wikimedia's Data Center SRE Team.

Please note that private message via phabricator is not my preferred contact means. Please feel free to contact me (robh) directly via irc/freenode, or email my @wikimedia.org email address.

Recent Activity

Today

RobH added a parent task for T412230: Q2:rack/setup/install mwlog1003: Unknown Object (Task).
Wed, Dec 10, 3:20 PM · SRE, observability, ops-eqiad, DC-Ops
RobH moved T412230: Q2:rack/setup/install mwlog1003 from Backlog to Racking Tasks on the ops-eqiad board.
Wed, Dec 10, 3:19 PM · SRE, observability, ops-eqiad, DC-Ops
RobH updated the task description for T412229: Q2:rack/setup/install mwlog2003.
Wed, Dec 10, 3:19 PM · SRE, observability, ops-codfw, DC-Ops
RobH created T412230: Q2:rack/setup/install mwlog1003.
Wed, Dec 10, 3:19 PM · SRE, observability, ops-eqiad, DC-Ops
RobH moved T412229: Q2:rack/setup/install mwlog2003 from Backlog to Racking Tasks on the ops-codfw board.
Wed, Dec 10, 3:18 PM · SRE, observability, ops-codfw, DC-Ops
RobH added a parent task for T412229: Q2:rack/setup/install mwlog2003: Unknown Object (Task).
Wed, Dec 10, 3:18 PM · SRE, observability, ops-codfw, DC-Ops
RobH created T412229: Q2:rack/setup/install mwlog2003.
Wed, Dec 10, 3:17 PM · SRE, observability, ops-codfw, DC-Ops

Thu, Dec 4

RobH updated the task description for T404609: eqiad: rows C/D Upgrade Tracking.
Thu, Dec 4, 5:47 PM · SRE, Infrastructure-Foundations, netops, DC-Ops, ops-eqiad
RobH added a comment to T404609: eqiad: rows C/D Upgrade Tracking.

Day 13 Update:

Thu, Dec 4, 5:31 PM · SRE, Infrastructure-Foundations, netops, DC-Ops, ops-eqiad
RobH added a parent task for T411781: lvs1018: remove cross-rack links to rows A, C and D: T404609: eqiad: rows C/D Upgrade Tracking.
Thu, Dec 4, 5:29 PM · Patch-For-Review, DC-Ops, ops-eqiad, Infrastructure-Foundations, netops, SRE
RobH added a subtask for T404609: eqiad: rows C/D Upgrade Tracking: T411781: lvs1018: remove cross-rack links to rows A, C and D.
Thu, Dec 4, 5:29 PM · SRE, Infrastructure-Foundations, netops, DC-Ops, ops-eqiad

Wed, Dec 3

RobH closed T411678: Updating RobH ssh pubkey file to add fido backing as Resolved.
Wed, Dec 3, 8:56 PM · SRE, SRE-Access-Requests
RobH created T411678: Updating RobH ssh pubkey file to add fido backing.
Wed, Dec 3, 8:50 PM · SRE, SRE-Access-Requests
RobH added a comment to T404609: eqiad: rows C/D Upgrade Tracking.

Day 12 Update (in progress, will edit as day progresses):

Wed, Dec 3, 3:21 PM · SRE, Infrastructure-Foundations, netops, DC-Ops, ops-eqiad
RobH closed T405946: eqiad row C/D Observability host migrations, a subtask of T404609: eqiad: rows C/D Upgrade Tracking, as Resolved.
Wed, Dec 3, 3:06 PM · SRE, Infrastructure-Foundations, netops, DC-Ops, ops-eqiad
RobH closed T405946: eqiad row C/D Observability host migrations as Resolved.

This migration was completed just know with no issues. Thanks to both @Jclark-ctr and @herron for the on-site part and the icinga handling!

Wed, Dec 3, 3:06 PM · observability, SRE, DC-Ops, ops-eqiad

Wed, Nov 26

RobH updated the task description for T404609: eqiad: rows C/D Upgrade Tracking.
Wed, Nov 26, 5:36 PM · SRE, Infrastructure-Foundations, netops, DC-Ops, ops-eqiad
RobH added a comment to T404609: eqiad: rows C/D Upgrade Tracking.

Day 11 Update:

  • 8 hosts moved, 5 remain out of 308 total hosts.
  • John did all the moves today working with Andrew.
  • Migrated 6 of the 8 WMCS hosts found and added to T411025, only clouddumps1002 and cloudelastic1010 remain from WMCS.
  • lvs10[19,20] hosts pending migration scheduling with Cathal and Brett.
  • alert1002 scheduled for migration on 2025-12-03 @ 13:00 GMT
Wed, Nov 26, 5:35 PM · SRE, Infrastructure-Foundations, netops, DC-Ops, ops-eqiad

Tue, Nov 25

RobH added a comment to T404609: eqiad: rows C/D Upgrade Tracking.

New host count:

Tue, Nov 25, 8:46 PM · SRE, Infrastructure-Foundations, netops, DC-Ops, ops-eqiad
RobH assigned T411025: eqiad row C/D cloud hosts pending migration to Andrew.

I prefer we not wait for the entire refresh of pending Q2 hosts but instead migrate all these hosts during the first or second week of December.

Tue, Nov 25, 8:43 PM · Data-Platform-SRE (2025.11.07 - 2025.11.28), cloud-services-team, ops-eqiad, DC-Ops, SRE
RobH renamed T411025: eqiad row C/D cloud hosts pending migration from eqiad row C/D visual audit remaining host migrations to eqiad row C/D cloud hosts pending migration.
Tue, Nov 25, 8:39 PM · Data-Platform-SRE (2025.11.07 - 2025.11.28), cloud-services-team, ops-eqiad, DC-Ops, SRE
RobH added a parent task for T411025: eqiad row C/D cloud hosts pending migration: T404609: eqiad: rows C/D Upgrade Tracking.
Tue, Nov 25, 8:38 PM · Data-Platform-SRE (2025.11.07 - 2025.11.28), cloud-services-team, ops-eqiad, DC-Ops, SRE
RobH added a subtask for T404609: eqiad: rows C/D Upgrade Tracking: T411025: eqiad row C/D cloud hosts pending migration.
Tue, Nov 25, 8:38 PM · SRE, Infrastructure-Foundations, netops, DC-Ops, ops-eqiad
RobH added a comment to T410743: Degraded RAID on ganeti1039.

So the disk hasn't failed out of md0, just md1 and md2. I'd attempt to rebuild manually and if that doesn't work then RMA the drive since it shows no errors in smartctl.

Tue, Nov 25, 5:56 PM · SRE, ops-eqiad, DC-Ops
RobH added a comment to T410743: Degraded RAID on ganeti1039.
robh@ganeti1039:~$ sudo smartctl -a -T permissive /dev/sdb
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.1.0-40-amd64] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org
Tue, Nov 25, 5:55 PM · SRE, ops-eqiad, DC-Ops
RobH closed T351352: Update Wikitech Common Data center Specifications as Declined.
Tue, Nov 25, 4:38 PM · DC-Ops
RobH added a comment to T405950: eqiad row C/D Service Ops host migrations.

Both kafka-main100[89] moved, last one to move is wikikube-ctrl1003

Tue, Nov 25, 4:29 PM · serviceops, SRE, DC-Ops, ops-eqiad

Mon, Nov 24

RobH added a comment to T407897: Q2:rack/setup/install x1 host.

Thanks Rob, I think the confusion was whether we ordered the right HW or not. Doing 1G is fine for this host, 10G would be ideal, but we are not expecting all hosts to be doing 10G anyway (as far as I know, we are (or were) very limited in 10G switches).
So whatever works best for DCOps, we can live with 1G or 10G.

Mon, Nov 24, 9:41 PM · ops-eqiad, SRE, Data-Persistence, DC-Ops
RobH updated subscribers of T405950: eqiad row C/D Service Ops host migrations.

I've chatted with @brouberol via IRC:

Mon, Nov 24, 7:55 PM · serviceops, SRE, DC-Ops, ops-eqiad
RobH added a comment to T404609: eqiad: rows C/D Upgrade Tracking.

Day 9 Update:

  • 9 hosts moved, 10 remain - 300 hosts total at start of migration
  • John worked with Ben directly to migrate the (8) Data Platform hosts this AM.
  • The last (4) Data Platform hosts are scheduled for migration tomorrow.
  • Myself and John worked with Scott to get conf1009 migrated today.
  • (2) lvs hosts will be moved after Cathal coordinates with Brett for patch submission this week
  • alert1002 will migrate on 2025-12-03 @ 13:00 GMT T405946
  • (3) ServiceOps hosts remain: wikikube-ctrl1003, kafka-main100[89], feedback pending on sub-task T405950#11402388.
Mon, Nov 24, 6:58 PM · SRE, Infrastructure-Foundations, netops, DC-Ops, ops-eqiad
RobH updated subscribers of T405950: eqiad row C/D Service Ops host migrations.

IRC Echo Update (chatting with Scott in irc about this just echoing to task for history):

Mon, Nov 24, 6:52 PM · serviceops, SRE, DC-Ops, ops-eqiad
RobH added a comment to T405950: eqiad row C/D Service Ops host migrations.

conf1009 migrated,

@brouberol: Please provide feedback on migration of wikikube-ctrl1003 and kafka-main1008 as these are the last serviceops hosts to migrate:

@brouberol, you were tagged into this task by T405950#11236474 but I don't have any feedback on the migration details for kafka-main1009 other than: "Coordinate with ServiceOps, some badly written clients may need a kick"

We would like to migrate this on either Thursday, Nov 20 or next week Nov 24-26, earlier in the week preferred. Would you detail what needs to be done for this to happen?

If we need to run any depool commands or if you can depool it from use for one of those days, please let us know!

Mon, Nov 24, 6:49 PM · serviceops, SRE, DC-Ops, ops-eqiad
RobH added a comment to T405950: eqiad row C/D Service Ops host migrations.

conf1009 migrated,

Mon, Nov 24, 6:28 PM · serviceops, SRE, DC-Ops, ops-eqiad

Fri, Nov 21

RobH added a comment to T405950: eqiad row C/D Service Ops host migrations.

Can we also set a date/time for moving the other three hosts remaining?

Fri, Nov 21, 5:54 PM · serviceops, SRE, DC-Ops, ops-eqiad
RobH added a comment to T405943: eqiad row C/D Data Platform host migrations.

IRC Update:

Fri, Nov 21, 5:29 PM · Data-Platform-SRE (2025.11.07 - 2025.11.28), Essential-Work, SRE, DC-Ops, ops-eqiad
RobH added a comment to T404609: eqiad: rows C/D Upgrade Tracking.

Day 8 Update:

Fri, Nov 21, 5:02 PM · SRE, Infrastructure-Foundations, netops, DC-Ops, ops-eqiad
RobH added a comment to T405943: eqiad row C/D Data Platform host migrations.

Sent an email to @BTullis to ensure he is aware of these 12 hosts pending his feedback, subject line: Need IF feedback for 12 remaining hosts since November 12th

Fri, Nov 21, 4:52 PM · Data-Platform-SRE (2025.11.07 - 2025.11.28), Essential-Work, SRE, DC-Ops, ops-eqiad

Thu, Nov 20

RobH added a comment to T407897: Q2:rack/setup/install x1 host.

Clarification Questions and statements:

Thu, Nov 20, 11:41 PM · ops-eqiad, SRE, Data-Persistence, DC-Ops
RobH added a comment to T405950: eqiad row C/D Service Ops host migrations.

@RobH - Confirming conf1009 for 2025-11-24, but the SRE staff meeting runs from 17:00 - 18:00 UTC. I'd suggest starting no later than 16:30 (and pausing by 17:00) or starting at / after 18:00.

Thu, Nov 20, 7:22 PM · serviceops, SRE, DC-Ops, ops-eqiad
RobH closed T405945: eqiad row C/D Infrastructure Foundations host migrations, a subtask of T404609: eqiad: rows C/D Upgrade Tracking, as Resolved.
Thu, Nov 20, 6:17 PM · SRE, Infrastructure-Foundations, netops, DC-Ops, ops-eqiad
RobH closed T405945: eqiad row C/D Infrastructure Foundations host migrations as Resolved.

All Infrastructure-Foundations hosts in eqiad c/d rows migrated to the new switch stacks.

Thu, Nov 20, 6:17 PM · Infrastructure-Foundations, SRE, DC-Ops, ops-eqiad
RobH added a comment to T404609: eqiad: rows C/D Upgrade Tracking.

Day 8 Update:

Thu, Nov 20, 6:13 PM · SRE, Infrastructure-Foundations, netops, DC-Ops, ops-eqiad
RobH added a comment to T405950: eqiad row C/D Service Ops host migrations.

Please note all wikikube workers have been migrated and we're now down to only 4 hosts left with serviceops to migrate:

Thu, Nov 20, 6:09 PM · serviceops, SRE, DC-Ops, ops-eqiad
RobH added a comment to T405942: eqiad row C/D Data Persistence host migrations.

@Ladsgroup had other things going on and wasn't able to do this today but did link me to the directions on how to depool: https://wikitech.wikimedia.org/wiki/MariaDB/Troubleshooting

Thu, Nov 20, 6:05 PM · media-backups, DBA, Data-Persistence, SRE, DC-Ops, ops-eqiad
RobH added a comment to T405946: eqiad row C/D Observability host migrations.

I've set a gcal event for 2025-12003 @ 10AM EST / 15:00 GMT for the alert1002 migration.

Thu, Nov 20, 4:29 PM · observability, SRE, DC-Ops, ops-eqiad
RobH added a comment to T405950: eqiad row C/D Service Ops host migrations.

I totally messed up the dates on your comment: 2025-11-20, 2025-11-24, 2025-11-25 2025-12-03, 2025-12-04

Thu, Nov 20, 12:38 AM · serviceops, SRE, DC-Ops, ops-eqiad

Wed, Nov 19

RobH closed T405948: eqiad row C/D Search Platform host migrations, a subtask of T404609: eqiad: rows C/D Upgrade Tracking, as Resolved.
Wed, Nov 19, 11:08 PM · SRE, Infrastructure-Foundations, netops, DC-Ops, ops-eqiad
RobH closed T405948: eqiad row C/D Search Platform host migrations as Resolved.

Please note all hosts listed on this task have been migrated.

Wed, Nov 19, 11:08 PM · Data-Platform-SRE (2025.11.07 - 2025.11.28), Essential-Work, SRE, DC-Ops, ops-eqiad
RobH added a comment to T405945: eqiad row C/D Infrastructure Foundations host migrations.

Please note we didn't get to these two today, will do tomorrow!

Wed, Nov 19, 11:06 PM · Infrastructure-Foundations, SRE, DC-Ops, ops-eqiad
RobH added a comment to T405950: eqiad row C/D Service Ops host migrations.

conf1009 is (1) a member of eqiad main-etcd cluster, so clients will attempt to issue writes to it, (2) the upstream source for etcd-mirror replication to codfw, and (3) not a PyBal config host. Also, it sounds like the connectivity disruption should be very brief, similar to previous migrations (i.e., switch-side of the cable just moves from adjacent old device to new).

Given that, the simplest reasonable sequence would be:

  • Silence EtcdReplicationDown
  • Migrate to the new switch
  • On "done" from DC-Ops, in parallel:
    • Test conf1009 (e.g., check cluster membership on peers, probe conf1009 with a quorum-read)
    • Check the health of etcd-mirror on conf2005 and restart if needed
  • Delete silence
  • Restart eqiad-associated confds and navtiming, verify Liberica control-plane daemons are healthy

Rationale:

  • This is only a single cluster member (i.e., the cluster will maintain quorum, though there may be an election), so there's no strong justification to temporarily point eqiad-associated clients at codfw, particularly given the amount of effort and disruption (e.g., read-only periods) involved.
  • We could temporarily point etcd-mirror at a different member, but I don't think that's worth the effort either. The timed-watch strategy it uses tends to be remarkably robust to "temporarily pull the network cable" (more so than to a proper upstream connection close). Also, the "check and restart if needed" approach worked well when the recent round of nginx updates reached conf1009.
  • There is some risk that if replication is disrupted and a large influx of writes pushes us out of the 1000 event log window, then recovery is a bit involved. We could rule this out by making etcd read-only during the migration. However, that turns the "oops, your client hit a transient error attempting to communicate with conf1009" into "full write unavailability," which again seems rather disruptive for a low-likelihood issue.

In any case, I'll give some thought to timing and follow up in the sheet. It's only the last point that I'm still on the fence about (i.e., whether to bracket the migration with read-only), but (1) that's technically easy to do if we choose to do it and (2) should not have much of an impact on scheduling (it mostly just requires additional upfront comms).

Wed, Nov 19, 11:00 PM · serviceops, SRE, DC-Ops, ops-eqiad
RobH triaged T405946: eqiad row C/D Observability host migrations as High priority.

We've migrated 9 of the 10 observability hosts. We're now only left with alert1002 which the notes detail will require scheduling. In addition to picking a scheduled date, can you also provide any steps and details I'll need to do the date of the migration?

Wed, Nov 19, 10:57 PM · observability, SRE, DC-Ops, ops-eqiad
RobH added a comment to T405943: eqiad row C/D Data Platform host migrations.

We're now down to 44 hosts overall to migrate, and 12 of those belong to your team.

Wed, Nov 19, 10:49 PM · Data-Platform-SRE (2025.11.07 - 2025.11.28), Essential-Work, SRE, DC-Ops, ops-eqiad
RobH added a comment to T405942: eqiad row C/D Data Persistence host migrations.

Migration Update:
Only 3 Data-Persistence hosts remain for migration: pc101[678].

Wed, Nov 19, 10:32 PM · media-backups, DBA, Data-Persistence, SRE, DC-Ops, ops-eqiad
RobH reassigned T405950: eqiad row C/D Service Ops host migrations from RobH to brouberol.

@brouberol, you were tagged into this task by T405950#11236474 but I don't have any feedback on the migration details for kafka-main1009 other than: "Coordinate with ServiceOps, some badly written clients may need a kick"

Wed, Nov 19, 7:31 PM · serviceops, SRE, DC-Ops, ops-eqiad
RobH added a comment to T404609: eqiad: rows C/D Upgrade Tracking.

Day 7 Update:

  • 33 hosts moved today, 44 remain
  • all row c wikikube migrated, some of row D wikikube migrated
    • 23 wikikube hosts remain out of the 44 left to move
  • (2) pc hosts will be moved next week on 24, 25, or 26th. Manuel is out this Thursday-Friday.
  • (2) lvs hosts will be moved whenever @cmooney would like to schedule being around for it as they are a bit touchy. lvs1020 is backup to 1019, both must move so 1020 will move first.
  • other hosts left on list need details submitted (they've been asked on sub tasks this week) or scheduling.
Wed, Nov 19, 7:27 PM · SRE, Infrastructure-Foundations, netops, DC-Ops, ops-eqiad
RobH added a comment to T405950: eqiad row C/D Service Ops host migrations.

Going to depool wikikube in rack eqiad D1 for port migrations.

Wed, Nov 19, 6:39 PM · serviceops, SRE, DC-Ops, ops-eqiad
RobH added a comment to T405950: eqiad row C/D Service Ops host migrations.

sudo cookbook sre.k8s.pool-depool-node -t T405950 -r 'network migration' --k8s-cluster wikikube-eqiad pool wikikube-worker126[0-9].eqiad.wmnet
sudo cookbook sre.k8s.pool-depool-node -t T405950 -r 'network migration' --k8s-cluster wikikube-eqiad pool wikikube-worker10[36,51,52,54,54,55,83].eqiad.wmnet

Wed, Nov 19, 6:28 PM · serviceops, SRE, DC-Ops, ops-eqiad
RobH added a comment to T405950: eqiad row C/D Service Ops host migrations.

Depooling wikikube in rack C6:

Wed, Nov 19, 5:35 PM · serviceops, SRE, DC-Ops, ops-eqiad
RobH added a comment to T405942: eqiad row C/D Data Persistence host migrations.

Please ping me before moving of pc1014 so I depool pc4 cluster from rotation.

Wed, Nov 19, 3:53 PM · media-backups, DBA, Data-Persistence, SRE, DC-Ops, ops-eqiad
RobH added a comment to T405942: eqiad row C/D Data Persistence host migrations.
  • backup1006, backup1007, ms-backup1002 moved yesterday.
  • db1189 was moved yesterday by accident sorry about that!
  • The only data persistence hosts left to move are:
    • moss-be1002 - no directions provided on moving this, please advise
    • pc1014 - scheduled to move today
    • pc1016 - not yet scheduled for migration
    • pc1017 - not yet scheduled for migration
    • pc1018
Wed, Nov 19, 3:45 PM · media-backups, DBA, Data-Persistence, SRE, DC-Ops, ops-eqiad

Tue, Nov 18

RobH added a comment to T405950: eqiad row C/D Service Ops host migrations.

I think it'll be ok when we move things tomorrow, since I know exactly the mistake I made I don't think I'll make it again for a few months minimum ; D

Tue, Nov 18, 10:32 PM · serviceops, SRE, DC-Ops, ops-eqiad
RobH added a comment to T405950: eqiad row C/D Service Ops host migrations.

@RLazarus and I were looking into verifying workloads returning to the migrated workers, and ran into a few surprises.

Going by what's been marked complete in the sheet from today, there appears to be a difference between the what was depooled vs. what was ostensibly migrated.

Specifically, it looks like the following, where - indicates "in the sheet, not depooled" and + indicates "not in the sheet, depooled" (no mark indicates agreement):

+wikikube-worker1016
 wikikube-worker1063
-wikikube-worker1135
-wikikube-worker1136
-wikikube-worker1137
-wikikube-worker1138
-wikikube-worker1139
-wikikube-worker1154
-wikikube-worker1155
-wikikube-worker1156
 wikikube-worker1157
+wikikube-worker1254
+wikikube-worker1255
+wikikube-worker1256
 wikikube-worker1305
 wikikube-worker1306
 wikikube-worker1313

So, there appear to be 4 workers depooled that weren't in-scope for migration, and 8 workers migrated that weren't depooled (of which 3 look like maybe a typo in the depool command - i.e., wikikube-worker[1254-1256] vs wikikube-worker[1154-1156]).

In any case, as far as we can tell, workloads on the 8 workers appear to have weathered it the disruption.

@RobH - Were the 8 that weren't depooled, but appear to be checked off actually deferred to tomorrow? If not (i.e., they were indeed migrated), is there some way we can help prepare the depool commands ahead of time to make sure it's aligned with the planned migrations?

Tue, Nov 18, 9:47 PM · serviceops, SRE, DC-Ops, ops-eqiad
RobH added a comment to T405945: eqiad row C/D Infrastructure Foundations host migrations.

Thank you for the update, we'll likely move these two hosts tomorrow!

Tue, Nov 18, 8:44 PM · Infrastructure-Foundations, SRE, DC-Ops, ops-eqiad
RobH updated the task description for T408510: ULSFO: switch refresh.
Tue, Nov 18, 8:04 PM · Traffic, SRE, Infrastructure-Foundations, DC-Ops, netops, ops-ulsfo
RobH updated the task description for T408510: ULSFO: switch refresh.
Tue, Nov 18, 8:02 PM · Traffic, SRE, Infrastructure-Foundations, DC-Ops, netops, ops-ulsfo
RobH updated the task description for T408510: ULSFO: switch refresh.
Tue, Nov 18, 7:59 PM · Traffic, SRE, Infrastructure-Foundations, DC-Ops, netops, ops-ulsfo
RobH changed the status of T410456: ulsfo switch refresh from Invalid to Resolved.

dupe of T408510

Tue, Nov 18, 7:58 PM · SRE, Traffic, netops, Infrastructure-Foundations, ops-ulsfo, DC-Ops
RobH closed T410456: ulsfo switch refresh as Invalid.

dupe of T410456

Tue, Nov 18, 7:57 PM · SRE, Traffic, netops, Infrastructure-Foundations, ops-ulsfo, DC-Ops
RobH moved T408510: ULSFO: switch refresh from Backlog to Racking Tasks on the ops-ulsfo board.
Tue, Nov 18, 7:57 PM · Traffic, SRE, Infrastructure-Foundations, DC-Ops, netops, ops-ulsfo
RobH created T410456: ulsfo switch refresh.
Tue, Nov 18, 7:57 PM · SRE, Traffic, netops, Infrastructure-Foundations, ops-ulsfo, DC-Ops
RobH closed Unknown Object (Task), a subtask of T408510: ULSFO: switch refresh, as Resolved.
Tue, Nov 18, 7:54 PM · Traffic, SRE, Infrastructure-Foundations, DC-Ops, netops, ops-ulsfo
RobH claimed T405950: eqiad row C/D Service Ops host migrations.

IRC Discussion Update:

Tue, Nov 18, 7:48 PM · serviceops, SRE, DC-Ops, ops-eqiad
RobH closed T410434: eno1 on wikikube-worker1016:9100 has the wrong speed: 1.25e+07. as Resolved.

optic swap by john fixed it.

Tue, Nov 18, 7:46 PM · SRE, ops-eqiad, DC-Ops
RobH added a comment to T410434: eno1 on wikikube-worker1016:9100 has the wrong speed: 1.25e+07..

This is indeed detecting slow:

Tue, Nov 18, 7:41 PM · SRE, ops-eqiad, DC-Ops
RobH added a comment to T404609: eqiad: rows C/D Upgrade Tracking.

Day 6 Update:

Tue, Nov 18, 7:31 PM · SRE, Infrastructure-Foundations, netops, DC-Ops, ops-eqiad
RobH added a comment to T405950: eqiad row C/D Service Ops host migrations.

Awesome! We're also moving dns1006 at the same time (we'll move it first while k8 hosts drain) and then we'll move onto moving these! I'll ping you in about 70 minutes for the start of the window at 17:00 GMT. Thanks!

Tue, Nov 18, 3:48 PM · serviceops, SRE, DC-Ops, ops-eqiad
RobH closed T405647: eqiad row C/D Machine Learning host migrations, a subtask of T404609: eqiad: rows C/D Upgrade Tracking, as Resolved.
Tue, Nov 18, 3:37 PM · SRE, Infrastructure-Foundations, netops, DC-Ops, ops-eqiad
RobH closed T405647: eqiad row C/D Machine Learning host migrations as Resolved.

All machine learning hosts have been migrated, resolving this task.

Tue, Nov 18, 3:37 PM · Machine-Learning-Team, SRE, DC-Ops, ops-eqiad

Mon, Nov 17

RobH raised the priority of T405945: eqiad row C/D Infrastructure Foundations host migrations from Medium to High.

@LSobanski,

The only two Infrastructure-Foundations hosts left to migrate are

  • aux-k8s-worker100[67]: can be drained at any time, we have a cookbook

However, this doesn't quite make it clear if I should run the cookbook and move whenever works for me, or if your team should run the cookbook and you want to set a date/time to work with us or have us ping you via irc?

If it is something I can run the cookbook, can you advise exactly what cookbook and flags I'd run to migrate each of these two hosts?

Mon, Nov 17, 4:23 PM · Infrastructure-Foundations, SRE, DC-Ops, ops-eqiad
RobH added a comment to T405943: eqiad row C/D Data Platform host migrations.

Please note we now only have 12 data platform hosts remaining for migration. I still need clarification for

Mon, Nov 17, 4:22 PM · Data-Platform-SRE (2025.11.07 - 2025.11.28), Essential-Work, SRE, DC-Ops, ops-eqiad
RobH added a comment to T405942: eqiad row C/D Data Persistence host migrations.

@Marostegui and/or @jcrespo,

Mon, Nov 17, 3:59 PM · media-backups, DBA, Data-Persistence, SRE, DC-Ops, ops-eqiad
RobH added a comment to T405950: eqiad row C/D Service Ops host migrations.

Is it possible that I could send the commands for this or do we need someone in your team? If we need someone in your team, could we schedule an hour or so for this tomorrow (Tuesday, Nov 18thy) at 18:00GMT start time? (So 10:00 AM Pacific?)

Mon, Nov 17, 3:53 PM · serviceops, SRE, DC-Ops, ops-eqiad
RobH added a comment to T405623: eqiad row C/D Traffic host migrations.

@BCornwall and @ssingh: We chatted about this last week, can we schedule this work to move dns1006 tomorrow, Tuesday November 18th at 9AM Pacific / 5PM GMT?

Mon, Nov 17, 3:29 PM · Traffic, SRE, DC-Ops, ops-eqiad

Fri, Nov 14

RobH added a comment to T405940: eqiad row C/D Collaboration Services host migrations.

Sorry about that, we've now migrated all of the non scheduled migration hosts (except k8) so we can schedule this for next week. Would Tuesday, Nov 18th @ 915 Pacific / 17:15GMT work for migrating the two gitlab hosts?

Fri, Nov 14, 8:35 PM · collaboration-services, SRE, DC-Ops, ops-eqiad

Thu, Nov 13

RobH reassigned T405945: eqiad row C/D Infrastructure Foundations host migrations from MoritzMuehlenhoff to LSobanski.

The only two Infrastructure-Foundations hosts left to migrate are

Thu, Nov 13, 8:45 PM · Infrastructure-Foundations, SRE, DC-Ops, ops-eqiad
RobH moved T410073: Netbox Cable report - incorrectly parsing Nokia power supplies from Backlog to Hardware Failure / Troubleshoot on the ops-eqiad board.
Thu, Nov 13, 6:18 PM · SRE, Infrastructure-Foundations, DC-Ops, netops, ops-eqiad
RobH updated the task description for T410073: Netbox Cable report - incorrectly parsing Nokia power supplies.
Thu, Nov 13, 6:18 PM · SRE, Infrastructure-Foundations, DC-Ops, netops, ops-eqiad
RobH created T410073: Netbox Cable report - incorrectly parsing Nokia power supplies.
Thu, Nov 13, 6:17 PM · SRE, Infrastructure-Foundations, DC-Ops, netops, ops-eqiad
RobH moved T410072: netbox cable report cleanup: unterminated cable ends from Backlog to Hardware Failure / Troubleshoot on the ops-eqiad board.
Thu, Nov 13, 6:15 PM · SRE, DC-Ops, ops-eqiad
RobH created T410072: netbox cable report cleanup: unterminated cable ends.
Thu, Nov 13, 6:15 PM · SRE, DC-Ops, ops-eqiad
RobH added a comment to T404609: eqiad: rows C/D Upgrade Tracking.

Day 5 Update:

Thu, Nov 13, 5:11 PM · SRE, Infrastructure-Foundations, netops, DC-Ops, ops-eqiad
RobH added a comment to T405945: eqiad row C/D Infrastructure Foundations host migrations.

All ganeti hosts migrated to their new switch ports in eqiad rows c/d

Thu, Nov 13, 5:01 PM · Infrastructure-Foundations, SRE, DC-Ops, ops-eqiad
RobH added a comment to T405945: eqiad row C/D Infrastructure Foundations host migrations.

ganeti1028
ganeti1047
ganeti1048
ganeti1037

Thu, Nov 13, 4:27 PM · Infrastructure-Foundations, SRE, DC-Ops, ops-eqiad

Wed, Nov 12

RobH added a comment to T405647: eqiad row C/D Machine Learning host migrations.

IRC Update from chat with Tobias:

Wed, Nov 12, 8:23 PM · Machine-Learning-Team, SRE, DC-Ops, ops-eqiad
RobH added a comment to T405623: eqiad row C/D Traffic host migrations.

IRC Update:

Wed, Nov 12, 8:14 PM · Traffic, SRE, DC-Ops, ops-eqiad
RobH added a comment to T405623: eqiad row C/D Traffic host migrations.

For dns1006, since the downtime of the host is around 5-12 seconds (missing about 5-12 seq numbers via ping) I'm not sure it even has to be fully depooled as long as we don't move it during an active dns sync.

Wed, Nov 12, 7:04 PM · Traffic, SRE, DC-Ops, ops-eqiad
RobH added a comment to T405623: eqiad row C/D Traffic host migrations.

All cp hosts in rows C/D have been migrated as of today (last ones done) and all that is left in Traffic realm for migration is dns1006 and lvs1020 via T405602.

Wed, Nov 12, 7:03 PM · Traffic, SRE, DC-Ops, ops-eqiad
RobH updated the task description for T404609: eqiad: rows C/D Upgrade Tracking.
Wed, Nov 12, 7:01 PM · SRE, Infrastructure-Foundations, netops, DC-Ops, ops-eqiad
RobH added a parent task for T409800: Row C traffic outage Nov 11 2025: T404609: eqiad: rows C/D Upgrade Tracking.
Wed, Nov 12, 7:00 PM · netops, Infrastructure-Foundations, SRE
RobH added a subtask for T404609: eqiad: rows C/D Upgrade Tracking: T409800: Row C traffic outage Nov 11 2025.
Wed, Nov 12, 7:00 PM · SRE, Infrastructure-Foundations, netops, DC-Ops, ops-eqiad