Page MenuHomePhabricator

RobH (Rob Halsell)
Operations EngineerAdministrator

Projects (20)

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Sunday

  • Clear sailing ahead.

User Details

User Since
Nov 24 2014, 1:43 PM (309 w, 4 d)
Roles
Administrator
Availability
Available
IRC Nick
RobH
LDAP User
RobH
MediaWiki User
RobH [ Global Accounts ]

My GPG Key fingerprint = CB1F C7E7 0FF8 5DB2 6820 9C7E 75ED 14C7 0245 D22A

I am an Operations Engineer on Wikimedia's Datacenter Operations Team.

I also am the primary triage engineer for the hardware-requests project, as well as the private S4 procurement space and procurement project.

All questions involving allocation of hardware can be initially addressed on https://wikitech.wikimedia.org/wiki/Operations_requests.

Please note that private message via phabricator is not my preferred contact means. Please feel free to contact me (robh) directly via irc/freenode, or email my @wikimedia.org email address.

Recent Activity

Yesterday

RobH added a comment to T260269: (Need By: TBD) rack/setup/install maps10[05-10].eqiad.wmnet.

Can anything be done to help unblock this task? This capacity is needed for the cluster as it's quite short on resources.

Thu, Oct 29, 9:55 PM · Maps, ops-eqiad, DC-Ops, Operations
RobH updated the task description for T260269: (Need By: TBD) rack/setup/install maps10[05-10].eqiad.wmnet.
Thu, Oct 29, 9:06 PM · Maps, ops-eqiad, DC-Ops, Operations
RobH updated the task description for T260269: (Need By: TBD) rack/setup/install maps10[05-10].eqiad.wmnet.
Thu, Oct 29, 8:21 PM · Maps, ops-eqiad, DC-Ops, Operations
RobH updated the task description for T260269: (Need By: TBD) rack/setup/install maps10[05-10].eqiad.wmnet.
Thu, Oct 29, 7:14 PM · Maps, ops-eqiad, DC-Ops, Operations

Wed, Oct 28

RobH moved T266724: (Need By: TBD) rack/setup/install rdb101[12] from Backlog to Racking Tasks on the ops-eqiad board.
Wed, Oct 28, 10:29 PM · Operations, ops-eqiad, DC-Ops
RobH added a parent task for T266724: (Need By: TBD) rack/setup/install rdb101[12]: Unknown Object (Task).
Wed, Oct 28, 10:29 PM · Operations, ops-eqiad, DC-Ops
RobH created T266724: (Need By: TBD) rack/setup/install rdb101[12].
Wed, Oct 28, 10:29 PM · Operations, ops-eqiad, DC-Ops
RobH moved T266721: (Need By: TBD) rack/setup/install rdb20[09|10] from Backlog to Racking Tasks on the ops-codfw board.
Wed, Oct 28, 10:08 PM · Operations, ops-codfw, DC-Ops
RobH created T266721: (Need By: TBD) rack/setup/install rdb20[09|10].
Wed, Oct 28, 10:07 PM · Operations, ops-codfw, DC-Ops
RobH added a parent task for T266709: an-coord1001 ram upgrade: Unknown Object (Task).
Wed, Oct 28, 8:43 PM · Reading Epics (Analytics), Operations, ops-eqiad
RobH triaged T266709: an-coord1001 ram upgrade as Medium priority.
Wed, Oct 28, 8:43 PM · Reading Epics (Analytics), Operations, ops-eqiad
RobH added a parent task for T266623: relocate/reimage cloudvirt1030 with 10G interfaces: Unknown Object (Task).
Wed, Oct 28, 8:16 PM · cloud-services-team (Hardware), ops-eqiad, DC-Ops, Operations
RobH added a parent task for T266514: relocate/reimage cloudvirt1028 with 10G interfaces: Unknown Object (Task).
Wed, Oct 28, 8:15 PM · cloud-services-team (Hardware), ops-eqiad, DC-Ops, Operations
RobH added a parent task for T266369: relocate/reimage cloudvirt1027 with 10G interfaces: Unknown Object (Task).
Wed, Oct 28, 8:15 PM · cloud-services-team (Hardware), ops-eqiad, DC-Ops, Operations
RobH added a parent task for T266281: relocate/reimage cloudvirt1026 with 10G interfaces: Unknown Object (Task).
Wed, Oct 28, 8:14 PM · cloud-services-team (Hardware), ops-eqiad, DC-Ops, Operations
RobH added a parent task for T266206: relocate/reimage cloudvirt1029 with 10G interfaces: Unknown Object (Task).
Wed, Oct 28, 8:14 PM · cloud-services-team (Hardware), ops-eqiad, DC-Ops, Operations
RobH added a parent task for T266187: relocate/reimage cloudvirt1025 with 10G interfaces: Unknown Object (Task).
Wed, Oct 28, 8:14 PM · cloud-services-team (Hardware), ops-eqiad, DC-Ops, Operations
RobH added a comment to T266623: relocate/reimage cloudvirt1030 with 10G interfaces.

IRC Update:

Wed, Oct 28, 7:57 PM · cloud-services-team (Hardware), ops-eqiad, DC-Ops, Operations
RobH claimed T266623: relocate/reimage cloudvirt1030 with 10G interfaces.

this server does not have a 10GB nic card

Wed, Oct 28, 7:10 PM · cloud-services-team (Hardware), ops-eqiad, DC-Ops, Operations
RobH reopened T266497: fix/replace cable ID 2648 on FB peering patch - cable report error as "Open".

2649 is also already in use, so now your fix introduced a new error:

Wed, Oct 28, 6:56 PM · ops-eqiad, DC-Ops, Operations
RobH reopened T266497: fix/replace cable ID 2648 on FB peering patch - cable report error, a subtask of T265916: patch in FB peering into cr1-eqiad:xe-3/2/1, as Open.
Wed, Oct 28, 6:56 PM · netops, Operations, DC-Ops

Tue, Oct 27

RobH triaged T266604: ms-be1057 down - cable disconnected? as High priority.

The switch port sees this port enabled (admin up) but link down, supporting that it could be a bad cable or cable disconnect.

Tue, Oct 27, 9:42 PM · DC-Ops, ops-eqiad, Operations
RobH moved T266604: ms-be1057 down - cable disconnected? from Backlog to Hardware Failure / Troubleshoot on the ops-eqiad board.
Tue, Oct 27, 9:41 PM · DC-Ops, ops-eqiad, Operations
RobH triaged T266192: Connect cloudstore1008 and cloudstore1009 directly via second 10G interface similar to labstore1004/5 as Medium priority.
Tue, Oct 27, 6:18 PM · cloud-services-team (Hardware), ops-eqiad, Data-Services, Operations
RobH reassigned T266192: Connect cloudstore1008 and cloudstore1009 directly via second 10G interface similar to labstore1004/5 from RobH to Cmjohnson.

So this appears like it is basically requiring someone in eqiad to perform a cross connection between the two hosts. I'm not actually the person to schedule this, either @wiki_willy (or one of the onsites in eqiad directly @Cmjohnson or @Jclark-ctr, but John is out due to a broken hand, so I'll assign this directly to Chris.

Tue, Oct 27, 6:16 PM · cloud-services-team (Hardware), ops-eqiad, Data-Services, Operations
RobH added a comment to T261130: ganeti5002 was down / powered off, machine check entries in SEL.

For some reason (we found this out a few months ago), Dell Singapore part replacements don't go out with return tags. They require you to call and schedule a pickup of the part with Dell after you swap things out.

Tue, Oct 27, 1:31 AM · serviceops, Operations, ops-eqsin

Mon, Oct 26

RobH added a comment to T266497: fix/replace cable ID 2648 on FB peering patch - cable report error.

I set to high priority, since its causing a report error. Once a netbox report is in error, it won't repeat/append to its error state via IRC echo.

Mon, Oct 26, 9:33 PM · ops-eqiad, DC-Ops, Operations
RobH triaged T266497: fix/replace cable ID 2648 on FB peering patch - cable report error as High priority.
Mon, Oct 26, 9:32 PM · ops-eqiad, DC-Ops, Operations
RobH updated the task description for T266497: fix/replace cable ID 2648 on FB peering patch - cable report error.
Mon, Oct 26, 6:28 PM · ops-eqiad, DC-Ops, Operations
RobH added a comment to T266497: fix/replace cable ID 2648 on FB peering patch - cable report error.

So to find an available cable ID at a given site, I do the following:

Mon, Oct 26, 6:26 PM · ops-eqiad, DC-Ops, Operations
RobH created T266497: fix/replace cable ID 2648 on FB peering patch - cable report error.
Mon, Oct 26, 6:22 PM · ops-eqiad, DC-Ops, Operations
RobH updated the task description for T265916: patch in FB peering into cr1-eqiad:xe-3/2/1.
Mon, Oct 26, 5:34 PM · netops, Operations, DC-Ops
RobH updated the task description for T265916: patch in FB peering into cr1-eqiad:xe-3/2/1.
Mon, Oct 26, 5:34 PM · netops, Operations, DC-Ops
RobH added a comment to T265916: patch in FB peering into cr1-eqiad:xe-3/2/1.

I've updated the circuit (with its circuit id) and updated the cable (with its cable id and set to status connected)

Mon, Oct 26, 5:32 PM · netops, Operations, DC-Ops
RobH reassigned T265916: patch in FB peering into cr1-eqiad:xe-3/2/1 from Cmjohnson to ayounsi.
Mon, Oct 26, 5:31 PM · netops, Operations, DC-Ops
RobH removed a project from T265916: patch in FB peering into cr1-eqiad:xe-3/2/1: ops-eqiad.

forgot to add I do not have a link light

Mon, Oct 26, 5:31 PM · netops, Operations, DC-Ops
RobH updated the task description for T265916: patch in FB peering into cr1-eqiad:xe-3/2/1.
Mon, Oct 26, 5:30 PM · netops, Operations, DC-Ops
RobH moved T266481: (Need By: TBD) rack/setup/install payments100[5-8] from Backlog to Racking Tasks on the ops-eqiad board.
Mon, Oct 26, 3:38 PM · ops-eqiad, Operations, DC-Ops
RobH added a parent task for T266481: (Need By: TBD) rack/setup/install payments100[5-8]: Unknown Object (Task).
Mon, Oct 26, 3:37 PM · ops-eqiad, Operations, DC-Ops
RobH updated the task description for T266481: (Need By: TBD) rack/setup/install payments100[5-8].
Mon, Oct 26, 3:37 PM · ops-eqiad, Operations, DC-Ops
RobH created T266481: (Need By: TBD) rack/setup/install payments100[5-8].
Mon, Oct 26, 3:37 PM · ops-eqiad, Operations, DC-Ops

Fri, Oct 23

RobH moved T266365: (Need By: TBD) rack/setup/install frqueue100[34] from Backlog to Racking Tasks on the ops-eqiad board.
Fri, Oct 23, 6:38 PM · Operations, ops-eqiad, DC-Ops
RobH added a parent task for T266365: (Need By: TBD) rack/setup/install frqueue100[34]: Unknown Object (Task).
Fri, Oct 23, 6:38 PM · Operations, ops-eqiad, DC-Ops
RobH created T266365: (Need By: TBD) rack/setup/install frqueue100[34].
Fri, Oct 23, 6:38 PM · Operations, ops-eqiad, DC-Ops
RobH added a parent task for T266363: (Need By: TBD) rack/setup/install deploy2002: Unknown Object (Task).
Fri, Oct 23, 6:31 PM · Operations, ops-codfw, DC-Ops
RobH moved T266363: (Need By: TBD) rack/setup/install deploy2002 from Backlog to Racking Tasks on the ops-codfw board.
Fri, Oct 23, 6:31 PM · Operations, ops-codfw, DC-Ops
RobH created T266363: (Need By: TBD) rack/setup/install deploy2002.
Fri, Oct 23, 6:31 PM · Operations, ops-codfw, DC-Ops

Thu, Oct 22

RobH changed the status of Unknown Object (Task), a subtask of T266016: Refresh and expand Swift hardware capacity, from Stalled to Open.
Thu, Oct 22, 4:04 PM · User-fgiunchedi, SRE-swift-storage

Wed, Oct 21

RobH updated the task description for T261405: db1139 memory errors on boot 2020-08-27.
Wed, Oct 21, 4:41 PM · Operations, DBA, ops-eqiad
RobH updated the task description for T261405: db1139 memory errors on boot 2020-08-27.
Wed, Oct 21, 4:40 PM · Operations, DBA, ops-eqiad
RobH added a comment to T261405: db1139 memory errors on boot 2020-08-27.

Oh, if it is a mainboard replacement, the host will need reimage. I assume if that is the case, it can come offline well in advance as its basically re-entering service as a new host. We'll know later today.

Wed, Oct 21, 4:22 PM · Operations, DBA, ops-eqiad
RobH claimed T261405: db1139 memory errors on boot 2020-08-27.

Jaime: I didn't realize the DB systems hardware repair cadence was different then the other systems (with DBA team only taking it offline immediately before work.) I'll have to figure out where to document that so I don't forget when I don't work on the db systems for a few months. Your explanation makes perfect sense, thank you!

Wed, Oct 21, 4:21 PM · Operations, DBA, ops-eqiad
RobH added a comment to T261405: db1139 memory errors on boot 2020-08-27.

They emailed me and required I upload the AHS log via a https drop box utility, so I did so along with the IML log file.

Wed, Oct 21, 3:54 AM · Operations, DBA, ops-eqiad

Tue, Oct 20

RobH added a comment to T261405: db1139 memory errors on boot 2020-08-27.

Case ID: 5350976764 opened, requesting a new mainboard and any/all migration directions to be dispatched to eqiad to @Cmjohnson's attention. (He is currently out sick, but is projected to be on-site before John.)

Tue, Oct 20, 9:54 PM · Operations, DBA, ops-eqiad
RobH added a comment to T261405: db1139 memory errors on boot 2020-08-27.

DIMM Failure - Uncorrectable Memory Error (Processor 2, DIMM 5) is the actual failure from the log.

Tue, Oct 20, 9:29 PM · Operations, DBA, ops-eqiad
RobH added a comment to T261405: db1139 memory errors on boot 2020-08-27.

I'm waiting on the very slow HPE site upload to parse the AHS file I downloaded for this, and I also noticed that via https interface (https://db1139.mgmt.eqiad.wmnet/) that it has an Integrated Management Log (nearly identical to Dell's Service Event Log) which includes the memory error.

Tue, Oct 20, 9:28 PM · Operations, DBA, ops-eqiad
RobH updated the task description for T261405: db1139 memory errors on boot 2020-08-27.
Tue, Oct 20, 9:13 PM · Operations, DBA, ops-eqiad
RobH reassigned T261405: db1139 memory errors on boot 2020-08-27 from RobH to jcrespo.
Tue, Oct 20, 9:12 PM · Operations, DBA, ops-eqiad
RobH added a comment to T261405: db1139 memory errors on boot 2020-08-27.

This task has a number of issues, starting with:

Tue, Oct 20, 9:11 PM · Operations, DBA, ops-eqiad
RobH updated the task description for T261405: db1139 memory errors on boot 2020-08-27.
Tue, Oct 20, 9:07 PM · Operations, DBA, ops-eqiad

Mon, Oct 19

RobH reopened T265653: (Need By: TBD) setup/install deploy1002 as "Open".

I shouldn't have resolved, hostname label has to go on.

Mon, Oct 19, 7:08 PM · ops-eqiad, Operations, DC-Ops
RobH closed T265653: (Need By: TBD) setup/install deploy1002 as Resolved.
Mon, Oct 19, 7:07 PM · ops-eqiad, Operations, DC-Ops
RobH updated the task description for T265412: patch in FB peering into cr2-eqdfw.
Mon, Oct 19, 5:34 PM · netops, ops-codfw, Operations
RobH updated the task description for T265916: patch in FB peering into cr1-eqiad:xe-3/2/1.
Mon, Oct 19, 5:34 PM · netops, Operations, DC-Ops
RobH updated the task description for T265412: patch in FB peering into cr2-eqdfw.
Mon, Oct 19, 5:33 PM · netops, ops-codfw, Operations
RobH added a parent task for T265916: patch in FB peering into cr1-eqiad:xe-3/2/1: Unknown Object (Task).
Mon, Oct 19, 5:32 PM · netops, Operations, DC-Ops
RobH updated the task description for T265916: patch in FB peering into cr1-eqiad:xe-3/2/1.
Mon, Oct 19, 5:30 PM · netops, Operations, DC-Ops
RobH added a comment to T265916: patch in FB peering into cr1-eqiad:xe-3/2/1.

Please note that FB provided a circuitID for the other peering connection but not this one, so its entry for the circuit is N/A. I noticed that we have other circuits (peering) that also lack a circuit ID, so just copied those. (This makes sense since older peering are handshake type deals and less formalized than things like transit/transport/wave/etc).

Mon, Oct 19, 5:29 PM · netops, Operations, DC-Ops
RobH updated the task description for T265916: patch in FB peering into cr1-eqiad:xe-3/2/1.
Mon, Oct 19, 5:26 PM · netops, Operations, DC-Ops
RobH updated the task description for T265916: patch in FB peering into cr1-eqiad:xe-3/2/1.
Mon, Oct 19, 5:26 PM · netops, Operations, DC-Ops
RobH renamed T265916: patch in FB peering into cr1-eqiad:xe-3/2/1 from patch in FB peering into cr2-eqdfw to patch in FB peering into cr1-eqiad:xe-3/2/1.
Mon, Oct 19, 5:12 PM · netops, Operations, DC-Ops
RobH added a comment to T261130: ganeti5002 was down / powered off, machine check entries in SEL.

So we got some movement on this Friday/replies today. Dell Singapore is being very difficult and require a local contact number. I've gone ahead and cleared Jin's info with him, and handed it to Dell. Jin 2/ DreamIIC will coordinate the part replacement and update us.

Mon, Oct 19, 3:50 PM · serviceops, Operations, ops-eqsin
RobH created T265916: patch in FB peering into cr1-eqiad:xe-3/2/1.
Mon, Oct 19, 3:38 PM · netops, Operations, DC-Ops
RobH updated the task description for T265412: patch in FB peering into cr2-eqdfw.
Mon, Oct 19, 3:35 PM · netops, ops-codfw, Operations
RobH added a member for acl*procurement-review: Arrbee.
Mon, Oct 19, 3:24 PM
RobH added a member for acl*procurement-review: Pginer-WMF.
Mon, Oct 19, 3:24 PM
RobH added a member for acl*procurement-review: KartikMistry.
Mon, Oct 19, 3:23 PM
RobH added a member for acl*procurement-review: santhosh.
Mon, Oct 19, 3:23 PM

Fri, Oct 16

RobH reassigned T265653: (Need By: TBD) setup/install deploy1002 from RobH to Dzahn.

This fails reimage due to the initial puppet run failing. Not sure if we should apply a different role, or if you want to take over and reimage from here.

Fri, Oct 16, 8:29 PM · ops-eqiad, Operations, DC-Ops
RobH removed a project from T265653: (Need By: TBD) setup/install deploy1002: Patch-For-Review.
Fri, Oct 16, 7:29 PM · ops-eqiad, Operations, DC-Ops

Thu, Oct 15

RobH updated the task description for T265653: (Need By: TBD) setup/install deploy1002.
Thu, Oct 15, 9:25 PM · ops-eqiad, DC-Ops, Operations
RobH reassigned T265653: (Need By: TBD) setup/install deploy1002 from RobH to Dzahn.

setup notes:

Thu, Oct 15, 8:05 PM · ops-eqiad, DC-Ops, Operations
RobH updated the task description for T265653: (Need By: TBD) setup/install deploy1002.
Thu, Oct 15, 7:36 PM · ops-eqiad, DC-Ops, Operations
RobH added a parent task for T265653: (Need By: TBD) setup/install deploy1002: Unknown Object (Task).
Thu, Oct 15, 7:18 PM · ops-eqiad, DC-Ops, Operations
RobH claimed T265653: (Need By: TBD) setup/install deploy1002.
Thu, Oct 15, 7:18 PM · ops-eqiad, DC-Ops, Operations
RobH created T265653: (Need By: TBD) setup/install deploy1002.
Thu, Oct 15, 7:18 PM · ops-eqiad, DC-Ops, Operations

Wed, Oct 14

RobH placed T238036: scs-c1-eqiad CPU usage over 85% up for grabs.
Wed, Oct 14, 4:26 PM · ops-eqiad, DC-Ops, Operations
RobH closed T238036: scs-c1-eqiad CPU usage over 85% as Resolved.

If this happens again on any scs device, other than scs-a8-eqiad, it means the firmware update to 4.9.0u1 (fleetwide) doesn't fix the CPU spike issue.

Wed, Oct 14, 4:26 PM · ops-eqiad, DC-Ops, Operations

Tue, Oct 13

RobH updated the task description for T265412: patch in FB peering into cr2-eqdfw.
Tue, Oct 13, 10:41 PM · netops, Operations, ops-codfw
RobH added a parent task for T265419: (Need By: TBD) rack/setup/install ms-be20[58-61]: Unknown Object (Task).
Tue, Oct 13, 10:05 PM · Operations, ops-codfw, DC-Ops
RobH created T265419: (Need By: TBD) rack/setup/install ms-be20[58-61].
Tue, Oct 13, 10:05 PM · Operations, ops-codfw, DC-Ops
RobH added a parent task for T265412: patch in FB peering into cr2-eqdfw: Unknown Object (Task).
Tue, Oct 13, 9:03 PM · netops, Operations, ops-codfw
RobH updated the task description for T265412: patch in FB peering into cr2-eqdfw.
Tue, Oct 13, 9:02 PM · netops, Operations, ops-codfw
RobH triaged T265412: patch in FB peering into cr2-eqdfw as Medium priority.
Tue, Oct 13, 9:02 PM · netops, Operations, ops-codfw
RobH added a comment to T238036: scs-c1-eqiad CPU usage over 85%.

I've successfully upgraded the scs firmware fleetwide, with the exception of two devices:

Tue, Oct 13, 7:02 PM · ops-eqiad, DC-Ops, Operations
RobH added a comment to T238036: scs-c1-eqiad CPU usage over 85%.
Tue, Oct 13, 6:00 PM · ops-eqiad, DC-Ops, Operations

Thu, Oct 8

RobH moved T265093: (Need By: TBD) rack/setup/install ms-be106[0-3] from Backlog to Racking Tasks on the ops-eqiad board.
Thu, Oct 8, 9:45 PM · Operations, ops-eqiad, DC-Ops
RobH added a parent task for T265093: (Need By: TBD) rack/setup/install ms-be106[0-3]: Unknown Object (Task).
Thu, Oct 8, 9:41 PM · Operations, ops-eqiad, DC-Ops
RobH created T265093: (Need By: TBD) rack/setup/install ms-be106[0-3].
Thu, Oct 8, 9:41 PM · Operations, ops-eqiad, DC-Ops

Wed, Oct 7

RobH removed a member for SRE-Access-Requests: RobH.
Wed, Oct 7, 7:53 PM