Page MenuHomePhabricator

RobH (Rob Halsell)
Operations EngineerAdministrator

Projects (21)

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Wednesday

  • Clear sailing ahead.

User Details

User Since
Nov 24 2014, 1:43 PM (273 w, 2 h)
Roles
Administrator
Availability
Available
IRC Nick
RobH
LDAP User
RobH
MediaWiki User
RobH [ Global Accounts ]

My GPG Key fingerprint = CB1F C7E7 0FF8 5DB2 6820 9C7E 75ED 14C7 0245 D22A

I am an Operations Engineer on Wikimedia's Datacenter Operations Team.

I also am the primary triage engineer for the hardware-requests project, as well as the private S4 procurement space and procurement project.

All questions involving allocation of hardware can be initially addressed on https://wikitech.wikimedia.org/wiki/Operations_requests.

Please note that private message via phabricator is not my preferred contact means. Please feel free to contact me (robh) directly via irc/freenode, or email my @wikimedia.org email address.

Recent Activity

Fri, Feb 14

RobH added a project to T245279: decommission kraz.wikimedia.org: decommission.
Fri, Feb 14, 9:09 PM · decommission, serviceops, Operations, Analytics

Thu, Feb 13

RobH added a comment to T245188: Audit msw1-eqiad cables.

Please note this appears to also be an ideal time to do T225121 perhaps? Rather than updating netbox for the old msw?

Thu, Feb 13, 7:05 PM · Operations, ops-eqiad
RobH added a comment to T245164: Alert for device ps1-a8-codfw.mgmt.codfw.wmnet - Device rebooted.

So this came back after my firmware update, and I logged in, but then I logged out after looking that firmware updated. Then Arzhel pointed out it wasn't showing online in librenms, and I go to login a second time, and it doesn't work.

Thu, Feb 13, 6:21 PM · Operations, ops-codfw
RobH added a comment to T245164: Alert for device ps1-a8-codfw.mgmt.codfw.wmnet - Device rebooted.

Ok, it was firmware

Thu, Feb 13, 4:32 PM · Operations, ops-codfw
RobH added a comment to T245164: Alert for device ps1-a8-codfw.mgmt.codfw.wmnet - Device rebooted.

Uptime: 0 days 0 hours 20 minutes 27 seconds

Thu, Feb 13, 4:07 PM · Operations, ops-codfw
RobH added a comment to T245056: snag asset tags from ulsfo, ship some to eqsin.

I'll mail this to Jin tomorrow (Friday) with directions to leave them in our racks at eqsin. They won't end up being applied for a couple of weeks, as he is working with another contract for a time, and this isn't an emergency.

Thu, Feb 13, 12:07 AM · Operations, ops-eqsin
RobH added a subtask for T245056: snag asset tags from ulsfo, ship some to eqsin: T244900: apply asset tags to s[12]-60[34]-eqsin.
Thu, Feb 13, 12:06 AM · Operations, ops-eqsin
RobH added a parent task for T244900: apply asset tags to s[12]-60[34]-eqsin: T245056: snag asset tags from ulsfo, ship some to eqsin.
Thu, Feb 13, 12:06 AM · Operations, ops-eqsin
RobH removed a project from T245056: snag asset tags from ulsfo, ship some to eqsin: ops-ulsfo.
Thu, Feb 13, 12:05 AM · Operations, ops-eqsin

Wed, Feb 12

RobH added a comment to T245056: snag asset tags from ulsfo, ship some to eqsin.

WMF7236 - WMF7247 snagged to mail to eqsin

Wed, Feb 12, 10:33 PM · Operations, ops-eqsin
RobH removed a project from T243450: Audit & update spares part tracking for all sites: ops-ulsfo.
Wed, Feb 12, 10:33 PM · ops-eqiad, ops-esams, ops-codfw, ops-eqsin, DC-Ops, Operations
RobH closed T238856: audit cable labels @ ulsfo as Resolved.

of the missing labels (5 or 6) all but 1 had labels already applied, and 1 had new labels applied. The duplicate id was 1240, not 1241 (dupe).

Wed, Feb 12, 9:42 PM · Operations, ops-ulsfo
RobH triaged T245056: snag asset tags from ulsfo, ship some to eqsin as High priority.
Wed, Feb 12, 8:19 PM · Operations, ops-eqsin
RobH closed T211368: update PDUs for eqsin (asset tag and other info) as Resolved.

Ok, this task was to track the old PDUs, which have been replaced.

Wed, Feb 12, 5:43 PM · Operations, ops-eqsin
RobH added a comment to T243521: Hadoop Hardware Orders FY2019-2020.

So to put some of the figures I just posted in IRC about this:

Wed, Feb 12, 5:27 PM · Analytics-Cluster, Analytics
RobH reassigned T244958: db1095 backup source crashed: broken BBU from wiki_willy to Jclark-ctr.

Please note that we just ordered replacement raid batteries for HP Gen9 raid controllers via T243547.

Wed, Feb 12, 5:11 PM · Patch-For-Review, ops-eqiad, Operations, DBA
RobH closed T244886: (2020-01-15) rack/setup/install mc-gp100[123].eqiad.wmnet as Invalid.

@RobH, those hosts are delivered and in production per T241795

Wed, Feb 12, 4:18 PM · Operations, ops-eqiad, DC-Ops
RobH added a parent task for T241795: (Need By: Jan 10) rack/setup/install mc-gp100[123].eqiad.wmnet: Unknown Object (Task).
Wed, Feb 12, 4:17 PM · serviceops, Operations
RobH removed a subtask for T241795: (Need By: Jan 10) rack/setup/install mc-gp100[123].eqiad.wmnet: Unknown Object (Task).
Wed, Feb 12, 4:17 PM · serviceops, Operations
RobH updated the task description for T243167: Upgrade BIOS and IDRAC firmware on R440 cp systems.
Wed, Feb 12, 4:06 PM · ops-eqiad, DC-Ops, Traffic, ops-esams, Operations
RobH added a comment to T243167: Upgrade BIOS and IDRAC firmware on R440 cp systems.

Please note the outage caused the SAL of my adding back cp108[67] to service, (as the rest weren't really returned, but depooled.)

Wed, Feb 12, 4:05 PM · ops-eqiad, DC-Ops, Traffic, ops-esams, Operations

Tue, Feb 11

RobH updated the task description for T243167: Upgrade BIOS and IDRAC firmware on R440 cp systems.
Tue, Feb 11, 8:56 PM · ops-eqiad, DC-Ops, Traffic, ops-esams, Operations
RobH triaged T244914: trace qfx5100-spare[12]-esams power cables as Low priority.
Tue, Feb 11, 7:29 PM · Operations, ops-esams
RobH closed T237009: Add missing labels for equipment and cables as Resolved.

Updated into netbox, all cables are now tracked in netbox except possibly the spare switches (which were not covered in this task.) I'll make a lower priority task since spare switches are only there for spare, and not active use.

Tue, Feb 11, 7:23 PM · DC-Ops, ops-esams, Operations
RobH closed T239125: duplicate cable IDs in eqsin as Resolved.

Fixed via audit of on-site cable mapping and update in netbox.

Tue, Feb 11, 6:02 PM · Operations, ops-eqsin
RobH updated the task description for T242250: rack/setup/install ps[12]-60[34]-eqsin.
Tue, Feb 11, 6:01 PM · Operations, ops-eqsin
RobH closed T242250: rack/setup/install ps[12]-60[34]-eqsin as Resolved.

T244900 created for asset tag application at a later date. This is now resolved and setup for monitoring in librenms and icinga both.

Tue, Feb 11, 6:01 PM · Operations, ops-eqsin
RobH created T244900: apply asset tags to s[12]-60[34]-eqsin.
Tue, Feb 11, 6:00 PM · Operations, ops-eqsin
RobH updated the task description for T242250: rack/setup/install ps[12]-60[34]-eqsin.
Tue, Feb 11, 5:58 PM · Operations, ops-eqsin
RobH added a parent task for T244886: (2020-01-15) rack/setup/install mc-gp100[123].eqiad.wmnet: Unknown Object (Task).
Tue, Feb 11, 4:34 PM · Operations, ops-eqiad, DC-Ops
RobH created T244886: (2020-01-15) rack/setup/install mc-gp100[123].eqiad.wmnet.
Tue, Feb 11, 4:34 PM · Operations, ops-eqiad, DC-Ops
RobH closed Unknown Object (Task), a subtask of T239675: Add 10G NICs to core site DNS servers (6 servers, 3 per site), as Resolved.
Tue, Feb 11, 4:22 PM · hardware-requests, Operations, Traffic
RobH moved T244783: (no date provided) rack/setup/install ganeti20[19-24] from Backlog to Racking Tasks on the ops-codfw board.
Tue, Feb 11, 4:20 PM · ops-codfw, Operations, DC-Ops
RobH reassigned T244783: (no date provided) rack/setup/install ganeti20[19-24] from RobH to Papaul.
Tue, Feb 11, 4:20 PM · ops-codfw, Operations, DC-Ops
RobH updated the task description for T242250: rack/setup/install ps[12]-60[34]-eqsin.
Tue, Feb 11, 4:20 PM · Operations, ops-eqsin
RobH updated the task description for T242250: rack/setup/install ps[12]-60[34]-eqsin.
Tue, Feb 11, 4:19 PM · Operations, ops-eqsin
RobH removed a project from T242250: rack/setup/install ps[12]-60[34]-eqsin: Traffic.
Tue, Feb 11, 4:19 PM · Operations, ops-eqsin
RobH removed a project from T242250: rack/setup/install ps[12]-60[34]-eqsin: Patch-For-Review.
Tue, Feb 11, 4:18 PM · Operations, ops-eqsin

Mon, Feb 10

RobH updated the task description for T243167: Upgrade BIOS and IDRAC firmware on R440 cp systems.
Mon, Feb 10, 11:30 PM · ops-eqiad, DC-Ops, Traffic, ops-esams, Operations
RobH added a project to T243167: Upgrade BIOS and IDRAC firmware on R440 cp systems: ops-eqiad.
Mon, Feb 10, 11:05 PM · ops-eqiad, DC-Ops, Traffic, ops-esams, Operations
RobH updated the task description for T243167: Upgrade BIOS and IDRAC firmware on R440 cp systems.
Mon, Feb 10, 10:44 PM · ops-eqiad, DC-Ops, Traffic, ops-esams, Operations
RobH updated the task description for T243167: Upgrade BIOS and IDRAC firmware on R440 cp systems.
Mon, Feb 10, 10:39 PM · ops-eqiad, DC-Ops, Traffic, ops-esams, Operations
RobH updated the task description for T243167: Upgrade BIOS and IDRAC firmware on R440 cp systems.
Mon, Feb 10, 10:19 PM · ops-eqiad, DC-Ops, Traffic, ops-esams, Operations
RobH added a comment to T243167: Upgrade BIOS and IDRAC firmware on R440 cp systems.

My plan is to do one from each service group (upload/text) at a time, batched together. (It is just as easy to watch two bios updates as one, it doesn't quite scale more than that for close supervision.) Would that be ok?

Mon, Feb 10, 8:28 PM · ops-eqiad, DC-Ops, Traffic, ops-esams, Operations
RobH updated the task description for T243167: Upgrade BIOS and IDRAC firmware on R440 cp systems.
Mon, Feb 10, 8:28 PM · ops-eqiad, DC-Ops, Traffic, ops-esams, Operations
RobH renamed T243167: Upgrade BIOS and IDRAC firmware on R440 cp systems from Upgrade BIOS and IDRAC firmware on esams caches to Upgrade BIOS and IDRAC firmware on R440 cp systems.
Mon, Feb 10, 8:26 PM · ops-eqiad, DC-Ops, Traffic, ops-esams, Operations
RobH added a comment to T244783: (no date provided) rack/setup/install ganeti20[19-24].

@RobH we have already ganeti2009-ganeti2014 in codfw

Mon, Feb 10, 8:05 PM · ops-codfw, Operations, DC-Ops
RobH renamed T244783: (no date provided) rack/setup/install ganeti20[19-24] from (no date provided) rack/setup/install ganeti2009-ganeti2014 to (no date provided) rack/setup/install ganeti20[16-21].
Mon, Feb 10, 8:05 PM · ops-codfw, Operations, DC-Ops
RobH added a parent task for T244783: (no date provided) rack/setup/install ganeti20[19-24]: Unknown Object (Task).
Mon, Feb 10, 7:53 PM · ops-codfw, Operations, DC-Ops
RobH updated subscribers of T244783: (no date provided) rack/setup/install ganeti20[19-24].
Mon, Feb 10, 7:52 PM · ops-codfw, Operations, DC-Ops
RobH updated subscribers of T244783: (no date provided) rack/setup/install ganeti20[19-24].

@Papaul: Please note that we're now implementing a new process change from @wiki_willy. No specific due date has been provided by the SRE sub-team, so please just set this due date to 20 days after they arrive on site.

Mon, Feb 10, 7:52 PM · ops-codfw, Operations, DC-Ops
RobH updated subscribers of T244783: (no date provided) rack/setup/install ganeti20[19-24].
Mon, Feb 10, 7:51 PM · ops-codfw, Operations, DC-Ops
RobH edited projects for T244783: (no date provided) rack/setup/install ganeti20[19-24], added: ops-codfw; removed ops-eqiad.
Mon, Feb 10, 7:51 PM · ops-codfw, Operations, DC-Ops
RobH reassigned T244783: (no date provided) rack/setup/install ganeti20[19-24] from Jclark-ctr to Papaul.
Mon, Feb 10, 7:51 PM · ops-codfw, Operations, DC-Ops
RobH created T244783: (no date provided) rack/setup/install ganeti20[19-24].
Mon, Feb 10, 7:50 PM · ops-codfw, Operations, DC-Ops
RobH added a comment to T243167: Upgrade BIOS and IDRAC firmware on R440 cp systems.

@BBlack, Can we modify this task to include the eqiad caches that need update as well? I'll be handing these remotely. During this process, if any single server fails and requires on-site work, I'll make a sub-task for its repair off this task.

Mon, Feb 10, 5:44 PM · ops-eqiad, DC-Ops, Traffic, ops-esams, Operations

Thu, Feb 6

RobH edited Description on DC-Ops.
Thu, Feb 6, 8:22 PM
RobH closed T242885: Expand Eqiad Ganeti row_A capacity as Resolved.

memory ordered on T243442 and implementation tracking on T244530. resolving this task

Thu, Feb 6, 7:52 PM · hardware-requests, Operations
RobH added a parent task for T244530: upgrade memory in ganeti100[5-8].eqiad.wmnet: Unknown Object (Task).
Thu, Feb 6, 7:51 PM · Operations, ops-eqiad
RobH triaged T244530: upgrade memory in ganeti100[5-8].eqiad.wmnet as Medium priority.
Thu, Feb 6, 7:51 PM · Operations, ops-eqiad
RobH moved T244506: rack/setup/install kafka-jumbo100[789].eqiad.wmnet from Backlog to Racking Tasks on the ops-eqiad board.
Thu, Feb 6, 5:24 PM · Operations, Analytics, ops-eqiad
RobH added a parent task for T244506: rack/setup/install kafka-jumbo100[789].eqiad.wmnet: Unknown Object (Task).
Thu, Feb 6, 5:23 PM · Operations, Analytics, ops-eqiad
RobH triaged T244506: rack/setup/install kafka-jumbo100[789].eqiad.wmnet as Medium priority.
Thu, Feb 6, 5:23 PM · Operations, Analytics, ops-eqiad
RobH added projects to T243450: Audit & update spares part tracking for all sites: ops-eqsin, ops-ulsfo, ops-codfw, ops-esams, ops-eqiad.

I'm adding in each site's project. Once an on-site engineer has audited and updated the spares tracking sheet for hardware, this task should be commented and that sites' project can be removed.

Thu, Feb 6, 4:30 PM · ops-eqiad, ops-esams, ops-codfw, ops-eqsin, DC-Ops, Operations

Wed, Feb 5

RobH updated the task description for T242250: rack/setup/install ps[12]-60[34]-eqsin.
Wed, Feb 5, 5:53 PM · Operations, ops-eqsin
RobH added a comment to T242250: rack/setup/install ps[12]-60[34]-eqsin.

I've coordinated with Jin via Google Hangout Messages and he has reviewed the rack and ensured he has all the cabled needed. I sent in this email to him, but since then he also followed up immediately and went onsite last evening (my time, early am his time) to work on pre-staging things as best he could.

Wed, Feb 5, 5:53 PM · Operations, ops-eqsin

Tue, Feb 4

RobH added a comment to T244291: Upgrade Netbox to 2.7 series.

My only request is this not happen during the planned eqsin PDU work staring 2020-02-06 16:00 Pacific / 2020-02-07 00:00 GMT / 2020-02-07 08:00 Singapore time and expecting to take a few hours.

Tue, Feb 4, 9:39 PM · Patch-For-Review, netbox, User-crusnov, SRE-tools

Mon, Feb 3

RobH added a comment to T242250: rack/setup/install ps[12]-60[34]-eqsin.

Please note this has been confirmed as likely to occur on Feb 6th (GMT). Jin has approved that he can work during that window, and we need to get confirmation from @BBlack that this is ok for Traffic.

Mon, Feb 3, 11:36 PM · Operations, ops-eqsin
RobH edited Description on DC-Ops.
Mon, Feb 3, 7:07 PM

Thu, Jan 30

RobH edited Description on DC-Ops.
Thu, Jan 30, 11:29 PM
RobH edited Description on DC-Ops.
Thu, Jan 30, 11:25 PM
RobH edited Description on DC-Ops.
Thu, Jan 30, 6:42 PM
RobH edited Description on DC-Ops.
Thu, Jan 30, 6:29 PM
RobH edited Description on decommission.
Thu, Jan 30, 6:28 PM
RobH edited Description on DC-Ops.
Thu, Jan 30, 6:27 PM
RobH edited Description on DC-Ops.
Thu, Jan 30, 6:21 PM
RobH edited Description on DC-Ops.
Thu, Jan 30, 6:18 PM
RobH added a hashtag to DC-Ops: #dc-operations.
Thu, Jan 30, 6:09 PM

Thu, Jan 23

RobH moved T242250: rack/setup/install ps[12]-60[34]-eqsin from Backlog to Acknowledged on the Operations board.
Thu, Jan 23, 6:51 PM · Operations, ops-eqsin
RobH added a comment to T242250: rack/setup/install ps[12]-60[34]-eqsin.

I've not seen @BBlack in IRC since posting the above comment, I suspect due to pre-all-hands-rush. We have SRE meeting time set aside during all hands, so I'll sync up with @BBlack about this.

Thu, Jan 23, 6:51 PM · Operations, ops-eqsin
RobH reassigned T229586: decommission cp1008, cp1071, cp1072, cp1073, cp1074, cp1099 from RobH to Jclark-ctr.

These hosts are ready for the on-site wipe steps. I've also left the puppet and dns updates, so during our off-site during all hands we can show you how to push these commits to our gerrit server for merge.

Thu, Jan 23, 6:08 PM · ops-eqiad, DC-Ops, Operations, decommission
RobH updated the task description for T229586: decommission cp1008, cp1071, cp1072, cp1073, cp1074, cp1099.
Thu, Jan 23, 6:07 PM · ops-eqiad, DC-Ops, Operations, decommission
RobH placed T229586: decommission cp1008, cp1071, cp1072, cp1073, cp1074, cp1099 up for grabs.
Thu, Jan 23, 5:26 PM · ops-eqiad, DC-Ops, Operations, decommission
RobH reassigned T229586: decommission cp1008, cp1071, cp1072, cp1073, cp1074, cp1099 from RobH to Volans.
Thu, Jan 23, 4:10 PM · ops-eqiad, DC-Ops, Operations, decommission
jijiki awarded T241852: rack/setup/install new codfw mw systems a Orange Medal token.
Thu, Jan 23, 12:11 PM · ops-codfw, serviceops, Operations

Wed, Jan 22

RobH added a project to T243450: Audit & update spares part tracking for all sites: DC-Ops.
Wed, Jan 22, 9:42 PM · ops-eqiad, ops-esams, ops-codfw, ops-eqsin, DC-Ops, Operations
RobH triaged T243450: Audit & update spares part tracking for all sites as Medium priority.
Wed, Jan 22, 7:50 PM · ops-eqiad, ops-esams, ops-codfw, ops-eqsin, DC-Ops, Operations
RobH moved T204589: eqiad: (1) misc single cpu server allocation for performance browser testing from Pending Approval to Stalled on the hardware-requests board.
Wed, Jan 22, 7:17 PM · Performance-Team (Radar), Operations, hardware-requests
RobH closed T232654: eqiad: three clouvirt-wdqs servers for WDQS testing, a subtask of T221631: Dedicated servers on WMCS to test WDQS scalability strategy, as Resolved.
Wed, Jan 22, 7:17 PM · cloud-services-team (Kanban), Wikidata, Wikidata-Query-Service, Discovery-Search
RobH closed T232654: eqiad: three clouvirt-wdqs servers for WDQS testing as Resolved.

fulfilled by T235685, resolving task off @hw-request workboard.

Wed, Jan 22, 7:16 PM · DC-Ops, hardware-requests, Operations
RobH added a subtask for T242885: Expand Eqiad Ganeti row_A capacity: Unknown Object (Task).
Wed, Jan 22, 6:46 PM · hardware-requests, Operations
RobH moved T242885: Expand Eqiad Ganeti row_A capacity from Backlog to In Discussion / Review on the hardware-requests board.
Wed, Jan 22, 6:39 PM · hardware-requests, Operations
RobH reassigned T214024: Two test hosts for SREs from RobH to faidon.

Ok, wmf5175 was ordered and can be allocated as the dual cpu spare pool system currently available in eqiad.

Wed, Jan 22, 6:39 PM · Operations, hardware-requests
RobH created T243433: cloudclastic1006 malformed asset tag - report error.
Wed, Jan 22, 5:57 PM · ops-eqiad, Operations, DC-Ops
RobH added a comment to T242250: rack/setup/install ps[12]-60[34]-eqsin.

We only got confirmation of delivery of the PDUs yesterday via email. I'll be dispatching directions to Jin after we determine what date works best.

Wed, Jan 22, 4:09 PM · Operations, ops-eqsin

Tue, Jan 21

RobH added a comment to T242885: Expand Eqiad Ganeti row_A capacity.

Ok, next steps for this as far as I can tell:

Tue, Jan 21, 10:11 PM · hardware-requests, Operations
RobH updated subscribers of T242097: mr1-esams i2c syslog flood.

Next steps:

Tue, Jan 21, 8:42 PM · Operations, netops
RobH added a subtask for T242097: mr1-esams i2c syslog flood: Unknown Object (Task).
Tue, Jan 21, 8:37 PM · Operations, netops
RobH moved T243167: Upgrade BIOS and IDRAC firmware on R440 cp systems from Triage to Hardware on the Traffic board.
Tue, Jan 21, 4:53 PM · ops-eqiad, DC-Ops, Traffic, ops-esams, Operations
RobH reassigned T243167: Upgrade BIOS and IDRAC firmware on R440 cp systems from RobH to BBlack.

Please note that Traffic (and @BBlack) previously asked me NOT to do this on these hosts, while they worked out why they are crashing. (reference T238305)

Tue, Jan 21, 4:53 PM · ops-eqiad, DC-Ops, Traffic, ops-esams, Operations