Page MenuHomePhabricator

RobH (Rob Halsell)
Operations EngineerAdministrator

Projects (22)

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Tuesday

  • Clear sailing ahead.

User Details

User Since
Nov 24 2014, 1:43 PM (247 w, 6 d)
Roles
Administrator
Availability
Available
IRC Nick
RobH
LDAP User
RobH
MediaWiki User
RobH [ Global Accounts ]

My GPG Key fingerprint = CB1F C7E7 0FF8 5DB2 6820 9C7E 75ED 14C7 0245 D22A

I am an Operations Engineer on Wikimedia's Datacenter Operations Team.

I also am the primary triage engineer for the hardware-requests project, as well as the private S4 procurement space and procurement project.

All questions involving allocation of hardware can be initially addressed on https://wikitech.wikimedia.org/wiki/Operations_requests.

Please note that private message via phabricator is not my preferred contact means. Please feel free to contact me (robh) directly via irc/freenode, or email my @wikimedia.org email address.

Recent Activity

Fri, Aug 23

RobH reassigned T223216: Decommission db2034 from RobH to Papaul.
Fri, Aug 23, 11:42 PM · Operations, decommission, ops-codfw
RobH removed a project from T223216: Decommission db2034: Patch-For-Review.
Fri, Aug 23, 11:42 PM · Operations, decommission, ops-codfw
RobH added a comment to T223216: Decommission db2034.

I cannot locate the labeled switch port on the switch, so @Papaul will need to trace and disable this via on-site work.

Fri, Aug 23, 11:36 PM · Operations, decommission, ops-codfw
RobH reassigned T228281: decommission db2045.codfw.wmnet from RobH to Papaul.
Fri, Aug 23, 11:13 PM · ops-codfw, Operations, DC-Ops, decommission
RobH updated the task description for T228281: decommission db2045.codfw.wmnet.
Fri, Aug 23, 11:11 PM · ops-codfw, Operations, DC-Ops, decommission
RobH updated the task description for T228281: decommission db2045.codfw.wmnet.
Fri, Aug 23, 11:02 PM · ops-codfw, Operations, DC-Ops, decommission
RobH reassigned T200209: Decom graphite2001/WMF6160 from RobH to Papaul.
Fri, Aug 23, 7:41 PM · Patch-For-Review, decommission, ops-codfw, Operations, observability
RobH updated the task description for T200209: Decom graphite2001/WMF6160.
Fri, Aug 23, 7:40 PM · Patch-For-Review, decommission, ops-codfw, Operations, observability
RobH added a comment to T200209: Decom graphite2001/WMF6160.

statsd.codfw.wmnet points to graphite2001.codfw.wmnet, so I'm not sure what to point this at.

Fri, Aug 23, 7:35 PM · Patch-For-Review, decommission, ops-codfw, Operations, observability
RobH added a comment to T200209: Decom graphite2001/WMF6160.

Ok, I synced with @wiki_willy about this and the comment above.

Fri, Aug 23, 7:34 PM · Patch-For-Review, decommission, ops-codfw, Operations, observability
RobH renamed T200209: Decom graphite2001/WMF6160 from Decom graphite2001 to Decom graphite2001/WMF6160.
Fri, Aug 23, 7:19 PM · Patch-For-Review, decommission, ops-codfw, Operations, observability
RobH updated the task description for T200209: Decom graphite2001/WMF6160.
Fri, Aug 23, 7:13 PM · Patch-For-Review, decommission, ops-codfw, Operations, observability
RobH updated the task description for T200209: Decom graphite2001/WMF6160.
Fri, Aug 23, 7:09 PM · Patch-For-Review, decommission, ops-codfw, Operations, observability
RobH updated the task description for T200209: Decom graphite2001/WMF6160.
Fri, Aug 23, 7:06 PM · Patch-For-Review, decommission, ops-codfw, Operations, observability
RobH updated subscribers of T200209: Decom graphite2001/WMF6160.

I wanted to check with @wiki_willy if we need to reclaim this to spares, or if we can decommission and pull it out of the rack.

Fri, Aug 23, 6:59 PM · Patch-For-Review, decommission, ops-codfw, Operations, observability
RobH reassigned T200210: Decom graphite2002 & return server to spares pool from RobH to Papaul.

This is ready for disk wipe, hostname label removal, and then to be returned to the spares pool. Please note I've modified the above checklist since this isn't a full decommission and disposal. Once the disks are wiped and the hostname label removed, this can be resolved.

Fri, Aug 23, 5:48 PM · decommission, observability, Operations, ops-codfw
RobH updated the task description for T200210: Decom graphite2002 & return server to spares pool.
Fri, Aug 23, 5:47 PM · decommission, observability, Operations, ops-codfw
RobH removed a project from T200210: Decom graphite2002 & return server to spares pool: Patch-For-Review.
Fri, Aug 23, 5:47 PM · decommission, observability, Operations, ops-codfw
RobH updated the task description for T200210: Decom graphite2002 & return server to spares pool.
Fri, Aug 23, 5:40 PM · decommission, observability, Operations, ops-codfw
RobH renamed T200210: Decom graphite2002 & return server to spares pool from Decom graphite2002 to Decom graphite2002 & return server to spares pool.
Fri, Aug 23, 5:36 PM · decommission, observability, Operations, ops-codfw

Thu, Aug 22

RobH added a comment to T231046: setup/install gerrit1001.

I think you are the person to push this into service, being the author of the original hardware request. If not you, please advise if you know who should get this, and if not sure, assign back to me for followup.

Thu, Aug 22, 11:33 PM · Operations
RobH reassigned T231046: setup/install gerrit1001 from RobH to Dzahn.
Thu, Aug 22, 11:32 PM · Operations
RobH updated the task description for T231046: setup/install gerrit1001.
Thu, Aug 22, 10:54 PM · Operations
RobH removed a project from T231046: setup/install gerrit1001: Patch-For-Review.
Thu, Aug 22, 10:23 PM · Operations
RobH updated the task description for T231046: setup/install gerrit1001.
Thu, Aug 22, 9:49 PM · Operations
RobH created T231047: apply hostname label for wmf5176/gerrit1001.
Thu, Aug 22, 9:48 PM · Operations
RobH updated the task description for T231046: setup/install gerrit1001.
Thu, Aug 22, 9:47 PM · Operations
RobH added a parent task for T231046: setup/install gerrit1001: Unknown Object (Task).
Thu, Aug 22, 9:46 PM · Operations
RobH triaged T231046: setup/install gerrit1001 as Normal priority.
Thu, Aug 22, 9:46 PM · Operations
RobH added a comment to T148541: Replace Torrus with Prometheus snmp_exporter for PDUs monitoring.

From a chat with @faidon it emerged that we have at least three main use cases for PDU metrics:

  1. Checking overload / availability of rack infeeds (e.g. for redundant power, if we're using over 50% of available power that means that going non-redundant will trip the breaker)
  2. Power consumption for general site monitoring (per row/rack/site)
  3. Capacity planning (e.g. for footprint expansion or shrinkage as needed) (per row/rack/site)

I'd like to get some input / review on which of the above infeed metrics we should be looking at to get the right numbers out, cc DC-Ops @wiki_willy

Thu, Aug 22, 5:24 PM · User-fgiunchedi, Patch-For-Review, Prometheus-metrics-monitoring, Operations, observability

Mon, Aug 19

RobH added a comment to T230746: rack/setup/install elastic10[53-67].eqiad.wmnet.

Try to evenly space out elastic nodes in the row evenly in 1G racks.

All new elastic servers are coming in with 10G cards and should go into 10G racks.

Mon, Aug 19, 8:28 PM · Operations, ops-eqiad
RobH updated the task description for T230746: rack/setup/install elastic10[53-67].eqiad.wmnet.
Mon, Aug 19, 8:27 PM · Operations, ops-eqiad
RobH closed T221636: Replace elastic1017-1031, a subtask of T221630: [Epic] Search platform - Hardware requests for 2019-2020, as Resolved.
Mon, Aug 19, 7:00 PM · Discovery-Search, Epic
RobH closed T221636: Replace elastic1017-1031 as Resolved.

Please note that this hardware was ordered on T226843 and will be installed via T230746. As such, this request is resolved/granted.

Mon, Aug 19, 7:00 PM · Discovery-Search (Current work), Operations, hardware-requests
RobH added a parent task for T230746: rack/setup/install elastic10[53-67].eqiad.wmnet: Unknown Object (Task).
Mon, Aug 19, 6:58 PM · Operations, ops-eqiad
RobH triaged T230746: rack/setup/install elastic10[53-67].eqiad.wmnet as Normal priority.
Mon, Aug 19, 6:58 PM · Operations, ops-eqiad

Fri, Aug 16

RobH closed Unknown Object (Task), a subtask of T230077: refresh/replace scs-ulsfo, as Resolved.
Fri, Aug 16, 9:03 PM · Operations, ops-ulsfo
RobH updated the task description for T230597: can't SSH to elastic2050.mgmt .
Fri, Aug 16, 3:22 PM · ops-codfw, DC-Ops, Discovery-Search (Current work), Operations
RobH moved T230597: can't SSH to elastic2050.mgmt from Backlog to Hardware Failure / Troubleshoot on the ops-codfw board.
Fri, Aug 16, 3:21 PM · ops-codfw, DC-Ops, Discovery-Search (Current work), Operations
RobH assigned T230597: can't SSH to elastic2050.mgmt to Papaul.

IRC sync: Chatted with @Mathew.onipe, who let me know they had synced with @Papaul to take this offline on Monday to reset the power/bmc.

Fri, Aug 16, 3:21 PM · ops-codfw, DC-Ops, Discovery-Search (Current work), Operations
RobH added a comment to T230597: can't SSH to elastic2050.mgmt .

Please note this mgmt interface is still down:

Fri, Aug 16, 2:47 PM · ops-codfw, DC-Ops, Discovery-Search (Current work), Operations

Thu, Aug 15

RobH updated the task description for T218751: Audit down ports.
Thu, Aug 15, 8:10 PM · DC-Ops, Operations, ops-eqiad
RobH removed a project from T218751: Audit down ports: ops-ulsfo.

the single ops-ulsfo item has been fixed (its not on the switch any longer) so removing that tag.

Thu, Aug 15, 8:10 PM · DC-Ops, Operations, ops-eqiad
RobH closed T230077: refresh/replace scs-ulsfo as Resolved.

Ok, the new scs is now in place, with all connections documented and tested as working.

Thu, Aug 15, 8:07 PM · Operations, ops-ulsfo
RobH updated the task description for T230077: refresh/replace scs-ulsfo.
Thu, Aug 15, 8:07 PM · Operations, ops-ulsfo
RobH closed T206185: connect atlas-ulsfo to scs-ulsfo, a subtask of T230077: refresh/replace scs-ulsfo, as Resolved.
Thu, Aug 15, 8:07 PM · Operations, ops-ulsfo
RobH closed T206185: connect atlas-ulsfo to scs-ulsfo as Resolved.

done tested and works on port 8 on scs-ulsfo (baud rate 19200 8n1, default on scs is 9600, so its the only one differing on the scs console right now)

Thu, Aug 15, 8:07 PM · DC-Ops, Operations, ops-ulsfo, ops-eqiad
RobH updated the task description for T230077: refresh/replace scs-ulsfo.
Thu, Aug 15, 7:41 PM · Operations, ops-ulsfo
RobH updated the task description for T230077: refresh/replace scs-ulsfo.
Thu, Aug 15, 7:37 PM · Operations, ops-ulsfo
RobH updated the task description for T230077: refresh/replace scs-ulsfo.
Thu, Aug 15, 7:33 PM · Operations, ops-ulsfo
RobH updated the task description for T230077: refresh/replace scs-ulsfo.
Thu, Aug 15, 7:32 PM · Operations, ops-ulsfo
RobH updated the task description for T230077: refresh/replace scs-ulsfo.
Thu, Aug 15, 7:11 PM · Operations, ops-ulsfo

Wed, Aug 14

RobH placed T227538: b2-eqiad pdu refresh (Tuesday 10/29 @11am UTC) up for grabs.
Wed, Aug 14, 4:53 PM · DC-Ops, Operations, ops-eqiad
RobH placed T227143: a7-eqiad pdu refresh up for grabs.
Wed, Aug 14, 4:53 PM · DC-Ops, Operations, ops-eqiad
RobH placed T227142: a6-eqiad pdu refresh (Tuesday 10/22 @11am UTC) up for grabs.
Wed, Aug 14, 4:53 PM · DC-Ops, Operations, ops-eqiad
RobH placed T227138: a2-eqiad pdu refresh (Tuesday 10/8 @11am UTC) up for grabs.
Wed, Aug 14, 4:52 PM · DC-Ops, Operations, ops-eqiad
RobH placed T227133: a8-eqiad pdu refresh (Thursday 9/19 @11am UTC) up for grabs.
Wed, Aug 14, 4:52 PM · DC-Ops, Operations, ops-eqiad
RobH placed T226782: a1-eqiad pdu refresh (Thursday 9/12 @11am UTC) up for grabs.
Wed, Aug 14, 4:52 PM · DC-Ops, Operations, ops-eqiad
RobH placed T227536: b1-eqiad pdu refresh (Thursday 10/10 @11am UTC) up for grabs.
Wed, Aug 14, 4:51 PM · DC-Ops, Operations, ops-eqiad
RobH added a comment to T224188: rack/setup/install (3) new osd ceph nodes.

Please do not assign this to me, it is awaiting installation by DC ops into 10G racks, and not on me.

Wed, Aug 14, 3:07 PM · ops-eqiad, Operations, cloud-services-team (Kanban), Cloud-Services

Tue, Aug 13

RobH updated the task description for T230077: refresh/replace scs-ulsfo.
Tue, Aug 13, 4:22 PM · Operations, ops-ulsfo
RobH added a comment to T230077: refresh/replace scs-ulsfo.

The cables have arrived for this. I'll go onsite on Wednesday, August 14th to swap out the scs-ulsfo console server.

Tue, Aug 13, 4:20 PM · Operations, ops-ulsfo

Fri, Aug 9

RobH closed T211368: update PDUs for eqsin (asset tag and other info) as Resolved.

I've gone ahead and populated the serial number with the asset tags. That will remove it off our error reporting. As we plan to eventually replace these with smart/per outlet control units, this seemed an acceptable solution.

Fri, Aug 9, 9:48 PM · Operations, ops-eqsin
RobH closed T227911: msw1-eqsin/msw2-eqsin missing serial number as Resolved.

netbox updated

Fri, Aug 9, 6:43 PM · Operations, ops-eqsin

Wed, Aug 7

RobH added a subtask for T230077: refresh/replace scs-ulsfo: Unknown Object (Task).
Wed, Aug 7, 11:41 PM · Operations, ops-ulsfo
RobH added a subtask for T230077: refresh/replace scs-ulsfo: T206185: connect atlas-ulsfo to scs-ulsfo.
Wed, Aug 7, 11:40 PM · Operations, ops-ulsfo
RobH added a parent task for T206185: connect atlas-ulsfo to scs-ulsfo: T230077: refresh/replace scs-ulsfo.
Wed, Aug 7, 11:40 PM · DC-Ops, Operations, ops-ulsfo, ops-eqiad
RobH changed the status of T206185: connect atlas-ulsfo to scs-ulsfo from Open to Stalled.

this is now blocked on the new scs setup and patch cables on T230077

Wed, Aug 7, 11:07 PM · DC-Ops, Operations, ops-ulsfo, ops-eqiad
RobH moved T230077: refresh/replace scs-ulsfo from Backlog to In Progress on the ops-ulsfo board.
Wed, Aug 7, 11:07 PM · Operations, ops-ulsfo
RobH moved T206185: connect atlas-ulsfo to scs-ulsfo from Backlog to In Progress on the ops-ulsfo board.
Wed, Aug 7, 11:07 PM · DC-Ops, Operations, ops-ulsfo, ops-eqiad
RobH updated the task description for T230077: refresh/replace scs-ulsfo.
Wed, Aug 7, 10:55 PM · Operations, ops-ulsfo
RobH triaged T230077: refresh/replace scs-ulsfo as Normal priority.
Wed, Aug 7, 10:47 PM · Operations, ops-ulsfo
RobH added a comment to T220853: VMs on cloudvirt1015 crashing - bad mainboard/memory.

I neglected to update this, but it passed all dell epsa tests without crash.

Wed, Aug 7, 10:13 PM · Operations, ops-eqiad, DC-Ops, User-Zppix, cloud-services-team (Kanban)
RobH updated subscribers of T228447: Requesting access to machines [stat1004, stat1007, stat1006, notebook1003 and notebook1004] and groups for cchen.

Updated ticket to reflect 3 business days passing and approval from Nuria. @RobH can you provide the patch or let us know who's on clinic duty?
Thank you!

Wed, Aug 7, 6:06 PM · Operations, SRE-Access-Requests
RobH changed the status of Unknown Object (Task), a subtask of T227314: eqiad+codfw: 6x hardware request for swift backend (each site), from Stalled to Open.
Wed, Aug 7, 5:12 PM · hardware-requests, Operations

Tue, Aug 6

RobH updated the task description for T227541: b6-eqiad pdu refresh (Tuesday 9/10 @11am UTC).
Tue, Aug 6, 9:57 AM · DC-Ops, Operations, ops-eqiad

Mon, Aug 5

RobH closed T229124: add jclark to datacenter-ops group as Resolved.
Mon, Aug 5, 5:59 PM · SRE-Access-Requests, Operations
RobH updated the task description for T229124: add jclark to datacenter-ops group.
Mon, Aug 5, 5:59 PM · SRE-Access-Requests, Operations
RobH added a comment to T229124: add jclark to datacenter-ops group.

This was reviewed in the weekly SRE meeting. After discussion, it was decided that the dc operations user group will be managed by @wiki_willy as the DC Operations Manager.

Mon, Aug 5, 5:41 PM · SRE-Access-Requests, Operations

Wed, Jul 31

RobH added a comment to T226778: Install new PDUs in rows A/B (Top level tracking task).

In reviewing the comments of T227138#5354060 and T226778#5358383, and in my IRC discussions with @wiki_willy, I propose the following schedule of rack swaps and cadence options.

Wed, Jul 31, 5:00 PM · DC-Ops, Operations, ops-eqiad

Tue, Jul 30

RobH renamed T223463: (2019-09) Create secteam groups in admin.yaml and define permissions from Create secteam groups in admin.yaml and define permissions to (2019-09) Create secteam groups in admin.yaml and define permissions.
Tue, Jul 30, 8:11 PM · SRE-Access-Requests, Operations, Security-Team, Patch-For-Review
RobH assigned T223463: (2019-09) Create secteam groups in admin.yaml and define permissions to sbassett.
Tue, Jul 30, 8:10 PM · SRE-Access-Requests, Operations, Security-Team, Patch-For-Review
RobH updated the task description for T228447: Requesting access to machines [stat1004, stat1007, stat1006, notebook1003 and notebook1004] and groups for cchen.
Tue, Jul 30, 7:41 PM · Operations, SRE-Access-Requests
RobH updated the task description for T228447: Requesting access to machines [stat1004, stat1007, stat1006, notebook1003 and notebook1004] and groups for cchen.
Tue, Jul 30, 7:40 PM · Operations, SRE-Access-Requests
RobH assigned T228447: Requesting access to machines [stat1004, stat1007, stat1006, notebook1003 and notebook1004] and groups for cchen to cchen.

Please note we will need you to review and sign the L3 document. Once done, if @Nuria has not attached approval, assign to @Nuria.

Tue, Jul 30, 7:39 PM · Operations, SRE-Access-Requests
RobH updated the task description for T228447: Requesting access to machines [stat1004, stat1007, stat1006, notebook1003 and notebook1004] and groups for cchen.
Tue, Jul 30, 7:37 PM · Operations, SRE-Access-Requests
RobH added a parent task for T228924: rack/setup/install ganeti10([09]|1[0-8[).eqiad.wmnet: Unknown Object (Task).
Tue, Jul 30, 5:26 PM · ops-eqiad, vm-requests, Operations
RobH updated the task description for T227695: Requesting access to analytics-privatedata-users for mbsantos.
Tue, Jul 30, 4:58 PM · SRE-Access-Requests, Operations
RobH moved T229143: Access to HUE for Mayakpwiki from Backlog to Acknowledged on the Operations board.
Tue, Jul 30, 4:54 PM · Operations, Analytics
RobH moved T229143: Access to HUE for Mayakpwiki from Manager/NDA Approval/Confirmation to In Discussion on the SRE-Access-Requests board.
Tue, Jul 30, 4:54 PM · Operations, Analytics
RobH moved T229143: Access to HUE for Mayakpwiki from Untriaged to Manager/NDA Approval/Confirmation on the SRE-Access-Requests board.
Tue, Jul 30, 4:53 PM · Operations, Analytics
RobH added a comment to T229143: Access to HUE for Mayakpwiki.

This seems to be something that the Analytics team needs to handle directly, rather than ops clinic duty, as the directions for HUE require someone who is already an Admin on it to grant other access.

Tue, Jul 30, 4:53 PM · Operations, Analytics
RobH updated the task description for T229284: add all remaining new pdus to netbox.
Tue, Jul 30, 3:44 PM · DC-Ops, Operations, ops-eqiad

Mon, Jul 29

RobH updated the task description for T229284: add all remaining new pdus to netbox.
Mon, Jul 29, 10:18 PM · DC-Ops, Operations, ops-eqiad
RobH updated the task description for T226778: Install new PDUs in rows A/B (Top level tracking task).
Mon, Jul 29, 10:17 PM · DC-Ops, Operations, ops-eqiad
RobH reassigned T229284: add all remaining new pdus to netbox from RobH to Jclark-ctr.
Mon, Jul 29, 10:15 PM · DC-Ops, Operations, ops-eqiad
RobH updated the task description for T229284: add all remaining new pdus to netbox.
Mon, Jul 29, 10:13 PM · DC-Ops, Operations, ops-eqiad
RobH updated the task description for T229284: add all remaining new pdus to netbox.
Mon, Jul 29, 10:12 PM · DC-Ops, Operations, ops-eqiad
RobH created T229284: add all remaining new pdus to netbox.
Mon, Jul 29, 10:06 PM · DC-Ops, Operations, ops-eqiad
RobH updated the task description for T229243: remote hands setups for ganeti500[123].
Mon, Jul 29, 8:04 PM · Operations, ops-eqsin