Page MenuHomePhabricator

RobH (Rob Halsell)
Operations EngineerAdministrator

Projects (24)

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Wednesday

  • Clear sailing ahead.

User Details

User Since
Nov 24 2014, 1:43 PM (239 w, 5 h)
Roles
Administrator
Availability
Available
IRC Nick
RobH
LDAP User
RobH
MediaWiki User
RobH [ Global Accounts ]

My GPG Key fingerprint = CB1F C7E7 0FF8 5DB2 6820 9C7E 75ED 14C7 0245 D22A

I am an Operations Engineer on Wikimedia's Datacenter Operations Team.

I also am the primary triage engineer for the hardware-requests project, as well as the private S4 procurement space and procurement project.

All questions involving allocation of hardware can be initially addressed on https://wikitech.wikimedia.org/wiki/Operations_requests.

Please note that private message via phabricator is not my preferred contact means. Please feel free to contact me (robh) directly via irc/freenode, or email my @wikimedia.org email address.

Recent Activity

Today

RobH added a parent task for T226444: rack/setup/install ganeti400[123]: Unknown Object (Task).
Mon, Jun 24, 7:11 PM · Patch-For-Review, Operations, ops-ulsfo
RobH triaged T226444: rack/setup/install ganeti400[123] as Normal priority.
Mon, Jun 24, 7:11 PM · Patch-For-Review, Operations, ops-ulsfo
RobH added a comment to T225889: Degraded RAID on db2043.

Are the disks being installed (and then failing) new disks or old decom disks?

Mon, Jun 24, 4:21 PM · DBA, Operations, ops-codfw
RobH added a parent task for T226424: update RE-S-X6-64G-S in cr[12]-eqiad: Unknown Object (Task).
Mon, Jun 24, 4:00 PM · Operations, netops, ops-eqiad
RobH triaged T226424: update RE-S-X6-64G-S in cr[12]-eqiad as Normal priority.
Mon, Jun 24, 4:00 PM · Operations, netops, ops-eqiad
RobH added a parent task for T226422: update RE-S-X6-64G-S in cr[12]-codfw: Unknown Object (Task).
Mon, Jun 24, 3:59 PM · netops, Operations, ops-codfw
RobH triaged T226422: update RE-S-X6-64G-S in cr[12]-codfw as Normal priority.
Mon, Jun 24, 3:59 PM · netops, Operations, ops-codfw

Fri, Jun 21

RobH reassigned T226274: rack/setup/install kafka-main100[1-5] from RobH to herron.

This looks correct, except there isn't a mention of if these need internal or external vlan/ip addresses?

Fri, Jun 21, 6:59 PM · ops-eqiad, Operations
RobH added a parent task for T226274: rack/setup/install kafka-main100[1-5]: Unknown Object (Task).
Fri, Jun 21, 6:53 PM · ops-eqiad, Operations
RobH removed a subtask for T226274: rack/setup/install kafka-main100[1-5]: Unknown Object (Task).
Fri, Jun 21, 6:53 PM · ops-eqiad, Operations
RobH added a comment to T225720: poll power data for redeployment of esams/knams.

Ok, as it is now peak hours (according to @BBlack) for eqiad, I'm re-pulling all the power data now. Please note that I'll update the task description AFTER this post (and likely won't update task description summary until Friday AM.)

Fri, Jun 21, 1:31 AM · Traffic, DC-Ops, Operations

Thu, Jun 20

RobH added a comment to T225720: poll power data for redeployment of esams/knams.

Updated from irc chat and @BBlack.

Thu, Jun 20, 6:36 PM · Traffic, DC-Ops, Operations
RobH updated the task description for T225720: poll power data for redeployment of esams/knams.
Thu, Jun 20, 6:33 PM · Traffic, DC-Ops, Operations
RobH updated the task description for T225720: poll power data for redeployment of esams/knams.
Thu, Jun 20, 6:31 PM · Traffic, DC-Ops, Operations
RobH updated subscribers of T225720: poll power data for redeployment of esams/knams.

so for the QFX5100 (thanks @Papaul) the command is:

Thu, Jun 20, 5:14 PM · Traffic, DC-Ops, Operations
RobH added a comment to T225720: poll power data for redeployment of esams/knams.

Power data:

Thu, Jun 20, 5:06 PM · Traffic, DC-Ops, Operations
RobH added a comment to T225720: poll power data for redeployment of esams/knams.

commands to run:

Thu, Jun 20, 4:52 PM · Traffic, DC-Ops, Operations
RobH added a comment to T225720: poll power data for redeployment of esams/knams.

Ok, in checking, EQIAD seems to enter its PEAK usage around 20:00 GMT (so about a half an hour from now at 10:00 Pacific.)

Thu, Jun 20, 4:30 PM · Traffic, DC-Ops, Operations
RobH removed a project from T225704: eqiad: rack/setup/install (4) dbproxy systems.: ops-eqiad.
Thu, Jun 20, 3:43 PM · Patch-For-Review, Operations, DBA
RobH reassigned T225704: eqiad: rack/setup/install (4) dbproxy systems. from RobH to Marostegui.

Assigned to @Marostegui per irc sync up (dns records are live.)

Thu, Jun 20, 3:43 PM · Patch-For-Review, Operations, DBA
RobH updated the task description for T225704: eqiad: rack/setup/install (4) dbproxy systems..
Thu, Jun 20, 3:43 PM · Patch-For-Review, Operations, DBA

Thu, Jun 13

RobH updated subscribers of T225720: poll power data for redeployment of esams/knams.

My understanding is we won't be using any MX80s when this is all done, so I did not pull that info.

Thu, Jun 13, 1:11 PM · Traffic, DC-Ops, Operations
RobH updated the task description for T225720: poll power data for redeployment of esams/knams.
Thu, Jun 13, 1:09 PM · Traffic, DC-Ops, Operations
RobH added a comment to T225720: poll power data for redeployment of esams/knams.
5 $> ssh cr2-esams.wikimedia.org
--- JUNOS 13.3R8.7 built 2015-10-23 21:23:16 UTC
{master}
robh@re0.cr2-esams> show power
                         ^
syntax error, expecting <command>.
Thu, Jun 13, 12:56 PM · Traffic, DC-Ops, Operations
RobH triaged T225720: poll power data for redeployment of esams/knams as Normal priority.
Thu, Jun 13, 12:54 PM · Traffic, DC-Ops, Operations
RobH updated the task description for T225704: eqiad: rack/setup/install (4) dbproxy systems..
Thu, Jun 13, 10:06 AM · Patch-For-Review, Operations, DBA
RobH added a parent task for T225704: eqiad: rack/setup/install (4) dbproxy systems.: Unknown Object (Task).
Thu, Jun 13, 10:05 AM · Patch-For-Review, Operations, DBA
RobH triaged T225704: eqiad: rack/setup/install (4) dbproxy systems. as Normal priority.
Thu, Jun 13, 10:05 AM · Patch-For-Review, Operations, DBA
RobH closed Unknown Object (Task), a subtask of T202367: Productionize dbproxy101[2-7].eqiad.wmnet and dbproxy200[1-4], as Resolved.
Thu, Jun 13, 9:29 AM · Patch-For-Review, DBA

Wed, Jun 12

RobH placed T221785: ulsfo netbox updates up for grabs.
Wed, Jun 12, 4:19 PM · netbox, Operations, ops-ulsfo
RobH added a comment to T221785: ulsfo netbox updates.

I'm not sure who added the atlas-ulsfo serial since I commented I couldn't get it. Either I pulled it out of the rack to do it (doubtful or I'd have updated this task) or someone else pulled it somehow?

Wed, Jun 12, 4:18 PM · netbox, Operations, ops-ulsfo
RobH closed T221785: ulsfo netbox updates as Resolved.
Wed, Jun 12, 4:17 PM · netbox, Operations, ops-ulsfo
RobH moved T223216: Decommission db2034 from Backlog to Decommission on the ops-codfw board.
Wed, Jun 12, 1:11 PM · Operations, decommission, ops-codfw
RobH moved T223885: Decommission db2036 from Backlog to Decommission on the ops-codfw board.
Wed, Jun 12, 1:11 PM · decommission, Operations, ops-codfw
RobH moved T223949: lvs2002 possible broken BBU from Backlog to Hardware Failure / Troubleshoot on the ops-codfw board.
Wed, Jun 12, 1:11 PM · ops-codfw, Operations
RobH moved T223950: Decommission db2041 from Backlog to Decommission on the ops-codfw board.
Wed, Jun 12, 1:11 PM · decommission, Operations, ops-codfw
RobH moved T224250: Setup new msw1-codfw from Backlog to Racking Tasks on the ops-codfw board.
Wed, Jun 12, 1:11 PM · ops-codfw, netops, Operations
RobH moved T224528: rack/setup codfw: cloudbackup2001.codfw.wmnet and cloudbackup2002.codfw.wmnet from Backlog to Racking Tasks on the ops-codfw board.
Wed, Jun 12, 1:11 PM · Cloud-Services, Operations, ops-codfw
RobH moved T224603: rack/setup/ codfw: ganeti2009 - ganeti201[0-8] from Backlog to Racking Tasks on the ops-codfw board.
Wed, Jun 12, 1:11 PM · ops-codfw, Operations
RobH moved T224079: Decommission db2040 from Backlog to Decommission on the ops-codfw board.
Wed, Jun 12, 1:10 PM · ops-codfw, Operations, decommission
RobH moved T224720: Decommission db2037 from Backlog to Decommission on the ops-codfw board.
Wed, Jun 12, 1:10 PM · Operations, ops-codfw, decommission
RobH moved T225090: Decommission db2042 from Backlog to Decommission on the ops-codfw board.
Wed, Jun 12, 1:10 PM · Operations, ops-codfw, decommission
RobH moved T225131: Degraded RAID on es2003 from Backlog to Hardware Failure / Troubleshoot on the ops-codfw board.
Wed, Jun 12, 1:10 PM · Operations, ops-codfw
RobH added a comment to T214903: labsdb1002-array1: status clarification.

It seems the server was a Cisco system (per comment on the task T146455) but this is a Dell disk shelf, this is a bit strange.

Wed, Jun 12, 8:32 AM · decommission, DC-Ops, cloud-services-team (Kanban)
RobH reassigned T214903: labsdb1002-array1: status clarification from RobH to Cmjohnson.

It would appear so. However, I cannot confirm anything from here.

Wed, Jun 12, 8:31 AM · decommission, DC-Ops, cloud-services-team (Kanban)
RobH moved T214903: labsdb1002-array1: status clarification from Backlog to Ready for Decommission on the decommission board.
Wed, Jun 12, 8:30 AM · decommission, DC-Ops, cloud-services-team (Kanban)
RobH moved T220590: Decom ms-be101[345] from Backlog to Ready for Decommission on the decommission board.
Wed, Jun 12, 8:30 AM · decommission, User-fgiunchedi, media-storage, Operations
RobH moved T221068: decom ms-be201[345] from Backlog to Ready for Decommission on the decommission board.
Wed, Jun 12, 8:30 AM · decommission, ops-codfw, media-storage, User-fgiunchedi, Operations
RobH moved T223216: Decommission db2034 from Backlog to Ready for Decommission on the decommission board.
Wed, Jun 12, 8:30 AM · Operations, decommission, ops-codfw
RobH moved T223217: Decommission db1064 from Backlog to Ready for Decommission on the decommission board.
Wed, Jun 12, 8:30 AM · Operations, ops-eqiad, decommission
RobH moved T223950: Decommission db2041 from Backlog to Ready for Decommission on the decommission board.
Wed, Jun 12, 8:30 AM · decommission, Operations, ops-codfw
RobH moved T223885: Decommission db2036 from Backlog to Ready for Decommission on the decommission board.
Wed, Jun 12, 8:30 AM · decommission, Operations, ops-codfw
RobH moved T224079: Decommission db2040 from Backlog to Ready for Decommission on the decommission board.
Wed, Jun 12, 8:30 AM · ops-codfw, Operations, decommission
RobH moved T224223: decommission lvs100[123456].wikimedia.org from Backlog to Ready for Decommission on the decommission board.
Wed, Jun 12, 8:30 AM · Traffic, Operations, ops-eqiad, DC-Ops, decommission
RobH moved T224268: Decommission rhenium from Backlog to Ready for Decommission on the decommission board.
Wed, Jun 12, 8:30 AM · Operations, ops-eqiad, decommission
RobH moved T224475: Return sulfur to spares from Backlog to Ready for Decommission on the decommission board.
Wed, Jun 12, 8:30 AM · decommission, ops-eqiad, Operations
RobH moved T224720: Decommission db2037 from Backlog to Ready for Decommission on the decommission board.
Wed, Jun 12, 8:30 AM · Operations, ops-codfw, decommission
RobH moved T220002: Decommission dbstore1001, dbstore2001, dbstore2002 from Backlog to Ready for Decommission on the decommission board.
Wed, Jun 12, 8:30 AM · DC-Ops, decommission, Goal, DBA
RobH moved T225090: Decommission db2042 from Backlog to Ready for Decommission on the decommission board.
Wed, Jun 12, 8:30 AM · Operations, ops-codfw, decommission

Fri, Jun 7

RobH reassigned T224188: rack/setup/install (3) new osd ceph nodes from Andrew to ayounsi.

Ok, I've synced up with @Bstorm via IRC, and we have the following questions to be addressed by our network admin(s) to ensure we aren't breaking any rules:

Fri, Jun 7, 6:33 PM · ops-eqiad, Operations, cloud-services-team (Kanban), Cloud-Services
RobH moved T225062: Requesting access to deployment cluster for awight from Untriaged to Manager/NDA Approval/Confirmation on the SRE-Access-Requests board.
Fri, Jun 7, 5:44 PM · Operations, SRE-Access-Requests

Wed, Jun 5

RobH added a comment to T224180: Send some LibreNMS alerts to dcops and netops only.

so I'd just email the google group. Then the default settings for the folks in that (DC ops) is to get email updates (unless they have disabled it.)

Wed, Jun 5, 7:07 PM · Operations, DC-Ops, netops, observability
RobH added a comment to T225137: codfw humidity too high.

Since this may require CyrusOne techs to enter our cage, I've assigned this to @Papaul to setup/arrange/handle directly with CyrusOne support. If I need to handle this instead (due to onsite time constraints), please let me know and assign this back to me!

Wed, Jun 5, 6:55 PM · Operations, ops-codfw
RobH triaged T225137: codfw humidity too high as Normal priority.
Wed, Jun 5, 6:55 PM · Operations, ops-codfw
RobH changed the status of T225121: upgrade msw1-eqiad from EX4200 to EX4300 from Open to Stalled.

Please note @Papaul is working with @ayongsi to upgrade the codfw msw1 on T224250. The current plan is to allow that to complete, and then replicate its work for eqiad.

Wed, Jun 5, 5:03 PM · netops, ops-eqiad, Operations
RobH renamed T225121: upgrade msw1-eqiad from EX4200 to EX4300 from upgrade mr1-eqiad from EX4200 to EX4300 to upgrade msw1-eqiad from EX4200 to EX4300.
Wed, Jun 5, 5:02 PM · netops, ops-eqiad, Operations
RobH added a parent task for T224250: Setup new msw1-codfw: Unknown Object (Task).
Wed, Jun 5, 5:00 PM · ops-codfw, netops, Operations
RobH added a parent task for T225121: upgrade msw1-eqiad from EX4200 to EX4300: Unknown Object (Task).
Wed, Jun 5, 5:00 PM · netops, ops-eqiad, Operations
RobH triaged T225121: upgrade msw1-eqiad from EX4200 to EX4300 as Normal priority.
Wed, Jun 5, 5:00 PM · netops, Operations, ops-eqiad

Tue, Jun 4

RobH assigned T225035: cp3035 PS Redundancy Lost to wiki_willy.

This system is no longer under warranty.

Tue, Jun 4, 10:38 PM · Traffic, Operations, ops-esams
RobH added a comment to T214183: Setup graphs for power usage readings in Grafana.

@fgiunchedi: https://grafana.wikimedia.org/d/cq0ZowkZz/pdus?orgId=1 lists:

Tue, Jun 4, 7:38 PM · DC-Ops, observability
RobH added a project to T222109: decommission frav1001.frack.eqiad.wmnet: decommission.
Tue, Jun 4, 6:02 PM · decommission, Operations, fundraising-tech-ops, ops-eqiad, DC-Ops
RobH added a project to T203520: decommission thulium.frack.eqiad.wmnet: decommission.
Tue, Jun 4, 6:02 PM · decommission, ops-eqiad, Operations
RobH added a project to T220002: Decommission dbstore1001, dbstore2001, dbstore2002: decommission.
Tue, Jun 4, 6:01 PM · DC-Ops, decommission, Goal, DBA
RobH reassigned T187456: Decommission labstore100[123] and their disk shelves from RobH to Cmjohnson.

For some reason this lacked the decommission tag and I didn't know about it until @MoritzMuehlenhoff pinged me about it yesterday.

Tue, Jun 4, 5:07 PM · decommission, cloud-services-team (Kanban), Data-Services, Operations, DC-Ops, ops-eqiad
RobH edited projects for T187456: Decommission labstore100[123] and their disk shelves, added: decommission; removed Patch-For-Review.
Tue, Jun 4, 5:04 PM · decommission, cloud-services-team (Kanban), Data-Services, Operations, DC-Ops, ops-eqiad
RobH updated the task description for T187456: Decommission labstore100[123] and their disk shelves.
Tue, Jun 4, 5:03 PM · decommission, cloud-services-team (Kanban), Data-Services, Operations, DC-Ops, ops-eqiad
RobH added a comment to T187456: Decommission labstore100[123] and their disk shelves.

Switch port info:

Tue, Jun 4, 4:39 PM · decommission, cloud-services-team (Kanban), Data-Services, Operations, DC-Ops, ops-eqiad
RobH updated the task description for T187456: Decommission labstore100[123] and their disk shelves.
Tue, Jun 4, 4:29 PM · decommission, cloud-services-team (Kanban), Data-Services, Operations, DC-Ops, ops-eqiad
RobH updated the task description for T187456: Decommission labstore100[123] and their disk shelves.
Tue, Jun 4, 4:28 PM · decommission, cloud-services-team (Kanban), Data-Services, Operations, DC-Ops, ops-eqiad

May 22 2019

RobH added a comment to T224180: Send some LibreNMS alerts to dcops and netops only.

I prefer we open tasks for anything requiring actual work, since yet another inbox spam is just that, more spam.

May 22 2019, 10:52 PM · Operations, DC-Ops, netops, observability
RobH assigned T224188: rack/setup/install (3) new osd ceph nodes to Andrew.

@Andrew or @Bstorm: Since you both were commenting on the hardware specification task, I'm assuming you would also be the ones to ask about the networking requirements/vlans for these systems as well as the redundancy requirements?

May 22 2019, 9:08 PM · ops-eqiad, Operations, cloud-services-team (Kanban), Cloud-Services
RobH created T224188: rack/setup/install (3) new osd ceph nodes.
May 22 2019, 8:59 PM · ops-eqiad, Operations, cloud-services-team (Kanban), Cloud-Services
RobH updated the task description for T223493: rack/setup/install kafka-main200[1-5].
May 22 2019, 5:25 PM · ops-codfw, Operations
RobH added a comment to T223493: rack/setup/install kafka-main200[1-5].

Please note these are showing as an error state of staged in netbox, when they are not yet installed with an OS and have not yet run puppet.

May 22 2019, 5:10 PM · ops-codfw, Operations

May 20 2019

RobH added a comment to T222383: pull decom hardware and ship to Harry/OIT @ SF office.

TRACKING # 1ZA19A021290889548

May 20 2019, 7:32 PM · ops-codfw, Operations

May 16 2019

RobH added a comment to T221068: decom ms-be201[345].

Please note I didn't actually change the state in puppet, since as @faidon pointed out, I'm not sure if we need to change the report, or the process, or what. I did add the decommission project so it is easy to look at the workboard for decommission and the report output and match hostnames to tasks.

May 16 2019, 10:13 PM · decommission, ops-codfw, media-storage, User-fgiunchedi, Operations
RobH added a project to T221068: decom ms-be201[345]: decommission.

Please note these show 'decommission' in netbox when they are still actively calling into puppet. So they should be active in netbox until they are added to the decommission queue and shifted to dc ops to decom them.

May 16 2019, 9:50 PM · decommission, ops-codfw, media-storage, User-fgiunchedi, Operations
RobH added a project to T220590: Decom ms-be101[345]: decommission.

Please note these show 'decommission' in netbox when they are still actively calling into puppet. So they should be active in netbox until they are added to the decommission queue and shifted to dc ops to decom them.

May 16 2019, 8:25 PM · decommission, User-fgiunchedi, media-storage, Operations
RobH created T223468: audit offline codfw devices with rack location assigned.
May 16 2019, 5:42 PM · ops-codfw, ops-eqiad, Operations, Operations-Software-Development, netbox, DC-Ops
RobH updated the task description for T223467: Cleanup/delete recycled and returned (lease tranche 1) hardware from Netbox.
May 16 2019, 5:32 PM · DC-Ops, Operations
RobH triaged T223467: Cleanup/delete recycled and returned (lease tranche 1) hardware from Netbox as Normal priority.
May 16 2019, 5:30 PM · DC-Ops, Operations
RobH added a comment to T209425: Decommission rdb2001, rdb2002.

I don't know why this needs my input? This sounds like a standard decom, unless I misunderstand it.

May 16 2019, 4:45 PM · Patch-For-Review, ops-codfw, User-jijiki, decommission, Operations

May 15 2019

RobH renamed T220853: VMs on cloudvirt1015 crashing - bad mainboard/memory from VMs on cloudvirt1015 crashing to VMs on cloudvirt1015 crashing - bad mainboard/memory.
May 15 2019, 8:50 PM · Operations, ops-eqiad, DC-Ops, User-Zppix, cloud-services-team (Kanban)
RobH reassigned T220853: VMs on cloudvirt1015 crashing - bad mainboard/memory from RobH to Cmjohnson.

Error output:

May 15 2019, 8:50 PM · Operations, ops-eqiad, DC-Ops, User-Zppix, cloud-services-team (Kanban)
RobH claimed T220853: VMs on cloudvirt1015 crashing - bad mainboard/memory.
May 15 2019, 8:42 PM · Operations, ops-eqiad, DC-Ops, User-Zppix, cloud-services-team (Kanban)
RobH added a comment to T220853: VMs on cloudvirt1015 crashing - bad mainboard/memory.

Ok, so this has had CPU issues from the get go, tracked on both T215012 and T171473. It seems that the CPUs have been swapped, but not the mainboard. Considering its throwing CPU errors after all the CPU swaps, I advise we swap the mainboard next.

May 15 2019, 8:41 PM · Operations, ops-eqiad, DC-Ops, User-Zppix, cloud-services-team (Kanban)
RobH added a comment to T222922: wmf7622 wont powercycle (cannot be allocated from spares).

Hello, process question about this. The current flowchart for states doesn't allow Spare->Failed to happen, so there are some implicit assumptions inside of f or example the PuppetDB netbox report about that (Failed state is expected to be in Puppet since it implicitly comes from a production state). Is it the preference that boxes like this go through a Failed state (and thus never appear in Puppet? Thanks.

May 15 2019, 6:31 PM · Operations, ops-eqiad

May 14 2019

RobH added a member for acl*procurement-review: Gilles.
May 14 2019, 10:15 PM
RobH reassigned T223332: update *.tools.wmflabs.org certificate from RobH to aborrero.

@aborrero: Since you were the one to confirm the certificate usage on the procurement task, would you also be the person to implement the renewed certificate/keypair?

May 14 2019, 9:02 PM · cloud-services-team (Kanban), Operations