Page MenuHomePhabricator

herron (Keith Herron)
Ops Engineer

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Tuesday

  • Clear sailing ahead.

User Details

User Since
May 30 2017, 5:25 PM (217 w, 4 d)
Availability
Available
IRC Nick
herron
LDAP User
Herron
MediaWiki User
Unknown

Recent Activity

Tue, Jul 27

herron closed T281266: Decommission old ELK5 Logstash cluster as Resolved.

All elk5 hardware has been decommed at this point.

Tue, Jul 27, 5:39 PM · SRE Observability (FY2021/2022-Q1), Patch-For-Review, SRE
herron added a parent task for T287496: decommission servers logstash202[012].codfw.wmnet: T281266: Decommission old ELK5 Logstash cluster.
Tue, Jul 27, 5:30 PM · decommission-hardware
herron added a subtask for T281266: Decommission old ELK5 Logstash cluster: T287496: decommission servers logstash202[012].codfw.wmnet.
Tue, Jul 27, 5:30 PM · SRE Observability (FY2021/2022-Q1), Patch-For-Review, SRE
herron assigned T287496: decommission servers logstash202[012].codfw.wmnet to Papaul.
Tue, Jul 27, 5:29 PM · decommission-hardware
herron updated the task description for T287496: decommission servers logstash202[012].codfw.wmnet.
Tue, Jul 27, 5:29 PM · decommission-hardware
herron triaged T287496: decommission servers logstash202[012].codfw.wmnet as Medium priority.
Tue, Jul 27, 4:31 PM · decommission-hardware

Thu, Jul 22

herron updated the task description for T286065: Switch buffer re-partition - Eqiad Row C.
Thu, Jul 22, 3:10 PM · Patch-For-Review, DBA, Analytics, Infrastructure-Foundations, SRE, netops
herron updated the task description for T286065: Switch buffer re-partition - Eqiad Row C.
Thu, Jul 22, 2:49 PM · Patch-For-Review, DBA, Analytics, Infrastructure-Foundations, SRE, netops

Mon, Jul 19

herron added a comment to T253810: Alert on ECC warnings in SEL.

I've PoC this with check_ipmi_sensor which supports checking SEL
...
The downside of this approach is potentially old SEL entries that we'll have to clear as they are surfaced on first deployment. Going forward, the SEL will need clearing for such errors to let the icinga alert actually clear. Since if we deploy this we'll be routinely clear the SEL on errors, I think it is important to log its entries elsewhere too and for that we can deploy freeipmi-ipmiseld which polls SEL and logs to syslog.

Mon, Jul 19, 4:56 PM · User-MoritzMuehlenhoff, Wikimedia-Incident, observability, SRE
herron added a comment to T225005: Replace and expand kafka main hosts (kafka[12]00[123]) with kafka-main[12]00[12345].

sure, sounds good to me!

Mon, Jul 19, 2:53 PM · Analytics-Radar, Patch-For-Review, Services (watching), Platform Team Legacy (Watching / External), User-herron, SRE
herron added a comment to T286911: Upgrade MXes to Bullseye.

+1 for option 2, I think that will be a more straightforward approach overall.

Mon, Jul 19, 2:26 PM · Infrastructure-Foundations, Mail

Jul 1 2021

herron triaged T285949: Redirect https://lists.wikimedia.org/pipermail/foobar/ to https://lists.wikimedia.org/hyperkitty/list/foobar@lists.wikimedia.org/ as Medium priority.
Jul 1 2021, 5:34 PM · SRE, Wikimedia-Mailing-lists
herron triaged T285569: Automated uploads of minimal & comprehensible timeseries metrics for statuspage display as Medium priority.
Jul 1 2021, 5:29 PM · User-jbond, SRE-OnFire, Patch-For-Review, observability, SRE
herron triaged T285769: Ensure SRE team has a good understanding of how & when to declare an outage on the status page; & it is easy to do so as Medium priority.
Jul 1 2021, 5:29 PM · SRE Observability (FY2021/2022-Q1), SRE-OnFire, SRE
herron triaged T285931: Grant Access to mediawiki gerrit group for divec as Medium priority.
Jul 1 2021, 5:27 PM · MediaWiki-Gerrit-Group-Requests, SRE
herron triaged T285936: Please add btullis@wikimedia.org to the analytics-alerts mailing list as Medium priority.
Jul 1 2021, 5:27 PM · SRE
herron added a comment to T285936: Please add btullis@wikimedia.org to the analytics-alerts mailing list.

Hi @BTullis, sure, I've just added you to analytics-alerts and you should be receiving these emails now.

Jul 1 2021, 5:26 PM · SRE
herron triaged T285927: Add the possibility to deploy calico on kubernetes master nodes as Medium priority.
Jul 1 2021, 5:18 PM · Patch-For-Review, Kubernetes, Machine-Learning-Team, SRE, serviceops
herron triaged T285835: Thanos bucket operations sporadic errors as High priority.
Jul 1 2021, 5:17 PM · Patch-For-Review, SRE Observability (FY2021/2022-Q1), User-fgiunchedi, SRE
herron triaged T285534: mtail testing infrastructure prints python deprecation warnings as Medium priority.
Jul 1 2021, 5:16 PM · good first task, SRE, observability
herron triaged T285533: mtail testing infrastructure does not report Runtime errors as Medium priority.
Jul 1 2021, 5:15 PM · observability, SRE
herron triaged T256641: Delay spinner showing for graphs for 1s as Medium priority.
Jul 1 2021, 5:13 PM · Patch-For-Review, serviceops, SRE, Graphoid
herron closed T285580: Grant Access to ldap/wmf for fgoodwin as Resolved.

Hi @FGoodwin, your ldap account has been added to group wmf. I'll transition this to resolved now, but please don't hesitate to reopen if any followup is needed. Thanks!

Jul 1 2021, 2:57 PM · SRE, LDAP-Access-Requests
herron triaged T285899: Root access to AQS cluster as Medium priority.
Jul 1 2021, 2:42 PM · SRE, Platform Engineering, SRE-Access-Requests
herron moved T285899: Root access to AQS cluster from Untriaged to SRE Meeting Review on the SRE-Access-Requests board.

Looks reasonable to me, and thanks much for writing the patch!

Jul 1 2021, 2:41 PM · SRE, Platform Engineering, SRE-Access-Requests
herron closed T285877: New production ssh key for sbassett as Resolved.

Key updated, but gerrit unable to update task due to policy. Resolving!

Jul 1 2021, 2:12 PM · SecTeam-Processed, SRE-Access-Requests, SRE, Security

Jun 30 2021

herron added a comment to T285877: New production ssh key for sbassett.

Verified face to face via a google meet session

Jun 30 2021, 7:38 PM · SecTeam-Processed, SRE-Access-Requests, SRE, Security
herron closed T285326: Grant Access to ldap/wmf for TChin as Resolved.

Hi @tchin, your ldap account is now a member of the wmf group. I'll transition to resolved now but please don't hesitate to reopen if any follow-up is needed. Thanks!

Jun 30 2021, 6:38 PM · SRE, LDAP-Access-Requests
herron added a comment to T285754: Requesting access to analytics cluster for Ben Tullis.

@herron, so we should do step 1 and then help Ben do step 2?

Jun 30 2021, 4:01 PM · LDAP-Access-Requests, SRE, SRE-Access-Requests
herron added a comment to T285754: Requesting access to analytics cluster for Ben Tullis.

I also have an item on my checklist to say that I should be in the cn=ops LDAP group.

There are instructions on how I can add myself to that group, but only once I have sudo access.

Can anyone confirm this requirement? If so, can it be done on this ticket, or should I raise a new one?

Jun 30 2021, 3:02 PM · LDAP-Access-Requests, SRE, SRE-Access-Requests

Jun 29 2021

herron added a comment to T285754: Requesting access to analytics cluster for Ben Tullis.

Shell account has been created, and ldap account has been added to group wmf

Jun 29 2021, 6:53 PM · LDAP-Access-Requests, SRE, SRE-Access-Requests
herron updated the task description for T285754: Requesting access to analytics cluster for Ben Tullis.
Jun 29 2021, 6:19 PM · LDAP-Access-Requests, SRE, SRE-Access-Requests
herron moved T285754: Requesting access to analytics cluster for Ben Tullis from Untriaged to Manager/NDA Approval/Confirmation on the SRE-Access-Requests board.

Sure I'll go ahead and prep a patch. I may have missed it, but what realname should be used for btullis?

Jun 29 2021, 6:10 PM · LDAP-Access-Requests, SRE, SRE-Access-Requests
herron added a comment to T285754: Requesting access to analytics cluster for Ben Tullis.

@razzi will take care of this, and I will follow up with SRE on enabling root access after the initial access is granted.

Jun 29 2021, 5:22 PM · LDAP-Access-Requests, SRE, SRE-Access-Requests
herron updated the task description for T285754: Requesting access to analytics cluster for Ben Tullis.
Jun 29 2021, 5:15 PM · LDAP-Access-Requests, SRE, SRE-Access-Requests

Jun 28 2021

herron removed a project from T277629: Create new group for root access to snapshot*, dumpsdata* and labstore1006,7 with holger in it: SRE-Access-Requests.

Hey @ArielGlenn, Since this has been idling in the access request queue for some time I'm going to untag SRE-Access-Requests for the time being. If any follow up is needed please do re-tag. Thanks!

Jun 28 2021, 7:36 PM · SRE, Dumps-Generation
herron moved T285436: Access request to superset for user natalia-rodriguez from Awaiting User Input to Manager Approval Pending on the LDAP-Access-Requests board.
Jun 28 2021, 7:29 PM · SRE, LDAP-Access-Requests
herron moved T285580: Grant Access to ldap/wmf for fgoodwin from Awaiting User Input to Manager Approval Pending on the LDAP-Access-Requests board.
Jun 28 2021, 7:29 PM · SRE, LDAP-Access-Requests
herron assigned T285326: Grant Access to ldap/wmf for TChin to tchin.

Hi @tchin could you please coordinate obtaining a comment of approval on this task from your manager?

Jun 28 2021, 7:29 PM · SRE, LDAP-Access-Requests
herron updated the task description for T285580: Grant Access to ldap/wmf for fgoodwin.
Jun 28 2021, 7:26 PM · SRE, LDAP-Access-Requests
herron reassigned T285580: Grant Access to ldap/wmf for fgoodwin from MNadrofsky to FGoodwin.

Hi @FGoodwin could you please coordinate obtaining a comment of approval on this task from your manager?

Jun 28 2021, 7:26 PM · SRE, LDAP-Access-Requests
herron moved T285436: Access request to superset for user natalia-rodriguez from Backlog to Awaiting User Input on the LDAP-Access-Requests board.
Jun 28 2021, 6:55 PM · SRE, LDAP-Access-Requests
herron assigned T285436: Access request to superset for user natalia-rodriguez to NRodriguez.

Hi @NRodriguez there are a couple steps to check off in order to move forward on this request. When you have a moment could you please...

Jun 28 2021, 2:44 PM · SRE, LDAP-Access-Requests
herron updated the task description for T285436: Access request to superset for user natalia-rodriguez.
Jun 28 2021, 2:30 PM · SRE, LDAP-Access-Requests
herron updated the task description for T285436: Access request to superset for user natalia-rodriguez.
Jun 28 2021, 2:30 PM · SRE, LDAP-Access-Requests

Jun 24 2021

herron closed T279342: Migrate colocated kafka-logging brokers to dedicated kafka-logging hosts as Resolved.
Jun 24 2021, 6:52 PM · Patch-For-Review, observability
herron updated the task description for T279342: Migrate colocated kafka-logging brokers to dedicated kafka-logging hosts.
Jun 24 2021, 6:51 PM · Patch-For-Review, observability

Jun 14 2021

herron closed Restricted Task, a subtask of T234854: Upgrade ELK Stack to version 7, as Resolved.
Jun 14 2021, 5:02 PM · observability, Patch-For-Review, SRE, Wikimedia-Logstash
herron closed T234854: Upgrade ELK Stack to version 7 as Resolved.
Jun 14 2021, 3:58 PM · observability, Patch-For-Review, SRE, Wikimedia-Logstash
herron updated the task description for T234854: Upgrade ELK Stack to version 7.
Jun 14 2021, 3:58 PM · observability, Patch-For-Review, SRE, Wikimedia-Logstash
herron closed T234854: Upgrade ELK Stack to version 7, a subtask of T272655: Phatality doesn't work with Kibana 7, as Resolved.
Jun 14 2021, 3:57 PM · observability, Release-Engineering-Team-TODO (2021-01-01 to 2021-03-31 (Q3)), Wikimedia-Logstash, Phatality
herron closed T247014: ELK7 shards failed errors when loading saved objects, e.g. "field expansion matches too many fields, limit: 1024, got: 1726" as Resolved.
Jun 14 2021, 3:56 PM · observability, Patch-For-Review, SRE, Wikimedia-Logstash
herron closed T247014: ELK7 shards failed errors when loading saved objects, e.g. "field expansion matches too many fields, limit: 1024, got: 1726", a subtask of T234854: Upgrade ELK Stack to version 7, as Resolved.
Jun 14 2021, 3:55 PM · observability, Patch-For-Review, SRE, Wikimedia-Logstash
Krinkle awarded T233134: logstash-beta.wmflabs.org does not receive any mediawiki events a Orange Medal token.
Jun 14 2021, 1:50 AM · observability, SRE Observability, User-Majavah, Wikimedia-Logstash, Beta-Cluster-Infrastructure

Jun 11 2021

herron added a comment to T243057: Move Prometheus off eqsin/ulsfo/esams bastions.

That's interesting about the same behavior happening in the opposite direction with a disk add. I guess that makes some sense in a bug-ish kind of way -- network device being renumbered as a side-effect of changing the VM device layout. I was worried it would happen on the next reboot, but sounds like this should be stable unless we were to change the disk layout again. Feeling much better about leaving it as-is now.

Jun 11 2021, 7:11 PM · Patch-For-Review, SRE, observability

Jun 7 2021

herron added a comment to T243057: Move Prometheus off eqsin/ulsfo/esams bastions.

The 150G secondary disk has been removed from the prometheus3001 VM.

Jun 7 2021, 7:45 PM · Patch-For-Review, SRE, observability

Jun 4 2021

herron added a comment to T243057: Move Prometheus off eqsin/ulsfo/esams bastions.

+1! I'll plan deploy the patch above (now amended to 80G retention), move data and release the vdb device from prometheus3001 next week.

Jun 4 2021, 3:45 PM · Patch-For-Review, SRE, observability

Jun 3 2021

herron updated the task description for T243057: Move Prometheus off eqsin/ulsfo/esams bastions.
Jun 3 2021, 2:04 PM · Patch-For-Review, SRE, observability
herron added a comment to T243057: Move Prometheus off eqsin/ulsfo/esams bastions.

Prometheus disk usage (106G) on prometheus3001 is larger than what can comfortably fit alongside the OS on the /dev/vda (128G) so a 150G /dev/vdb was added as /srv.

Jun 3 2021, 2:04 PM · Patch-For-Review, SRE, observability

May 24 2021

herron reassigned T283507: decommission logstash102[012] from herron to Cmjohnson.
May 24 2021, 4:14 PM · observability, decommission-hardware
herron added a comment to T283507: decommission logstash102[012].

cookbooks.sre.hosts.decommission executed by herron@cumin1001 for hosts: logstash1020.eqiad.wmnet

  • logstash1020.eqiad.wmnet (PASS)
    • Downtimed host on Icinga
    • Found physical host
    • Downtimed management interface on Icinga
    • Wiped all swraid, partition-table and filesystem signatures
    • Powered off
    • Set Netbox status to Decommissioning and deleted all non-mgmt interfaces and related IPs
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
May 24 2021, 2:48 PM · observability, decommission-hardware
herron added a subtask for T281266: Decommission old ELK5 Logstash cluster: T283507: decommission logstash102[012].
May 24 2021, 2:46 PM · SRE Observability (FY2021/2022-Q1), Patch-For-Review, SRE
herron added a parent task for T283507: decommission logstash102[012]: T281266: Decommission old ELK5 Logstash cluster.
May 24 2021, 2:45 PM · observability, decommission-hardware
herron triaged T283507: decommission logstash102[012] as Medium priority.
May 24 2021, 2:45 PM · observability, decommission-hardware
herron updated the task description for T283213: Move technical Wikimedia IRC bots from freenode to Libera Chat.
May 24 2021, 2:04 PM · User-bd808, Release-Engineering-Team (Radar), User-brennen, wikimedia-irc-libera, Wikimedia-General-or-Unknown

May 21 2021

herron committed rOHPUd6cffce99696: remove icinga[12]001 addresses from firewall rules (authored by herron).
remove icinga[12]001 addresses from firewall rules
May 21 2021, 3:10 PM
herron updated the task description for T282575: decommission mwlog1001.
May 21 2021, 3:05 PM · Patch-For-Review, SRE, observability, decommission-hardware
herron reassigned T282575: decommission mwlog1001 from herron to Cmjohnson.
May 21 2021, 3:05 PM · Patch-For-Review, SRE, observability, decommission-hardware

May 20 2021

herron updated the task description for T283213: Move technical Wikimedia IRC bots from freenode to Libera Chat.
May 20 2021, 2:16 PM · User-bd808, Release-Engineering-Team (Radar), User-brennen, wikimedia-irc-libera, Wikimedia-General-or-Unknown

May 19 2021

herron triaged T283204: Clarify 'wipe bootloader' step in sre.hosts.decommission as Medium priority.
May 19 2021, 9:29 PM · Infrastructure-Foundations, SRE-tools

May 17 2021

herron awarded T283013: Migrate beta cluster to ELK7 a Party Time token.
May 17 2021, 6:30 PM · User-Majavah, Beta-Cluster-Infrastructure
herron added a comment to T283013: Migrate beta cluster to ELK7.

Puppet manifests are trying to do that but that line is not present on deployment-logstash04 for some reason. Is that there on prod?

May 17 2021, 5:43 PM · User-Majavah, Beta-Cluster-Infrastructure
herron updated the task description for T282993: Offboard Cas.
May 17 2021, 2:59 PM · SRE

May 13 2021

herron awarded T282784: New VictorOps user request a Party Time token.
May 13 2021, 2:26 PM · SRE, SRE-Access-Requests, observability
herron added a comment to T282784: New VictorOps user request.

Hi @cmooney I've added your VO account to the GMT+1 "batphone" rotation just now. If you'd like to adjust that please feel free to ping observability any time. Thanks!

May 13 2021, 2:15 PM · SRE, SRE-Access-Requests, observability

May 11 2021

herron updated the task description for T282576: decommission mwlog2001.
May 11 2021, 4:32 PM · SRE, observability, decommission-hardware
herron reassigned T282576: decommission mwlog2001 from herron to Papaul.
May 11 2021, 4:32 PM · SRE, observability, decommission-hardware
herron added a comment to T282575: decommission mwlog1001.

Will give this a 1w grace period before handing off for disk removal/wipe

May 11 2021, 4:30 PM · Patch-For-Review, SRE, observability, decommission-hardware
herron updated the task description for T282575: decommission mwlog1001.
May 11 2021, 4:29 PM · Patch-For-Review, SRE, observability, decommission-hardware
herron updated the task description for T282576: decommission mwlog2001.
May 11 2021, 4:29 PM · SRE, observability, decommission-hardware
herron updated the task description for T282575: decommission mwlog1001.
May 11 2021, 3:47 PM · Patch-For-Review, SRE, observability, decommission-hardware
herron updated the task description for T282576: decommission mwlog2001.
May 11 2021, 3:27 PM · SRE, observability, decommission-hardware
herron renamed T282575: decommission mwlog1001 from decomission mwlog1001 to decommission mwlog1001.
May 11 2021, 3:23 PM · Patch-For-Review, SRE, observability, decommission-hardware
herron renamed T282576: decommission mwlog2001 from decomission mwlog2001 to decommission mwlog2001.
May 11 2021, 3:23 PM · SRE, observability, decommission-hardware
herron added a subtask for T224565: Migrate mwlog/udp2log servers to Buster: T282576: decommission mwlog2001.
May 11 2021, 3:14 PM · Patch-For-Review, observability, SRE
herron added a parent task for T282576: decommission mwlog2001: T224565: Migrate mwlog/udp2log servers to Buster.
May 11 2021, 3:14 PM · SRE, observability, decommission-hardware
herron added a subtask for T224565: Migrate mwlog/udp2log servers to Buster: T282575: decommission mwlog1001.
May 11 2021, 3:14 PM · Patch-For-Review, observability, SRE
herron added a parent task for T282575: decommission mwlog1001: T224565: Migrate mwlog/udp2log servers to Buster.
May 11 2021, 3:14 PM · Patch-For-Review, SRE, observability, decommission-hardware
herron triaged T282576: decommission mwlog2001 as Medium priority.
May 11 2021, 3:11 PM · SRE, observability, decommission-hardware
herron triaged T282575: decommission mwlog1001 as Medium priority.
May 11 2021, 3:10 PM · Patch-For-Review, SRE, observability, decommission-hardware

May 10 2021

herron added a comment to T224565: Migrate mwlog/udp2log servers to Buster.

Thanks for correcting the oversight on arclamp/xenon, TIL

May 10 2021, 2:02 PM · Patch-For-Review, observability, SRE

May 5 2021

herron added a comment to T224565: Migrate mwlog/udp2log servers to Buster.

Along with migrating these hosts to buster I've deployed an updated config to make mwlog more of a multi-datacenter service.

May 5 2021, 8:39 PM · Patch-For-Review, observability, SRE
herron added a comment to T282019: sre.hosts.decommission: don't FAIL when unable to set icinga downtime.

I'm not sure if hiding it completely is a great choice. We could look to see if we could improve the check and distinguish between a missing host and other failures.

May 5 2021, 7:13 PM · Infrastructure-Foundations, SRE-tools
herron assigned T279601: reclaim icinga1001.wikimedia.org to Cmjohnson.
May 5 2021, 4:39 PM · SRE Observability (FY2021/2022-Q1), decommission-hardware
herron updated the task description for T279601: reclaim icinga1001.wikimedia.org.
May 5 2021, 4:39 PM · SRE Observability (FY2021/2022-Q1), decommission-hardware
herron assigned T279602: reclaim icinga2001.wikimedia.org to Papaul.
May 5 2021, 4:38 PM · observability, decommission-hardware
herron updated the task description for T279602: reclaim icinga2001.wikimedia.org.
May 5 2021, 4:38 PM · observability, decommission-hardware
herron created T282019: sre.hosts.decommission: don't FAIL when unable to set icinga downtime.
May 5 2021, 4:26 PM · Infrastructure-Foundations, SRE-tools

May 4 2021

herron added a comment to T232343: Consider Postfix as MTA for our MXes (and OTRS/Mailman/Phab).

Some high level thoughts about how we might approach migrating:

May 4 2021, 5:32 PM · Infrastructure-Foundations, Patch-For-Review, User-MoritzMuehlenhoff, Mail, SRE

May 3 2021

herron added a comment to T279342: Migrate colocated kafka-logging brokers to dedicated kafka-logging hosts.

Reimaging the eqiad kafka-logging hosts and configuring them with raid50 layout, this will give us 5T usable (as opposed to the current 3T raid10) per host.

May 3 2021, 10:08 PM · Patch-For-Review, observability