Page MenuHomePhabricator

herron (Keith Herron)
Ops Engineer

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Tuesday

  • Clear sailing ahead.

User Details

User Since
May 30 2017, 5:25 PM (177 w, 4 d)
Availability
Available
IRC Nick
herron
LDAP User
Herron
MediaWiki User
Unknown

Recent Activity

Mon, Oct 19

herron added a comment to T265590: ulog: filter out diffscan from ulog.

What are the downsides to using iptables rules for this?

Mon, Oct 19, 5:02 PM · observability, Security, Operations, netops, User-jbond

Fri, Oct 9

herron closed T264504: Don't get a mail to confirm my email address (mx2001 is blacklisted by abusix blacklist) as Resolved.

I think we're in good shape here now.

Fri, Oct 9, 1:37 PM · Operations, Security, Mail
herron added a comment to T265142: exim should log the reason for defer with disconnect after HELO/EHLO.

This would have been helpful in troubleshooting T264504

Fri, Oct 9, 1:36 PM · Operations, Mail
herron triaged T265142: exim should log the reason for defer with disconnect after HELO/EHLO as Medium priority.
Fri, Oct 9, 1:35 PM · Operations, Mail
herron added a comment to T264504: Don't get a mail to confirm my email address (mx2001 is blacklisted by abusix blacklist).

I tried to get a confirmation mail again and got it. It worked as expected now. Thanks a lot.

Fri, Oct 9, 1:25 PM · Operations, Security, Mail

Thu, Oct 8

herron added a comment to T264504: Don't get a mail to confirm my email address (mx2001 is blacklisted by abusix blacklist).

wiki-mail-codfw.wikimedia.org has been delisted. This should resolve the issue outlined in the description here.

Thu, Oct 8, 8:12 PM · Operations, Security, Mail
herron added a comment to T264504: Don't get a mail to confirm my email address (mx2001 is blacklisted by abusix blacklist).

Actually, after some further manual testing I think we have a reason:

Thu, Oct 8, 4:25 PM · Operations, Security, Mail
herron added a comment to T264504: Don't get a mail to confirm my email address (mx2001 is blacklisted by abusix blacklist).

With regard to why mail from eqiad seemed to be working while codfw was not -- part of this is because the working email examples are gerrit mails, which in addition to having different message contents are also sent outward via the main mx host interface instead of the wiki-mail-site.wikimedia.org bulk mail interface.

Thu, Oct 8, 4:02 PM · Operations, Security, Mail

Mon, Oct 5

herron closed T264127: Please replace Shannon Baileys SSH key as Resolved.

Hi @Sbailey, the updated SSH key has been deployed to servers by now. Please re-open if any follow-up is needed. Thanks!

Mon, Oct 5, 6:07 PM · Patch-For-Review, SRE-Access-Requests, Operations
herron added a comment to T264127: Please replace Shannon Baileys SSH key.

New key has been confirmed via google chat and email

Mon, Oct 5, 2:55 PM · Patch-For-Review, SRE-Access-Requests, Operations

Fri, Oct 2

herron closed T264392: Update public key for production shell for dedcode as Resolved.

This has been done, I'll transition to resolved now

Fri, Oct 2, 6:07 PM · Operations, SRE-Access-Requests
herron added a comment to T264127: Please replace Shannon Baileys SSH key.

Hi @Sbailey I've reached out to you via google chat and by email to verify. Thanks!

Fri, Oct 2, 5:48 PM · Patch-For-Review, SRE-Access-Requests, Operations
herron triaged T264392: Update public key for production shell for dedcode as High priority.
Fri, Oct 2, 5:18 PM · Operations, SRE-Access-Requests
herron triaged T264274: Define a methodology to track WMF services backup requirements as Medium priority.
Fri, Oct 2, 5:18 PM · Goal, Operations, Data-Persistence-Backup
herron triaged T264174: Migrate remaining services using Java to profile::java as Medium priority.
Fri, Oct 2, 5:18 PM · Operations
herron triaged T264176: Switch Zookeeper to profile::java as Medium priority.
Fri, Oct 2, 5:17 PM · Analytics-Clusters, Operations
herron triaged T264177: Switch cergen to profile::java as Medium priority.
Fri, Oct 2, 5:17 PM · Operations
herron triaged T264178: Switch puppetdb to profile::java as Medium priority.
Fri, Oct 2, 5:16 PM · Puppet, Operations
herron triaged T264181: Migrate WDQS to profile::java as Medium priority.
Fri, Oct 2, 5:16 PM · Wikidata-Query-Service, Operations, Wikidata
herron triaged T264272: Plan WMF infrastructure for 100% coverage of data recovery as Medium priority.
Fri, Oct 2, 5:16 PM · Goal, Epic, Operations, Data-Persistence-Backup
herron triaged T264275: Track all directly-owned SRE datasets into the new inventory system as Medium priority.
Fri, Oct 2, 5:15 PM · Goal, Operations, Data-Persistence-Backup
herron triaged T264292: Migrate maps to Buster as Medium priority.
Fri, Oct 2, 5:15 PM · Maps, Operations
herron triaged T264345: Change urbanecm's SSH production key as Medium priority.
Fri, Oct 2, 4:43 PM · SRE-Access-Requests, Operations
herron triaged T264409: makevm cookbook fails get_vm() call as Medium priority.
Fri, Oct 2, 4:39 PM · SRE-tools, Operations

Thu, Oct 1

herron added a comment to T264345: Change urbanecm's SSH production key .

The file bast2002:/home/urbanecm/id_ed25519_wmnet_20201001.pub.sig does indeed match the key in the description, and on the patch

Thu, Oct 1, 6:28 PM · SRE-Access-Requests, Operations
herron added a project to T264345: Change urbanecm's SSH production key : SRE-Access-Requests.
Thu, Oct 1, 6:23 PM · SRE-Access-Requests, Operations
herron added a comment to T264127: Please replace Shannon Baileys SSH key.

Is there another host in production where you have working access? Placing a file there would work too, just let me know where to check. Otherwise we can figure out another method. Thanks!

Thu, Oct 1, 6:06 PM · Patch-For-Review, SRE-Access-Requests, Operations
herron closed T263692: Requesting access to analytics-privatedata-users for Djellel Difallah as Resolved.

The requested access has been enabled and will become active within the next 30 minutes. I'll transition this task to resolved now, but please don't hesitate to re-open if any follow-up is needed. Thanks!

Thu, Oct 1, 6:05 PM · SRE-Access-Requests, Operations
herron updated the task description for T263692: Requesting access to analytics-privatedata-users for Djellel Difallah.
Thu, Oct 1, 6:01 PM · SRE-Access-Requests, Operations
herron updated the task description for T263692: Requesting access to analytics-privatedata-users for Djellel Difallah.
Thu, Oct 1, 2:55 PM · SRE-Access-Requests, Operations
herron added a comment to T263692: Requesting access to analytics-privatedata-users for Djellel Difallah.

Since this is somewhat of an atypical access request (in that the account and group membership are pre-existing, but attributes are changing) please have a close look at https://gerrit.wikimedia.org/r/631455 to ensure it matches the expected outcome. Thanks in advance!

Thu, Oct 1, 2:30 PM · SRE-Access-Requests, Operations
herron added a comment to T264127: Please replace Shannon Baileys SSH key.

Hi @Sbailey as a security precaution, could you please use your existing shell access to upload the desired new ssh key onto one of the bastions (let's say bast1002) as a file in your home directory called sbailey_new_ssh_key? Once done and confirmed we'll be ready to move forward with the above patch. Thanks in advance!

Thu, Oct 1, 1:14 PM · Patch-For-Review, SRE-Access-Requests, Operations

Wed, Sep 30

herron closed T262921: Add Bereket teshome to the ldap/wmde and ldap/nda group as Resolved.
Wed, Sep 30, 5:34 PM · LDAP-Access-Requests, Operations
herron removed a project from T253988: Adding Italian Wikinews to Google Search Console to add it to Google News: SRE-Access-Requests.

Removing the SRE-Access-Requests tag for now, please re-add when ready to proceed with this. Thanks!

Wed, Sep 30, 5:33 PM · Operations
herron moved T263692: Requesting access to analytics-privatedata-users for Djellel Difallah from Untriaged to Manager/NDA Approval/Confirmation on the SRE-Access-Requests board.
Wed, Sep 30, 5:31 PM · SRE-Access-Requests, Operations
herron moved T148976: Strongswan Icinga check: do not report issues about depooled hosts from Radar to Inbox on the observability board.
Wed, Sep 30, 5:29 PM · Patch-For-Review, serviceops, observability, Operations
herron triaged T148976: Strongswan Icinga check: do not report issues about depooled hosts as Medium priority.
Wed, Sep 30, 5:28 PM · Patch-For-Review, serviceops, observability, Operations
herron triaged T264014: unbound variable error when calling puppet-merge script with an explicit treeish as Medium priority.
Wed, Sep 30, 5:25 PM · Patch-For-Review, Operations, Puppet
herron triaged T264074: varnishkafka 1.1.0 CPU usage increase as High priority.
Wed, Sep 30, 5:23 PM · Patch-For-Review, Analytics-Clusters, Operations, Traffic
herron triaged T264127: Please replace Shannon Baileys SSH key as Medium priority.
Wed, Sep 30, 5:21 PM · Patch-For-Review, SRE-Access-Requests, Operations
herron triaged T264182: Migrate Gerrit to profile::java as Medium priority.
Wed, Sep 30, 5:19 PM · Release-Engineering-Team-TODO (2020-10-01 to 2020-12-31 (Q2)), Release-Engineering-Team (Development services), Gerrit, Operations
herron triaged T264189: Prepare a proof of concept of the minimum setup capable of backup and recover testwiki media files as Medium priority.
Wed, Sep 30, 5:18 PM · Patch-For-Review, Data-Persistence-Backup, Goal, Operations, SRE-swift-storage
herron triaged T264190: Research storage solutions for media backups as Medium priority.
Wed, Sep 30, 5:17 PM · Data-Persistence-Backup, Goal, Operations, SRE-swift-storage

Sep 24 2020

herron moved T148976: Strongswan Icinga check: do not report issues about depooled hosts from Inbox to Radar on the observability board.
Sep 24 2020, 4:03 PM · Patch-For-Review, serviceops, observability, Operations

Sep 23 2020

herron added a comment to T263662: check-mariadb-backups pkg_resources.VersionConflict: 0.2 (/usr/lib/python3/dist-packages), Requirement.parse('wmfbackups==0.1').

This should be fixed now in alert1001 and alert2001. The issue was that my puppet only had ensure_package(wmfbackups-check) but I didn't upgrade to 0.2 on alert* hosts, only the icinga ones.

Sep 23 2020, 5:06 PM · Operations, Data-Persistence-Backup, Data-Persistence, observability
herron added a comment to T263662: check-mariadb-backups pkg_resources.VersionConflict: 0.2 (/usr/lib/python3/dist-packages), Requirement.parse('wmfbackups==0.1').

I see the issue, after apt upgrade, the 0.2 version was available, which was the one that worked. It must have been caught in the small window between first deploy (0.1) and 0.2 fix, and I didn't notice alert1001 was about to be made the primary host.

0.1 had missing dependencies. Those are fixed on 0.2. Now it works. Apologies.

Sep 23 2020, 5:05 PM · Operations, Data-Persistence-Backup, Data-Persistence, observability
herron added a comment to T263662: check-mariadb-backups pkg_resources.VersionConflict: 0.2 (/usr/lib/python3/dist-packages), Requirement.parse('wmfbackups==0.1').

Was there an OS upgrade on the change from icinga1001 to alert1001? I must not have uploaded all versions for all os versions, or something.

Sep 23 2020, 5:00 PM · Operations, Data-Persistence-Backup, Data-Persistence, observability
herron renamed T263662: check-mariadb-backups pkg_resources.VersionConflict: 0.2 (/usr/lib/python3/dist-packages), Requirement.parse('wmfbackups==0.1') from check-mariadb-backups pkg_resources.VersionConflict: 0.2 (/usr/lib/python3/dist-packages), Requirement.parse('wmfbackups==0.1' to check-mariadb-backups pkg_resources.VersionConflict: 0.2 (/usr/lib/python3/dist-packages), Requirement.parse('wmfbackups==0.1').
Sep 23 2020, 4:50 PM · Operations, Data-Persistence-Backup, Data-Persistence, observability
herron created T263662: check-mariadb-backups pkg_resources.VersionConflict: 0.2 (/usr/lib/python3/dist-packages), Requirement.parse('wmfbackups==0.1').
Sep 23 2020, 4:50 PM · Operations, Data-Persistence-Backup, Data-Persistence, observability
herron added a comment to T247966: Migrate role::alerting_host to Buster.

Alert1001 is now the active Icinga server. Meta monitoring for alert[12]001 has been enabled as well.

Sep 23 2020, 4:25 PM · Patch-For-Review, observability
herron updated the task description for T247966: Migrate role::alerting_host to Buster.
Sep 23 2020, 4:22 PM · Patch-For-Review, observability

Sep 21 2020

herron updated the task description for T243057: Move Prometheus off eqsin/ulsfo/esams bastions.
Sep 21 2020, 2:30 PM · Patch-For-Review, Operations, observability

Sep 15 2020

herron updated the task description for T247966: Migrate role::alerting_host to Buster.
Sep 15 2020, 7:43 PM · Patch-For-Review, observability
herron moved T262512: Enable CAS authentication for Grafana from Inbox to In progress on the observability board.
Sep 15 2020, 4:06 PM · User-fgiunchedi, Patch-For-Review, observability, Operations
herron moved T262675: Store Kubernetes events for more than one hour from Inbox to In progress on the observability board.
Sep 15 2020, 4:06 PM · Patch-For-Review, observability, Prod-Kubernetes, Kubernetes, serviceops
herron added a comment to T262675: Store Kubernetes events for more than one hour.

Thanks @JMeybohm, ok I think we should defer to your expertise with regard to the optimal way to output these logs from the Kubernetes environment.

Sep 15 2020, 3:52 PM · Patch-For-Review, observability, Prod-Kubernetes, Kubernetes, serviceops

Sep 14 2020

herron added a comment to T262675: Store Kubernetes events for more than one hour.

What I think I need from your side is mainly the "okay" to push those events to the logstash-* indices of the elasticsearch cluster (I can try to figure out what that would mean in terms of documents per day, size etc. - as you might need some numbers there I guess) and probably some support in how to access it (set up needed credentials/accounts etc.). But if you have any objections in general or better ideas on how to do this, please let me know.

Sep 14 2020, 4:18 PM · Patch-For-Review, observability, Prod-Kubernetes, Kubernetes, serviceops

Sep 10 2020

herron added a member for observability: herron.
Sep 10 2020, 2:51 PM
herron added a watcher for observability: herron.
Sep 10 2020, 2:51 PM

Sep 8 2020

herron moved T262291: update nagios_nsca configuration in frack for new nsca servers from Inbox to In progress on the observability board.
Sep 8 2020, 5:10 PM · fundraising-tech-ops, Operations, netops, observability

Sep 3 2020

herron added a comment to T252773: Move kafkamon hosts to Debian Buster.

The buster kafkamon hosts are now live. Will let them settle for a bit before moving on to cleanup/teardown of the old hosts.

Sep 3 2020, 5:19 PM · Patch-For-Review, Analytics-Clusters, Analytics-Radar, observability, Operations
herron updated the task description for T252773: Move kafkamon hosts to Debian Buster.
Sep 3 2020, 5:17 PM · Patch-For-Review, Analytics-Clusters, Analytics-Radar, observability, Operations
herron updated the task description for T247966: Migrate role::alerting_host to Buster.
Sep 3 2020, 4:11 PM · Patch-For-Review, observability
herron updated the task description for T247966: Migrate role::alerting_host to Buster.
Sep 3 2020, 4:10 PM · Patch-For-Review, observability
herron closed T261342: ensure alert[12]001 are prepared for meta monitoring as Resolved.
Sep 3 2020, 3:30 PM · observability
herron renamed T261342: ensure alert[12]001 are prepared for meta monitoring from ensure alert[12]001 are configured for meta monitoring to ensure alert[12]001 are prepared for meta monitoring.
Sep 3 2020, 3:30 PM · observability
herron closed T261342: ensure alert[12]001 are prepared for meta monitoring, a subtask of T247966: Migrate role::alerting_host to Buster, as Resolved.
Sep 3 2020, 3:30 PM · Patch-For-Review, observability
herron added a comment to T261342: ensure alert[12]001 are prepared for meta monitoring.

Icinga/alerts certificate issue has been fixed and meta monitoring is now working against the new alert[12]001 hosts.

Sep 3 2020, 3:29 PM · observability

Sep 2 2020

herron updated the task description for T247966: Migrate role::alerting_host to Buster.
Sep 2 2020, 7:10 PM · Patch-For-Review, observability
herron added a comment to T261342: ensure alert[12]001 are prepared for meta monitoring.

Thanks @Volans, sync_check_icinga_contacts is happy now on alert[12]001

Sep 2 2020, 6:36 PM · observability

Aug 28 2020

herron awarded T260686: check_mariadb_dump failing on alert[12]* hosts a Love token.
Aug 28 2020, 2:02 PM · DBA, observability

Aug 27 2020

herron updated the task description for T252773: Move kafkamon hosts to Debian Buster.
Aug 27 2020, 3:46 PM · Patch-For-Review, Analytics-Clusters, Analytics-Radar, observability, Operations
herron added a comment to T252773: Move kafkamon hosts to Debian Buster.

Hey @elukey, prep work is done for the new hosts. Will be performing cut-over in the near future, will keep you on the cc.

Aug 27 2020, 3:10 PM · Patch-For-Review, Analytics-Clusters, Analytics-Radar, observability, Operations
herron updated the task description for T252773: Move kafkamon hosts to Debian Buster.
Aug 27 2020, 3:08 PM · Patch-For-Review, Analytics-Clusters, Analytics-Radar, observability, Operations

Aug 26 2020

herron added a comment to T234854: Upgrade ELK Stack to version 7.

I am getting a lot of 500 internal server errors on logstash-next instance. I am guessing that is expected/WIP?

Aug 26 2020, 6:07 PM · Patch-For-Review, Operations, Wikimedia-Logstash
herron updated the task description for T247966: Migrate role::alerting_host to Buster.
Aug 26 2020, 6:02 PM · Patch-For-Review, observability
herron added a comment to T261342: ensure alert[12]001 are prepared for meta monitoring.

Currently the sync_check_icinga_contacts unit is failed on alert1001. I've armed the keyholder, but am not sure if there's an additional step to carry out on the wikitech-static host to permit the key from a new host. Or even if the sync should be running from multiple places at the same time.

Aug 26 2020, 6:01 PM · observability
herron created T261342: ensure alert[12]001 are prepared for meta monitoring.
Aug 26 2020, 6:00 PM · observability
herron closed T260688: WMCS galera cluster checks failing from new alert[12]* hosts, a subtask of T247966: Migrate role::alerting_host to Buster, as Resolved.
Aug 26 2020, 5:03 PM · Patch-For-Review, observability
herron closed T260688: WMCS galera cluster checks failing from new alert[12]* hosts as Resolved.

Thanks! These Galera checks are now green in the new icinga instance

Aug 26 2020, 5:03 PM · observability, cloud-services-team (Kanban)
herron awarded T261198: nagios-nrpe-server in jessie not compatibile with Buster version a Party Time token.
Aug 26 2020, 3:24 PM · User-fgiunchedi, Operations, observability

Aug 25 2020

herron renamed T259219: "shards failed" error: "Data too large" from "shards failed" error: "Data too large, data for [indices:data/read/search[phase/query]] would be ... which is larger than the limit of ..." to "shards failed" error: "Data too large".
Aug 25 2020, 3:18 PM · Wikimedia-Logstash, observability
herron renamed T259219: "shards failed" error: "Data too large" from "shards failed" error while loading the "varnish webrequest 50x" dashboard in logstash-next to "shards failed" error: "Data too large, data for [indices:data/read/search[phase/query]] would be ... which is larger than the limit of ...".
Aug 25 2020, 3:17 PM · Wikimedia-Logstash, observability
herron added a comment to T259219: "shards failed" error: "Data too large".

This came up again this morning in codfw, cluster went yellow due to shard allocation failure on logstash-2020.08.18 index.

Aug 25 2020, 3:16 PM · Wikimedia-Logstash, observability

Aug 24 2020

herron awarded T259465: VictorOps behavior on long-ack'd incidents a Like token.
Aug 24 2020, 2:25 PM · User-fgiunchedi, Operations, observability

Aug 11 2020

Dzahn awarded T224586: Migrate fermium to Buster a Like token.
Aug 11 2020, 10:21 PM · Patch-For-Review, Operations
herron closed T224586: Migrate fermium to Buster, a subtask of T224549: Track remaining jessie systems in production, as Resolved.
Aug 11 2020, 5:02 PM · Operations
herron closed T224586: Migrate fermium to Buster as Resolved.

lists.wikimedia.org is now running from the buster host lists1001.wikimedia.org.

Aug 11 2020, 5:01 PM · Patch-For-Review, Operations
herron placed T260154: De-noise "Ensure local MW versions match expected deployment" alerts up for grabs.
Aug 11 2020, 3:43 PM · observability
herron created T260154: De-noise "Ensure local MW versions match expected deployment" alerts.
Aug 11 2020, 3:43 PM · observability

Aug 9 2020

herron added a comment to T259219: "shards failed" error: "Data too large".

Saw this same Data too large, data for... error also affecting shard allocation on the HDD hosts yesterday. Bumping the heap on the eqiad HDD hosts manually from 24G to 26G and issuing a /_cluster/reroute?retry_failed=true cleared it. Uploaded https://gerrit.wikimedia.org/r/619032 to persist the setting (and for deploy to codfw)

Aug 9 2020, 12:24 AM · Wikimedia-Logstash, observability

Aug 4 2020

herron moved T247966: Migrate role::alerting_host to Buster from Backlog to In progress on the observability board.
Aug 4 2020, 4:45 PM · Patch-For-Review, observability
herron moved T252773: Move kafkamon hosts to Debian Buster from Backlog to In progress on the observability board.
Aug 4 2020, 4:45 PM · Patch-For-Review, Analytics-Clusters, Analytics-Radar, observability, Operations

Aug 3 2020

herron added a comment to T259465: VictorOps behavior on long-ack'd incidents.

I've updated the description to outline the two auto-retrigger and auto-resolve options as available by VO today.

Aug 3 2020, 5:39 PM · User-fgiunchedi, Operations, observability
herron updated the task description for T259465: VictorOps behavior on long-ack'd incidents.
Aug 3 2020, 5:36 PM · User-fgiunchedi, Operations, observability
herron updated the task description for T259465: VictorOps behavior on long-ack'd incidents.
Aug 3 2020, 5:35 PM · User-fgiunchedi, Operations, observability
herron moved T259388: Requesting access to production shell for Denny Vrandecic from Untriaged to Awaiting User Input on the SRE-Access-Requests board.

@Nuria could you please review and give a thumbs up/down on the request for analytics-privatedata-users membership?

Aug 3 2020, 5:09 PM · Analytics-Radar, SRE-Access-Requests, Operations
herron updated the task description for T259388: Requesting access to production shell for Denny Vrandecic.
Aug 3 2020, 5:01 PM · Analytics-Radar, SRE-Access-Requests, Operations

Jul 31 2020

herron renamed T257561: codfw: 1 VM for kafkamon - kafkamon2002 from codfw: 1 VM for kafkamon to codfw: 1 VM for kafkamon - kafkamon2002.
Jul 31 2020, 7:36 PM · vm-requests, Operations
herron renamed T257560: eqiad: 1 VM for kafkamon - kafkamon1002 from eqiad: 1 VM for kafkamon to eqiad: 1 VM for kafkamon - kafkamon1002.
Jul 31 2020, 7:36 PM · vm-requests, Operations