Page MenuHomePhabricator

herron (Keith Herron)
Ops Engineer

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Thursday

  • Clear sailing ahead.

User Details

User Since
May 30 2017, 5:25 PM (196 w, 6 d)
Availability
Available
IRC Nick
herron
LDAP User
Herron
MediaWiki User
Unknown

Recent Activity

Tue, Mar 2

herron closed T276104: scrape logstash mtail metrics from v7 cluster as Resolved.
Tue, Mar 2, 8:52 PM · observability
herron moved T224565: Migrate mwlog/udp2log servers to Buster from Radar to In progress on the observability board.
Tue, Mar 2, 6:34 PM · observability, SRE
herron claimed T224565: Migrate mwlog/udp2log servers to Buster.
Tue, Mar 2, 6:33 PM · observability, SRE

Mon, Mar 1

herron triaged T276104: scrape logstash mtail metrics from v7 cluster as Medium priority.
Mon, Mar 1, 3:08 PM · observability
herron added a comment to T274668: Standardize a SLI metrics naming/storage/mapping scheme.

Something worth considering here, in addition to the naming scheme for SLIs themselves, is recording metrics that represent the SLOs values themselves (e.g. 0.1 percent)

Mon, Mar 1, 2:55 PM · observability
herron triaged T276101: histogram bucket metrics for elasticsearch query latency as Medium priority.
Mon, Mar 1, 2:44 PM · observability

Mon, Feb 22

herron moved T274668: Standardize a SLI metrics naming/storage/mapping scheme from Inbox to In progress on the observability board.
Mon, Feb 22, 3:36 PM · observability
herron moved T274372: Improve Automation for Alert Reviews from Inbox to In progress on the observability board.
Mon, Feb 22, 3:36 PM · observability
herron moved T274377: Ingest Cron and Root Alerts Into Logstash from Inbox to In progress on the observability board.
Mon, Feb 22, 3:36 PM · SRE, netops, observability
herron moved T274374: Extend Retention of Alerts (Icinga) in Logstash from Inbox to In progress on the observability board.
Mon, Feb 22, 3:35 PM · observability
herron moved T274663: Icinga meta monitoring recovery didn't resolve VO page from Backlog to In progress on the observability board.
Mon, Feb 22, 3:35 PM · observability
herron moved T274662: Icinga meta monitoring pages during icinga host reboots from Inbox to Backlog on the observability board.
Mon, Feb 22, 3:35 PM · SRE, observability
herron moved T274663: Icinga meta monitoring recovery didn't resolve VO page from Inbox to Backlog on the observability board.
Mon, Feb 22, 3:35 PM · observability
herron removed a project from T274392: hosts failing puppet compile due to missing secrets: observability.
Mon, Feb 22, 3:33 PM · serviceops, cloud-services-team (Kanban), SRE
herron updated the task description for T274392: hosts failing puppet compile due to missing secrets.
Mon, Feb 22, 3:33 PM · serviceops, cloud-services-team (Kanban), SRE
herron closed T273984: eqiad: Move logstash1020 to rack A8 as Resolved.

Thanks @ayounsi it's been re-enabled and puppet has been run

Mon, Feb 22, 3:31 PM · SRE, observability, ops-eqiad
herron added a comment to T275170: Define monitoring for gitlab.

FWIW here's a quick review of current gerrit alerting in case it helps when thinking about checks to include in gitlab monitoring.

Mon, Feb 22, 3:26 PM · GitLab (Initialization), observability

Fri, Feb 12

herron added a comment to T274665: Design and implement SLO Dashboard tooling.

mentioning (but not yet linking) some pre-existing SLO tasks T258754 T254916 T256629 T263792

Fri, Feb 12, 7:25 PM · observability
herron created T274668: Standardize a SLI metrics naming/storage/mapping scheme.
Fri, Feb 12, 5:20 PM · observability
herron triaged T274663: Icinga meta monitoring recovery didn't resolve VO page as Medium priority.
Fri, Feb 12, 4:02 PM · observability
herron triaged T274662: Icinga meta monitoring pages during icinga host reboots as Medium priority.
Fri, Feb 12, 4:01 PM · SRE, observability

Wed, Feb 10

herron added a comment to T274377: Ingest Cron and Root Alerts Into Logstash.

Sorry, I should have clarified this initially, afaict a proxy won't work for this case because logstash configures this at the JVM level and would have unwanted effects on the other inputs and outputs. So I was curious what other approaches might be recommended for this type of outward connection?

Wed, Feb 10, 8:53 PM · SRE, netops, observability
herron added a project to T274377: Ingest Cron and Root Alerts Into Logstash: netops.

Hey @ayounsi, what approach would you recommend for outward connectivity from logstash frontend hosts (logstash1023 for instance) to imap.gmail.com:993?

Wed, Feb 10, 3:21 PM · SRE, netops, observability
herron updated the task description for T274377: Ingest Cron and Root Alerts Into Logstash.
Wed, Feb 10, 3:21 PM · SRE, netops, observability
herron created T274377: Ingest Cron and Root Alerts Into Logstash.
Wed, Feb 10, 3:13 PM · SRE, netops, observability
herron created T274374: Extend Retention of Alerts (Icinga) in Logstash.
Wed, Feb 10, 3:01 PM · observability
herron triaged T274372: Improve Automation for Alert Reviews as Medium priority.
Wed, Feb 10, 2:53 PM · observability

Tue, Feb 9

herron added a comment to T274214: codfw: relocate logstash2035 .

LGTM thanks @Papaul!

Tue, Feb 9, 3:48 PM · SRE, ops-codfw
herron updated the task description for T273065: Setup Fundraising team in VO/splunk oncall.
Tue, Feb 9, 3:47 PM · User-fgiunchedi, observability
herron updated the task description for T273065: Setup Fundraising team in VO/splunk oncall.
Tue, Feb 9, 3:37 PM · User-fgiunchedi, observability
herron awarded T273951: Update Icinga meta-monitoring to account for "no pagers" in contacts a Love token.
Tue, Feb 9, 2:34 PM · User-fgiunchedi, observability
herron added a comment to T274214: codfw: relocate logstash2035 .

@Papaul sure, sounds good. This host is not yet in production so there will be no prep/depool needed before the re-rack.

Tue, Feb 9, 2:32 PM · SRE, ops-codfw

Mon, Feb 8

herron awarded T273984: eqiad: Move logstash1020 to rack A8 a Party Time token.
Mon, Feb 8, 4:52 PM · SRE, observability, ops-eqiad

Feb 5 2021

herron added a comment to T273984: eqiad: Move logstash1020 to rack A8.

Hey @Cmjohnson, @elukey, sure this should be no problem. I've set a reminder in my calendar to stop services on this host ahead of the window, and yup as long as the host/network config stays the same ES should do the right thing when services are brought back up. Would like to monitor it as it comes up though, just shoot a ping when ready. Thanks!

Feb 5 2021, 4:41 PM · SRE, observability, ops-eqiad
herron added a project to T273984: eqiad: Move logstash1020 to rack A8: observability.
Feb 5 2021, 4:36 PM · SRE, observability, ops-eqiad

Feb 4 2021

herron triaged T273919: Parse logstash error messages into fields as Medium priority.
Feb 4 2021, 6:53 PM · observability

Feb 2 2021

herron updated the task description for T273065: Setup Fundraising team in VO/splunk oncall.
Feb 2 2021, 8:56 PM · User-fgiunchedi, observability
herron added a comment to T225005: Replace and expand kafka main hosts (kafka[12]00[123]) with kafka-main[12]00[12345].

That's really exciting! Yes I'd love do see this happen as well, and am on board with the plan that you outlined. Time will be the main constraint for me right now, but yes let's get it started on prep work and then and if necessary can plan out the more time consuming components for the next Q.

Feb 2 2021, 4:26 PM · Analytics-Radar, Patch-For-Review, Services (watching), Platform Team Legacy (Watching / External), User-herron, SRE

Feb 1 2021

herron claimed T273065: Setup Fundraising team in VO/splunk oncall.
Feb 1 2021, 4:45 PM · User-fgiunchedi, observability
herron added a comment to T267271: (Need By: TBD) rack/setup/install mwlog1002.eqiad.wmnet.

Hey @Cmjohnson, when do you estimate this one will be racked and installed?

Feb 1 2021, 2:57 PM · SRE, ops-eqiad, DC-Ops

Jan 27 2021

herron awarded T272391: Create "phaultfinder" Phabricator bot account for Alertmanager a Love token.
Jan 27 2021, 4:01 PM · observability, Phabricator-Bot-Requests

Jan 15 2021

herron added a comment to T272016: Update saved / short links with objects in ELK7.

I hear you, it depends on the use case a bit, but in general a screen shot or similar (along with saving useful views as visualizations and dashboards) will be more durable in the long-term because, for example, logs will age off after 90d.

Jan 15 2021, 4:20 PM · SRE, Wikimedia-Logstash
herron added a comment to T272016: Update saved / short links with objects in ELK7.

Yes /goto/ links will need to be re-created. We have updated the links within the operations/puppet repository, and for things like bookmarks simply log in to logstash.wikimedia.org and search for the dashboard then hit share to obtain an updated /goto/ url.

Jan 15 2021, 2:39 PM · SRE, Wikimedia-Logstash

Jan 14 2021

herron updated the task description for T234854: Upgrade ELK Stack to version 7.
Jan 14 2021, 1:58 AM · Patch-For-Review, SRE, Wikimedia-Logstash

Jan 13 2021

herron added a comment to T271123: Mailman password reminder mail (and other texts) has broken encoding in Czech.

Was hoping for some feedback on the above patch, but since it's been a few days I've gone ahead and merged it. The listinfo page in this task description looks to have improved to me, in that copy/pasting a sampling of text into a translator gives back a meaningful result. How does it look to you @Mormegil?

Jan 13 2021, 7:50 PM · I18n, SRE, Wikimedia-Mailing-lists

Jan 12 2021

herron reopened T266250: Requesting access to Production Shell Access (analytics-privatedata-users) for Rmaung as "Open".
Jan 12 2021, 6:11 PM · Analytics, SRE, SRE-Access-Requests

Jan 7 2021

herron added a comment to T271123: Mailman password reminder mail (and other texts) has broken encoding in Czech.
Jan 7 2021, 6:59 PM · I18n, SRE, Wikimedia-Mailing-lists

Jan 6 2021

herron added a comment to T234854: Upgrade ELK Stack to version 7.

It might also have something to do with the logstash-* index names, which seems to be the first field that shows an error when editing those filter bubbles, in which case might just be a configuration issue.

Jan 6 2021, 5:43 PM · Patch-For-Review, SRE, Wikimedia-Logstash

Dec 16 2020

herron closed T270325: API key for the production 'wikimedia' VictorOps environment, a subtask of T270324: launch Klaxon: manual paging app for trusted users to escalate urgent issues to SRE, as Resolved.
Dec 16 2020, 7:11 PM · SRE-OnFire, SRE
herron closed T270325: API key for the production 'wikimedia' VictorOps environment as Resolved.

An API key for klaxxon (discussed via IRC, that's what this is going to be used for, see linked task as well) has been created and added to the pwstore file 'victorops'.

Dec 16 2020, 7:11 PM · observability, SRE

Dec 14 2020

herron added a watcher for Wikimedia-Logstash: herron.
Dec 14 2020, 4:17 PM

Dec 11 2020

herron added a comment to T269552: Degraded RAID on logstash2022.

I think we can go without it, we plan to replace these older hosts in the near future and also have some logstash refresh hardware that was just ordered. Thanks!

Dec 11 2020, 6:26 PM · SRE, ops-codfw

Dec 10 2020

fgiunchedi awarded T266019: Several days of metrics not uploaded to Thanos object storage from Prometheus on PoPs a Party Time token.
Dec 10 2020, 9:54 AM · User-fgiunchedi, observability

Dec 9 2020

herron closed T266019: Several days of metrics not uploaded to Thanos object storage from Prometheus on PoPs as Resolved.

The missing cache pop metrics have been backfilled using the above method and the thanos bucket web viewer no longer shows a gap. I think we're good here!

Dec 9 2020, 11:31 PM · User-fgiunchedi, observability

Dec 8 2020

herron added a comment to T266019: Several days of metrics not uploaded to Thanos object storage from Prometheus on PoPs.

After some testing, I think this may be a viable approach for backfilling:

Dec 8 2020, 6:48 PM · User-fgiunchedi, observability

Dec 3 2020

herron added a comment to T267420: (Need By: TBD) rack/setup/install logstash203[345].

Thanks @Papaul!

Dec 3 2020, 8:25 PM · ops-codfw, SRE, DC-Ops
herron awarded T267420: (Need By: TBD) rack/setup/install logstash203[345] a Party Time token.
Dec 3 2020, 8:16 PM · ops-codfw, SRE, DC-Ops

Dec 2 2020

herron added a comment to T266019: Several days of metrics not uploaded to Thanos object storage from Prometheus on PoPs.

Copies of the missing blocks have been made into /root/gap_blocks on each of the prometheus pop instances

Dec 2 2020, 9:17 PM · User-fgiunchedi, observability

Nov 20 2020

herron closed T268200: Beta cluster logstash down as Resolved.

Apache2 on deployment-logstash03 was erroring with [auth_cas:error] [pid 18928:tid 139767719112768] MOD_AUTH_CAS: CASLoginURL or CASValidateURL not defined.

Nov 20 2020, 3:35 PM · observability, Release-Engineering-Team, User-DannyS712, SRE, Beta-Cluster-Infrastructure
herron triaged T268043: MW REST API should be routed to api_appserver MW cluster as Medium priority.
Nov 20 2020, 3:14 PM · serviceops, Traffic, SRE, Platform Team Workboards (Green)
herron moved T268150: LDAP access to the "nda" group for Alangi Derick from Awaiting User Input to NDA Pending on the LDAP-Access-Requests board.

Hi @KFrancis could you please help @DAlangi_WMF with an NDA? Thanks in advance!

Nov 20 2020, 3:13 PM · SRE, LDAP-Access-Requests
herron triaged T268150: LDAP access to the "nda" group for Alangi Derick as Medium priority.
Nov 20 2020, 3:13 PM · SRE, LDAP-Access-Requests
herron triaged T268211: Filter (if possible) downtimed hosts from check_puppet_run_changes.py's report as Medium priority.
Nov 20 2020, 3:11 PM · Patch-For-Review, SRE
herron triaged T268233: thanos u/i gives errors if left idle for a few hours as Medium priority.
Nov 20 2020, 3:10 PM · CAS-SSO, observability, SRE
herron triaged T268281: Degraded RAID on labstore1006 as High priority.
Nov 20 2020, 3:10 PM · cloud-services-team (Hardware), ops-eqiad, SRE
herron triaged T268285: update RAID controller firmware on labstore1006, 1007 as Medium priority.
Nov 20 2020, 3:09 PM · ops-eqiad, cloud-services-team (Kanban), SRE
herron triaged T268291: Requesting access to phab1001 for Brennen Bearnes (brennen) as Medium priority.
Nov 20 2020, 3:09 PM · SRE, SRE-Access-Requests
herron triaged T268301: Requesting access to contint1001 for mmodell as Medium priority.
Nov 20 2020, 3:08 PM · SRE, SRE-Access-Requests
herron triaged T268316: Base replication lag detection on heartbeat as Medium priority.
Nov 20 2020, 3:08 PM · Orchestrator, DBA, SRE
herron triaged T268320: Configure mariadb to notice/recover from replication issues quicker as Medium priority.
Nov 20 2020, 3:07 PM · Orchestrator, DBA
herron triaged T268336: Cleanup heartbeat.heartbeat on all production instances as Medium priority.
Nov 20 2020, 3:07 PM · Orchestrator, DBA
herron updated the task description for T268291: Requesting access to phab1001 for Brennen Bearnes (brennen).
Nov 20 2020, 3:00 PM · SRE, SRE-Access-Requests
herron updated the task description for T268301: Requesting access to contint1001 for mmodell.
Nov 20 2020, 2:58 PM · SRE, SRE-Access-Requests
herron updated the task description for T268301: Requesting access to contint1001 for mmodell.
Nov 20 2020, 2:55 PM · SRE, SRE-Access-Requests
herron updated the task description for T268301: Requesting access to contint1001 for mmodell.
Nov 20 2020, 2:48 PM · SRE, SRE-Access-Requests

Nov 19 2020

herron updated subscribers of T267744: LDAP access for Till Mletzko.

Hi @tmletzko could you please also coordinate a comment from @conny-kawohl_WMDE, @WMDE-leszek, @darthmon_wmde, or @Tobi_WMDE_SW approving this request?

Nov 19 2020, 4:21 PM · Patch-For-Review, LDAP-Access-Requests, SRE
herron updated subscribers of T267771: LDAP access for Jan Jaquemot.

@JanJaquemot could you please also coordinate a comment from @conny-kawohl_WMDE, @WMDE-leszek, @darthmon_wmde, or @Tobi_WMDE_SW approving this request?

Nov 19 2020, 4:21 PM · SRE, LDAP-Access-Requests
herron changed the status of T266791: Requesting access to production shell groups for DNdubane from Open to Stalled.
Nov 19 2020, 3:17 PM · SRE, SRE-Access-Requests

Nov 18 2020

herron closed T267962: Request for LDAP Access in order to access Superset for IJethroBT-WMF as Resolved.

Hi @IJethroBT-WMF, the requested access has been granted. I'll transition this to closed now, but please reopen if any follow-up is needed. Thanks!

Nov 18 2020, 3:29 PM · SRE, LDAP-Access-Requests
herron closed T267968: Add STran to `wmf` LDAP group as Resolved.

Hi @STran, you have been added to the wmf LDAP group. I'll transition this to closed now, but please reopen if any follow-up is needed. Thanks!

Nov 18 2020, 3:14 PM · LDAP-Access-Requests, SRE
herron closed T267314: Access to analytics-privatedata-users for Research volunteer Swagoel as Resolved.

Hi @Swagoel, the requested access has been granted and will be fully active within 30 minutes. I'll transition this to closed now, but please reopen if any follow-up is needed. Thanks!

Nov 18 2020, 3:05 PM · Research, SRE, SRE-Access-Requests
herron closed T267917: LDAP 'nda' access for Tobias Schumann as Resolved.

Hi @Tobias_Schumann_WMDE-ext, the requested access has been granted. I'll transition this to closed now, but please re-open if any follow-up is needed. Thanks!

Nov 18 2020, 2:49 PM · SRE, LDAP-Access-Requests
herron renamed T267917: LDAP 'nda' access for Tobias Schumann from LDAP access for Tobias Schumann to LDAP 'nda' access for Tobias Schumann.
Nov 18 2020, 2:48 PM · SRE, LDAP-Access-Requests
herron updated subscribers of T267744: LDAP access for Till Mletzko.

Hi @KFrancis, could you please confirm or coordinate an NDA for @tmletzko? Thanks in advance!

Nov 18 2020, 2:28 PM · Patch-For-Review, LDAP-Access-Requests, SRE
herron updated subscribers of T267771: LDAP access for Jan Jaquemot.

Hi @KFrancis, could you please confirm or coordinate an NDA for @JanJaquemot? Thanks in advance!

Nov 18 2020, 2:26 PM · SRE, LDAP-Access-Requests

Nov 17 2020

herron added a comment to T267968: Add STran to `wmf` LDAP group.

Hi @STran, for our records could you please give a high level description of what the requested access will be used for? Thanks in advance!

Nov 17 2020, 8:57 PM · LDAP-Access-Requests, SRE
herron updated the task description for T267314: Access to analytics-privatedata-users for Research volunteer Swagoel.
Nov 17 2020, 6:03 PM · Research, SRE, SRE-Access-Requests
herron closed T267817: Requesting access to analytics-privatedata-users and wmf LDAP for fkaelin as Resolved.

The requested shell and LDAP access has been granted, and will be fully active within 30 minutes. I'll transition this to closed now, but please re-open if any follow-up is needed. Thanks!

Nov 17 2020, 5:33 PM · SRE, SRE-Access-Requests
herron updated the task description for T267817: Requesting access to analytics-privatedata-users and wmf LDAP for fkaelin.
Nov 17 2020, 5:30 PM · SRE, SRE-Access-Requests
herron added a comment to T267817: Requesting access to analytics-privatedata-users and wmf LDAP for fkaelin.

If the comments say that they are probably true. researchers is kinda outdated. Just analyitics-privatedata-users then.

Nov 17 2020, 5:20 PM · SRE, SRE-Access-Requests
herron renamed T267817: Requesting access to analytics-privatedata-users and wmf LDAP for fkaelin from Requesting access to researchers, analytics-privatedata-users and wmf LDAP for fkaelin to Requesting access to analytics-privatedata-users and wmf LDAP for fkaelin.
Nov 17 2020, 5:19 PM · SRE, SRE-Access-Requests
herron closed T267961: Request Superset Access (LDAP group 'wmf') for KEchavarriqueen as Resolved.

Hi @KEchavarriqueen, the requested group access has been granted. I'll transition this to closed now, but please don't hesitate to re-open if any follow up is needed. Thanks!

Nov 17 2020, 5:13 PM · SRE, LDAP-Access-Requests

Nov 16 2020

herron renamed T267961: Request Superset Access (LDAP group 'wmf') for KEchavarriqueen from Request Superset Access for KEchavarriqueen to Request Superset Access (LDAP group 'wmf') for KEchavarriqueen.
Nov 16 2020, 8:37 PM · SRE, LDAP-Access-Requests
herron moved T266791: Requesting access to production shell groups for DNdubane from Awaiting User Input to Manager/NDA Approval/Confirmation on the SRE-Access-Requests board.
Nov 16 2020, 7:58 PM · SRE, SRE-Access-Requests
herron closed T267913: Add gmodena to wmf LDAP group as Resolved.

Hi @hnowlan gmodena has been added to LDAP group wmf, and the above patch has been merged. Thanks for that!

Nov 16 2020, 7:49 PM · SRE, LDAP-Access-Requests
herron updated subscribers of T267917: LDAP 'nda' access for Tobias Schumann.

Hi @KFrancis, could you please confirm that we have an NDA on file for Tobias? Thanks in advance!

Nov 16 2020, 7:42 PM · SRE, LDAP-Access-Requests
herron closed T266249: Requesting access to production shell groups for JAnstee as Resolved.

I'll transition this to closed for the time being due to inactivity. When ready to proceed please add a comment of manager approval and re-open the task. Thanks in advance!

Nov 16 2020, 7:32 PM · Analytics, SRE, SRE-Access-Requests
herron updated subscribers of T266791: Requesting access to production shell groups for DNdubane.

Hi @DNdubane_WMF, could you please coordinate obtaining a comment from your manager approving this request?

Nov 16 2020, 7:28 PM · SRE, SRE-Access-Requests
herron updated the task description for T266791: Requesting access to production shell groups for DNdubane.
Nov 16 2020, 7:24 PM · SRE, SRE-Access-Requests
herron updated subscribers of T267314: Access to analytics-privatedata-users for Research volunteer Swagoel.

Hi @KFrancis could you please verify that @Swagoel has a valid NDA on file? Thanks in advance!

Nov 16 2020, 7:23 PM · Research, SRE, SRE-Access-Requests