Page MenuHomePhabricator

herron (Keith Herron)
Site Reliability Engineer

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Saturday

  • Clear sailing ahead.

User Details

User Since
May 30 2017, 5:25 PM (329 w, 1 d)
Availability
Available
IRC Nick
herron
LDAP User
Herron
MediaWiki User
Unknown

Recent Activity

Yesterday

herron updated the task description for T346950: Prometheus rule evaluation failure.
Wed, Sep 20, 5:20 PM · Patch-For-Review, observability
herron created T346950: Prometheus rule evaluation failure.
Wed, Sep 20, 5:17 PM · Patch-For-Review, observability
herron updated the task description for T344937: Decom dispatch infrastructure.
Wed, Sep 20, 1:51 PM · Patch-For-Review, Incident Tooling, User-herron
herron updated the task description for T344937: Decom dispatch infrastructure.
Wed, Sep 20, 1:50 PM · Patch-For-Review, Incident Tooling, User-herron
herron updated the task description for T344937: Decom dispatch infrastructure.
Wed, Sep 20, 1:41 PM · Patch-For-Review, Incident Tooling, User-herron
herron updated the task description for T344937: Decom dispatch infrastructure.
Wed, Sep 20, 1:36 PM · Patch-For-Review, Incident Tooling, User-herron

Tue, Sep 12

herron added a comment to T346144: Hardcode the SLO time windows in Grafana dashboards generated via Grizzly.

+1 for trying this. Thinking out loud:

Tue, Sep 12, 2:06 PM · SRE Observability (FY2023/2024-Q1), Patch-For-Review, serviceops, observability

Fri, Sep 1

herron added a comment to T344751: Decide on default histogram buckets for MediaWiki timers.

Uploaded the above to get the ball rolling on a patch. As a starting point it is essentially borrowing the values used for benthos mw accesslog [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10, 15, 20, 30, 60]

Fri, Sep 1, 2:41 PM · Patch-For-Review, serviceops, Observability-Metrics

Thu, Aug 31

herron renamed T345377: Deploy puppetized statsd exporter to mw hosts from Deploy puppetized statsd_exporter to mw hosts to Deploy puppetized statsd exporter to mw hosts.
Thu, Aug 31, 3:54 PM · Patch-For-Review, User-herron, Observability-Metrics, SRE Observability (FY2023/2024-Q1)
herron updated the task description for T343023: Deploy StatsD Exporter to production.
Thu, Aug 31, 3:54 PM · User-herron, Observability-Metrics, SRE Observability (FY2023/2024-Q1)
herron updated the task description for T343023: Deploy StatsD Exporter to production.
Thu, Aug 31, 3:54 PM · User-herron, Observability-Metrics, SRE Observability (FY2023/2024-Q1)
herron created T345377: Deploy puppetized statsd exporter to mw hosts.
Thu, Aug 31, 3:53 PM · Patch-For-Review, User-herron, Observability-Metrics, SRE Observability (FY2023/2024-Q1)

Wed, Aug 30

herron updated the task description for T344937: Decom dispatch infrastructure.
Wed, Aug 30, 1:59 PM · Patch-For-Review, Incident Tooling, User-herron

Tue, Aug 29

herron changed the status of T343987: Switch thanos-fe to cfssl from Open to Stalled.

Yes stalling is fine. The original reason for the switch to cfssl was related to adding a SAN to the thanos-fe certificate. That shouldn't be blocked since we can use still cergen for the time being.

Tue, Aug 29, 1:49 PM · Patch-For-Review, Observability-Metrics
herron added a comment to T326657: Add prometheus-https load balancer.

Hi,

Just FYI, JS and CSS are currently broken on prometheus-{eqiad,codfw}.wikipedia.org due to 401 and 403 errors, with some CORS sprinkled in

Tue, Aug 29, 1:43 PM · Traffic, Patch-For-Review, Observability-Metrics

Fri, Aug 25

herron updated subscribers of T341606: Investigate why Traffic SLO Grafana dashboard has negative values on combined SLI.

@BCornwall fwiw switching from "sli good" to "sli bad" does have the above in mind, namely by working with the small margin-of-error (by switching to calculation to a bad and total metric) instead of against it (attempting to maintain identical good and total metric values). That'd be actionable in the near term and would avoid the negative sli with the exception of edge cases where haproxy is serving 100% errors. With that said, looping in @colewhite and @fgiunchedi for their thoughts and potential alternate alternatives

Fri, Aug 25, 1:41 PM · Patch-For-Review, Traffic

Thu, Aug 24

herron added a comment to T344937: Decom dispatch infrastructure.

@lmata could you please confirm if/when ready to proceed with decom of dispatch infra?

Thu, Aug 24, 4:53 PM · Patch-For-Review, Incident Tooling, User-herron
herron closed T313228: Deploy Dispatch for SRE incident workflow automation as Declined.

Closing as dispatch has been ruled out as an option: See T308467 for follow-up discussion of where we're going.

Thu, Aug 24, 4:49 PM · User-herron, Incident Tooling
herron closed T313228: Deploy Dispatch for SRE incident workflow automation, a subtask of T308467: implementing an incident response workflow automation tool for SRE, as Declined.
Thu, Aug 24, 4:49 PM · Incident Tooling, SRE-OnFire
herron updated the task description for T308467: implementing an incident response workflow automation tool for SRE.
Thu, Aug 24, 4:48 PM · Incident Tooling, SRE-OnFire
herron changed the status of T313229: Production Dispatch Infrastructure from Resolved to Declined.
Thu, Aug 24, 4:40 PM · Incident Tooling, User-herron
herron changed the status of T313229: Production Dispatch Infrastructure, a subtask of T313228: Deploy Dispatch for SRE incident workflow automation, from Resolved to Declined.
Thu, Aug 24, 4:40 PM · User-herron, Incident Tooling
herron closed T313229: Production Dispatch Infrastructure, a subtask of T313228: Deploy Dispatch for SRE incident workflow automation, as Resolved.
Thu, Aug 24, 4:40 PM · User-herron, Incident Tooling
herron closed T313229: Production Dispatch Infrastructure as Resolved.
Thu, Aug 24, 4:40 PM · Incident Tooling, User-herron
herron created T344937: Decom dispatch infrastructure.
Thu, Aug 24, 4:39 PM · Patch-For-Review, Incident Tooling, User-herron
herron added a comment to T313229: Production Dispatch Infrastructure.

Agree, although let's create a decom task for that as there are some services on the alert hosts to clean up as well

Thu, Aug 24, 4:36 PM · Incident Tooling, User-herron
herron awarded T240667: Ingestion errors for production logs on ELK7 a Like token.
Thu, Aug 24, 2:01 PM · Observability-Logging, observability, SRE, Wikimedia-Logstash

Wed, Aug 23

herron triaged T343023: Deploy StatsD Exporter to production as Medium priority.
Wed, Aug 23, 2:57 PM · User-herron, Observability-Metrics, SRE Observability (FY2023/2024-Q1)
herron claimed T343023: Deploy StatsD Exporter to production.
Wed, Aug 23, 2:57 PM · User-herron, Observability-Metrics, SRE Observability (FY2023/2024-Q1)

Aug 21 2023

herron added a comment to T331512: Support for multiple SSO thanos-web backends.

I'm curious if the recent maglev hashing 'mh' inclusion/migration in T263797 provides any improvement here. On paper using 'mh' scheduler should address session stickiness better than 'sh' did.

Aug 21 2023, 2:58 PM · Observability-Metrics

Aug 18 2023

herron added a comment to T326657: Add prometheus-https load balancer.

I think we'll need two distinct services there (i.e. to be able to change prometheus (the existing internal endpoint) and prometheus-https (the web interface) independently:

# confctl select service=prometheus.* get
{"prometheus2005.codfw.wmnet": {"weight": 10, "pooled": "yes"}, "tags": "dc=codfw,cluster=prometheus,service=prometheus"}
{"prometheus2006.codfw.wmnet": {"weight": 10, "pooled": "yes"}, "tags": "dc=codfw,cluster=prometheus,service=prometheus"}
{"prometheus1005.eqiad.wmnet": {"weight": 10, "pooled": "yes"}, "tags": "dc=eqiad,cluster=prometheus,service=prometheus"}
{"prometheus1006.eqiad.wmnet": {"weight": 10, "pooled": "yes"}, "tags": "dc=eqiad,cluster=prometheus,service=prometheus"}
Aug 18 2023, 6:04 PM · Traffic, Patch-For-Review, Observability-Metrics

Aug 17 2023

herron updated subscribers of T343987: Switch thanos-fe to cfssl.
Aug 17 2023, 4:14 PM · Patch-For-Review, Observability-Metrics

Aug 15 2023

herron changed the status of T344202: Create VictorOps config for new Data Platform SRE team from Open to Stalled.

Thanks for the info. With this in mind I'm going to stall this victorops setup task while the details of the desired alerting/paging/team layout are decided. Once that's sorted please update with the desired team name(s) and members and we'll work on setting that up. Thanks!

Aug 15 2023, 5:00 PM · observability, Observability-Alerting, Data-Platform-SRE
herron changed the status of T344202: Create VictorOps config for new Data Platform SRE team, a subtask of T342578: Ensure Data Platform SREs have a contact group in puppet/alerting, from Open to Stalled.
Aug 15 2023, 4:59 PM · Data-Platform-SRE
herron awarded T326657: Add prometheus-https load balancer a Love token.
Aug 15 2023, 4:12 PM · Traffic, Patch-For-Review, Observability-Metrics
herron added a comment to T344202: Create VictorOps config for new Data Platform SRE team.

Hey @bking, thanks for the task! Could you please point me towards the current team roster in order to get the ball rolling for team/account creations?

Aug 15 2023, 4:10 PM · observability, Observability-Alerting, Data-Platform-SRE

Aug 14 2023

herron added a comment to T326657: Add prometheus-https load balancer.

Thanks for the patch @BCornwall LGTM

Aug 14 2023, 5:01 PM · Traffic, Patch-For-Review, Observability-Metrics

Aug 10 2023

herron renamed T343987: Switch thanos-fe to cfssl from Switch thanos-fe to csffl to Switch thanos-fe to cfssl.
Aug 10 2023, 2:19 PM · Patch-For-Review, Observability-Metrics
herron triaged T343987: Switch thanos-fe to cfssl as Medium priority.
Aug 10 2023, 2:16 PM · Patch-For-Review, Observability-Metrics

Aug 8 2023

herron created P50191 (An Untitled Masterwork).
Aug 8 2023, 2:00 PM

Aug 7 2023

herron closed T342998: Splunk not displaying rotations correctly as Resolved.

AFAIK this is sorted, please reopen if follow up is needed

Aug 7 2023, 4:02 PM · SRE Observability (FY2023/2024-Q1), Observability-Alerting

Aug 3 2023

herron added a comment to T341606: Investigate why Traffic SLO Grafana dashboard has negative values on combined SLI.

Based on the IRC conversation, wouldn't A have the same issues that we currently have?

Aug 3 2023, 8:38 PM · Patch-For-Review, Traffic
herron added a comment to T341606: Investigate why Traffic SLO Grafana dashboard has negative values on combined SLI.

Thanks for creating a tracking task! Quickly adding my notes from looking into this:

Aug 3 2023, 7:34 PM · Patch-For-Review, Traffic

Jul 27 2023

herron closed T326657: Add prometheus-https load balancer, a subtask of T301944: Web interface to navigate Prometheus alerts and their status, as Resolved.
Jul 27 2023, 3:50 PM · User-herron, Observability-Metrics
herron closed T326657: Add prometheus-https load balancer as Resolved.
Jul 27 2023, 3:50 PM · Traffic, Patch-For-Review, Observability-Metrics

Jul 19 2023

herron removed a project from T334733: Grant IdempotentWrite Kafka Cluster ACL to User:ANONYMOUS in all Kafka clusters: SRE Observability.

Untagging observability to table this wrt the kafka-logging cluster for the time being. Will need to revisit the kafka-logging acl config in more detail as part of planning out the kafka upgrade to 3.x

Jul 19 2023, 2:22 PM · Data-Platform-SRE, SRE, Data-Engineering

Jul 18 2023

herron awarded T341691: mw-log-cleanup fails often with errors about files not found a Love token.
Jul 18 2023, 1:34 PM · SRE Observability (FY2023/2024-Q1), Observability-Logging

Jun 29 2023

herron added a comment to T340772: Dispatch: enable gmail plugin.

When attempting to manually reproduce the api call that is erroring:

"code": 401,
"message": "API keys are not supported by this API. Expected OAuth2 access token or other authentication credentials that assert a principal. See https://cloud.google.com/docs/authentication",

Looking into alternate credentials

Jun 29 2023, 6:07 PM · Incident Tooling, User-herron
herron added a comment to T340772: Dispatch: enable gmail plugin.
Jun 29 17:00:40 alert1001 docker-dispatch[1484]: ERROR:dispatch.incident.messaging:Error in sending welcome email to redacted@wikimedia.org: [Errno 20] Not a directory: '/usr/local/bin/mjml'
Jun 29 2023, 5:08 PM · Incident Tooling, User-herron
herron placed T340772: Dispatch: enable gmail plugin up for grabs.

Gmail plugin is now enabled using the existing dispatch google cloud credentials (same as used by drive and docs plugins)

Jun 29 2023, 5:04 PM · Incident Tooling, User-herron
herron created T340772: Dispatch: enable gmail plugin.
Jun 29 2023, 5:03 PM · Incident Tooling, User-herron

Jun 28 2023

herron renamed T324725: Observability Bookworm/Bullseye upgrades from Observability Bullseye upgrades to Observability Bookworm/Bullseye upgrades.
Jun 28 2023, 2:19 PM · SRE Observability (FY2023/2024-Q1)

Jun 16 2023

herron added a comment to T339374: victorops: transition "SRE Business Hours (Escalation)" to use sub-policies.

Off hand will need to confirm that vo-escalate, vopsbot and the oncall schedule generator will work (or be adjusted to work) normally after this change.

Jun 16 2023, 4:57 PM · Observability-Alerting
herron triaged T339374: victorops: transition "SRE Business Hours (Escalation)" to use sub-policies as Medium priority.
Jun 16 2023, 4:56 PM · Observability-Alerting

Jun 13 2023

herron added a comment to T324725: Observability Bookworm/Bullseye upgrades.

The Grafana hosts are getting tight on disk space, let's bump disk spec for the bullseye hosts to something like 30GB

Jun 13 2023, 7:13 PM · SRE Observability (FY2023/2024-Q1)

Jun 9 2023

herron added a comment to T326657: Add prometheus-https load balancer.

@herron I see there's an unresolved conversation in that patch. Since @Vgutierrez +1ed it before that conversation, I just want to make sure that it is, indeed, ready for merging.

Jun 9 2023, 7:15 PM · Traffic, Patch-For-Review, Observability-Metrics

Jun 8 2023

herron closed T277445: Hourly log rotation for large MW logs as Resolved.

(fixed in T338127)

Jun 8 2023, 1:31 PM · User-herron, Observability-Logging, Developer Productivity, Platform Team Workboards (Clinic Duty Team)
herron closed T338127: log rotation stopped on mwlog for all files but "api.log", a subtask of T277445: Hourly log rotation for large MW logs, as Resolved.
Jun 8 2023, 1:30 PM · User-herron, Observability-Logging, Developer Productivity, Platform Team Workboards (Clinic Duty Team)
herron closed T338127: log rotation stopped on mwlog for all files but "api.log" as Resolved.

Checking in this morning, I see both /srv/mw-log/api.log hourly and /srv/mw-log/*.log daily rotations happened. And with the 6to4 relay in place I'm seeing logs arrive on both sides of the "udp tee". I think we're good here!

Jun 8 2023, 1:30 PM · User-herron, Observability-Logging, Developer Productivity, Platform Team Workboards (Clinic Duty Team)

Jun 7 2023

herron added a comment to T338127: log rotation stopped on mwlog for all files but "api.log".

Also note that mwlog2002 does not regularly rotate non-api.log logs, yet doesn't suffer from "big log files" issue, suggesting the mirroring/teeing of udp2log traffic to it isn't working as expected

Jun 7 2023, 5:13 PM · User-herron, Observability-Logging, Developer Productivity, Platform Team Workboards (Clinic Duty Team)

Jun 6 2023

herron added a comment to T338127: log rotation stopped on mwlog for all files but "api.log".

3.21.0 looks promising. Here are my notes from testing on a throwaway VM. Depending on the version, and configuration order the behavior varies.

Jun 6 2023, 4:59 PM · User-herron, Observability-Logging, Developer Productivity, Platform Team Workboards (Clinic Duty Team)
herron added a comment to T338127: log rotation stopped on mwlog for all files but "api.log".

Checking back on this after rotations overnight, looks like we're not out of the woods yet. Daily rotations are happening once again, however api.log is now being rotated daily as well. Looks like we're hitting https://github.com/logrotate/logrotate/issues/38

Jun 6 2023, 3:38 PM · User-herron, Observability-Logging, Developer Productivity, Platform Team Workboards (Clinic Duty Team)

Jun 5 2023

herron placed T331461: Logstash SLO excursion on 2023-02-11 up for grabs.
Jun 5 2023, 5:00 PM · SRE Observability (FY2023/2024-Q1), Wikimedia-Logstash, Observability-Logging, SRE
herron added a comment to T338127: log rotation stopped on mwlog for all files but "api.log".

looks like logrotate is treating the glob meant to exclude api.log as a literal filename 🤦‍♂️

Jun 5 2023, 1:43 PM · User-herron, Observability-Logging, Developer Productivity, Platform Team Workboards (Clinic Duty Team)

Jun 2 2023

herron created T338047: mirror1001 apache server reached MaxRequestWorkers.
Jun 2 2023, 4:27 PM · Infrastructure-Foundations
herron added a comment to T333614: Upgrade mwlog hosts to Bullseye.

Is there a task about the udp2log porting work to Python 3, or will that be unnecessary due to T205856?

Jun 2 2023, 2:23 PM · User-herron, SRE Observability (FY2022/2023-Q4)

Jun 1 2023

herron added a parent task for T333614: Upgrade mwlog hosts to Bullseye: T324725: Observability Bookworm/Bullseye upgrades.
Jun 1 2023, 3:03 PM · User-herron, SRE Observability (FY2022/2023-Q4)
herron added a subtask for T324725: Observability Bookworm/Bullseye upgrades: T333614: Upgrade mwlog hosts to Bullseye.
Jun 1 2023, 3:03 PM · SRE Observability (FY2023/2024-Q1)
herron updated the task description for T324725: Observability Bookworm/Bullseye upgrades.
Jun 1 2023, 3:03 PM · SRE Observability (FY2023/2024-Q1)
herron closed T333614: Upgrade mwlog hosts to Bullseye as Resolved.

mwlog1002 has been upgraded to bullseye as well, resolving!

Jun 1 2023, 3:02 PM · User-herron, SRE Observability (FY2022/2023-Q4)
herron closed T327277: Move excimer/arclamp redis from mwlog to arclamp hosts as Resolved.
Jun 1 2023, 3:01 PM · User-herron, Performance-Team (Radar), Observability-Tracing
herron closed T327277: Move excimer/arclamp redis from mwlog to arclamp hosts, a subtask of T328707: Update arclamp to active/active architecture, as Resolved.
Jun 1 2023, 3:01 PM · Arc-Lamp, Observability-Tracing

May 30 2023

herron added a comment to T333614: Upgrade mwlog hosts to Bullseye.

mwlog2002 is up and running now on bullseye. I made a cursory attempt to use python3, but after fixing errors thrown and getting the daemons up and running under python3, it still wasn't writing logs to the filesystem.

May 30 2023, 4:44 PM · User-herron, SRE Observability (FY2022/2023-Q4)
herron added a comment to T333614: Upgrade mwlog hosts to Bullseye.

END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mwlog2002.codfw.wmnet with OS bullseye

May 30 2023, 3:52 PM · User-herron, SRE Observability (FY2022/2023-Q4)
herron added a comment to T333614: Upgrade mwlog hosts to Bullseye.

I live edited reuse-lvm-root-4dev.cfg adding this to the bottom of the file, after another reimage the host boots into the os and is accessible from install_console

May 30 2023, 2:47 PM · User-herron, SRE Observability (FY2022/2023-Q4)
herron added a comment to T333614: Upgrade mwlog hosts to Bullseye.

Mwlog2002 is throwing an error and dropping into grub rescue after reimage with the reuse partitions recipe, going to try and troubleshoot the recipe

May 30 2023, 2:20 PM · User-herron, SRE Observability (FY2022/2023-Q4)

May 25 2023

herron updated the task description for T327277: Move excimer/arclamp redis from mwlog to arclamp hosts.
May 25 2023, 1:42 PM · User-herron, Performance-Team (Radar), Observability-Tracing
herron added a comment to T327277: Move excimer/arclamp redis from mwlog to arclamp hosts.

I'm not seeing any traffic arriving to mwlog1002 on port 6379, considering this confirmation of no further redis clients of mwlog

May 25 2023, 1:42 PM · User-herron, Performance-Team (Radar), Observability-Tracing

May 24 2023

herron updated the task description for T327277: Move excimer/arclamp redis from mwlog to arclamp hosts.
May 24 2023, 1:17 PM · User-herron, Performance-Team (Radar), Observability-Tracing

May 23 2023

herron closed T335424: kafkamon: upgrade to bullseye as Resolved.
May 23 2023, 2:10 PM · SRE Observability (FY2022/2023-Q4)
herron closed T335424: kafkamon: upgrade to bullseye, a subtask of T324725: Observability Bookworm/Bullseye upgrades, as Resolved.
May 23 2023, 2:10 PM · SRE Observability (FY2023/2024-Q1)
herron added a comment to T335424: kafkamon: upgrade to bullseye.

cookbooks.sre.hosts.decommission executed by herron@cumin1001 for hosts: kafkamon2002.codfw.wmnet

  • kafkamon2002.codfw.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster codfw to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster codfw to Netbox
  • COMMON_STEPS (FAIL)
    • Failed to run the sre.dns.netbox cookbook, run it manually

ERROR: some step on some host failed, check the bolded items above

May 23 2023, 2:09 PM · SRE Observability (FY2022/2023-Q4)
herron added a comment to T335424: kafkamon: upgrade to bullseye.

Good catch thanks, I had to take a look through my history on cumin1001 because I remember decomming these. Turns out ran the cookbook with the -d (dry run) flag enabled 🤦‍♂️ Will re-run these decoms now.

May 23 2023, 1:45 PM · SRE Observability (FY2022/2023-Q4)

May 17 2023

herron moved T302995: Explore dedicated (non-grafana) SLO Visualization and Management from Backlog to Working on on the User-herron board.
May 17 2023, 1:56 PM · SRE Observability (FY2023/2024-Q1), Patch-For-Review, User-herron, Observability-Metrics
herron moved T327277: Move excimer/arclamp redis from mwlog to arclamp hosts from Backlog to Working on on the User-herron board.
May 17 2023, 1:56 PM · User-herron, Performance-Team (Radar), Observability-Tracing
herron moved T333614: Upgrade mwlog hosts to Bullseye from Backlog to Working on on the User-herron board.
May 17 2023, 1:55 PM · User-herron, SRE Observability (FY2022/2023-Q4)
herron moved T326419: Expand kafka-logging using hosts kafka-logging[12]00[45] from Working on to Backlog on the User-herron board.
May 17 2023, 1:55 PM · SRE Observability (FY2023/2024-Q1), User-herron, Observability-Logging
herron updated the task description for T326419: Expand kafka-logging using hosts kafka-logging[12]00[45].
May 17 2023, 1:54 PM · SRE Observability (FY2023/2024-Q1), User-herron, Observability-Logging
herron closed T324470: SRE/Oncall/Schedule: add history support as Resolved.
May 17 2023, 1:54 PM · User-herron, SRE Observability
herron closed T301944: Web interface to navigate Prometheus alerts and their status as Resolved.
May 17 2023, 1:54 PM · User-herron, Observability-Metrics
herron closed T301944: Web interface to navigate Prometheus alerts and their status, a subtask of T288622: All Prometheus based alerts move from Icinga to alert manager exclusively, as Resolved.
May 17 2023, 1:54 PM · SRE Observability (FY2023/2024-Q1)
herron added a project to T333614: Upgrade mwlog hosts to Bullseye: User-herron.
May 17 2023, 1:53 PM · User-herron, SRE Observability (FY2022/2023-Q4)

May 16 2023

herron closed T277445: Hourly log rotation for large MW logs as Resolved.

Looking good, I see the first hourly rotated file on disk

May 16 2023, 7:09 PM · User-herron, Observability-Logging, Developer Productivity, Platform Team Workboards (Clinic Duty Team)
herron updated the task description for T327277: Move excimer/arclamp redis from mwlog to arclamp hosts.
May 16 2023, 2:35 PM · User-herron, Performance-Team (Radar), Observability-Tracing
herron updated the task description for T335042: codfw row D switches upgrade.
May 16 2023, 2:24 PM · Data-Platform-SRE, Discovery-Search (Current work), SRE, DBA, netops, Machine-Learning-Team, Traffic, collaboration-services, SRE Observability, serviceops, cloud-services-team, Infrastructure-Foundations, Platform Engineering

May 15 2023

herron updated the task description for T335042: codfw row D switches upgrade.
May 15 2023, 3:41 PM · Data-Platform-SRE, Discovery-Search (Current work), SRE, DBA, netops, Machine-Learning-Team, Traffic, collaboration-services, SRE Observability, serviceops, cloud-services-team, Infrastructure-Foundations, Platform Engineering

May 10 2023

herron updated herron.
May 10 2023, 1:18 PM

May 8 2023

herron added a project to T327277: Move excimer/arclamp redis from mwlog to arclamp hosts: User-herron.
May 8 2023, 3:14 PM · User-herron, Performance-Team (Radar), Observability-Tracing
herron added a project to T277445: Hourly log rotation for large MW logs: User-herron.
May 8 2023, 3:14 PM · User-herron, Observability-Logging, Developer Productivity, Platform Team Workboards (Clinic Duty Team)
herron awarded T334880: cookbooks.sre.hosts.reimage should not fail if the first Puppet run failed and if the user was prompted a Love token.
May 8 2023, 2:24 PM · SRE-tools, Infrastructure-Foundations, Traffic, SRE