Page MenuHomePhabricator
Feed Advanced Search

Wed, Nov 29

herron closed T349159: Transfer Arc Lamp alerts (Grafana and cron errors_mailto) to SRE O11y as Resolved.

optimistically resolving now that the checklist in the desc is complete

Wed, Nov 29, 2:50 PM · MediaWiki-Platform-Team (Radar), SRE Observability (FY2023/2024-Q2), observability
herron updated the task description for T349159: Transfer Arc Lamp alerts (Grafana and cron errors_mailto) to SRE O11y.
Wed, Nov 29, 2:49 PM · MediaWiki-Platform-Team (Radar), SRE Observability (FY2023/2024-Q2), observability

Tue, Nov 28

herron added a comment to T349159: Transfer Arc Lamp alerts (Grafana and cron errors_mailto) to SRE O11y.

can you help me identify what is relevant to:

  • Transfer Grafana alerts (AlertManager).

...I'm hoping we may be able to close this task with your latest patch.

Tue, Nov 28, 4:04 PM · MediaWiki-Platform-Team (Radar), SRE Observability (FY2023/2024-Q2), observability
herron triaged T352128: No on-call page notification when shift override was set on November 27 as Medium priority.

As we got an email from VO about unassigned overrides I think that the issue here is that only one rotation was assigned and not the one that actually pages:

Tue, Nov 28, 3:54 PM · Incident Tooling

Mon, Nov 27

herron updated the task description for T349159: Transfer Arc Lamp alerts (Grafana and cron errors_mailto) to SRE O11y.
Mon, Nov 27, 3:30 PM · MediaWiki-Platform-Team (Radar), SRE Observability (FY2023/2024-Q2), observability

Thu, Nov 16

herron added a comment to T351390: Istio recording rules for Pyrra and Grizzly.

One option that comes to mind is relabeling with something like labelkeep to ingest only the labels we want/need on the prometheus side. That'd let us cut down without modifying the framework at the source.

Thu, Nov 16, 3:20 PM · SRE Observability (FY2023/2024-Q2), Patch-For-Review, Machine-Learning-Team, observability

Wed, Nov 15

herron added a comment to T351179: LVM vg0 close to getting full on prometheus eqiad.

Prometheus1005 is an R440 which should have 10 total 2.5" bays, today there are (6) 2T SSDs installed. I think it'd be worth getting the ball rolling on adding another (4) SSDs.

Wed, Nov 15, 4:07 PM · Patch-For-Review, User-fgiunchedi, Observability-Metrics

Mon, Nov 13

herron updated the task description for T302995: Explore Pyrra for SLO Visualization and Management.
Mon, Nov 13, 6:11 PM · SRE Observability (FY2023/2024-Q2), Patch-For-Review, User-herron, Observability-Metrics
herron moved T351111: Add footer including privacy policy to slo.wikimedia.org (pyrra) from Inbox to Backlog on the SRE Observability board.
Mon, Nov 13, 6:10 PM · SRE Observability
herron renamed T351111: Add footer including privacy policy to slo.wikimedia.org (pyrra) from Missing links for slo.wikimedia.org to Add footer including privacy policy to slo.wikimedia.org (pyrra).
Mon, Nov 13, 6:09 PM · SRE Observability

Thu, Nov 9

herron updated the task description for T350591: Audit legacy mediawiki stats used in production dashboards.
Thu, Nov 9, 7:12 PM · SRE Observability (FY2023/2024-Q2), Observability-Metrics
herron added a comment to T350591: Audit legacy mediawiki stats used in production dashboards.

I spent some time today experimenting with https://github.com/grafana/cortex-tools, specifically cortextool analyse grafana which looked promising, but unfortunately throws parse errors when it encounters a period in the metric name which makes it not suitable for graphite metrics.

Thu, Nov 9, 7:10 PM · SRE Observability (FY2023/2024-Q2), Observability-Metrics
herron renamed T350591: Audit legacy mediawiki stats used in production dashboards from Audit legacy mediawiki stats used in production to Audit legacy mediawiki stats used in production dashboards.
Thu, Nov 9, 2:47 PM · SRE Observability (FY2023/2024-Q2), Observability-Metrics
herron renamed T350591: Audit legacy mediawiki stats used in production dashboards from Audit & convert stats in use in production to statslib to Audit legacy mediawiki stats used in production.
Thu, Nov 9, 2:46 PM · SRE Observability (FY2023/2024-Q2), Observability-Metrics

Wed, Nov 8

herron added a subtask for T222826: Leverage Grafana annotations to show events in graphs: T350825: Loki: add a channel(s) for git commits.
Wed, Nov 8, 7:32 PM · Observability-Logging, SRE
herron added a parent task for T350825: Loki: add a channel(s) for git commits: T222826: Leverage Grafana annotations to show events in graphs.
Wed, Nov 8, 7:32 PM · Observability-Logging, Observability-Metrics
herron added projects to T350825: Loki: add a channel(s) for git commits: Observability-Metrics, Observability-Logging.
Wed, Nov 8, 7:32 PM · Observability-Logging, Observability-Metrics
herron triaged T350825: Loki: add a channel(s) for git commits as Medium priority.
Wed, Nov 8, 7:32 PM · Observability-Logging, Observability-Metrics
herron added a comment to T350461: Set nofail for raid0 recipes.

I've had success mounting non-root filesystems that were unreliable (networked fs, external arrays, these kinds of things) using autofs, which these days can be done in systemd.

Wed, Nov 8, 6:32 PM · Infrastructure-Foundations, User-fgiunchedi, SRE

Tue, Nov 7

herron added a comment to T350434: Logstash collector tuning.

Thanks for the input everyone! Sounds like we have a consensus on option 1. I'll get started with rolling collector VM reboots into 12GB memory then upload a patch for the JVMs and go from there.

Tue, Nov 7, 2:41 PM · SRE Observability (FY2023/2024-Q2), Observability-Logging

Mon, Nov 6

herron added a comment to T350434: Logstash collector tuning.

Would we have enough Ganeti resources to bump all the VMs to 12GB?

Mon, Nov 6, 5:50 PM · SRE Observability (FY2023/2024-Q2), Observability-Logging

Fri, Nov 3

herron updated the task description for T350506: Explore Grafana OnCall for on-call schedule management and alert/page routing.
Fri, Nov 3, 7:41 PM · Observability-Alerting
herron created T350508: Grafana OnCall: Production service setup.
Fri, Nov 3, 7:41 PM · Observability-Alerting
herron triaged T350506: Explore Grafana OnCall for on-call schedule management and alert/page routing as Medium priority.
Fri, Nov 3, 7:30 PM · Observability-Alerting
herron updated the task description for T302995: Explore Pyrra for SLO Visualization and Management.
Fri, Nov 3, 2:15 PM · SRE Observability (FY2023/2024-Q2), Patch-For-Review, User-herron, Observability-Metrics
herron updated the task description for T302995: Explore Pyrra for SLO Visualization and Management.
Fri, Nov 3, 1:59 PM · SRE Observability (FY2023/2024-Q2), Patch-For-Review, User-herron, Observability-Metrics

Thu, Nov 2

herron added a comment to T350434: Logstash collector tuning.

Personally I'm for doing option 1 right away, seeing how we handle the next log spike(s) and go from there. One of the main upsides from my view is this option changes the fewest things, essentially only the logstash JVM size as opposed to shuffling services around or introducing new host variants.

Thu, Nov 2, 7:53 PM · SRE Observability (FY2023/2024-Q2), Observability-Logging
herron triaged T350434: Logstash collector tuning as Medium priority.
Thu, Nov 2, 7:45 PM · SRE Observability (FY2023/2024-Q2), Observability-Logging
herron closed T350402: logstash::collector apache high cpu usage as Resolved.

After a rolling restart of apache2 cpu utilization is down to essentially 0. Resolving for now, will investigate further if cpu util rises again.

Thu, Nov 2, 3:27 PM · Observability-Logging
herron added a comment to T350402: logstash::collector apache high cpu usage.

Mentioned in SAL (#wikimedia-operations) [2023-11-02T14:56:17Z] <herron> logstash1025 systemctl restart apache2.service T350402

Thu, Nov 2, 2:58 PM · Observability-Logging
herron triaged T350402: logstash::collector apache high cpu usage as Medium priority.
Thu, Nov 2, 2:53 PM · Observability-Logging

Oct 31 2023

herron updated the task description for T240685: MediaWiki Prometheus support.
Oct 31 2023, 2:20 PM · SRE Observability (FY2023/2024-Q2), MW-1.41-notes (1.41.0-wmf.28; 2023-09-26), MW-1.40-notes (1.40.0-wmf.27; 2023-03-13), MW-1.38-notes (1.38.0-wmf.19; 2022-01-24), MediaWiki-libs-Stats, Platform Team Workboards (External Code Reviews), Patch-For-Review, serviceops, SRE, MediaWiki-General, observability
herron closed T343026: Configure Prometheus to scrape MW metrics from statsd-exporter, a subtask of T343020: Converting MediaWiki Metrics to StatsLib, as Resolved.
Oct 31 2023, 2:19 PM · SRE Observability (FY2023/2024-Q2), Observability-Metrics
herron closed T343026: Configure Prometheus to scrape MW metrics from statsd-exporter as Resolved.

MW statsd-exporter instances are being scraped by production prometheus

Oct 31 2023, 2:19 PM · SRE Observability (FY2023/2024-Q2), Observability-Metrics
herron closed T343023: Deploy StatsD Exporter to production as Resolved.
Oct 31 2023, 2:18 PM · SRE Observability (FY2023/2024-Q2), User-herron, Observability-Metrics
herron closed T343023: Deploy StatsD Exporter to production, a subtask of T343020: Converting MediaWiki Metrics to StatsLib, as Resolved.
Oct 31 2023, 2:18 PM · SRE Observability (FY2023/2024-Q2), Observability-Metrics
herron updated the task description for T343023: Deploy StatsD Exporter to production.
Oct 31 2023, 2:18 PM · SRE Observability (FY2023/2024-Q2), User-herron, Observability-Metrics
herron updated the task description for T240685: MediaWiki Prometheus support.
Oct 31 2023, 2:05 PM · SRE Observability (FY2023/2024-Q2), MW-1.41-notes (1.41.0-wmf.28; 2023-09-26), MW-1.40-notes (1.40.0-wmf.27; 2023-03-13), MW-1.38-notes (1.38.0-wmf.19; 2022-01-24), MediaWiki-libs-Stats, Platform Team Workboards (External Code Reviews), Patch-For-Review, serviceops, SRE, MediaWiki-General, observability
herron updated the task description for T343023: Deploy StatsD Exporter to production.
Oct 31 2023, 1:43 PM · SRE Observability (FY2023/2024-Q2), User-herron, Observability-Metrics
herron closed T345377: Deploy puppetized statsd exporter to mw hosts as Resolved.

With the above patch statsd_exporter has been deployed to mw hosts via puppet

Oct 31 2023, 1:42 PM · SRE Observability (FY2023/2024-Q2), User-herron, Observability-Metrics
herron closed T345377: Deploy puppetized statsd exporter to mw hosts, a subtask of T343023: Deploy StatsD Exporter to production, as Resolved.
Oct 31 2023, 1:42 PM · SRE Observability (FY2023/2024-Q2), User-herron, Observability-Metrics
herron triaged T345377: Deploy puppetized statsd exporter to mw hosts as Medium priority.
Oct 31 2023, 1:42 PM · SRE Observability (FY2023/2024-Q2), User-herron, Observability-Metrics
herron closed T344751: Decide on default histogram buckets for MediaWiki timers as Resolved.

Change 954114 merged by Herron:

[operations/puppet@production] profile::mediawiki::common: set default histogram buckets

https://gerrit.wikimedia.org/r/954114

Does this mean we're decided? 😃

Oct 31 2023, 1:34 PM · MediaWiki-libs-Stats, serviceops, Observability-Metrics
herron closed T344751: Decide on default histogram buckets for MediaWiki timers, a subtask of T240685: MediaWiki Prometheus support, as Resolved.
Oct 31 2023, 1:34 PM · SRE Observability (FY2023/2024-Q2), MW-1.41-notes (1.41.0-wmf.28; 2023-09-26), MW-1.40-notes (1.40.0-wmf.27; 2023-03-13), MW-1.38-notes (1.38.0-wmf.19; 2022-01-24), MediaWiki-libs-Stats, Platform Team Workboards (External Code Reviews), Patch-For-Review, serviceops, SRE, MediaWiki-General, observability

Oct 27 2023

herron triaged T349909: Logstash-filter-verifier: Present logstash ERROR level logs in CI as Medium priority.
Oct 27 2023, 2:06 PM · Observability-Logging

Oct 23 2023

herron added a comment to T349521: Prometheus/Pyrra: establish backfill process for recording rules.

promtool tsdb create-blocks-from rules is looking promising so far. Here's an slo recording rule that was deployed 2 days ago with history going back 12 weeks:

Oct 23 2023, 9:21 PM · Patch-For-Review, SRE Observability (FY2023/2024-Q2), User-herron, Observability-Metrics
herron added a comment to T349521: Prometheus/Pyrra: establish backfill process for recording rules.

The initial approach I'm thinking of is running promtool tsdb create-blocks-from rules against the recording rules created by pyrra (since they are new and have no production dependencies yet) and output that into an unused scratch directory. Then transfer that over to pontoon and experiment with loading/using them, and go from there. Sound alright?

Oct 23 2023, 3:35 PM · Patch-For-Review, SRE Observability (FY2023/2024-Q2), User-herron, Observability-Metrics
herron added a comment to T349521: Prometheus/Pyrra: establish backfill process for recording rules.

Most promising option at the moment looks like https://prometheus.io/docs/prometheus/latest/command-line/promtool/#promtool-tsdb-create-blocks-from-rules

Oct 23 2023, 2:58 PM · Patch-For-Review, SRE Observability (FY2023/2024-Q2), User-herron, Observability-Metrics
herron created T349521: Prometheus/Pyrra: establish backfill process for recording rules.
Oct 23 2023, 2:56 PM · Patch-For-Review, SRE Observability (FY2023/2024-Q2), User-herron, Observability-Metrics
herron renamed T302995: Explore Pyrra for SLO Visualization and Management from Explore dedicated (non-grafana) SLO Visualization and Management to Explore Pyrra for SLO Visualization and Management.
Oct 23 2023, 2:45 PM · SRE Observability (FY2023/2024-Q2), Patch-For-Review, User-herron, Observability-Metrics

Oct 19 2023

herron added a comment to T349102: Expose Thanos rule web interface.

Thank you, this has been useful already!

Oct 19 2023, 8:02 PM · User-fgiunchedi, Observability-Metrics

Oct 18 2023

herron awarded T271138: Some Observability clusters do not support IPv6. a Party Time token.
Oct 18 2023, 1:40 PM · Observability-Metrics, Wikimedia-Incident, IPv6, User-crusnov

Oct 17 2023

herron added a comment to T344751: Decide on default histogram buckets for MediaWiki timers.

Perfect, I've updated https://gerrit.wikimedia.org/r/c/operations/puppet/+/954114 to reflect this and I think with a +1 from @Krinkle we'll be good to go.

Oct 17 2023, 2:44 PM · MediaWiki-libs-Stats, serviceops, Observability-Metrics

Oct 16 2023

herron added a comment to T344751: Decide on default histogram buckets for MediaWiki timers.

Thanks SGTM overall, I'll propose just one amendment

Oct 16 2023, 9:24 PM · MediaWiki-libs-Stats, serviceops, Observability-Metrics
herron added a comment to T348756: Wikimedia\MWConfig\Profiler::excimerFlushToArclamp(): PHP Warning: RedisException: Connection timed out.
10:52 AM <+jinxer-wm> (RedisMemoryFull) resolved: Redis memory full on arclamp1001:9121 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_arclamp - https://grafana.wikimedia.org/d/000000174/redis?orgId=1&var-datasource=eqiad%20prometheus/ops&var-job=redis_arclamp&var-instance=arclamp1001:9121&viewPanel=16 - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull
Oct 16 2023, 2:53 PM · SRE Observability (FY2023/2024-Q2), observability, Arc-Lamp, Wikimedia-production-error
herron added a comment to T348756: Wikimedia\MWConfig\Profiler::excimerFlushToArclamp(): PHP Warning: RedisException: Connection timed out.

Currently maxmemory is set to 1Mb

Oct 16 2023, 2:40 PM · SRE Observability (FY2023/2024-Q2), observability, Arc-Lamp, Wikimedia-production-error
herron added a comment to T348756: Wikimedia\MWConfig\Profiler::excimerFlushToArclamp(): PHP Warning: RedisException: Connection timed out.

We have a Redis dashboard but arclamp1001.eqiad.wmnet is not collected there.

Oct 16 2023, 2:33 PM · SRE Observability (FY2023/2024-Q2), observability, Arc-Lamp, Wikimedia-production-error

Oct 13 2023

herron added a comment to T321579: Audit/log AM silences.

Above is a patch for initial audit logging of POST data via modsec. Once we have some example data to work with we can refine the rules to log more human readable entries.

Oct 13 2023, 5:32 PM · SRE Observability (FY2023/2024-Q2), User-fgiunchedi, Observability-Alerting
herron added a comment to T341606: Investigate why Traffic SLO Grafana dashboard has negative values on combined SLI.

@herron Thanks for all of your help. We've implemented varnish_sli_bad. I followed the formulae presented at the top of grafana-grizzly's slo_definitions.libsonnet and got these results. They seem to differ from the current dashboard's values, however. Would you be kind enough to double-check to make sure I'm not missing anything? Thank you. :)

Oct 13 2023, 4:08 PM · Patch-For-Review, Traffic

Oct 11 2023

herron added a comment to T321579: Audit/log AM silences.

initial inclination/impressions:

Oct 11 2023, 5:09 PM · SRE Observability (FY2023/2024-Q2), User-fgiunchedi, Observability-Alerting
herron closed T346688: Icinga contact for dr0ptp4kt as Resolved.

Done!

Oct 11 2023, 1:56 PM · SRE Observability (FY2023/2024-Q2), observability, SRE

Oct 6 2023

herron added a comment to T344953: Manage jaeger-* index lifecycle.

When I raised the idea of using curator for jaeger earlier this year the thinking then was essentially that not managing these indices in curator was a feature.

Oct 6 2023, 3:39 PM · Observability-Tracing

Oct 3 2023

herron added a comment to T344937: Decom dispatch infrastructure.

That's been sorted: RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational

Oct 3 2023, 2:32 PM · Incident Tooling, User-herron

Oct 2 2023

herron closed T343987: Switch thanos-fe to cfssl as Invalid.

We (o11y) have moved pyrra to the titan hosts, which makes this task moot. Transitioning to invalid

Oct 2 2023, 4:14 PM · Patch-For-Review, Observability-Metrics

Sep 27 2023

herron added a project to T213902: Implement sensitive logstash access control: User-herron.
Sep 27 2023, 2:36 PM · Patch-Needs-Improvement, User-herron, SRE Observability (FY2023/2024-Q2), Observability-Logging
herron added a project to T347499: Grafana oncall pilot environment (in prod/ganeti): User-herron.
Sep 27 2023, 2:33 PM · User-herron, Observability-Alerting

Sep 25 2023

herron updated the task description for T313228: Deploy Dispatch for SRE incident workflow automation.
Sep 25 2023, 5:18 PM · User-herron, Incident Tooling
herron closed T344937: Decom dispatch infrastructure as Resolved.
build2001:~$ sudo -i docker-registryctl delete-tags docker-registry.discovery.wmnet/dispatch
We're about to delete the following tags for image docker-registry.discovery.wmnet/dispatch:
latest
v20220801-1-20220821
v20220801-1-20220828
v20220801-1-20220904
v20220801-1-20220911
v20220801-1-20220918
v20220801-1-20220925
v20220801-1-20221009
v20220801-1-20221016
v20220801-1-20221023
v20220801-1
v20220915-1-20221030
v20220915-1
v20220915-2-20221106
v20220915-2-20221113
v20220915-2
v20220915-3-20221120
v20220915-3-20221127
v20220915-3-20221204
v20220915-3-20221211
v20220915-3-20221218
v20220915-3-20221225
v20220915-3-20230101
v20220915-3-20230108
v20220915-3-20230115
v20220915-3-20230122
v20220915-3-20230129
v20220915-3-20230205
v20220915-3-20230212
v20220915-3-20230305
v20220915-3-20230312
v20220915-3-20230319
v20220915-3-20230326
v20220915-3-20230402
v20220915-3-20230409
v20220915-3-20230416
v20220915-3-20230423
v20220915-3-20230430
v20220915-3-20230507
v20220915-3-20230514
v20220915-3-20230521
v20220915-3-20230528
v20220915-3-20230604
v20220915-3-20230611
v20220915-3-20230612
v20220915-3-20230618
v20220915-3-20230625
v20220915-3-20230702
v20220915-3-20230716
v20220915-3-20230723
v20220915-3-20230730
v20220915-3-20230806
v20220915-3-20230813
v20220915-3-20230820
v20220915-3-20230827
v20220915-3-20230903
v20220915-3-20230910
v20220915-3-20230917
v20220915-3-20230924
v20220915-3
Ok to proceed? (y/n)y
Sep 25 2023, 5:11 PM · Incident Tooling, User-herron
herron updated the task description for T344937: Decom dispatch infrastructure.
Sep 25 2023, 5:11 PM · Incident Tooling, User-herron
herron closed T344937: Decom dispatch infrastructure, a subtask of T313229: Production Dispatch Infrastructure, as Resolved.
Sep 25 2023, 5:11 PM · Incident Tooling, User-herron
herron updated the task description for T344937: Decom dispatch infrastructure.
Sep 25 2023, 3:40 PM · Incident Tooling, User-herron
herron updated the task description for T344937: Decom dispatch infrastructure.
Sep 25 2023, 3:39 PM · Incident Tooling, User-herron
herron updated the task description for T344937: Decom dispatch infrastructure.
Sep 25 2023, 3:32 PM · Incident Tooling, User-herron
herron updated the task description for T344937: Decom dispatch infrastructure.
Sep 25 2023, 3:30 PM · Incident Tooling, User-herron
herron added a comment to T344937: Decom dispatch infrastructure.

GCP: Project "Dispatch" is now shut down and scheduled to be deleted after Oct 25, 2023.

Sep 25 2023, 3:30 PM · Incident Tooling, User-herron
herron updated the task description for T344937: Decom dispatch infrastructure.
Sep 25 2023, 3:21 PM · Incident Tooling, User-herron
herron updated the task description for T344937: Decom dispatch infrastructure.
Sep 25 2023, 2:57 PM · Incident Tooling, User-herron
herron updated the task description for T344937: Decom dispatch infrastructure.
Sep 25 2023, 2:54 PM · Incident Tooling, User-herron
herron updated the task description for T344937: Decom dispatch infrastructure.
Sep 25 2023, 2:44 PM · Incident Tooling, User-herron

Sep 21 2023

herron added a comment to T346144: Hardcode the SLO time windows in Grafana dashboards generated via Grizzly.

I think it wouldn't even need to be "make editable," this only changes the default range for the time picker, right? So you can still use the time picker to punch in the dates of other quarters. Not as clean as having a list to pick from, but no worse than it is now.

Sep 21 2023, 4:36 PM · SRE Observability (FY2023/2024-Q1), serviceops, observability
herron added a comment to T346950: Prometheus rule evaluation failure.

Had a quick look at the current file limits for thanos-rule on titan2001, I'm seeing 524k as the current limit

Sep 21 2023, 4:35 PM · observability

Sep 20 2023

herron updated the task description for T346950: Prometheus rule evaluation failure.
Sep 20 2023, 5:20 PM · observability
herron created T346950: Prometheus rule evaluation failure.
Sep 20 2023, 5:17 PM · observability
herron updated the task description for T344937: Decom dispatch infrastructure.
Sep 20 2023, 1:51 PM · Incident Tooling, User-herron
herron updated the task description for T344937: Decom dispatch infrastructure.
Sep 20 2023, 1:50 PM · Incident Tooling, User-herron
herron updated the task description for T344937: Decom dispatch infrastructure.
Sep 20 2023, 1:41 PM · Incident Tooling, User-herron
herron updated the task description for T344937: Decom dispatch infrastructure.
Sep 20 2023, 1:36 PM · Incident Tooling, User-herron

Sep 12 2023

herron added a comment to T346144: Hardcode the SLO time windows in Grafana dashboards generated via Grizzly.

+1 for trying this. Thinking out loud:

Sep 12 2023, 2:06 PM · SRE Observability (FY2023/2024-Q1), serviceops, observability

Sep 1 2023

herron added a comment to T344751: Decide on default histogram buckets for MediaWiki timers.

Uploaded the above to get the ball rolling on a patch. As a starting point it is essentially borrowing the values used for benthos mw accesslog [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10, 15, 20, 30, 60]

Sep 1 2023, 2:41 PM · MediaWiki-libs-Stats, serviceops, Observability-Metrics

Aug 31 2023

herron renamed T345377: Deploy puppetized statsd exporter to mw hosts from Deploy puppetized statsd_exporter to mw hosts to Deploy puppetized statsd exporter to mw hosts.
Aug 31 2023, 3:54 PM · SRE Observability (FY2023/2024-Q2), User-herron, Observability-Metrics
herron updated the task description for T343023: Deploy StatsD Exporter to production.
Aug 31 2023, 3:54 PM · SRE Observability (FY2023/2024-Q2), User-herron, Observability-Metrics
herron updated the task description for T343023: Deploy StatsD Exporter to production.
Aug 31 2023, 3:54 PM · SRE Observability (FY2023/2024-Q2), User-herron, Observability-Metrics
herron created T345377: Deploy puppetized statsd exporter to mw hosts.
Aug 31 2023, 3:53 PM · SRE Observability (FY2023/2024-Q2), User-herron, Observability-Metrics

Aug 30 2023

herron updated the task description for T344937: Decom dispatch infrastructure.
Aug 30 2023, 1:59 PM · Incident Tooling, User-herron

Aug 29 2023

herron changed the status of T343987: Switch thanos-fe to cfssl from Open to Stalled.

Yes stalling is fine. The original reason for the switch to cfssl was related to adding a SAN to the thanos-fe certificate. That shouldn't be blocked since we can use still cergen for the time being.

Aug 29 2023, 1:49 PM · Patch-For-Review, Observability-Metrics
herron added a comment to T326657: Add prometheus-https load balancer.

Hi,

Just FYI, JS and CSS are currently broken on prometheus-{eqiad,codfw}.wikipedia.org due to 401 and 403 errors, with some CORS sprinkled in

Aug 29 2023, 1:43 PM · Traffic, Patch-For-Review, Observability-Metrics

Aug 25 2023

herron updated subscribers of T341606: Investigate why Traffic SLO Grafana dashboard has negative values on combined SLI.

@BCornwall fwiw switching from "sli good" to "sli bad" does have the above in mind, namely by working with the small margin-of-error (by switching to calculation to a bad and total metric) instead of against it (attempting to maintain identical good and total metric values). That'd be actionable in the near term and would avoid the negative sli with the exception of edge cases where haproxy is serving 100% errors. With that said, looping in @colewhite and @fgiunchedi for their thoughts and potential alternatives

Aug 25 2023, 1:41 PM · Patch-For-Review, Traffic

Aug 24 2023

herron added a comment to T344937: Decom dispatch infrastructure.

@lmata could you please confirm if/when ready to proceed with decom of dispatch infra?

Aug 24 2023, 4:53 PM · Incident Tooling, User-herron
herron closed T313228: Deploy Dispatch for SRE incident workflow automation as Declined.

Closing as dispatch has been ruled out as an option: See T308467 for follow-up discussion of where we're going.

Aug 24 2023, 4:49 PM · User-herron, Incident Tooling
herron closed T313228: Deploy Dispatch for SRE incident workflow automation, a subtask of T308467: implementing an incident response workflow automation tool for SRE, as Declined.
Aug 24 2023, 4:49 PM · Incident Tooling, SRE-OnFire