Page MenuHomePhabricator

herron (Keith Herron)
Ops Engineer

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Friday

  • Clear sailing ahead.

User Details

User Since
May 30 2017, 5:25 PM (259 w, 1 d)
Availability
Available
IRC Nick
herron
LDAP User
Herron
MediaWiki User
Unknown

Recent Activity

Today

herron updated subscribers of T299106: Q3:(Need By: TBD) rack/setup/install netmon1003.
Wed, May 18, 2:17 PM · SRE, ops-eqiad, DC-Ops

Mon, May 9

herron added a comment to T307958: Reminders for unhandled/unacked alerts.

This idea arose in an irc convo while looking into stale icinga alerts on the unhandled dashboard, @Dzahn please add/adjust/edit anything I missed!

Mon, May 9, 7:26 PM · SRE, SRE Observability
herron triaged T307958: Reminders for unhandled/unacked alerts as Medium priority.
Mon, May 9, 7:24 PM · SRE, SRE Observability
herron renamed T307873: [mitigated] Google returning 503 error when delivering to mx1001 and mx2001 from [Urgent] Google returning 503 error when delivering to mx1001 and mx2001 to [mitigated] Google returning 503 error when delivering to mx1001 and mx2001.
Mon, May 9, 5:14 PM · SRE, Mail, Infrastructure-Foundations
herron lowered the priority of T307873: [mitigated] Google returning 503 error when delivering to mx1001 and mx2001 from High to Medium.

'chunking_advertise_hosts =' (disabling chunking) has been applied to both MXes and we have not seen this error recur since that change was made.

Mon, May 9, 3:57 PM · SRE, Mail, Infrastructure-Foundations
herron added a comment to T307873: [mitigated] Google returning 503 error when delivering to mx1001 and mx2001.

Hi @bcampbell, while SRE is investigating could ITS please open a case with the google postmasters about this issue as well?

Mon, May 9, 3:03 PM · SRE, Mail, Infrastructure-Foundations
herron added a comment to T307873: [mitigated] Google returning 503 error when delivering to mx1001 and mx2001.

Looking at count of log lines matching "BDAT command used when CHUNKING not advertised" on mx1001 this appears to have began on the 5th, which was also the date that exim was restarted after the host was rebooted for a kernel update.

Mon, May 9, 1:47 PM · SRE, Mail, Infrastructure-Foundations

Tue, May 3

herron committed rLPRI0ced0b596b26: "private" add prometheus.wm.o placeholder key (authored by herron).
"private" add prometheus.wm.o placeholder key
Tue, May 3, 3:08 PM
herron added a comment to T303803: Prometheus use of Squid proxies.

Seeing a significant drop in CONNECT (blue) since https://gerrit.wikimedia.org/r/776878 was applied, looking better!

Tue, May 3, 1:10 PM · SRE Observability (FY2021/2022-Q4)

Mon, Apr 25

herron closed T305652: sre.kafka.reboot-workers fails on logging cluster with failed to execute command 'systemctl stop kafka-mirror': as Resolved.

A round of kafka-logging rolling reboots was completed today using sre.kafka.reboot-workers. Resolving!

Mon, Apr 25, 4:56 PM · SRE Observability, SRE

Apr 14 2022

herron added a comment to T305954: Make 'status page' dashboard the default dashboard in Grafana.

JFTR this was discussed at the last o11y meeting and sounds good. I went ahead and made a copy of the status-page dashboard to incorporate the current home dashboard "welcome to grafana" bits with the status page panels and arranged things to try and fit as much as possible on screen at once. Edits welcome, but if that looks good let's go ahead and set it as the home.

Apr 14 2022, 2:44 PM · SRE-OnFire (FY2021/2022-Q4), SRE Observability (FY2021/2022-Q4), observability

Apr 7 2022

herron renamed T305652: sre.kafka.reboot-workers fails on logging cluster with failed to execute command 'systemctl stop kafka-mirror': from sre.kafka.reboot-workers fails on logging cluster with 100.0% (1/1) of nodes failed to execute command 'systemctl stop kafka-mirror': to sre.kafka.reboot-workers fails on logging cluster with failed to execute command 'systemctl stop kafka-mirror':.
Apr 7 2022, 5:51 PM · SRE Observability, SRE
herron created T305652: sre.kafka.reboot-workers fails on logging cluster with failed to execute command 'systemctl stop kafka-mirror':.
Apr 7 2022, 5:50 PM · SRE Observability, SRE

Apr 6 2022

herron renamed T305567: MX: increasing disk space from MX: increasing disks space to MX: increasing disk space.
Apr 6 2022, 4:47 PM · Mail, SRE, Infrastructure-Foundations
herron created T305567: MX: increasing disk space.
Apr 6 2022, 4:47 PM · Mail, SRE, Infrastructure-Foundations

Apr 5 2022

herron added a comment to T305147: ipmiseld not running reliably.

There's some things which are still puzzling here: Why wasn't this noticed before, was the service manually started before? And if the service wasn't running on the vast majority of our fleet (2/3 are Buster after all), did we miss logged errors this way?

Apr 5 2022, 6:30 PM · User-MoritzMuehlenhoff, Infrastructure-Foundations, observability, SRE
herron added a comment to T305175: Fix conflict between monthly and weekly index buckets.

If I'm understanding correctly the idea is to have a set of generic curator rules that would automatically set retention based on patterns like "2weeks" or "2days" in the index name?

Apr 5 2022, 3:31 PM · Patch-For-Review, Observability-Logging

Apr 4 2022

herron added a parent task for T305403: apifeatureusage hosts hanging on shutdown: T275405: Logstash collector nodes hang indefinitely on reboot.
Apr 4 2022, 7:55 PM · SRE Observability (FY2021/2022-Q4), Observability-Logging, SRE
herron added a subtask for T275405: Logstash collector nodes hang indefinitely on reboot: T305403: apifeatureusage hosts hanging on shutdown.
Apr 4 2022, 7:55 PM · SRE Observability (FY2021/2022-Q1)
herron triaged T305403: apifeatureusage hosts hanging on shutdown as Medium priority.
Apr 4 2022, 7:02 PM · SRE Observability (FY2021/2022-Q4), Observability-Logging, SRE
herron closed T305193: Access for new Data Platform Dev: Thomas Chin as Resolved.

Hi @tchin, the requested access has now been provisioned and will be fully deployed within 30 minutes (as puppet runs complete across the fleet)

Apr 4 2022, 2:17 PM · SRE, SRE-Access-Requests
herron updated the task description for T305193: Access for new Data Platform Dev: Thomas Chin.
Apr 4 2022, 1:27 PM · SRE, SRE-Access-Requests
herron closed T303398: [WIP] Requesting access to deployment group for TThoabala as Invalid.

Is it ok if we close this ticket and you just reopen it again once he is back?

Apr 4 2022, 1:26 PM · Patch-For-Review, SRE, SRE-Access-Requests

Apr 1 2022

herron added a comment to T305175: Fix conflict between monthly and weekly index buckets.

Sounds like that'd work, although I wonder if there are any alternatives that may be more obvious at a glance? Could we get away with doing something like including the unit as a keyword before the stamp? e.g.

Apr 1 2022, 2:01 AM · Patch-For-Review, Observability-Logging

Mar 31 2022

herron updated subscribers of T305193: Access for new Data Platform Dev: Thomas Chin.

Looping in @Ottomata and @odimitrijevic for analytics groupadd approval

Mar 31 2022, 8:33 PM · SRE, SRE-Access-Requests
herron updated the task description for T305193: Access for new Data Platform Dev: Thomas Chin.
Mar 31 2022, 8:23 PM · SRE, SRE-Access-Requests
herron triaged T305174: Advertised RSS/Atom feeds for wikimediastatus.net don't work as Medium priority.

Hey @Legoktm thanks for the report, yes looks like these were indeed set to inactive. That's been enabled and should be working now.

Mar 31 2022, 5:39 PM · Infrastructure-Foundations, SRE
herron added a comment to T305147: ipmiseld not running reliably.

Looking more closely I see all bullseye hosts have the unit enabled, while all buster hosts do not.

Mar 31 2022, 5:17 PM · User-MoritzMuehlenhoff, Infrastructure-Foundations, observability, SRE
herron triaged T305155: Blubber setup for Image Suggestions Service as Medium priority.
Mar 31 2022, 3:05 PM · Image-Suggestions, Patch-For-Review, serviceops, Generated Data Platform, Service-deployment-requests, Services, SRE
herron triaged T305154: Setup Initial Image Suggestion Service CI and k8s params/stubs as Medium priority.
Mar 31 2022, 3:05 PM · Image-Suggestions, serviceops, Generated Data Platform, Service-deployment-requests, Services, SRE
herron triaged T305147: ipmiseld not running reliably as Medium priority.
Mar 31 2022, 3:05 PM · User-MoritzMuehlenhoff, Infrastructure-Foundations, observability, SRE
herron added a comment to T303398: [WIP] Requesting access to deployment group for TThoabala.

Change to stalled until TsepoThoabala return

Mar 31 2022, 3:04 PM · Patch-For-Review, SRE, SRE-Access-Requests
herron added a comment to T305147: ipmiseld not running reliably.

Looks like ipmiseld isn't enabled on a sampling of these hosts, letting puppet ensure the service is enabled and running seems like a good next step

Mar 31 2022, 2:41 PM · User-MoritzMuehlenhoff, Infrastructure-Foundations, observability, SRE

Mar 30 2022

herron updated the task description for T304502: Requesting access to google console for TomekSikora.Monsoon.
Mar 30 2022, 7:21 PM · Search-Console-access-request, SRE
herron changed the status of T304502: Requesting access to google console for TomekSikora.Monsoon from Stalled to Open.

I have received this information from a member of your team:

  1. It doesn't need an ssh access or ssh keys unless you're going to need private data
  2. The resource you need is membership of the 'nda' group. It's missing in the title. As is the wikitech username.
  3. The name of the approving party. (Cost center owner probably)

I will need access to your dashboard data visualization software

Mar 30 2022, 7:19 PM · Search-Console-access-request, SRE
herron renamed T304502: Requesting access to google console for TomekSikora.Monsoon from Requesting access to RESOURCE for USER[S] to Requesting access to LDAP group NDA for TomekSikora.Monsoon.
Mar 30 2022, 7:14 PM · Search-Console-access-request, SRE
herron closed T304927: Failed to fetch API response from {wiki}. Error code {code} as Resolved.

Thanks for the report @kostajh yes this has been addressed and an acknowledgement has been added here as well https://www.wikimediastatus.net/incidents/ft72m2rcs8tg

Mar 30 2022, 3:32 PM · Growth-Team, SRE, Notifications, Wikimedia-production-error
herron triaged T304891: New Service Request Generated Datasets: Image Suggestions Service as Medium priority.
Mar 30 2022, 3:27 PM · User-Eevans, Image-Suggestions, Patch-For-Review, serviceops, Generated Data Platform, Service-deployment-requests, Services, SRE
herron triaged T305047: Degraded RAID on cp2028 as High priority.
Mar 30 2022, 3:19 PM · Traffic, SRE, ops-codfw
herron awarded T304897: Many Ganeti hosts have disk space warnings on /boot a Love token.
Mar 30 2022, 1:43 PM · SRE, Infrastructure-Foundations

Mar 28 2022

herron triaged T304898: puppetmaster1001 disk warning on / as High priority.
Mar 28 2022, 8:11 PM · Infrastructure-Foundations, SRE
herron created T304897: Many Ganeti hosts have disk space warnings on /boot.
Mar 28 2022, 7:51 PM · SRE, Infrastructure-Foundations
herron triaged T304873: Degraded RAID on thanos-be1003 as High priority.
Mar 28 2022, 6:26 PM · SRE, ops-eqiad
herron triaged T304814: MWoffliner scrapes slowed down by Thumbor failure throttling 429s as Medium priority.
Mar 28 2022, 4:27 PM · SRE, Traffic, Thumbor, affects-Kiwix-and-openZIM
herron triaged T304800: Set API server weights as Medium priority.
Mar 28 2022, 4:21 PM · Sustainability (Incident Followup), serviceops, SRE
herron triaged T304799: Investigate shorter-lived persistent connections for Envoy as Medium priority.
Mar 28 2022, 4:18 PM · Sustainability (Incident Followup), ChangeProp, envoy, serviceops, SRE
herron triaged T304788: File not found: /v1/AUTH_mw/wikipedia-commons-local-public.9e/9/9e/Christopher_Wilbrand.jpg as Medium priority.
Mar 28 2022, 4:17 PM · SRE, SRE-swift-storage
herron removed a project from T303857: Need a service account on deploy servers for automated train pre-sync operations: SRE-Access-Requests.

Removing from the sre access request queue while the details of the request are being clarified. Please re-tag when ready for implementation and/or assistance is needed from sre clinic duty.

Mar 28 2022, 3:37 PM · Release-Engineering-Team (Radar), SRE-Access-Requests, serviceops, SRE, Infrastructure-Foundations
herron closed T304502: Requesting access to google console for TomekSikora.Monsoon as Invalid.

Hello, I'll close this as invalid for now since the task will need to specify what access/group is being requested, and an approving party, in order to move forward.

Mar 28 2022, 3:24 PM · Search-Console-access-request, SRE
herron closed T302287: Requesting access to releaser for MarkAHershberger as Resolved.

Resolving as the near-term access requested in the description has been provisioned, please reopen if any follow up is needed. Thanks!

Mar 28 2022, 2:43 PM · SRE, SRE-Access-Requests
herron updated the task description for T302287: Requesting access to releaser for MarkAHershberger.
Mar 28 2022, 2:41 PM · SRE, SRE-Access-Requests

Mar 25 2022

herron awarded T202061: Implement an accurate and easy to understand status page for all wikis a Love token.
Mar 25 2022, 2:39 PM · Infrastructure-Foundations (FY2021/2022-Q4), SRE-OnFire (FY2021/2022-Q4), SRE Observability (FY2021/2022-Q4), SRE

Mar 22 2022

herron added a comment to T303593: increase of network errors on alert1001 after certspotter has been enabled.
  • as per the commit message above, "We first start by setting interface::rps to the alerting_host role in order to improve the network performance. In the next commit, we will try to adjust the RX queuelen, which is currently set to 200 (with a maximum support of 2047)." We can try doing this manually to check if it makes a difference perhaps and then puppetize it.
Mar 22 2022, 7:31 PM · Patch-For-Review, Traffic-Icebox, SRE
herron moved T299966: Incident: 2021-11-05 TOC language converter from Backlog to In Review on the SRE-OnFire (FY2021/2022-Q2) board.
Mar 22 2022, 1:46 PM · SRE-OnFire (FY2021/2022-Q2)
herron moved T299965: Incident: 2021-11-04 large file upload timeouts from In Progress to In Review on the SRE-OnFire (FY2021/2022-Q2) board.
Mar 22 2022, 1:46 PM · SRE-OnFire (FY2021/2022-Q2)
herron moved T297127: Incident: 2021-12-03 mx2001->Gmail delivery issues from In Progress to In Review on the SRE-OnFire (FY2021/2022-Q2) board.
Mar 22 2022, 1:46 PM · SRE-OnFire (FY2021/2022-Q2), Sustainability (Incident Followup), SRE

Mar 21 2022

herron claimed T299965: Incident: 2021-11-04 large file upload timeouts.
Mar 21 2022, 3:56 PM · SRE-OnFire (FY2021/2022-Q2)
herron claimed T299966: Incident: 2021-11-05 TOC language converter.
Mar 21 2022, 3:56 PM · SRE-OnFire (FY2021/2022-Q2)

Mar 17 2022

herron closed T302437: Q3: install 2 new HDD into centrallog1001 as Resolved.

Disks have been added and the volume group on the host has been grown. Thanks @Jclark-ctr!

Mar 17 2022, 4:26 PM · SRE, ops-eqiad, DC-Ops
herron updated the task description for T302437: Q3: install 2 new HDD into centrallog1001.
Mar 17 2022, 4:26 PM · SRE, ops-eqiad, DC-Ops
herron closed T300056: centrallog1001 high /srv filesystem utilization as Resolved.

Thanks to @RobH and @Jclark-ctr a pair of 1TB disks have been added to centrallog1001 and the VG and filesystem have been grown:

Mar 17 2022, 4:24 PM · SRE Observability (FY2021/2022-Q3)

Mar 16 2022

herron awarded T303599: Backport prometheus-elasticsearch-exporter version 1.1.0 to buster-wikimedia a Love token.
Mar 16 2022, 2:03 PM · Observability-Logging, Discovery, Infrastructure-Foundations

Mar 15 2022

herron added a comment to T303803: Prometheus use of Squid proxies.

The reasoning for checking these via the proxy is because the prometheus hosts can't reach all of the watchrat checked URLs directly, and it's simpler to have one blackbox exporter configuration that uses a proxy and works for all cases than to split the config out between proxied/non-proxied urls. Here's the current config https://phabricator.wikimedia.org/source/operations-puppet/browse/production/modules/prometheus/templates/blackbox_exporter/common.yml.erb$25-34

Mar 15 2022, 4:40 PM · SRE Observability (FY2021/2022-Q4)

Mar 14 2022

herron awarded T303605: Stop announcing and scheduling primary database switchovers a Like token.
Mar 14 2022, 1:48 PM · CommRel-Specialists-Support (Apr-Jun-2022), User-notice, Release-Engineering-Team (Radar), DBA

Mar 11 2022

herron added a comment to T303599: Backport prometheus-elasticsearch-exporter version 1.1.0 to buster-wikimedia.

Thanks for the task @BTullis!

Mar 11 2022, 3:33 PM · Observability-Logging, Discovery, Infrastructure-Foundations

Mar 4 2022

herron added a comment to T302437: Q3: install 2 new HDD into centrallog1001.

Hey @Jclark-ctr, could we schedule an installation window for next week?

Mar 4 2022, 6:48 PM · SRE, ops-eqiad, DC-Ops
herron updated the task description for T302437: Q3: install 2 new HDD into centrallog1001.
Mar 4 2022, 6:47 PM · SRE, ops-eqiad, DC-Ops
herron moved T302995: Explore dedicated (non-grafana) SLO Visualization and Management from Inbox to Backlog on the SRE Observability board.
Mar 4 2022, 6:45 PM · Observability-Metrics
herron edited projects for T302995: Explore dedicated (non-grafana) SLO Visualization and Management, added: SRE Observability; removed observability.
Mar 4 2022, 6:45 PM · Observability-Metrics
herron added a parent task for T303041: grafana-ldap-users-sync failing with Grafana 8: T303064: grafana-ldap-users-sync fails to finish intermittently.
Mar 4 2022, 5:25 PM · observability, SRE
herron added a subtask for T303064: grafana-ldap-users-sync fails to finish intermittently: T303041: grafana-ldap-users-sync failing with Grafana 8.
Mar 4 2022, 5:24 PM · SRE Observability (FY2021/2022-Q3), Observability-Metrics
herron added a comment to T302842: SLO dashboard refinements.

If no objections, I think we should delete the nontemplated dashboard as a hazard to navigation.

Mar 4 2022, 5:20 PM · SRE Observability (FY2021/2022-Q4), SRE
herron closed T257024: Buster elasticsearch-curator version not compatible with ELK7, a subtask of T234854: Upgrade ELK Stack to version 7, as Resolved.
Mar 4 2022, 4:31 PM · SRE Observability (FY2021/2022-Q1), observability, Patch-For-Review, SRE, Wikimedia-Logstash
herron closed T257024: Buster elasticsearch-curator version not compatible with ELK7 as Resolved.

SGTM!

Mar 4 2022, 4:31 PM · Observability-Logging, observability, SRE, Wikimedia-Logstash
herron updated subscribers of T303041: grafana-ldap-users-sync failing with Grafana 8.
Mar 4 2022, 4:29 PM · observability, SRE

Mar 3 2022

herron edited Description on Wikimedia-Incident.
Mar 3 2022, 10:36 PM
herron edited Description on Wikimedia-Incident.
Mar 3 2022, 10:29 PM
herron closed T303009: Phabricator form request for creation of tasks tagged wikimedia-incident as Resolved.

Neat, thanks @RhinosF1 that should be good enough to get started with this

Mar 3 2022, 10:12 PM · SRE-OnFire, Phabricator
herron updated the task description for T303009: Phabricator form request for creation of tasks tagged wikimedia-incident.
Mar 3 2022, 10:05 PM · SRE-OnFire, Phabricator
herron created T303009: Phabricator form request for creation of tasks tagged wikimedia-incident.
Mar 3 2022, 10:04 PM · SRE-OnFire, Phabricator
herron added a comment to T302842: SLO dashboard refinements.

To take a step back, the varnish slo dashboard linked in the description didn't actually originate from a template. Presumably this one was a manual fork of the original etcd slo example dashboard that's been manually adjusted.

Mar 3 2022, 6:49 PM · SRE Observability (FY2021/2022-Q4), SRE
herron triaged T302995: Explore dedicated (non-grafana) SLO Visualization and Management as Medium priority.
Mar 3 2022, 6:39 PM · Observability-Metrics

Mar 2 2022

herron added a comment to T300056: centrallog1001 high /srv filesystem utilization.

Added +20g to /dev/mapper/centrallog1001--vg-data

Mar 2 2022, 4:15 PM · SRE Observability (FY2021/2022-Q3)
herron added a comment to T300056: centrallog1001 high /srv filesystem utilization.

Centrallog1001 is above the icinga threshold today, I'll see what I can prune while we wait on the long term solution via T301926

Mar 2 2022, 2:56 PM · SRE Observability (FY2021/2022-Q3)
herron added a comment to T300056: centrallog1001 high /srv filesystem utilization.

00-partial_logs was a directory used during the centrallog host switchover, I've cleaned that up (removed it) just now and will keep an eye on the next run.

Mar 2 2022, 2:47 PM · SRE Observability (FY2021/2022-Q3)

Mar 1 2022

herron added a comment to T230570: De-noise systemd alerts (Reduce Icinga alert noise goal).

Had a shower of IRC alerts today after deploying the freeipmi-ipmiseld package, which isn't a critical situation but overwhelmed the operations channel with noise and caused the bot to be kicked and ircecho temporarily disabled (to avoid a recovery shower)

Mar 1 2022, 5:26 PM · Observability-Alerting, Patch-For-Review, Goal

Feb 28 2022

herron added a comment to T290924: Tooling for end-of-quarter SLO reporting.

Something like https://github.com/pyrra-dev/pyrra seems worth exploring for this and possibly more

Feb 28 2022, 8:41 PM · Observability-Metrics, SRE
herron moved T292506: Investigate cp5006 crash from FY2021/2022-Q3 to Radar on the SRE Observability board.
Feb 28 2022, 8:32 PM · SRE Observability, User-ema, SRE, Traffic
herron triaged T301770: Remove obsolete Java 8 packages from logstash cluster as Medium priority.

SGTM, IIRC java 8 was in use by kafka-logging which is no longer colocated on the logstash hosts.

Feb 28 2022, 8:07 PM · SRE Observability (FY2021/2022-Q3), Observability-Logging
herron removed a project from T292175: rsyslog errors about duplicate module includes: Patch-For-Review.

To complicate matters, rsyslog also appears to throw errors when a module is loaded but not actively used, e.g.:

Feb 28 2022, 3:23 PM · Patch-For-Review, Observability-Logging, User-ema, SRE

Feb 25 2022

herron closed T302626: New VictorOps user request - Bking as Resolved.

Hey @bking! Just created your account, you should have received an email from the system to confirm.

Feb 25 2022, 9:27 PM · Discovery, observability

Feb 24 2022

herron closed T299967: Incident: 2021-11-10 cirrussearch commonsfile outage as Resolved.

Scorecard has been filled in based on the info in the incident report

Feb 24 2022, 9:13 PM · SRE-OnFire (FY2021/2022-Q2)

Feb 22 2022

herron closed T281266: Decommission old ELK5 Logstash cluster as Resolved.
Feb 22 2022, 7:25 PM · SRE Observability (FY2021/2022-Q3), Patch-For-Review, SRE
herron reassigned T298994: Decom centrallog2001 from herron to Papaul.
Feb 22 2022, 6:35 PM · SRE, ops-codfw, decommission-hardware, SRE Observability (FY2021/2022-Q3)
herron triaged T298994: Decom centrallog2001 as Medium priority.
Feb 22 2022, 6:21 PM · SRE, ops-codfw, decommission-hardware, SRE Observability (FY2021/2022-Q3)

Feb 17 2022

herron closed T300062: ElasticSearch Curator forbidden to set replica count on apifeatureusage indices as Resolved.

Looks much better now, resolving!

Feb 17 2022, 5:54 PM · Discovery-Search (Current work), Observability-Logging
herron added a comment to T301944: Web interface to navigate Prometheus alerts and their status.

+1 for reverse proxying the prometheus web interface behind SSO, that seems straightforward to me and could be useful in other cases as well

Feb 17 2022, 5:17 PM · Patch-For-Review, Observability-Metrics
herron awarded T301944: Web interface to navigate Prometheus alerts and their status a 100 token.
Feb 17 2022, 5:13 PM · Patch-For-Review, Observability-Metrics

Feb 16 2022

herron added a comment to T300056: centrallog1001 high /srv filesystem utilization.

Thanks for looking at this! From what I can tell centrallog1001 has 2x 1TB disks installed, but you are seeing 8x in hardware?

Feb 16 2022, 10:25 PM · SRE Observability (FY2021/2022-Q3)
herron reassigned T300056: centrallog1001 high /srv filesystem utilization from herron to RobH.

Reassigning for visibility, feel free to pass back!

Feb 16 2022, 9:29 PM · SRE Observability (FY2021/2022-Q3)