Page MenuHomePhabricator
Feed Search

Jun 18 2025

herron changed the status of T391714: Review logging cluster merge pressure from Open to Stalled.
Jun 18 2025, 2:44 PM · Patch-For-Review, Observability-Logging
herron changed the status of T391714: Review logging cluster merge pressure, a subtask of T391687: Consider sharding big logging indices, from Open to Stalled.
Jun 18 2025, 2:44 PM · Observability-Logging
herron triaged T383923: Prometheus: queries matching on {__name__} error out on larger instances as Low priority.
Jun 18 2025, 2:41 PM · Observability-Metrics
herron changed the status of T383923: Prometheus: queries matching on {__name__} error out on larger instances from Open to Stalled.
Jun 18 2025, 2:41 PM · Observability-Metrics
herron moved T397099: Grant Access to NDA LDAP for DerHexer from Backlog to NDA Pending on the LDAP-Access-Requests board.
Jun 18 2025, 1:33 PM · SRE, LDAP-Access-Requests

Jun 17 2025

herron closed T397004: Requesting access to analytics-privatedata-users for AndyRussG as Resolved.

The requested access has been merged and will be fully deployed within 30 minutes. I'll go ahead and resolve this but please don't hesitate to re-open if any followup is needed. Thanks!

Jun 17 2025, 6:03 PM · SRE, SRE-Access-Requests
herron added a member for WMF-NDA: AndyRussG_volunteer.
Jun 17 2025, 6:02 PM
herron added a comment to T397004: Requesting access to analytics-privatedata-users for AndyRussG.

Hi @AndyRussG_volunteer you should have just received an email regarding kerberos, and I'll update the account data to reflect krb: present now as well

Jun 17 2025, 5:58 PM · SRE, SRE-Access-Requests
herron updated the task description for T395917: Grant Access to analytics-privatedata-user (and LDAP nda, wmde) for Anton Kokh (WMDE).
Jun 17 2025, 5:40 PM · Data-Engineering-Radar, Data-Engineering, SRE, SRE-Access-Requests, LDAP-Access-Requests
herron added a comment to T397099: Grant Access to NDA LDAP for DerHexer.

Change #1160216 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] admin: add ldap_only entry for derhexer

https://gerrit.wikimedia.org/r/1160216

Jun 17 2025, 5:39 PM · SRE, LDAP-Access-Requests
herron updated the task description for T397004: Requesting access to analytics-privatedata-users for AndyRussG.
Jun 17 2025, 5:17 PM · SRE, SRE-Access-Requests
herron updated the task description for T397004: Requesting access to analytics-privatedata-users for AndyRussG.
Jun 17 2025, 5:17 PM · SRE, SRE-Access-Requests
herron added a comment to T397004: Requesting access to analytics-privatedata-users for AndyRussG.

Thanks! I've just emailed you as well for the out of band verification step

Jun 17 2025, 4:56 PM · SRE, SRE-Access-Requests
herron added a comment to T397004: Requesting access to analytics-privatedata-users for AndyRussG.

Yes please 👍

Jun 17 2025, 4:51 PM · SRE, SRE-Access-Requests
herron updated the task description for T397004: Requesting access to analytics-privatedata-users for AndyRussG.
Jun 17 2025, 4:49 PM · SRE, SRE-Access-Requests
herron added a comment to T397004: Requesting access to analytics-privatedata-users for AndyRussG.
  • I created a new SSH key for andrew.green@extern.wikimedia.de and added it in Gerrit, as per instructions.
Jun 17 2025, 4:47 PM · SRE, SRE-Access-Requests
herron closed T397200: Access to Wikipedia DB Replicas (SSH) as Invalid.

Do you have any pointers as to how I could achieve this?

Jun 17 2025, 2:47 PM · SRE, SRE-Access-Requests

Jun 16 2025

herron added a comment to T397004: Requesting access to analytics-privatedata-users for AndyRussG.

@WMDE-leszek is this for a contract with end-date, or for ongoing access?

Jun 16 2025, 2:24 PM · SRE, SRE-Access-Requests
herron updated subscribers of T397004: Requesting access to analytics-privatedata-users for AndyRussG.

Hello! Here are a few next-steps to complete before proceeding with access:

Jun 16 2025, 2:01 PM · SRE, SRE-Access-Requests
herron updated the task description for T397004: Requesting access to analytics-privatedata-users for AndyRussG.
Jun 16 2025, 1:49 PM · SRE, SRE-Access-Requests
herron moved T395917: Grant Access to analytics-privatedata-user (and LDAP nda, wmde) for Anton Kokh (WMDE) from Backlog to Awaiting User Input on the LDAP-Access-Requests board.
Jun 16 2025, 1:47 PM · Data-Engineering-Radar, Data-Engineering, SRE, SRE-Access-Requests, LDAP-Access-Requests
herron added a comment to T395917: Grant Access to analytics-privatedata-user (and LDAP nda, wmde) for Anton Kokh (WMDE).

Hi @Anton.Kokh could you please add a unique SSH key here? Thanks in advance!

Jun 16 2025, 1:47 PM · Data-Engineering-Radar, Data-Engineering, SRE, SRE-Access-Requests, LDAP-Access-Requests
herron renamed T395917: Grant Access to analytics-privatedata-user (and LDAP nda, wmde) for Anton Kokh (WMDE) from Grant Access to analytics-privatedata-user for Anton Kokh (WMDE) to Grant Access to analytics-privatedata-user (and LDAP nda, wmde) for Anton Kokh (WMDE).
Jun 16 2025, 1:41 PM · Data-Engineering-Radar, Data-Engineering, SRE, SRE-Access-Requests, LDAP-Access-Requests
herron changed the status of T393626: Grant Access to Product's Superset & Turnilo for SKivlehan from Open to Stalled.
Jun 16 2025, 1:30 PM · Data-Engineering, SRE, LDAP-Access-Requests
herron moved T393626: Grant Access to Product's Superset & Turnilo for SKivlehan from Manager Approval Pending to Awaiting User Input on the LDAP-Access-Requests board.
Jun 16 2025, 1:30 PM · Data-Engineering, SRE, LDAP-Access-Requests
herron closed T395966: Requesting access to deployment for cmelo as Resolved.
Jun 16 2025, 1:21 PM · SRE, SRE-Access-Requests

Jun 11 2025

herron created T396647: Cortobot: reply to 'help' in private message.
Jun 11 2025, 5:44 PM · Incident Tooling

Jun 10 2025

herron renamed T395920: Add a section to the SLO template that explains SLO windows, and Pyrra's dashboards and alerts from Add a section to the SLO template that explains Pyrra's dashboards and alerts to Add a section to the SLO template that explains SLO windows, and Pyrra's dashboards and alerts.
Jun 10 2025, 7:00 PM · SRE-SLO
herron closed T393797: Pyrra detail grafana dashboard contains two panels displaying misleading data as Resolved.

Since we've addressed the misleading panels I think we're ok to resolve. There will be some follow up down the road like upgrading Pyrra, and adapting existing dashboards to use updated recording rules that become available in new version(s) but I think it'll be ok to track that independently

Jun 10 2025, 6:59 PM · SRE-SLO, Observability-Metrics, SRE
herron closed T393797: Pyrra detail grafana dashboard contains two panels displaying misleading data, a subtask of T391852: Create a Pyrra template for Istio-based K8s services and apply it to Citoid, as Resolved.
Jun 10 2025, 6:59 PM · Citoid, SRE-SLO, Observability-Metrics, SRE
herron moved T395987: WDQS Update Lag SLO looks wrong from Backlog to In Progress on the SRE-SLO board.
Jun 10 2025, 6:56 PM · Data-Platform-SRE (2025.06.13 - 2025.07.04), SRE-SLO, observability
herron added a comment to T395987: WDQS Update Lag SLO looks wrong.

If I'm understanding correctly "WDQS update lag" was replaced by "Search update lag" which looks healthy but it seems we've hit a bug where the old SLO hasn't been fully removed from the pyrra output rules that are executed by thanos rule, so thanos and grafana thought it still existed. I've just cleared the stale output rule and we should start to see the old SLO drop off.

Jun 10 2025, 6:37 PM · Data-Platform-SRE (2025.06.13 - 2025.07.04), SRE-SLO, observability
herron moved T395987: WDQS Update Lag SLO looks wrong from Inbox to Radar on the observability board.
Jun 10 2025, 6:25 PM · Data-Platform-SRE (2025.06.13 - 2025.07.04), SRE-SLO, observability
herron added a comment to T395916: Reduce Pyrra's default window from 12w to 4w.

+1 for the 4w, my only doubt is about backfilling - do we have a way to do it? Or should we just start from a clean state without history?

Jun 10 2025, 2:30 PM · SRE Observability (FY2025/2026-Q1), Observability-Metrics, SRE-SLO

Jun 5 2025

herron added a comment to T395920: Add a section to the SLO template that explains SLO windows, and Pyrra's dashboards and alerts.

@Vgutierrez @herron I added more stuff to https://wikitech.wikimedia.org/wiki/SLO/Template_instructions/Dashboards_and_alerts, especially related to alerting. I am planning to have multiple people reviewing it, but lemme know if you like it and if it is close to what we discussed over meetings.

Jun 5 2025, 7:25 PM · SRE-SLO

Jun 3 2025

herron added a comment to T395916: Reduce Pyrra's default window from 12w to 4w.

I went ahead and made adjustments that I think simplify the fixed window view https://grafana-rw.wikimedia.org/d/ccssRIenz/slo-quarterly-drilldown

Jun 3 2025, 5:04 PM · SRE Observability (FY2025/2026-Q1), Observability-Metrics, SRE-SLO
herron added a comment to T393796: Set a predefined time window in Pyrra's configuration to measure SLOs with.

I'm experimenting with self referencing links in the grafana slo review/list dashboard (https://grafana-rw.wikimedia.org/d/YuUMRZ44z/slo-quarterly-review) to provide dashboard buttons that load specific date ranges (e.g. Q3, Q4, etc.). It seems promising, I think with a bit of tuning this will be an intuitive way to browse a fixed window view, and in theory we could add links something like once per FY.

Jun 3 2025, 4:17 PM · SRE-SLO, Observability-Metrics, SRE
herron added a comment to T395916: Reduce Pyrra's default window from 12w to 4w.

For a quick simulation of viewing 4w over a longer window like 8w or 12w, we could view our current 12w window over a longer period. For instance here is a 12w (84d) pyrra window rendered in Grafana over a 6 month period. The percentages and graphs look ok, but I do think we'll want to clarify or remove the "window" panel which displays the pyrra_window metric (as opposed to grafana time picker) and likewise consider adjusting the other instant percentage value panels in favor of graphs over time.

Jun 3 2025, 3:57 PM · SRE Observability (FY2025/2026-Q1), Observability-Metrics, SRE-SLO

May 20 2025

herron added a comment to T393797: Pyrra detail grafana dashboard contains two panels displaying misleading data.

Made a few more improvements to the Pyrra detail dashboard, more specifically to properly include cluster labels when present, and automatically filter the site/cluster variables each time a different SLO is selected

May 20 2025, 3:08 PM · SRE-SLO, Observability-Metrics, SRE
herron renamed T393797: Pyrra detail grafana dashboard contains two panels displaying misleading data from Every Grafana dashboard generated by Pyrra contains two panels displaying misleading data to Pyrra detail grafana dashboard contains two panels displaying misleading data.
May 20 2025, 3:07 PM · SRE-SLO, Observability-Metrics, SRE
herron added a comment to T394415: Rework the Pyrra list dashboard.

Thanks, this makes a lot of sense. I've saved a version of the dashboard that pairs way down on headings. FWIW Cluster I think can stay since some SLOs use it e.g. haproxy.

May 20 2025, 12:41 PM · SRE-SLO, Observability-Metrics, SRE

May 19 2025

herron added a comment to T387350: liftwing SLO performance issues.

Lift Wing Grizzly dashboards have been bulk deleted within Grafana as well

May 19 2025, 3:31 PM · SRE Observability (FY2024/2025-Q4), SRE-SLO, Observability-Metrics
herron updated the task description for T394319: Move thanos cache out of process.
May 19 2025, 3:13 PM · SRE Observability (FY2024/2025-Q4), Observability-Metrics
herron added a comment to T394319: Move thanos cache out of process.

Thanks, makes sense to me. Cross site latency is a good point, not sure why I had it in mind that we might share across all nodes when realistically it'd be per-site.

May 19 2025, 3:12 PM · SRE Observability (FY2024/2025-Q4), Observability-Metrics
herron added a comment to T393797: Pyrra detail grafana dashboard contains two panels displaying misleading data.

Thanks a lot! So the changes will not be reverted by future Pyrra filesystem syncs (new SLOs etc..) as for the time window right? If so I think it is something workable at the moment,

May 19 2025, 2:37 PM · SRE-SLO, Observability-Metrics, SRE

May 14 2025

herron added a comment to T394319: Move thanos cache out of process.

As part of this I think we should summarize the available cache backends available today and document the reason for selecting the memcached backend. Other options available are:

May 14 2025, 4:50 PM · SRE Observability (FY2024/2025-Q4), Observability-Metrics
herron added a subtask for T368953: Thanos Cache Tuning: T394319: Move thanos cache out of process.
May 14 2025, 4:45 PM · Observability-Metrics
herron added a parent task for T394319: Move thanos cache out of process: T368953: Thanos Cache Tuning.
May 14 2025, 4:45 PM · SRE Observability (FY2024/2025-Q4), Observability-Metrics
herron added a comment to T369854: Occasional SLOMetricAbsent alerts.

linking with T383570 since these alerts are evaluated by thanos rule which depends on thanos query

May 14 2025, 3:41 PM · Observability-Alerting
herron added a subtask for T383570: thanos query/store OOM on titan hosts: T369854: Occasional SLOMetricAbsent alerts.
May 14 2025, 3:41 PM · Observability-Metrics
herron added a parent task for T369854: Occasional SLOMetricAbsent alerts: T383570: thanos query/store OOM on titan hosts.
May 14 2025, 3:41 PM · Observability-Alerting

May 13 2025

herron added a comment to T394080: Duplicate SLOMetricAbsent rules generated by Pyrra for varnish-combined-eqsin-cache_upload.yaml.

I see a lot of read: connection timed out errors in the thanos-rule journal, for instance on titan1001, and also noticed a few OOM events today affecting thanos query and thanos query frontend:

May 13 2025, 7:46 PM · SRE Observability
herron added a comment to T393797: Pyrra detail grafana dashboard contains two panels displaying misleading data.

Yeah, those panels are spiky and confusing. To try and make the better use of the recording rules that are in place today I added an experimental "error ratio" panel to the Pyrra details dashboard (https://grafana-rw.wikimedia.org/d/ccssRIenz/pyrra-detail?var-slo=citoid-requests) and hid the bugged panels inside a collapsed row. With this we can at least visualize error spikes as a percentage, and for the time being defer to service specific dashboards for request rates, etc.

May 13 2025, 2:22 PM · SRE-SLO, Observability-Metrics, SRE
herron added a comment to T393796: Set a predefined time window in Pyrra's configuration to measure SLOs with.

! In T393796#10815188, @elukey wrote:
@herron I think it is perfect, two manual changes are definitely ok for this use case! Do you know if the settings are kept even when we add new SLOs via Pyrra? I Assume that there is an auto-sync from Pyrra to Grafana, but I am not sure about it.

May 13 2025, 1:23 PM · SRE-SLO, Observability-Metrics, SRE

May 12 2025

herron added a comment to T393796: Set a predefined time window in Pyrra's configuration to measure SLOs with.
  1. In the Pyrra Grafana dashboards that are exported. Ideally we'd want to avoid setting the time manually and/or use a specialized URI every time that we need to check, it would be nice if it was pre-generated and updated upon single/centralized config change.
May 12 2025, 8:45 PM · SRE-SLO, Observability-Metrics, SRE
herron removed a project from T373995: CPU thermal throttling: saturation panel isn't working as expected: SRE Observability (FY2024/2025-Q3).
May 12 2025, 5:43 PM · Observability-Metrics
herron moved T392886: Revisit default Istio histogram buckets from Inbox to In progress on the SRE Observability (FY2024/2025-Q3) board.
May 12 2025, 5:43 PM · SRE Observability (FY2025/2026-Q1), Patch-For-Review, Observability-Metrics
herron removed a project from T391687: Consider sharding big logging indices: SRE Observability (FY2024/2025-Q3).
May 12 2025, 5:42 PM · Observability-Logging
herron closed T369854: Occasional SLOMetricAbsent alerts as Resolved.

Tentatively resolving

May 12 2025, 5:40 PM · Observability-Alerting
herron moved T391714: Review logging cluster merge pressure from Inbox to Up next on the SRE Observability (FY2024/2025-Q3) board.
May 12 2025, 5:38 PM · Patch-For-Review, Observability-Logging
herron moved T387350: liftwing SLO performance issues from Inbox to In progress on the SRE Observability (FY2024/2025-Q3) board.
May 12 2025, 5:38 PM · SRE Observability (FY2024/2025-Q4), SRE-SLO, Observability-Metrics
herron moved T392230: Explore improved isolation of non-ECS k8s log topics/indices from Inbox to Up next on the SRE Observability (FY2024/2025-Q3) board.
May 12 2025, 5:38 PM · Observability-Logging
herron closed T368088: upgrade prometheus-ipmi-exporter to 1.8.0, a subtask of T253810: Alert on ECC warnings in SEL, as Resolved.
May 12 2025, 5:37 PM · SRE-Sprint-Week-Sustainability-March2023, Sustainability (Incident Followup), User-MoritzMuehlenhoff
herron closed T368088: upgrade prometheus-ipmi-exporter to 1.8.0 as Resolved.
May 12 2025, 5:37 PM · SRE Observability (FY2024/2025-Q3), Patch-For-Review, Infrastructure-Foundations, Packaging
herron closed T390194: Add read-only users capability to logs-api.svc as Resolved.

Method based acl has been rolled out today, and jaeger has been granted read/write access (reduced from what was effectively read/write/delete before). Spans are still flowing into opensearch, I think we're good here!

May 12 2025, 3:34 PM · SRE Observability (FY2024/2025-Q4)
herron updated the task description for T390194: Add read-only users capability to logs-api.svc.
May 12 2025, 3:32 PM · SRE Observability (FY2024/2025-Q4)

May 7 2025

herron updated the task description for T390194: Add read-only users capability to logs-api.svc.
May 7 2025, 2:04 PM · SRE Observability (FY2024/2025-Q4)

May 6 2025

herron added a comment to T391852: Create a Pyrra template for Istio-based K8s services and apply it to Citoid.

@herron @RLazarus There are a couple of logistical things to discuss:

  • https://gerrit.wikimedia.org/r/1142596 is supposed to add/enable SLO alerts that Pyrra generates, so we can start checking how they work. Have we already decided what alerts are we going to use? Are the Pyrra ones good enough?
May 6 2025, 3:40 PM · Citoid, SRE-SLO, Observability-Metrics, SRE

Apr 25 2025

herron added a comment to T391687: Consider sharding big logging indices.

Another option is partitioning the data into more indexes to reduce index size.

Apr 25 2025, 5:44 PM · Observability-Logging
herron added a comment to T392230: Explore improved isolation of non-ECS k8s log topics/indices.

An additional factor that I should have put in the description are logstash pipeline(s). Backpressure on the opensearch output with current single logstash pipeline slowed the opensearch indexing rate to about 1/5 the usual amount. Probably we should factor pipelines into the partitioning scheme as well, and could align topic, logstash pipeline and index to some degree.

Apr 25 2025, 5:34 PM · Observability-Logging
herron added a comment to T392488: kafka-logging2005 is down since six days.

Can confirm topics have rebalanced as well. Thanks!

Apr 25 2025, 3:13 PM · SRE, SRE Observability (FY2024/2025-Q4), DC-Ops, ops-codfw

Apr 23 2025

herron added a comment to T392488: kafka-logging2005 is down since six days.

Hi @Jhancock.wm I'll be helping out with this as a kafka-logging service owner, yes please proceed!

Apr 23 2025, 3:22 PM · SRE, SRE Observability (FY2024/2025-Q4), DC-Ops, ops-codfw

Apr 17 2025

herron updated the task description for T369122: On-call batphone escalation configuration holidays FY2024/25.
Apr 17 2025, 7:43 PM · SRE Observability (FY2024/2025-Q4)
herron placed T392230: Explore improved isolation of non-ECS k8s log topics/indices up for grabs.
Apr 17 2025, 3:07 PM · Observability-Logging
herron created T392230: Explore improved isolation of non-ECS k8s log topics/indices.
Apr 17 2025, 3:06 PM · Observability-Logging
herron updated the task description for T391714: Review logging cluster merge pressure.
Apr 17 2025, 2:23 PM · Patch-For-Review, Observability-Logging
herron closed T392092: Review logging index refresh_intervals, a subtask of T391714: Review logging cluster merge pressure, as Declined.
Apr 17 2025, 2:22 PM · Patch-For-Review, Observability-Logging
herron closed T392092: Review logging index refresh_intervals as Declined.

I think we're ok to just decline this for now. We tried doubling refresh_interval across all indices and saw no change even under extreme load. We've reverted back to the original value (https://gerrit.wikimedia.org/r/c/operations/puppet/+/1137027). Declining this one and focusing on other tunables now

Apr 17 2025, 2:22 PM · SRE Observability (FY2024/2025-Q3), Observability-Logging
herron added a comment to T391714: Review logging cluster merge pressure.

Change #1136394 merged by Herron:

[operations/puppet@production] logstash: increase refresh_interval to 10s in index templates

https://gerrit.wikimedia.org/r/1136394

Apr 17 2025, 2:20 PM · Patch-For-Review, Observability-Logging

Apr 16 2025

herron added a comment to T391714: Review logging cluster merge pressure.
Apr 16 2025, 1:56 PM · Patch-For-Review, Observability-Logging
herron updated the task description for T391714: Review logging cluster merge pressure.
Apr 16 2025, 1:52 PM · Patch-For-Review, Observability-Logging
herron created T392092: Review logging index refresh_intervals.
Apr 16 2025, 1:52 PM · SRE Observability (FY2024/2025-Q3), Observability-Logging

Apr 15 2025

herron added a comment to T387350: liftwing SLO performance issues.

Yes, although for the purposes of e.g. citoid latency SLO, we need just a few labels: source_workload_namespace destination_canonical_service le site along with the typical instance, job etc.

Apr 15 2025, 2:31 PM · SRE Observability (FY2024/2025-Q4), SRE-SLO, Observability-Metrics

Apr 14 2025

herron added a comment to T391852: Create a Pyrra template for Istio-based K8s services and apply it to Citoid.

First thing I notice is the first panel (using recording rule) applies rate(sum()) and the second panel sum(rate())

Apr 14 2025, 5:22 PM · Citoid, SRE-SLO, Observability-Metrics, SRE
herron updated subscribers of T391714: Review logging cluster merge pressure.

Change #1136394 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] logstash: increase refresh_interval to 10s in index templates

https://gerrit.wikimedia.org/r/1136394

Apr 14 2025, 3:06 PM · Patch-For-Review, Observability-Logging
herron triaged T391793: FIRING: [2x] DatasourceNoData: <no value> - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData as Medium priority.

I had a look at whats generating these and they map back to various alerts defined in Grafana which make use of the graphite datasource.

Apr 14 2025, 2:14 PM · Wikidata, Observability-Alerting

Apr 11 2025

herron added a comment to T391687: Consider sharding big logging indices.

Adding shards is worth a try IMO. This got me thinking, what metrics should we monitor to know how much of an improvement sharding makes?

Apr 11 2025, 7:41 PM · Observability-Logging
herron added a subtask for T391687: Consider sharding big logging indices: T391714: Review logging cluster merge pressure.
Apr 11 2025, 7:18 PM · Observability-Logging
herron added a parent task for T391714: Review logging cluster merge pressure: T391687: Consider sharding big logging indices.
Apr 11 2025, 7:18 PM · Patch-For-Review, Observability-Logging
herron created T391714: Review logging cluster merge pressure.
Apr 11 2025, 6:25 PM · Patch-For-Review, Observability-Logging

Apr 7 2025

herron changed the status of T385727: etcd: adapt etcd-backup.py for etcd 3.4, a subtask of T381417: aux-k8s-codfw cluster setup, from Open to Stalled.
Apr 7 2025, 2:56 PM · SRE Observability (FY2024/2025-Q3), Infrastructure-Foundations, SRE, Kubernetes
herron changed the status of T385727: etcd: adapt etcd-backup.py for etcd 3.4 from Open to Stalled.
Apr 7 2025, 2:56 PM · SRE Observability, Kubernetes, SRE
herron closed T390938: Extend srv LV on kafka-logging hosts as Resolved.

Turns out we had superuser reserve enabled on these filesystems as well, so I've updated that to 0% and extended the srv filesystems. With that we get the same increase in FS capacity while keeping a ~500G reserve. Since we've seen some long lead times on storage I think it's a good idea to be conservative with reserve capacity for an emergency.

Apr 7 2025, 2:35 PM · SRE Observability

Apr 2 2025

herron added a comment to T390194: Add read-only users capability to logs-api.svc.

Rudimentary ACL would be possible in apache but FWIW we talked about this a bit in the beginning of 2023 in 881839 and the consensus then was that security plugin will handle this

Apr 2 2025, 2:28 PM · SRE Observability (FY2024/2025-Q4)

Mar 19 2025

herron updated the task description for T378742: aux-k8s: eqiad expansion, codfw creation, & future hopes and dreams.
Mar 19 2025, 6:00 PM · Infrastructure-Foundations, SRE, Kubernetes, Epic
herron closed T381417: aux-k8s-codfw cluster setup as Resolved.

Thanks to @ssingh the k8s-ingress-aux.svc.codfw.wmnet LVS is alive!

Mar 19 2025, 5:59 PM · SRE Observability (FY2024/2025-Q3), Infrastructure-Foundations, SRE, Kubernetes
herron closed T381417: aux-k8s-codfw cluster setup, a subtask of T378742: aux-k8s: eqiad expansion, codfw creation, & future hopes and dreams, as Resolved.
Mar 19 2025, 5:59 PM · Infrastructure-Foundations, SRE, Kubernetes, Epic
herron updated the task description for T381417: aux-k8s-codfw cluster setup.
Mar 19 2025, 5:54 PM · SRE Observability (FY2024/2025-Q3), Infrastructure-Foundations, SRE, Kubernetes

Mar 13 2025

herron closed T388586: aux-k8s-codfw enable bgp, a subtask of T381417: aux-k8s-codfw cluster setup, as Resolved.
Mar 13 2025, 2:09 PM · SRE Observability (FY2024/2025-Q3), Infrastructure-Foundations, SRE, Kubernetes
herron closed T388586: aux-k8s-codfw enable bgp as Resolved.

Great -- Just re-set BGP true on the aux-k8s-(ctrl|worker)2* nodes in netbox and the homer diff/commit completed successfully. Thanks @cmooney!

Mar 13 2025, 2:08 PM · SRE Observability (FY2024/2025-Q3), Infrastructure-Foundations, SRE, Kubernetes