Page MenuHomePhabricator

andrea.denisse (denisse)
Animal

Today

  • No visible events.

Tomorrow

  • No visible events.

Monday

  • No visible events.

User Details

User Since
Apr 26 2022, 12:59 AM (206 w, 4 d)
Availability
Available
IRC Nick
denisse
LDAP User
Unknown
MediaWiki User
ADenisse-WMF [ Global Accounts ]

Recent Activity

Yesterday

andrea.denisse moved T419820: Requesting access to analytics-admins for Jerrywang from Untriaged to Awaiting User Input on the SRE-Access-Requests board.
Fri, Apr 10, 4:54 PM · Data-Platform-SRE (2026-03-27 - 2026-04-17), SRE-Access-Requests

Mar 10 2026

andrea.denisse changed the status of T418723: Materialize analytics queries to improve superset dashboard latency from Open to In Progress.
Mar 10 2026, 8:26 PM · Patch-For-Review, SRE, Wikidata Platform Team (Sprint 03 (2026/03/03)), OKR-Work
andrea.denisse changed the status of T419029: Grant Access to ops for ebernhardson from Open to Stalled.

I was thinking of access as not solving the current issue, as we have a plan forward for that, but as more of addressing possibilities on a longer-term basis.  It seems like once or twice a year I run into something that would go easier if I had more access.  I see from the puppet data.yaml file that we have a couple, but very few, engineers with ops access. This isn't the first time the question of ops level access has come up, but in the past I've pushed off requesting access as it seemed not strictly necessary. It's still not strictly necessary, but I'm leaning towards this easing some of the work I do. The full solutions, like the readahead support being setup now, would still be the end-state we would be looking for, but the additional access would better allow figuring out where these things need to be before the full solution is ready to be deployed.

Historically we've used root access on the search fleet for a variety of reasons, sometimes debugging something amounts to attaching strace to a process, or using a kernel probe (these days it would probably be bpf) to print out arguments to a particular kernel function, utilizing the kernel page-types tool to evaluate file-cache effectiveness, or putting together a custom C program that reaches in with ptrace_do to execute syscalls inside the server process. These cases are rare, but I don't expect us to ever have general solutions to deep-diving into how something works exactly to understand why it's not working in the expected way. I've been doing something similar to the ptrace_do approach in recent days, but it requires using LD_PRELOAD to inject the custom code, which means i need to roll a cluster restart for each evaluation.

Essentially i noticed while trying to debug the opensearch servers in k8s recently that I don't have access to tools that I've used in the past, and it ends up being much more of a black box than it has to be.

Mar 10 2026, 8:20 PM · Data-Platform-SRE (2026-03-06 - 2026-03-27), SRE-Access-Requests, SRE
andrea.denisse added a comment to T390734: Requesting Kerberos access for ben.buchenau.

Hello guys - follow-up request regarding Kerebos authentication: Can I get a keytab file for my user?

I checked the Kerebos docs on Wikitech and saw this is the common way to go for automated services. I currently fetch aggregated data in a scheduled script using Spark for WMDE internal monitoring - but this is only semi-automatic since I have to kinit every second week. We are building an internal dashboard on top of that soon, Having a keytab file for fully automated authentification would be very practical for that.

Best, Ben (Data Analyst at WMDE)

Mar 10 2026, 8:13 PM · Data-Platform-SRE (2025.03.22 - 2025.04.11), SRE, Data-Engineering-Radar, SRE-Access-Requests, Data-Engineering
andrea.denisse closed T419145: Requesting access to analytics-privatedata-users for EMcFarland as Resolved.

Hi @EMcFarland-WMF , access to the analytics-privatedata-users is granted along with the kerberos principal. You should receive an email regarding the kerberos principal requesting you to change your password. Feel free to reopen if there's anything else I can help with.

Mar 10 2026, 8:00 PM · Data-Engineering, SRE, SRE-Access-Requests
andrea.denisse updated the task description for T419145: Requesting access to analytics-privatedata-users for EMcFarland.
Mar 10 2026, 7:59 PM · Data-Engineering, SRE, SRE-Access-Requests
andrea.denisse added a project to T419145: Requesting access to analytics-privatedata-users for EMcFarland: Data-Engineering.
Mar 10 2026, 6:03 PM · Data-Engineering, SRE, SRE-Access-Requests
andrea.denisse updated the task description for T419145: Requesting access to analytics-privatedata-users for EMcFarland.
Mar 10 2026, 5:56 PM · Data-Engineering, SRE, SRE-Access-Requests
andrea.denisse changed the status of T419145: Requesting access to analytics-privatedata-users for EMcFarland from Open to In Progress.
Mar 10 2026, 1:50 AM · Data-Engineering, SRE, SRE-Access-Requests

Dec 18 2025

andrea.denisse closed T413006: Add yubikey SSH key for 'denisse' as Resolved.
Dec 18 2025, 8:39 PM · SRE, SRE-Access-Requests
andrea.denisse added a comment to T413006: Add yubikey SSH key for 'denisse'.

@andrea.denisse I assume you'd handle this yourelf or you'd need help from clinic duty?

Dec 18 2025, 8:39 PM · SRE, SRE-Access-Requests

Dec 17 2025

andrea.denisse changed the status of T413006: Add yubikey SSH key for 'denisse' from Open to In Progress.
Dec 17 2025, 7:48 PM · SRE, SRE-Access-Requests
andrea.denisse moved T413006: Add yubikey SSH key for 'denisse' from Untriaged to Awaiting User Input on the SRE-Access-Requests board.
Dec 17 2025, 7:48 PM · SRE, SRE-Access-Requests
andrea.denisse created T413006: Add yubikey SSH key for 'denisse'.
Dec 17 2025, 7:47 PM · SRE, SRE-Access-Requests
andrea.denisse edited projects for T412793: Rotate statuspage API keys, added: Observability-Metrics; removed SRE Observability.
Dec 17 2025, 3:19 PM · SRE Observability
andrea.denisse moved T412793: Rotate statuspage API keys from Inbox to Backlog on the SRE Observability board.
Dec 17 2025, 3:19 PM · SRE Observability
andrea.denisse claimed T412793: Rotate statuspage API keys.
Dec 17 2025, 3:18 PM · SRE Observability
andrea.denisse edited projects for T412842: arclamp hosts ran out of space, added: Observability-Logging; removed observability.
Dec 17 2025, 3:10 PM · Observability-Logging, Arc-Lamp
andrea.denisse added a comment to T412842: arclamp hosts ran out of space.

Maybe we should look into implementing a way for arclamp to create tasks when this issue happens.

Dec 17 2025, 3:06 PM · Observability-Logging, Arc-Lamp

Dec 4 2025

andrea.denisse changed the status of T411436: Grant Access to analytics-privatedata-users for Silvia G from In Progress to Stalled.

In case it is useful: the MediaWiki page @Rmaung pointed out was the first thing that came up for me when I googled "Request Access Superset Wikimedia", so it might be good to update that one too? In any case, I'm super happy to learn about the proper process, so thank you, @Aklapper!

Hi Silvia, could you please update the task with the fields from the https://phabricator.wikimedia.org/maniphest/task/edit/form/8/ form?

Dec 4 2025, 10:41 PM · SRE, SRE-Access-Requests
andrea.denisse updated the task description for T411436: Grant Access to analytics-privatedata-users for Silvia G.
Dec 4 2025, 10:36 PM · SRE, SRE-Access-Requests
andrea.denisse edited projects for T411774: Requesting a new group allowing shell access to kafka-jumbo servers - with membership for JavierMonton, added: Infrastructure-Foundations, Data-Platform-SRE; removed SRE, SRE-Access-Requests.

Hi folks, if I understand correctly the access request can't be fulfilled because the requested type of access doesn't exist.

Dec 4 2025, 10:35 PM · Data-Platform-SRE (2026.01.05 - 2026.01.23), Essential-Work, Infrastructure-Foundations
andrea.denisse added a comment to T411730: Add FIDO-backed SSH key for brennen.

Hi folks, the patch for this task is merged. Can we close it as resolved?

Dec 4 2025, 10:31 PM · Essential-Work, SRE, User-brennen, SRE-Access-Requests
andrea.denisse closed T411679: Requesting access to analytics-privatedata-users for astein, a subtask of T405517: Make the shell group analytics-privatedata-users less confusing, as Resolved.
Dec 4 2025, 10:30 PM · Data-Platform-SRE, SRE
andrea.denisse closed T411679: Requesting access to analytics-privatedata-users for astein as Resolved.

Closing as resolved, feel free to reopen if there's anything else I can assist with.

Dec 4 2025, 10:30 PM · Fundraising-Backlog, SRE, SRE-Access-Requests
andrea.denisse edited projects for T408704: offline rackspace wikitech-static, online aws wikitech-static, added: Infrastructure-Foundations; removed SRE.
Dec 4 2025, 10:28 PM · Infrastructure-Foundations
andrea.denisse added a project to T403298: Provide auth-less access to Enterprise APIs from WMF Analytics cluster: Data-Platform-SRE.
Dec 4 2025, 10:28 PM · Data-Platform-SRE, Data-Engineering, Data-Platform, Wikimedia Enterprise
andrea.denisse edited projects for T403298: Provide auth-less access to Enterprise APIs from WMF Analytics cluster, added: Data-Platform; removed SRE.
Dec 4 2025, 10:27 PM · Data-Platform-SRE, Data-Engineering, Data-Platform, Wikimedia Enterprise
andrea.denisse closed T411624: Requesting access to analytics-privatedata-users for Riku Silvola as Resolved.

Closing as resolved, please let me know if there's anything else I can assist with.

Dec 4 2025, 10:11 PM · SRE, SRE-Access-Requests
andrea.denisse updated the task description for T411624: Requesting access to analytics-privatedata-users for Riku Silvola.
Dec 4 2025, 10:04 PM · SRE, SRE-Access-Requests
andrea.denisse updated the task description for T411624: Requesting access to analytics-privatedata-users for Riku Silvola.
Dec 4 2025, 9:26 PM · SRE, SRE-Access-Requests
andrea.denisse changed the status of T411624: Requesting access to analytics-privatedata-users for Riku Silvola from Open to In Progress.
Dec 4 2025, 9:25 PM · SRE, SRE-Access-Requests
andrea.denisse added a comment to T411612: Requesting access to analytics_privatedata_users for Thiemo Kreuz (WMDE).

Closing as resolved, please let me know if there's anything else I can assist with.

Dec 4 2025, 9:15 PM · SRE, SRE-Access-Requests
andrea.denisse closed T411543: Requesting access to analytics-privatedata-users for medelius as Resolved.

Closing as resolved, please let me know if there's anything else I can assist with.

Dec 4 2025, 9:14 PM · SRE, SRE-Access-Requests
andrea.denisse updated the task description for T411543: Requesting access to analytics-privatedata-users for medelius.
Dec 4 2025, 9:11 PM · SRE, SRE-Access-Requests
andrea.denisse updated the task description for T411543: Requesting access to analytics-privatedata-users for medelius.
Dec 4 2025, 8:50 PM · SRE, SRE-Access-Requests
andrea.denisse closed T411612: Requesting access to analytics_privatedata_users for Thiemo Kreuz (WMDE) as Resolved.
Dec 4 2025, 8:47 PM · SRE, SRE-Access-Requests
andrea.denisse changed the status of T411612: Requesting access to analytics_privatedata_users for Thiemo Kreuz (WMDE) from Open to In Progress.
Dec 4 2025, 7:11 PM · SRE, SRE-Access-Requests

Dec 3 2025

andrea.denisse changed the status of T411679: Requesting access to analytics-privatedata-users for astein from Open to In Progress.
Dec 3 2025, 9:32 PM · Fundraising-Backlog, SRE, SRE-Access-Requests
andrea.denisse updated the task description for T411543: Requesting access to analytics-privatedata-users for medelius.
Dec 3 2025, 9:30 PM · SRE, SRE-Access-Requests
andrea.denisse updated the task description for T411543: Requesting access to analytics-privatedata-users for medelius.
Dec 3 2025, 9:29 PM · SRE, SRE-Access-Requests
andrea.denisse changed the status of T411506: Requesting update of SSH key for zoe from Open to In Progress.

I wrote to Zoe directly to confirm of this request.

Dec 3 2025, 6:49 AM · SRE, SRE-Access-Requests
andrea.denisse edited projects for T410572: Replace deprecated Phabricator Conduit API call by @ProdPasteBot with its stable equivalent, added: collaboration-services; removed SRE.

Was collaboration-services tag removed? This tag showed up on the Clinic Duty dashboard.

Dec 3 2025, 6:45 AM · Essential-Work, Release-Engineering-Team (Doing 😎), collaboration-services, Phabricator
andrea.denisse changed the status of T411436: Grant Access to analytics-privatedata-users for Silvia G from Open to In Progress.

In case it is useful: the MediaWiki page @Rmaung pointed out was the first thing that came up for me when I googled "Request Access Superset Wikimedia", so it might be good to update that one too? In any case, I'm super happy to learn about the proper process, so thank you, @Aklapper!

Dec 3 2025, 6:38 AM · SRE, SRE-Access-Requests
andrea.denisse updated the task description for T411543: Requesting access to analytics-privatedata-users for medelius.
Dec 3 2025, 6:27 AM · SRE, SRE-Access-Requests
andrea.denisse changed the status of T411543: Requesting access to analytics-privatedata-users for medelius from Open to In Progress.

Hi @KFrancis, I was unable to find @medelius on the NDA spreadsheet, could you please help me to confirm their NDA status?

Dec 3 2025, 6:26 AM · SRE, SRE-Access-Requests
andrea.denisse added a comment to T411543: Requesting access to analytics-privatedata-users for medelius.

Hi @VPuffetMichel , do you approve this request?

Dec 3 2025, 6:24 AM · SRE, SRE-Access-Requests
andrea.denisse updated the task description for T411543: Requesting access to analytics-privatedata-users for medelius.
Dec 3 2025, 6:23 AM · SRE, SRE-Access-Requests

Dec 2 2025

andrea.denisse claimed T411543: Requesting access to analytics-privatedata-users for medelius.
Dec 2 2025, 8:49 PM · SRE, SRE-Access-Requests
andrea.denisse claimed T411506: Requesting update of SSH key for zoe.
Dec 2 2025, 8:48 PM · SRE, SRE-Access-Requests
andrea.denisse claimed T411436: Grant Access to analytics-privatedata-users for Silvia G.
Dec 2 2025, 8:48 PM · SRE, SRE-Access-Requests
andrea.denisse added a comment to T411404: Update SSH key for kamila.

Hi Raine, this is on the clinic duty dashboard.

Dec 2 2025, 8:46 PM · SRE-Access-Requests
andrea.denisse added a comment to T411365: Yubikey-SSH-FIDO for Hugh Nowlan (hnowlan).

Hi Hugh, this is on the clinic duty dashboard.

Dec 2 2025, 8:45 PM · SRE-Access-Requests, SRE

Nov 12 2025

andrea.denisse added a comment to T367370: Shift frack alerting to use prometheus-alertmanager instead of icinga.

@andrea.denisse Thanks for the feedback. We want to mimic what we currently have for the icinga alerting groups. I think I have done that with this commit (https://gerrit.wikimedia.org/r/c/operations/puppet/+/1204648). It goes a little further to split out fr-tech and fr-tech-ops since we have some hardware/OS level alerts that the whole fr-tech team doesn't need to have.

Our alert rules will live in fundraising and fire from our prometheus server. Given that, I believe we can reference the team in our rules to make sure that they get properly routed. Here is an example of one our our rules we have been testing locally:

---

groups:

  - name: queue-alerts

    rules:

      - &redis_queue_size_warn
        alert: RedisQueueSize
        expr:
          redis_queue_total{
              cluster="frqueue",
              queue=~"(contribution_tracking|payments_init|pending|refund)"}
            > 1500
        for: 5m
        labels:
          severity: warning
          team: 'fr-tech'
        annotations:
          description: 'High redis queue size'
          summary: "Redis Queue {{ $labels.queue }} is high: [{{ $value }}]"
          dashboard: 'https://frmon.wikimedia.org/d/R5m3iU1Wk/queue?orgId=1&from=now-24h&to=now&timezone=utc'
      - &redis_queue_size_crit
        alert: RedisQueueSize
        expr:
          redis_queue_total{
              cluster="frqueue",
              queue=~"(contribution_tracking|payments_init|pending|refund)"}
            > 2000
        for: 5m
        labels:
          severity: critical
          team: 'fr-tech'
        annotations:
          description: 'Critical redis queue size'
          summary: "Redis Queue {{ $labels.queue }} is Critical: [{{ $value }}]"
          dashboard: 'https://frmon.wikimedia.org/d/R5m3iU1Wk/queue?orgId=1&from=now-24h&to=now&timezone=utc'

Does this make sense?

As far as the PfwCoreBGPDown alert, when reading through the config I think we are getting email alerts due to this rule. When we have the new groups set up, maybe we can have that route to our paging level.

Nov 12 2025, 8:50 PM · Observability-Alerting, Fundraising-Backlog, fundraising-tech-ops
andrea.denisse added a comment to T367370: Shift frack alerting to use prometheus-alertmanager instead of icinga.

@fgiunchedi We (fr-tech) are getting close to live testing with some alerts. We have started to build a set of alerts and are firing them off to our local alertmanager instance that will just send us email. With a config change, we could start pointing those at the production alertmanager instance.

We think the next logical step would be to set up the contact groups (team?) within alertmanager so that we can get them routed correctly via email/irc/splunk on-call. We want to make sure that we tag our alerts properly so that we don't cause issues for other folks. Is this something you can assist us with/point us in the right direction for?

Nov 12 2025, 12:13 AM · Observability-Alerting, Fundraising-Backlog, fundraising-tech-ops

Nov 4 2025

andrea.denisse closed T408145: Improve the Alertmanager app templates as Resolved.
Nov 4 2025, 11:55 PM · SRE Observability (FY2025/2026-Q1)

Oct 23 2025

andrea.denisse created T408145: Improve the Alertmanager app templates.
Oct 23 2025, 6:17 PM · SRE Observability (FY2025/2026-Q1)
andrea.denisse claimed T401908: Define a policy for Grafana Alerting.

Hi folks, I'm working on updating the Grafana alerts Wikitech section. It's still a WIP but I'd greatly appreciate your feedback:

Oct 23 2025, 3:07 AM · SRE Observability (FY2025/2026-Q1), Grafana
andrea.denisse updated the task description for T405151: Create base alerts for REST API to Slack.
Oct 23 2025, 1:39 AM · MW-Interfaces-Team (MWI-Sprint-21 (2025-10-21 to 2025-11-04)), Patch-For-Review, OKR-Work, FY2025-26 WE5.2.3 API Monitoring & Alarms
andrea.denisse closed T405151: Create base alerts for REST API to Slack as Resolved.

Hi everyone,

Oct 23 2025, 1:39 AM · MW-Interfaces-Team (MWI-Sprint-21 (2025-10-21 to 2025-11-04)), Patch-For-Review, OKR-Work, FY2025-26 WE5.2.3 API Monitoring & Alarms

Oct 15 2025

andrea.denisse added a comment to T406689: Email alerts from Grafana stopped working?.

#api-alerts channel

Ah cool, didn't know, I'll try that out, thank you!

Oct 15 2025, 4:10 PM · Observability-Alerting, Grafana
andrea.denisse changed the status of T406689: Email alerts from Grafana stopped working? from Open to In Progress.

Hi @andrea.denisse wanted to check if you know happened here, is it an upgrade that changed it? Please let me know if there's anything I can do to help!

Oct 15 2025, 3:06 PM · Observability-Alerting, Grafana

Oct 8 2025

andrea.denisse claimed T406689: Email alerts from Grafana stopped working?.
Oct 8 2025, 2:28 PM · Observability-Alerting, Grafana

Oct 1 2025

andrea.denisse updated the task description for T353912: Observability Bookworm upgrades.
Oct 1 2025, 2:39 PM · SRE Observability (FY2025/2026-Q1), observability, Patch-For-Review
andrea.denisse added a comment to T405151: Create base alerts for REST API to Slack.

I've finished the patch and the tests, now measuring by all of the clusters meaning all the responses are accounted for when measuring the percentages and the alerts would trigger as soon as those conditions are met. https://gerrit.wikimedia.org/r/c/operations/alerts/+/1192183/

Oct 1 2025, 2:49 AM · MW-Interfaces-Team (MWI-Sprint-21 (2025-10-21 to 2025-11-04)), Patch-For-Review, OKR-Work, FY2025-26 WE5.2.3 API Monitoring & Alarms

Sep 29 2025

andrea.denisse updated subscribers of T404888: Parse DMARC reports and create a dashboard from data.

Hi folks,

Sep 29 2025, 6:59 PM · Patch-For-Review, SRE Observability, Epic, Infrastructure-Foundations, Mail

Sep 24 2025

andrea.denisse added a comment to T364622: Review/cleanup content of /srv/git/private/modules/secret/secrets/ssl in the private repo.

Hi @MoritzMuehlenhoff and @Dzahn , we have an alert for the Puppet CA certificate for kibana.discovery.wmnet about to expire however, that certificate is no longer required as the services that use it were migrated to cfssl.
I've been unable to find that cert in the locations specified in the task, do you know where could it be located?

This certificate was created on the CA operated by the Puppet 5 servers (which is where we managed pretty much all internal certs before cfssl was introduced). You can remove it by logging into puppetmaster1001.eqiad.wmnet and running

sudo puppet cert clean kibana.discovery.wmnet

The alert should then auto-resolve a bit later.

Sep 24 2025, 8:08 PM · Puppet-Infrastructure, Puppet (Puppet 7.0), Infrastructure-Foundations, SRE
andrea.denisse added a comment to T364622: Review/cleanup content of /srv/git/private/modules/secret/secrets/ssl in the private repo.

Hi @MoritzMuehlenhoff and @Dzahn , we have an alert for the Puppet CA certificate for kibana.discovery.wmnet about to expire however, that certificate is no longer required as the services that use it were migrated to cfssl.

Sep 24 2025, 1:39 AM · Puppet-Infrastructure, Puppet (Puppet 7.0), Infrastructure-Foundations, SRE

Sep 22 2025

andrea.denisse closed T401730: Add a pathway for Alertmanager to send alerts in Slack as Resolved.

Hi folks, I wrote the documentation on using this to Wikitech. https://wikitech.wikimedia.org/wiki/Alertmanager#Sending_alerts_to_Slack

Sep 22 2025, 6:34 PM · SRE Observability (FY2025/2026-Q1), FY2025-26 WE5.2.3 API Monitoring & Alarms
andrea.denisse updated the task description for T401730: Add a pathway for Alertmanager to send alerts in Slack.
Sep 22 2025, 6:33 PM · SRE Observability (FY2025/2026-Q1), FY2025-26 WE5.2.3 API Monitoring & Alarms

Sep 12 2025

andrea.denisse added a comment to T401730: Add a pathway for Alertmanager to send alerts in Slack.

Hi folks, I’ve updated the Wikitech documentation for this feature. I’d really appreciate your feedback: https://wikitech.wikimedia.org/wiki/Alertmanager#Sending_alerts_to_Slack

Sep 12 2025, 7:30 PM · SRE Observability (FY2025/2026-Q1), FY2025-26 WE5.2.3 API Monitoring & Alarms

Sep 4 2025

andrea.denisse updated the task description for T401730: Add a pathway for Alertmanager to send alerts in Slack.
Sep 4 2025, 12:15 AM · SRE Observability (FY2025/2026-Q1), FY2025-26 WE5.2.3 API Monitoring & Alarms

Aug 28 2025

andrea.denisse updated the task description for T401730: Add a pathway for Alertmanager to send alerts in Slack.
Aug 28 2025, 12:06 AM · SRE Observability (FY2025/2026-Q1), FY2025-26 WE5.2.3 API Monitoring & Alarms
andrea.denisse added a comment to T401730: Add a pathway for Alertmanager to send alerts in Slack.

Hi folks,

Aug 28 2025, 12:06 AM · SRE Observability (FY2025/2026-Q1), FY2025-26 WE5.2.3 API Monitoring & Alarms

Aug 22 2025

andrea.denisse changed the status of T401730: Add a pathway for Alertmanager to send alerts in Slack from Open to In Progress.
Aug 22 2025, 4:20 PM · SRE Observability (FY2025/2026-Q1), FY2025-26 WE5.2.3 API Monitoring & Alarms

Aug 21 2025

andrea.denisse added a comment to T401908: Define a policy for Grafana Alerting.

Hi folks, Grafana 12.1.1 introduced a couple of features regarding alerting,

Aug 21 2025, 10:41 PM · SRE Observability (FY2025/2026-Q1), Grafana

Aug 20 2025

andrea.denisse removed a parent task for T401730: Add a pathway for Alertmanager to send alerts in Slack: T401908: Define a policy for Grafana Alerting.
Aug 20 2025, 2:23 PM · SRE Observability (FY2025/2026-Q1), FY2025-26 WE5.2.3 API Monitoring & Alarms
andrea.denisse removed a subtask for T401908: Define a policy for Grafana Alerting: T401730: Add a pathway for Alertmanager to send alerts in Slack.
Aug 20 2025, 2:23 PM · SRE Observability (FY2025/2026-Q1), Grafana
andrea.denisse claimed T401730: Add a pathway for Alertmanager to send alerts in Slack.
Aug 20 2025, 2:22 PM · SRE Observability (FY2025/2026-Q1), FY2025-26 WE5.2.3 API Monitoring & Alarms
andrea.denisse added a comment to T402247: rsyslog is segfaulting non-stop on ms-be1071.

I think that the drive is failing:

sudo dmesg | grep -i 'error\|fail\|ata':

[37968478.484217] blk_update_request: I/O error, dev sdg, sector 348391680 op 0x0:(READ) flags 0x1000 phys_seg 4 prio class 0
[37968478.495312] XFS: metadata IO error: 13 callbacks suppressed
[37968478.495415] XFS (sdg1): metadata I/O error in "xfs_imap_to_bp+0x61/0xb0 [xfs]" at daddr 0x14c40100 len 32 error 5
[37968478.646386] sd 0:2:5:0: [sdg] tag#345 BRCM Debug mfi stat 0x2d, data len requested/completed 0x4000/0x0
[37968478.690658] sd 0:2:5:0: [sdg] tag#804 BRCM Debug mfi stat 0x2d, data len requested/completed 0x4000/0x0
[37968478.714318] sd 0:2:5:0: [sdg] tag#806 BRCM Debug mfi stat 0x2d, data len requested/completed 0x4000/0x0
[37968478.728083] sd 0:2:5:0: [sdg] tag#807 BRCM Debug mfi stat 0x2d, data len requested/completed 0x4000/0x0
[37968478.744146] sd 0:2:5:0: [sdg] tag#808 BRCM Debug mfi stat 0x2d, data len requested/completed 0x4000/0x0
[37968478.760181] sd 0:2:5:0: [sdg] tag#810 BRCM Debug mfi stat 0x2d, data len requested/completed 0x4000/0x0
[37968478.760209] sd 0:2:5:0: [sdg] tag#810 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=0s
[37968478.760218] sd 0:2:5:0: [sdg] tag#810 Sense Key : Medium Error [current]
[37968478.760237] blk_update_request: I/O error, dev sdg, sector 348391680 op 0x0:(READ) flags 0x1000 phys_seg 4 prio class 0
[37968478.771419] XFS (sdg1): metadata I/O error in "xfs_imap_to_bp+0x61/0xb0 [xfs]" at daddr 0x14c40100 len 32 error 5
[37968478.783012] sd 0:2:5:0: [sdg] tag#817 BRCM Debug mfi stat 0x2d, data len requested/completed 0x4000/0x0
[37968478.796246] sd 0:2:5:0: [sdg] tag#772 BRCM Debug mfi stat 0x2d, data len requested/completed 0x4000/0x0
[37968478.812163] sd 0:2:5:0: [sdg] tag#779 BRCM Debug mfi stat 0x2d, data len requested/completed 0x4000/0x0
[37968478.828086] sd 0:2:5:0: [sdg] tag#782 BRCM Debug mfi stat 0x2d, data len requested/completed 0x4000/0x0
[37968478.848122] sd 0:2:5:0: [sdg] tag#787 BRCM Debug mfi stat 0x2d, data len requested/completed 0x4000/0x0
[37968478.868260] sd 0:2:5:0: [sdg] tag#791 BRCM Debug mfi stat 0x2d, data len requested/completed 0x4000/0x0
[37968478.868277] sd 0:2:5:0: [sdg] tag#791 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=0s
[37968478.868282] sd 0:2:5:0: [sdg] tag#791 Sense Key : Medium Error [current]
[37968478.868295] blk_update_request: I/O error, dev sdg, sector 348391680 op 0x0:(READ) flags 0x1000 phys_seg 4 prio class 0
[37968478.889217] XFS (sdg1): metadata I/O error in "xfs_imap_to_bp+0x61/0xb0 [xfs]" at daddr 0x14c40100 len 32 error 5
[37968478.900866] sd 0:2:5:0: [sdg] tag#330 BRCM Debug mfi stat 0x2d, data len requested/completed 0x4000/0x0
[37968478.920091] sd 0:2:5:0: [sdg] tag#331 BRCM Debug mfi stat 0x2d, data len requested/completed 0x4000/0x0
[37968478.940089] sd 0:2:5:0: [sdg] tag#332 BRCM Debug mfi stat 0x2d, data len requested/completed 0x4000/0x0
[37968478.964070] sd 0:2:5:0: [sdg] tag#334 BRCM Debug mfi stat 0x2d, data len requested/completed 0x4000/0x0
[37968478.984109] sd 0:2:5:0: [sdg] tag#337 BRCM Debug mfi stat 0x2d, data len requested/completed 0x4000/0x0
[37968479.004076] sd 0:2:5:0: [sdg] tag#341 BRCM Debug mfi stat 0x2d, data len requested/completed 0x4000/0x0
[37968479.004097] sd 0:2:5:0: [sdg] tag#341 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=0s
[37968479.004104] sd 0:2:5:0: [sdg] tag#341 Sense Key : Medium Error [current]
[37968479.004124] blk_update_request: I/O error, dev sdg, sector 348391680 op 0x0:(READ) flags 0x1000 phys_seg 4 prio class 0
[37968479.015325] XFS (sdg1): metadata I/O error in "xfs_imap_to_bp+0x61/0xb0 [xfs]" at daddr 0x14c40100 len 32 error 5
[37968479.146415] sd 0:2:5:0: [sdg] tag#10 BRCM Debug mfi stat 0x2d, data len requested/completed 0x4000/0x0
[37968479.160143] sd 0:2:5:0: [sdg] tag#11 BRCM Debug mfi stat 0x2d, data len requested/completed 0x4000/0x0
[37968479.176058] sd 0:2:5:0: [sdg] tag#12 BRCM Debug mfi stat 0x2d, data len requested/completed 0x4000/0x0
[37968479.192162] sd 0:2:5:0: [sdg] tag#13 BRCM Debug mfi stat 0x2d, data len requested/completed 0x4000/0x0
[37968479.208110] sd 0:2:5:0: [sdg] tag#14 BRCM Debug mfi stat 0x2d, data len requested/completed 0x4000/0x0
[37968479.224069] sd 0:2:5:0: [sdg] tag#15 BRCM Debug mfi stat 0x2d, data len requested/completed 0x4000/0x0
[37968479.224084] sd 0:2:5:0: [sdg] tag#15 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=0s
[37968479.224088] sd 0:2:5:0: [sdg] tag#15 Sense Key : Medium Error [current]
[37968479.224102] blk_update_request: I/O error, dev sdg, sector 348391680 op 0x0:(READ) flags 0x1000 phys_seg 4 prio class 0
[37968479.235286] XFS (sdg1): metadata I/O error in "xfs_imap_to_bp+0x61/0xb0 [xfs]" at daddr 0x14c40100 len 32 error 5

Looking further the drive in slot 3 shows several reallocated sectors:

==== Checking Slot 3 ==== Device Model: TOSHIBA MG06ACA800EY Serial Number: 81U0A02YF1QF SMART overall-health self-assessment test result: PASSED 5 Reallocated_Sector_Ct 0x0033 096 096 010 Pre-fail Always - 416 198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 0

I left a smartctl test running that will complete after Tue Aug 19 13:36:35 2025 UTC.

Aug 20 2025, 12:29 AM · SRE Observability (FY2025/2026-Q1), Observability-Logging, SRE-swift-storage
andrea.denisse added a comment to T402346: hw troubleshooting: disk (sdg) errors on ms-be1071.

While investigating T402247 I left a smartctl test running for drive 3 (which is the one I suspect was failing due to the high number of reallocated sectors), here are the results.

Aug 20 2025, 12:25 AM · SRE, SRE-swift-storage, ops-eqiad, DC-Ops
andrea.denisse created P81577 (An Untitled Masterwork).
Aug 20 2025, 12:24 AM
andrea.denisse added a comment to T383309: rsyslog receiver on centrallog hosts misplaces some log host entries.

@andrea.denisse Do you still have any debug logs available? I’m just curious...

Hi Tiziano, I don't have any debug logs. I captured and analyzed them in the host so the logs didn't leave the prod infra but I'll enable debug logging again to share them with the upstream maintainers as they would like to see the logs headers.

My plan is to enable debug logging and to share a sample of sanitized logs with the rsyslog maintainers for their advice. I can leave the file on the host if you'd like to analyze it, any findings you make could be pretty useful to further understanding or solving the issue.

Aug 20 2025, 12:02 AM · Patch-For-Review, Observability-Logging

Aug 19 2025

andrea.denisse updated the task description for T401730: Add a pathway for Alertmanager to send alerts in Slack.
Aug 19 2025, 6:27 PM · SRE Observability (FY2025/2026-Q1), FY2025-26 WE5.2.3 API Monitoring & Alarms
andrea.denisse added a comment to T402247: rsyslog is segfaulting non-stop on ms-be1071.

rsyslog is back up and running after clearing the queue (/var/spool/rsyslog/*), which apparently was corrupted.

Aug 19 2025, 5:34 PM · SRE Observability (FY2025/2026-Q1), Observability-Logging, SRE-swift-storage
andrea.denisse added a comment to T402247: rsyslog is segfaulting non-stop on ms-be1071.

I think that the drive is failing:

sudo dmesg | grep -i 'error\|fail\|ata':

[37968478.484217] blk_update_request: I/O error, dev sdg, sector 348391680 op 0x0:(READ) flags 0x1000 phys_seg 4 prio class 0
[37968478.495312] XFS: metadata IO error: 13 callbacks suppressed
[37968478.495415] XFS (sdg1): metadata I/O error in "xfs_imap_to_bp+0x61/0xb0 [xfs]" at daddr 0x14c40100 len 32 error 5
[37968478.646386] sd 0:2:5:0: [sdg] tag#345 BRCM Debug mfi stat 0x2d, data len requested/completed 0x4000/0x0
[37968478.690658] sd 0:2:5:0: [sdg] tag#804 BRCM Debug mfi stat 0x2d, data len requested/completed 0x4000/0x0
[37968478.714318] sd 0:2:5:0: [sdg] tag#806 BRCM Debug mfi stat 0x2d, data len requested/completed 0x4000/0x0
[37968478.728083] sd 0:2:5:0: [sdg] tag#807 BRCM Debug mfi stat 0x2d, data len requested/completed 0x4000/0x0
[37968478.744146] sd 0:2:5:0: [sdg] tag#808 BRCM Debug mfi stat 0x2d, data len requested/completed 0x4000/0x0
[37968478.760181] sd 0:2:5:0: [sdg] tag#810 BRCM Debug mfi stat 0x2d, data len requested/completed 0x4000/0x0
[37968478.760209] sd 0:2:5:0: [sdg] tag#810 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=0s
[37968478.760218] sd 0:2:5:0: [sdg] tag#810 Sense Key : Medium Error [current]
[37968478.760237] blk_update_request: I/O error, dev sdg, sector 348391680 op 0x0:(READ) flags 0x1000 phys_seg 4 prio class 0
[37968478.771419] XFS (sdg1): metadata I/O error in "xfs_imap_to_bp+0x61/0xb0 [xfs]" at daddr 0x14c40100 len 32 error 5
[37968478.783012] sd 0:2:5:0: [sdg] tag#817 BRCM Debug mfi stat 0x2d, data len requested/completed 0x4000/0x0
[37968478.796246] sd 0:2:5:0: [sdg] tag#772 BRCM Debug mfi stat 0x2d, data len requested/completed 0x4000/0x0
[37968478.812163] sd 0:2:5:0: [sdg] tag#779 BRCM Debug mfi stat 0x2d, data len requested/completed 0x4000/0x0
[37968478.828086] sd 0:2:5:0: [sdg] tag#782 BRCM Debug mfi stat 0x2d, data len requested/completed 0x4000/0x0
[37968478.848122] sd 0:2:5:0: [sdg] tag#787 BRCM Debug mfi stat 0x2d, data len requested/completed 0x4000/0x0
[37968478.868260] sd 0:2:5:0: [sdg] tag#791 BRCM Debug mfi stat 0x2d, data len requested/completed 0x4000/0x0
[37968478.868277] sd 0:2:5:0: [sdg] tag#791 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=0s
[37968478.868282] sd 0:2:5:0: [sdg] tag#791 Sense Key : Medium Error [current]
[37968478.868295] blk_update_request: I/O error, dev sdg, sector 348391680 op 0x0:(READ) flags 0x1000 phys_seg 4 prio class 0
[37968478.889217] XFS (sdg1): metadata I/O error in "xfs_imap_to_bp+0x61/0xb0 [xfs]" at daddr 0x14c40100 len 32 error 5
[37968478.900866] sd 0:2:5:0: [sdg] tag#330 BRCM Debug mfi stat 0x2d, data len requested/completed 0x4000/0x0
[37968478.920091] sd 0:2:5:0: [sdg] tag#331 BRCM Debug mfi stat 0x2d, data len requested/completed 0x4000/0x0
[37968478.940089] sd 0:2:5:0: [sdg] tag#332 BRCM Debug mfi stat 0x2d, data len requested/completed 0x4000/0x0
[37968478.964070] sd 0:2:5:0: [sdg] tag#334 BRCM Debug mfi stat 0x2d, data len requested/completed 0x4000/0x0
[37968478.984109] sd 0:2:5:0: [sdg] tag#337 BRCM Debug mfi stat 0x2d, data len requested/completed 0x4000/0x0
[37968479.004076] sd 0:2:5:0: [sdg] tag#341 BRCM Debug mfi stat 0x2d, data len requested/completed 0x4000/0x0
[37968479.004097] sd 0:2:5:0: [sdg] tag#341 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=0s
[37968479.004104] sd 0:2:5:0: [sdg] tag#341 Sense Key : Medium Error [current]
[37968479.004124] blk_update_request: I/O error, dev sdg, sector 348391680 op 0x0:(READ) flags 0x1000 phys_seg 4 prio class 0
[37968479.015325] XFS (sdg1): metadata I/O error in "xfs_imap_to_bp+0x61/0xb0 [xfs]" at daddr 0x14c40100 len 32 error 5
[37968479.146415] sd 0:2:5:0: [sdg] tag#10 BRCM Debug mfi stat 0x2d, data len requested/completed 0x4000/0x0
[37968479.160143] sd 0:2:5:0: [sdg] tag#11 BRCM Debug mfi stat 0x2d, data len requested/completed 0x4000/0x0
[37968479.176058] sd 0:2:5:0: [sdg] tag#12 BRCM Debug mfi stat 0x2d, data len requested/completed 0x4000/0x0
[37968479.192162] sd 0:2:5:0: [sdg] tag#13 BRCM Debug mfi stat 0x2d, data len requested/completed 0x4000/0x0
[37968479.208110] sd 0:2:5:0: [sdg] tag#14 BRCM Debug mfi stat 0x2d, data len requested/completed 0x4000/0x0
[37968479.224069] sd 0:2:5:0: [sdg] tag#15 BRCM Debug mfi stat 0x2d, data len requested/completed 0x4000/0x0
[37968479.224084] sd 0:2:5:0: [sdg] tag#15 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=0s
[37968479.224088] sd 0:2:5:0: [sdg] tag#15 Sense Key : Medium Error [current]
[37968479.224102] blk_update_request: I/O error, dev sdg, sector 348391680 op 0x0:(READ) flags 0x1000 phys_seg 4 prio class 0
[37968479.235286] XFS (sdg1): metadata I/O error in "xfs_imap_to_bp+0x61/0xb0 [xfs]" at daddr 0x14c40100 len 32 error 5
Aug 19 2025, 12:42 AM · SRE Observability (FY2025/2026-Q1), Observability-Logging, SRE-swift-storage
andrea.denisse added a comment to T402247: rsyslog is segfaulting non-stop on ms-be1071.

I think that the drive is failing:

Aug 19 2025, 12:33 AM · SRE Observability (FY2025/2026-Q1), Observability-Logging, SRE-swift-storage

Aug 18 2025

andrea.denisse added a comment to T401908: Define a policy for Grafana Alerting.

I took a look at this and it seems like all of the alerts use the default alerting policy which delivers the notifications to Alertmanager and there are 6 actively alerting notifications that use that policy but I was unable to see them on Karma.

Aug 18 2025, 10:51 PM · SRE Observability (FY2025/2026-Q1), Grafana
andrea.denisse added a comment to T401730: Add a pathway for Alertmanager to send alerts in Slack.

@Mooeypoo , do you know if these alerts are already present in Prometheus?

We don't yet have any alerts set up, but if we need to we can definitely come up with one we'd need and can test with,

Aug 18 2025, 10:37 PM · SRE Observability (FY2025/2026-Q1), FY2025-26 WE5.2.3 API Monitoring & Alarms
andrea.denisse added a comment to T401730: Add a pathway for Alertmanager to send alerts in Slack.

@colewhite, are there any downsides to using the webhook method instead of email?

I believe this isn't possible because the grafana hosts cannot connect outside the production network. They can send emails through our internal mail servers, though.

But the Alertmanager hosts can connect outside of production so I think that the webhook can be used. The Alertmanager hosts already communicate with SplunkOnCall along with the Prometheus dead man switch.

As for the Grafana alerts, I think that they can be routed to Alertamanger and then Alertmanager could send the alerts to Slack using the webhook.

Aug 18 2025, 10:36 PM · SRE Observability (FY2025/2026-Q1), FY2025-26 WE5.2.3 API Monitoring & Alarms
andrea.denisse added a comment to T383309: rsyslog receiver on centrallog hosts misplaces some log host entries.

@andrea.denisse Do you still have any debug logs available? I’m just curious...

Aug 18 2025, 7:52 PM · Patch-For-Review, Observability-Logging

Aug 15 2025

andrea.denisse added a comment to T383309: rsyslog receiver on centrallog hosts misplaces some log host entries.

Hi folks,

Aug 15 2025, 10:29 PM · Patch-For-Review, Observability-Logging
andrea.denisse added a comment to T401730: Add a pathway for Alertmanager to send alerts in Slack.

@colewhite, are there any downsides to using the webhook method instead of email?

I believe this isn't possible because the grafana hosts cannot connect outside the production network. They can send emails through our internal mail servers, though.

Aug 15 2025, 7:09 PM · SRE Observability (FY2025/2026-Q1), FY2025-26 WE5.2.3 API Monitoring & Alarms
andrea.denisse updated subscribers of T401730: Add a pathway for Alertmanager to send alerts in Slack.

Hi @hnowlan, I noticed the parent task is T401908. Is the goal here to ingest Grafana alerts into Alertmanager before sending them to Slack, or to route the alerts Alertmanager already receives into Slack channels?

Aug 15 2025, 12:14 AM · SRE Observability (FY2025/2026-Q1), FY2025-26 WE5.2.3 API Monitoring & Alarms

Aug 14 2025

andrea.denisse updated subscribers of T401730: Add a pathway for Alertmanager to send alerts in Slack.

Looking at our Alertmanager configuration, we currently send Slack notifications by creating an email address for a channel and sending alerts to it via email. However, Alertmanager can also send alerts directly to Slack using a webhook.

Aug 14 2025, 11:48 PM · SRE Observability (FY2025/2026-Q1), FY2025-26 WE5.2.3 API Monitoring & Alarms

Jul 30 2025

andrea.denisse added a comment to T371366: Replace kafkatee o11y usage.

Hi Filippo, thanks for reviewing this! I had a question about this part:

Jul 30 2025, 7:26 PM · Observability-Metrics, Observability-Logging

Jul 7 2025

andrea.denisse triaged T396970: Degraded RAID on aqs1012 as High priority.
Jul 7 2025, 2:31 PM · DC-Ops, SRE, ops-eqiad

Jun 26 2025

andrea.denisse added a project to T359271: (Analytics?) Migrate MediaWiki.TemplateData to statslib: SRE Observability (FY2024/2025-Q4).
Jun 26 2025, 5:57 PM · SRE Observability (FY2024/2025-Q4), Editing-team, VisualEditor, Observability-Metrics
andrea.denisse closed T359271: (Analytics?) Migrate MediaWiki.TemplateData to statslib as Resolved.
Jun 26 2025, 5:57 PM · SRE Observability (FY2024/2025-Q4), Editing-team, VisualEditor, Observability-Metrics
andrea.denisse closed T359271: (Analytics?) Migrate MediaWiki.TemplateData to statslib, a subtask of T350592: EPIC: migrate in use metrics and dashboards to statslib, as Resolved.
Jun 26 2025, 5:57 PM · SRE Observability (FY2025/2026-Q1), MW-1.43-notes (1.43.0-wmf.21; 2024-09-03), Epic, MW-1.42-notes (1.42.0-wmf.15; 2024-01-23), MediaWiki-Platform-Team (Radar), Observability-Metrics