Page MenuHomePhabricator

andrea.denisse (denisse)
Animal

Today

  • No visible events.

Tomorrow

  • No visible events.

Tuesday

  • No visible events.

User Details

User Since
Apr 26 2022, 12:59 AM (188 w, 5 d)
Availability
Available
IRC Nick
denisse
LDAP User
Unknown
MediaWiki User
ADenisse-WMF [ Global Accounts ]

Recent Activity

Thu, Dec 4

andrea.denisse changed the status of T411436: Grant Access to analytics-privatedata-users for Silvia G from In Progress to Stalled.

In case it is useful: the MediaWiki page @Rmaung pointed out was the first thing that came up for me when I googled "Request Access Superset Wikimedia", so it might be good to update that one too? In any case, I'm super happy to learn about the proper process, so thank you, @Aklapper!

Hi Silvia, could you please update the task with the fields from the https://phabricator.wikimedia.org/maniphest/task/edit/form/8/ form?

Thu, Dec 4, 10:41 PM · SRE, SRE-Access-Requests, LDAP-Access-Requests
andrea.denisse updated the task description for T411436: Grant Access to analytics-privatedata-users for Silvia G.
Thu, Dec 4, 10:36 PM · SRE, SRE-Access-Requests, LDAP-Access-Requests
andrea.denisse edited projects for T411774: Requesting a new group allowing shell access to kafka-jumbo servers - with membership for JavierMonton, added: Infrastructure-Foundations, Data-Platform-SRE; removed SRE, SRE-Access-Requests.

Hi folks, if I understand correctly the access request can't be fulfilled because the requested type of access doesn't exist.

Thu, Dec 4, 10:35 PM · Data-Platform-SRE, Infrastructure-Foundations, Patch-For-Review
andrea.denisse added a comment to T411730: Add FIDO-backed SSH key for brennen.

Hi folks, the patch for this task is merged. Can we close it as resolved?

Thu, Dec 4, 10:31 PM · SRE, User-brennen, SRE-Access-Requests
andrea.denisse closed T411679: Requesting access to analytics-privatedata-users for astein, a subtask of T405517: Make the shell group analytics-privatedata-users less confusing, as Resolved.
Thu, Dec 4, 10:30 PM · Data-Platform-SRE, SRE
andrea.denisse closed T411679: Requesting access to analytics-privatedata-users for astein as Resolved.

Closing as resolved, feel free to reopen if there's anything else I can assist with.

Thu, Dec 4, 10:30 PM · Fundraising-Backlog, SRE, SRE-Access-Requests
andrea.denisse edited projects for T408704: offline rackspace wikitech-static, online aws wikitech-static, added: Infrastructure-Foundations; removed SRE.
Thu, Dec 4, 10:28 PM · Infrastructure-Foundations
andrea.denisse added a project to T403298: Provide auth-less access to Enterprise APIs from WMF Analytics cluster: Data-Platform-SRE.
Thu, Dec 4, 10:28 PM · Data-Platform-SRE, Data-Engineering, Data-Platform, Wikimedia Enterprise
andrea.denisse edited projects for T403298: Provide auth-less access to Enterprise APIs from WMF Analytics cluster, added: Data-Platform; removed SRE.
Thu, Dec 4, 10:27 PM · Data-Platform-SRE, Data-Engineering, Data-Platform, Wikimedia Enterprise
andrea.denisse closed T411624: Requesting access to analytics-privatedata-users for Riku Silvola as Resolved.

Closing as resolved, please let me know if there's anything else I can assist with.

Thu, Dec 4, 10:11 PM · SRE, SRE-Access-Requests
andrea.denisse updated the task description for T411624: Requesting access to analytics-privatedata-users for Riku Silvola.
Thu, Dec 4, 10:04 PM · SRE, SRE-Access-Requests
andrea.denisse updated the task description for T411624: Requesting access to analytics-privatedata-users for Riku Silvola.
Thu, Dec 4, 9:26 PM · SRE, SRE-Access-Requests
andrea.denisse changed the status of T411624: Requesting access to analytics-privatedata-users for Riku Silvola from Open to In Progress.
Thu, Dec 4, 9:25 PM · SRE, SRE-Access-Requests
andrea.denisse added a comment to T411612: Requesting access to analytics_privatedata_users for Thiemo Kreuz (WMDE).

Closing as resolved, please let me know if there's anything else I can assist with.

Thu, Dec 4, 9:15 PM · SRE, SRE-Access-Requests
andrea.denisse closed T411543: Requesting access to analytics-privatedata-users for medelius as Resolved.

Closing as resolved, please let me know if there's anything else I can assist with.

Thu, Dec 4, 9:14 PM · SRE, SRE-Access-Requests
andrea.denisse updated the task description for T411543: Requesting access to analytics-privatedata-users for medelius.
Thu, Dec 4, 9:11 PM · SRE, SRE-Access-Requests
andrea.denisse updated the task description for T411543: Requesting access to analytics-privatedata-users for medelius.
Thu, Dec 4, 8:50 PM · SRE, SRE-Access-Requests
andrea.denisse closed T411612: Requesting access to analytics_privatedata_users for Thiemo Kreuz (WMDE) as Resolved.
Thu, Dec 4, 8:47 PM · SRE, SRE-Access-Requests
andrea.denisse changed the status of T411612: Requesting access to analytics_privatedata_users for Thiemo Kreuz (WMDE) from Open to In Progress.
Thu, Dec 4, 7:11 PM · SRE, SRE-Access-Requests

Wed, Dec 3

andrea.denisse changed the status of T411679: Requesting access to analytics-privatedata-users for astein from Open to In Progress.
Wed, Dec 3, 9:32 PM · Fundraising-Backlog, SRE, SRE-Access-Requests
andrea.denisse updated the task description for T411543: Requesting access to analytics-privatedata-users for medelius.
Wed, Dec 3, 9:30 PM · SRE, SRE-Access-Requests
andrea.denisse updated the task description for T411543: Requesting access to analytics-privatedata-users for medelius.
Wed, Dec 3, 9:29 PM · SRE, SRE-Access-Requests
andrea.denisse changed the status of T411506: Requesting update of SSH key for zoe from Open to In Progress.

I wrote to Zoe directly to confirm of this request.

Wed, Dec 3, 6:49 AM · SRE, SRE-Access-Requests
andrea.denisse edited projects for T410572: Replace deprecated Phabricator Conduit API call by @ProdPasteBot with its stable equivalent, added: collaboration-services; removed SRE.

Was collaboration-services tag removed? This tag showed up on the Clinic Duty dashboard.

Wed, Dec 3, 6:45 AM · collaboration-services, Phabricator
andrea.denisse changed the status of T411436: Grant Access to analytics-privatedata-users for Silvia G from Open to In Progress.

In case it is useful: the MediaWiki page @Rmaung pointed out was the first thing that came up for me when I googled "Request Access Superset Wikimedia", so it might be good to update that one too? In any case, I'm super happy to learn about the proper process, so thank you, @Aklapper!

Wed, Dec 3, 6:38 AM · SRE, SRE-Access-Requests, LDAP-Access-Requests
andrea.denisse updated the task description for T411543: Requesting access to analytics-privatedata-users for medelius.
Wed, Dec 3, 6:27 AM · SRE, SRE-Access-Requests
andrea.denisse changed the status of T411543: Requesting access to analytics-privatedata-users for medelius from Open to In Progress.

Hi @KFrancis, I was unable to find @medelius on the NDA spreadsheet, could you please help me to confirm their NDA status?

Wed, Dec 3, 6:26 AM · SRE, SRE-Access-Requests
andrea.denisse added a comment to T411543: Requesting access to analytics-privatedata-users for medelius.

Hi @VPuffetMichel , do you approve this request?

Wed, Dec 3, 6:24 AM · SRE, SRE-Access-Requests
andrea.denisse updated the task description for T411543: Requesting access to analytics-privatedata-users for medelius.
Wed, Dec 3, 6:23 AM · SRE, SRE-Access-Requests

Tue, Dec 2

andrea.denisse claimed T411543: Requesting access to analytics-privatedata-users for medelius.
Tue, Dec 2, 8:49 PM · SRE, SRE-Access-Requests
andrea.denisse claimed T411506: Requesting update of SSH key for zoe.
Tue, Dec 2, 8:48 PM · SRE, SRE-Access-Requests
andrea.denisse claimed T411436: Grant Access to analytics-privatedata-users for Silvia G.
Tue, Dec 2, 8:48 PM · SRE, SRE-Access-Requests, LDAP-Access-Requests
andrea.denisse added a comment to T411404: Update SSH key for kamila.

Hi Raine, this is on the clinic duty dashboard.

Tue, Dec 2, 8:46 PM · SRE-Unowned
andrea.denisse added a comment to T411365: Yubikey-SSH-FIDO for Hugh Nowlan (hnowlan).

Hi Hugh, this is on the clinic duty dashboard.

Tue, Dec 2, 8:45 PM · SRE-Access-Requests, SRE

Wed, Nov 12

andrea.denisse added a comment to T367370: Shift frack alerting to use prometheus-alertmanager instead of icinga.

@andrea.denisse Thanks for the feedback. We want to mimic what we currently have for the icinga alerting groups. I think I have done that with this commit (https://gerrit.wikimedia.org/r/c/operations/puppet/+/1204648). It goes a little further to split out fr-tech and fr-tech-ops since we have some hardware/OS level alerts that the whole fr-tech team doesn't need to have.

Our alert rules will live in fundraising and fire from our prometheus server. Given that, I believe we can reference the team in our rules to make sure that they get properly routed. Here is an example of one our our rules we have been testing locally:

---

groups:

  - name: queue-alerts

    rules:

      - &redis_queue_size_warn
        alert: RedisQueueSize
        expr:
          redis_queue_total{
              cluster="frqueue",
              queue=~"(contribution_tracking|payments_init|pending|refund)"}
            > 1500
        for: 5m
        labels:
          severity: warning
          team: 'fr-tech'
        annotations:
          description: 'High redis queue size'
          summary: "Redis Queue {{ $labels.queue }} is high: [{{ $value }}]"
          dashboard: 'https://frmon.wikimedia.org/d/R5m3iU1Wk/queue?orgId=1&from=now-24h&to=now&timezone=utc'
      - &redis_queue_size_crit
        alert: RedisQueueSize
        expr:
          redis_queue_total{
              cluster="frqueue",
              queue=~"(contribution_tracking|payments_init|pending|refund)"}
            > 2000
        for: 5m
        labels:
          severity: critical
          team: 'fr-tech'
        annotations:
          description: 'Critical redis queue size'
          summary: "Redis Queue {{ $labels.queue }} is Critical: [{{ $value }}]"
          dashboard: 'https://frmon.wikimedia.org/d/R5m3iU1Wk/queue?orgId=1&from=now-24h&to=now&timezone=utc'

Does this make sense?

As far as the PfwCoreBGPDown alert, when reading through the config I think we are getting email alerts due to this rule. When we have the new groups set up, maybe we can have that route to our paging level.

Wed, Nov 12, 8:50 PM · Patch-For-Review, Observability-Alerting, Fundraising-Backlog, fundraising-tech-ops
andrea.denisse added a comment to T367370: Shift frack alerting to use prometheus-alertmanager instead of icinga.

@fgiunchedi We (fr-tech) are getting close to live testing with some alerts. We have started to build a set of alerts and are firing them off to our local alertmanager instance that will just send us email. With a config change, we could start pointing those at the production alertmanager instance.

We think the next logical step would be to set up the contact groups (team?) within alertmanager so that we can get them routed correctly via email/irc/splunk on-call. We want to make sure that we tag our alerts properly so that we don't cause issues for other folks. Is this something you can assist us with/point us in the right direction for?

Wed, Nov 12, 12:13 AM · Patch-For-Review, Observability-Alerting, Fundraising-Backlog, fundraising-tech-ops

Nov 4 2025

andrea.denisse closed T408145: Improve the Alertmanager app templates as Resolved.
Nov 4 2025, 11:55 PM · SRE Observability (FY2025/2026-Q1)

Oct 23 2025

andrea.denisse created T408145: Improve the Alertmanager app templates.
Oct 23 2025, 6:17 PM · SRE Observability (FY2025/2026-Q1)
andrea.denisse claimed T401908: Define a policy for Grafana Alerting.

Hi folks, I'm working on updating the Grafana alerts Wikitech section. It's still a WIP but I'd greatly appreciate your feedback:

Oct 23 2025, 3:07 AM · SRE Observability (FY2025/2026-Q1), Grafana
andrea.denisse updated the task description for T405151: Create base alerts for REST API to Slack.
Oct 23 2025, 1:39 AM · MW-Interfaces-Team (MWI-Sprint-21 (2025-10-21 to 2025-11-04)), Patch-For-Review, OKR-Work, FY2025-26 WE5.2.3 API Monitoring & Alarms
andrea.denisse closed T405151: Create base alerts for REST API to Slack as Resolved.

Hi everyone,

Oct 23 2025, 1:39 AM · MW-Interfaces-Team (MWI-Sprint-21 (2025-10-21 to 2025-11-04)), Patch-For-Review, OKR-Work, FY2025-26 WE5.2.3 API Monitoring & Alarms

Oct 15 2025

andrea.denisse added a comment to T406689: Email alerts from Grafana stopped working?.

#api-alerts channel

Ah cool, didn't know, I'll try that out, thank you!

Oct 15 2025, 4:10 PM · Observability-Alerting, Grafana
andrea.denisse changed the status of T406689: Email alerts from Grafana stopped working? from Open to In Progress.

Hi @andrea.denisse wanted to check if you know happened here, is it an upgrade that changed it? Please let me know if there's anything I can do to help!

Oct 15 2025, 3:06 PM · Observability-Alerting, Grafana

Oct 8 2025

andrea.denisse claimed T406689: Email alerts from Grafana stopped working?.
Oct 8 2025, 2:28 PM · Observability-Alerting, Grafana

Oct 1 2025

andrea.denisse updated the task description for T353912: Observability Bookworm upgrades.
Oct 1 2025, 2:39 PM · SRE Observability (FY2025/2026-Q1), observability, Patch-For-Review
andrea.denisse added a comment to T405151: Create base alerts for REST API to Slack.

I've finished the patch and the tests, now measuring by all of the clusters meaning all the responses are accounted for when measuring the percentages and the alerts would trigger as soon as those conditions are met. https://gerrit.wikimedia.org/r/c/operations/alerts/+/1192183/

Oct 1 2025, 2:49 AM · MW-Interfaces-Team (MWI-Sprint-21 (2025-10-21 to 2025-11-04)), Patch-For-Review, OKR-Work, FY2025-26 WE5.2.3 API Monitoring & Alarms

Sep 29 2025

andrea.denisse updated subscribers of T404888: Parse DMARC reports and create a dashboard from data.

Hi folks,

Sep 29 2025, 6:59 PM · Patch-For-Review, SRE Observability, Epic, Infrastructure-Foundations, Mail

Sep 24 2025

andrea.denisse added a comment to T364622: Review/cleanup content of /srv/git/private/modules/secret/secrets/ssl in the private repo.

Hi @MoritzMuehlenhoff and @Dzahn , we have an alert for the Puppet CA certificate for kibana.discovery.wmnet about to expire however, that certificate is no longer required as the services that use it were migrated to cfssl.
I've been unable to find that cert in the locations specified in the task, do you know where could it be located?

This certificate was created on the CA operated by the Puppet 5 servers (which is where we managed pretty much all internal certs before cfssl was introduced). You can remove it by logging into puppetmaster1001.eqiad.wmnet and running

sudo puppet cert clean kibana.discovery.wmnet

The alert should then auto-resolve a bit later.

Sep 24 2025, 8:08 PM · Puppet-Infrastructure, Puppet (Puppet 7.0), Infrastructure-Foundations, SRE
andrea.denisse added a comment to T364622: Review/cleanup content of /srv/git/private/modules/secret/secrets/ssl in the private repo.

Hi @MoritzMuehlenhoff and @Dzahn , we have an alert for the Puppet CA certificate for kibana.discovery.wmnet about to expire however, that certificate is no longer required as the services that use it were migrated to cfssl.

Sep 24 2025, 1:39 AM · Puppet-Infrastructure, Puppet (Puppet 7.0), Infrastructure-Foundations, SRE

Sep 22 2025

andrea.denisse closed T401730: Add a pathway for Alertmanager to send alerts in Slack as Resolved.

Hi folks, I wrote the documentation on using this to Wikitech. https://wikitech.wikimedia.org/wiki/Alertmanager#Sending_alerts_to_Slack

Sep 22 2025, 6:34 PM · SRE Observability (FY2025/2026-Q1), FY2025-26 WE5.2.3 API Monitoring & Alarms
andrea.denisse updated the task description for T401730: Add a pathway for Alertmanager to send alerts in Slack.
Sep 22 2025, 6:33 PM · SRE Observability (FY2025/2026-Q1), FY2025-26 WE5.2.3 API Monitoring & Alarms

Sep 12 2025

andrea.denisse added a comment to T401730: Add a pathway for Alertmanager to send alerts in Slack.

Hi folks, I’ve updated the Wikitech documentation for this feature. I’d really appreciate your feedback: https://wikitech.wikimedia.org/wiki/Alertmanager#Sending_alerts_to_Slack

Sep 12 2025, 7:30 PM · SRE Observability (FY2025/2026-Q1), FY2025-26 WE5.2.3 API Monitoring & Alarms

Sep 4 2025

andrea.denisse updated the task description for T401730: Add a pathway for Alertmanager to send alerts in Slack.
Sep 4 2025, 12:15 AM · SRE Observability (FY2025/2026-Q1), FY2025-26 WE5.2.3 API Monitoring & Alarms

Aug 28 2025

andrea.denisse updated the task description for T401730: Add a pathway for Alertmanager to send alerts in Slack.
Aug 28 2025, 12:06 AM · SRE Observability (FY2025/2026-Q1), FY2025-26 WE5.2.3 API Monitoring & Alarms
andrea.denisse added a comment to T401730: Add a pathway for Alertmanager to send alerts in Slack.

Hi folks,

Aug 28 2025, 12:06 AM · SRE Observability (FY2025/2026-Q1), FY2025-26 WE5.2.3 API Monitoring & Alarms

Aug 22 2025

andrea.denisse changed the status of T401730: Add a pathway for Alertmanager to send alerts in Slack from Open to In Progress.
Aug 22 2025, 4:20 PM · SRE Observability (FY2025/2026-Q1), FY2025-26 WE5.2.3 API Monitoring & Alarms

Aug 21 2025

andrea.denisse added a comment to T401908: Define a policy for Grafana Alerting.

Hi folks, Grafana 12.1.1 introduced a couple of features regarding alerting,

Aug 21 2025, 10:41 PM · SRE Observability (FY2025/2026-Q1), Grafana

Aug 20 2025

andrea.denisse removed a parent task for T401730: Add a pathway for Alertmanager to send alerts in Slack: T401908: Define a policy for Grafana Alerting.
Aug 20 2025, 2:23 PM · SRE Observability (FY2025/2026-Q1), FY2025-26 WE5.2.3 API Monitoring & Alarms
andrea.denisse removed a subtask for T401908: Define a policy for Grafana Alerting: T401730: Add a pathway for Alertmanager to send alerts in Slack.
Aug 20 2025, 2:23 PM · SRE Observability (FY2025/2026-Q1), Grafana
andrea.denisse claimed T401730: Add a pathway for Alertmanager to send alerts in Slack.
Aug 20 2025, 2:22 PM · SRE Observability (FY2025/2026-Q1), FY2025-26 WE5.2.3 API Monitoring & Alarms
andrea.denisse added a comment to T402247: rsyslog is segfaulting non-stop on ms-be1071.

I think that the drive is failing:

sudo dmesg | grep -i 'error\|fail\|ata':

[37968478.484217] blk_update_request: I/O error, dev sdg, sector 348391680 op 0x0:(READ) flags 0x1000 phys_seg 4 prio class 0
[37968478.495312] XFS: metadata IO error: 13 callbacks suppressed
[37968478.495415] XFS (sdg1): metadata I/O error in "xfs_imap_to_bp+0x61/0xb0 [xfs]" at daddr 0x14c40100 len 32 error 5
[37968478.646386] sd 0:2:5:0: [sdg] tag#345 BRCM Debug mfi stat 0x2d, data len requested/completed 0x4000/0x0
[37968478.690658] sd 0:2:5:0: [sdg] tag#804 BRCM Debug mfi stat 0x2d, data len requested/completed 0x4000/0x0
[37968478.714318] sd 0:2:5:0: [sdg] tag#806 BRCM Debug mfi stat 0x2d, data len requested/completed 0x4000/0x0
[37968478.728083] sd 0:2:5:0: [sdg] tag#807 BRCM Debug mfi stat 0x2d, data len requested/completed 0x4000/0x0
[37968478.744146] sd 0:2:5:0: [sdg] tag#808 BRCM Debug mfi stat 0x2d, data len requested/completed 0x4000/0x0
[37968478.760181] sd 0:2:5:0: [sdg] tag#810 BRCM Debug mfi stat 0x2d, data len requested/completed 0x4000/0x0
[37968478.760209] sd 0:2:5:0: [sdg] tag#810 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=0s
[37968478.760218] sd 0:2:5:0: [sdg] tag#810 Sense Key : Medium Error [current]
[37968478.760237] blk_update_request: I/O error, dev sdg, sector 348391680 op 0x0:(READ) flags 0x1000 phys_seg 4 prio class 0
[37968478.771419] XFS (sdg1): metadata I/O error in "xfs_imap_to_bp+0x61/0xb0 [xfs]" at daddr 0x14c40100 len 32 error 5
[37968478.783012] sd 0:2:5:0: [sdg] tag#817 BRCM Debug mfi stat 0x2d, data len requested/completed 0x4000/0x0
[37968478.796246] sd 0:2:5:0: [sdg] tag#772 BRCM Debug mfi stat 0x2d, data len requested/completed 0x4000/0x0
[37968478.812163] sd 0:2:5:0: [sdg] tag#779 BRCM Debug mfi stat 0x2d, data len requested/completed 0x4000/0x0
[37968478.828086] sd 0:2:5:0: [sdg] tag#782 BRCM Debug mfi stat 0x2d, data len requested/completed 0x4000/0x0
[37968478.848122] sd 0:2:5:0: [sdg] tag#787 BRCM Debug mfi stat 0x2d, data len requested/completed 0x4000/0x0
[37968478.868260] sd 0:2:5:0: [sdg] tag#791 BRCM Debug mfi stat 0x2d, data len requested/completed 0x4000/0x0
[37968478.868277] sd 0:2:5:0: [sdg] tag#791 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=0s
[37968478.868282] sd 0:2:5:0: [sdg] tag#791 Sense Key : Medium Error [current]
[37968478.868295] blk_update_request: I/O error, dev sdg, sector 348391680 op 0x0:(READ) flags 0x1000 phys_seg 4 prio class 0
[37968478.889217] XFS (sdg1): metadata I/O error in "xfs_imap_to_bp+0x61/0xb0 [xfs]" at daddr 0x14c40100 len 32 error 5
[37968478.900866] sd 0:2:5:0: [sdg] tag#330 BRCM Debug mfi stat 0x2d, data len requested/completed 0x4000/0x0
[37968478.920091] sd 0:2:5:0: [sdg] tag#331 BRCM Debug mfi stat 0x2d, data len requested/completed 0x4000/0x0
[37968478.940089] sd 0:2:5:0: [sdg] tag#332 BRCM Debug mfi stat 0x2d, data len requested/completed 0x4000/0x0
[37968478.964070] sd 0:2:5:0: [sdg] tag#334 BRCM Debug mfi stat 0x2d, data len requested/completed 0x4000/0x0
[37968478.984109] sd 0:2:5:0: [sdg] tag#337 BRCM Debug mfi stat 0x2d, data len requested/completed 0x4000/0x0
[37968479.004076] sd 0:2:5:0: [sdg] tag#341 BRCM Debug mfi stat 0x2d, data len requested/completed 0x4000/0x0
[37968479.004097] sd 0:2:5:0: [sdg] tag#341 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=0s
[37968479.004104] sd 0:2:5:0: [sdg] tag#341 Sense Key : Medium Error [current]
[37968479.004124] blk_update_request: I/O error, dev sdg, sector 348391680 op 0x0:(READ) flags 0x1000 phys_seg 4 prio class 0
[37968479.015325] XFS (sdg1): metadata I/O error in "xfs_imap_to_bp+0x61/0xb0 [xfs]" at daddr 0x14c40100 len 32 error 5
[37968479.146415] sd 0:2:5:0: [sdg] tag#10 BRCM Debug mfi stat 0x2d, data len requested/completed 0x4000/0x0
[37968479.160143] sd 0:2:5:0: [sdg] tag#11 BRCM Debug mfi stat 0x2d, data len requested/completed 0x4000/0x0
[37968479.176058] sd 0:2:5:0: [sdg] tag#12 BRCM Debug mfi stat 0x2d, data len requested/completed 0x4000/0x0
[37968479.192162] sd 0:2:5:0: [sdg] tag#13 BRCM Debug mfi stat 0x2d, data len requested/completed 0x4000/0x0
[37968479.208110] sd 0:2:5:0: [sdg] tag#14 BRCM Debug mfi stat 0x2d, data len requested/completed 0x4000/0x0
[37968479.224069] sd 0:2:5:0: [sdg] tag#15 BRCM Debug mfi stat 0x2d, data len requested/completed 0x4000/0x0
[37968479.224084] sd 0:2:5:0: [sdg] tag#15 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=0s
[37968479.224088] sd 0:2:5:0: [sdg] tag#15 Sense Key : Medium Error [current]
[37968479.224102] blk_update_request: I/O error, dev sdg, sector 348391680 op 0x0:(READ) flags 0x1000 phys_seg 4 prio class 0
[37968479.235286] XFS (sdg1): metadata I/O error in "xfs_imap_to_bp+0x61/0xb0 [xfs]" at daddr 0x14c40100 len 32 error 5

Looking further the drive in slot 3 shows several reallocated sectors:

==== Checking Slot 3 ==== Device Model: TOSHIBA MG06ACA800EY Serial Number: 81U0A02YF1QF SMART overall-health self-assessment test result: PASSED 5 Reallocated_Sector_Ct 0x0033 096 096 010 Pre-fail Always - 416 198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 0

I left a smartctl test running that will complete after Tue Aug 19 13:36:35 2025 UTC.

Aug 20 2025, 12:29 AM · SRE Observability (FY2025/2026-Q1), Observability-Logging, SRE-swift-storage
andrea.denisse added a comment to T402346: hw troubleshooting: disk (sdg) errors on ms-be1071.

While investigating T402247 I left a smartctl test running for drive 3 (which is the one I suspect was failing due to the high number of reallocated sectors), here are the results.

Aug 20 2025, 12:25 AM · SRE, SRE-swift-storage, ops-eqiad, DC-Ops
andrea.denisse created P81577 (An Untitled Masterwork).
Aug 20 2025, 12:24 AM
andrea.denisse added a comment to T383309: rsyslog receiver on centrallog hosts misplaces some log host entries.

@andrea.denisse Do you still have any debug logs available? I’m just curious...

Hi Tiziano, I don't have any debug logs. I captured and analyzed them in the host so the logs didn't leave the prod infra but I'll enable debug logging again to share them with the upstream maintainers as they would like to see the logs headers.

My plan is to enable debug logging and to share a sample of sanitized logs with the rsyslog maintainers for their advice. I can leave the file on the host if you'd like to analyze it, any findings you make could be pretty useful to further understanding or solving the issue.

Aug 20 2025, 12:02 AM · Patch-For-Review, Observability-Logging

Aug 19 2025

andrea.denisse updated the task description for T401730: Add a pathway for Alertmanager to send alerts in Slack.
Aug 19 2025, 6:27 PM · SRE Observability (FY2025/2026-Q1), FY2025-26 WE5.2.3 API Monitoring & Alarms
andrea.denisse added a comment to T402247: rsyslog is segfaulting non-stop on ms-be1071.

rsyslog is back up and running after clearing the queue (/var/spool/rsyslog/*), which apparently was corrupted.

Aug 19 2025, 5:34 PM · SRE Observability (FY2025/2026-Q1), Observability-Logging, SRE-swift-storage
andrea.denisse added a comment to T402247: rsyslog is segfaulting non-stop on ms-be1071.

I think that the drive is failing:

sudo dmesg | grep -i 'error\|fail\|ata':

[37968478.484217] blk_update_request: I/O error, dev sdg, sector 348391680 op 0x0:(READ) flags 0x1000 phys_seg 4 prio class 0
[37968478.495312] XFS: metadata IO error: 13 callbacks suppressed
[37968478.495415] XFS (sdg1): metadata I/O error in "xfs_imap_to_bp+0x61/0xb0 [xfs]" at daddr 0x14c40100 len 32 error 5
[37968478.646386] sd 0:2:5:0: [sdg] tag#345 BRCM Debug mfi stat 0x2d, data len requested/completed 0x4000/0x0
[37968478.690658] sd 0:2:5:0: [sdg] tag#804 BRCM Debug mfi stat 0x2d, data len requested/completed 0x4000/0x0
[37968478.714318] sd 0:2:5:0: [sdg] tag#806 BRCM Debug mfi stat 0x2d, data len requested/completed 0x4000/0x0
[37968478.728083] sd 0:2:5:0: [sdg] tag#807 BRCM Debug mfi stat 0x2d, data len requested/completed 0x4000/0x0
[37968478.744146] sd 0:2:5:0: [sdg] tag#808 BRCM Debug mfi stat 0x2d, data len requested/completed 0x4000/0x0
[37968478.760181] sd 0:2:5:0: [sdg] tag#810 BRCM Debug mfi stat 0x2d, data len requested/completed 0x4000/0x0
[37968478.760209] sd 0:2:5:0: [sdg] tag#810 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=0s
[37968478.760218] sd 0:2:5:0: [sdg] tag#810 Sense Key : Medium Error [current]
[37968478.760237] blk_update_request: I/O error, dev sdg, sector 348391680 op 0x0:(READ) flags 0x1000 phys_seg 4 prio class 0
[37968478.771419] XFS (sdg1): metadata I/O error in "xfs_imap_to_bp+0x61/0xb0 [xfs]" at daddr 0x14c40100 len 32 error 5
[37968478.783012] sd 0:2:5:0: [sdg] tag#817 BRCM Debug mfi stat 0x2d, data len requested/completed 0x4000/0x0
[37968478.796246] sd 0:2:5:0: [sdg] tag#772 BRCM Debug mfi stat 0x2d, data len requested/completed 0x4000/0x0
[37968478.812163] sd 0:2:5:0: [sdg] tag#779 BRCM Debug mfi stat 0x2d, data len requested/completed 0x4000/0x0
[37968478.828086] sd 0:2:5:0: [sdg] tag#782 BRCM Debug mfi stat 0x2d, data len requested/completed 0x4000/0x0
[37968478.848122] sd 0:2:5:0: [sdg] tag#787 BRCM Debug mfi stat 0x2d, data len requested/completed 0x4000/0x0
[37968478.868260] sd 0:2:5:0: [sdg] tag#791 BRCM Debug mfi stat 0x2d, data len requested/completed 0x4000/0x0
[37968478.868277] sd 0:2:5:0: [sdg] tag#791 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=0s
[37968478.868282] sd 0:2:5:0: [sdg] tag#791 Sense Key : Medium Error [current]
[37968478.868295] blk_update_request: I/O error, dev sdg, sector 348391680 op 0x0:(READ) flags 0x1000 phys_seg 4 prio class 0
[37968478.889217] XFS (sdg1): metadata I/O error in "xfs_imap_to_bp+0x61/0xb0 [xfs]" at daddr 0x14c40100 len 32 error 5
[37968478.900866] sd 0:2:5:0: [sdg] tag#330 BRCM Debug mfi stat 0x2d, data len requested/completed 0x4000/0x0
[37968478.920091] sd 0:2:5:0: [sdg] tag#331 BRCM Debug mfi stat 0x2d, data len requested/completed 0x4000/0x0
[37968478.940089] sd 0:2:5:0: [sdg] tag#332 BRCM Debug mfi stat 0x2d, data len requested/completed 0x4000/0x0
[37968478.964070] sd 0:2:5:0: [sdg] tag#334 BRCM Debug mfi stat 0x2d, data len requested/completed 0x4000/0x0
[37968478.984109] sd 0:2:5:0: [sdg] tag#337 BRCM Debug mfi stat 0x2d, data len requested/completed 0x4000/0x0
[37968479.004076] sd 0:2:5:0: [sdg] tag#341 BRCM Debug mfi stat 0x2d, data len requested/completed 0x4000/0x0
[37968479.004097] sd 0:2:5:0: [sdg] tag#341 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=0s
[37968479.004104] sd 0:2:5:0: [sdg] tag#341 Sense Key : Medium Error [current]
[37968479.004124] blk_update_request: I/O error, dev sdg, sector 348391680 op 0x0:(READ) flags 0x1000 phys_seg 4 prio class 0
[37968479.015325] XFS (sdg1): metadata I/O error in "xfs_imap_to_bp+0x61/0xb0 [xfs]" at daddr 0x14c40100 len 32 error 5
[37968479.146415] sd 0:2:5:0: [sdg] tag#10 BRCM Debug mfi stat 0x2d, data len requested/completed 0x4000/0x0
[37968479.160143] sd 0:2:5:0: [sdg] tag#11 BRCM Debug mfi stat 0x2d, data len requested/completed 0x4000/0x0
[37968479.176058] sd 0:2:5:0: [sdg] tag#12 BRCM Debug mfi stat 0x2d, data len requested/completed 0x4000/0x0
[37968479.192162] sd 0:2:5:0: [sdg] tag#13 BRCM Debug mfi stat 0x2d, data len requested/completed 0x4000/0x0
[37968479.208110] sd 0:2:5:0: [sdg] tag#14 BRCM Debug mfi stat 0x2d, data len requested/completed 0x4000/0x0
[37968479.224069] sd 0:2:5:0: [sdg] tag#15 BRCM Debug mfi stat 0x2d, data len requested/completed 0x4000/0x0
[37968479.224084] sd 0:2:5:0: [sdg] tag#15 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=0s
[37968479.224088] sd 0:2:5:0: [sdg] tag#15 Sense Key : Medium Error [current]
[37968479.224102] blk_update_request: I/O error, dev sdg, sector 348391680 op 0x0:(READ) flags 0x1000 phys_seg 4 prio class 0
[37968479.235286] XFS (sdg1): metadata I/O error in "xfs_imap_to_bp+0x61/0xb0 [xfs]" at daddr 0x14c40100 len 32 error 5
Aug 19 2025, 12:42 AM · SRE Observability (FY2025/2026-Q1), Observability-Logging, SRE-swift-storage
andrea.denisse added a comment to T402247: rsyslog is segfaulting non-stop on ms-be1071.

I think that the drive is failing:

Aug 19 2025, 12:33 AM · SRE Observability (FY2025/2026-Q1), Observability-Logging, SRE-swift-storage

Aug 18 2025

andrea.denisse added a comment to T401908: Define a policy for Grafana Alerting.

I took a look at this and it seems like all of the alerts use the default alerting policy which delivers the notifications to Alertmanager and there are 6 actively alerting notifications that use that policy but I was unable to see them on Karma.

Aug 18 2025, 10:51 PM · SRE Observability (FY2025/2026-Q1), Grafana
andrea.denisse added a comment to T401730: Add a pathway for Alertmanager to send alerts in Slack.

@Mooeypoo , do you know if these alerts are already present in Prometheus?

We don't yet have any alerts set up, but if we need to we can definitely come up with one we'd need and can test with,

Aug 18 2025, 10:37 PM · SRE Observability (FY2025/2026-Q1), FY2025-26 WE5.2.3 API Monitoring & Alarms
andrea.denisse added a comment to T401730: Add a pathway for Alertmanager to send alerts in Slack.

@colewhite, are there any downsides to using the webhook method instead of email?

I believe this isn't possible because the grafana hosts cannot connect outside the production network. They can send emails through our internal mail servers, though.

But the Alertmanager hosts can connect outside of production so I think that the webhook can be used. The Alertmanager hosts already communicate with SplunkOnCall along with the Prometheus dead man switch.

As for the Grafana alerts, I think that they can be routed to Alertamanger and then Alertmanager could send the alerts to Slack using the webhook.

Aug 18 2025, 10:36 PM · SRE Observability (FY2025/2026-Q1), FY2025-26 WE5.2.3 API Monitoring & Alarms
andrea.denisse added a comment to T383309: rsyslog receiver on centrallog hosts misplaces some log host entries.

@andrea.denisse Do you still have any debug logs available? I’m just curious...

Aug 18 2025, 7:52 PM · Patch-For-Review, Observability-Logging

Aug 15 2025

andrea.denisse added a comment to T383309: rsyslog receiver on centrallog hosts misplaces some log host entries.

Hi folks,

Aug 15 2025, 10:29 PM · Patch-For-Review, Observability-Logging
andrea.denisse added a comment to T401730: Add a pathway for Alertmanager to send alerts in Slack.

@colewhite, are there any downsides to using the webhook method instead of email?

I believe this isn't possible because the grafana hosts cannot connect outside the production network. They can send emails through our internal mail servers, though.

Aug 15 2025, 7:09 PM · SRE Observability (FY2025/2026-Q1), FY2025-26 WE5.2.3 API Monitoring & Alarms
andrea.denisse updated subscribers of T401730: Add a pathway for Alertmanager to send alerts in Slack.

Hi @hnowlan, I noticed the parent task is T401908. Is the goal here to ingest Grafana alerts into Alertmanager before sending them to Slack, or to route the alerts Alertmanager already receives into Slack channels?

Aug 15 2025, 12:14 AM · SRE Observability (FY2025/2026-Q1), FY2025-26 WE5.2.3 API Monitoring & Alarms

Aug 14 2025

andrea.denisse updated subscribers of T401730: Add a pathway for Alertmanager to send alerts in Slack.

Looking at our Alertmanager configuration, we currently send Slack notifications by creating an email address for a channel and sending alerts to it via email. However, Alertmanager can also send alerts directly to Slack using a webhook.

Aug 14 2025, 11:48 PM · SRE Observability (FY2025/2026-Q1), FY2025-26 WE5.2.3 API Monitoring & Alarms

Jul 30 2025

andrea.denisse added a comment to T371366: Replace kafkatee o11y usage.

Hi Filippo, thanks for reviewing this! I had a question about this part:

Jul 30 2025, 7:26 PM · Observability-Metrics, Observability-Logging

Jul 7 2025

andrea.denisse triaged T396970: Degraded RAID on aqs1012 as High priority.
Jul 7 2025, 2:31 PM · DC-Ops, SRE, ops-eqiad

Jun 26 2025

andrea.denisse added a project to T359271: (Analytics?) Migrate MediaWiki.TemplateData to statslib: SRE Observability (FY2024/2025-Q4).
Jun 26 2025, 5:57 PM · SRE Observability (FY2024/2025-Q4), Editing-team, VisualEditor, Observability-Metrics
andrea.denisse closed T359271: (Analytics?) Migrate MediaWiki.TemplateData to statslib as Resolved.
Jun 26 2025, 5:57 PM · SRE Observability (FY2024/2025-Q4), Editing-team, VisualEditor, Observability-Metrics
andrea.denisse closed T359271: (Analytics?) Migrate MediaWiki.TemplateData to statslib, a subtask of T350592: EPIC: migrate in use metrics and dashboards to statslib, as Resolved.
Jun 26 2025, 5:57 PM · SRE Observability (FY2025/2026-Q1), MW-1.43-notes (1.43.0-wmf.21; 2024-09-03), Epic, MW-1.42-notes (1.42.0-wmf.15; 2024-01-23), MediaWiki-Platform-Team (Radar), Observability-Metrics

Jun 20 2025

andrea.denisse updated the task description for T383309: rsyslog receiver on centrallog hosts misplaces some log host entries.
Jun 20 2025, 8:52 PM · Patch-For-Review, Observability-Logging

Jun 19 2025

andrea.denisse closed T359471: Migrate MediaWiki.extension.PageTriage to statslib, a subtask of T350592: EPIC: migrate in use metrics and dashboards to statslib, as Resolved.
Jun 19 2025, 3:28 PM · SRE Observability (FY2025/2026-Q1), MW-1.43-notes (1.43.0-wmf.21; 2024-09-03), Epic, MW-1.42-notes (1.42.0-wmf.15; 2024-01-23), MediaWiki-Platform-Team (Radar), Observability-Metrics
andrea.denisse closed T359471: Migrate MediaWiki.extension.PageTriage to statslib as Resolved.
Jun 19 2025, 3:28 PM · MW-1.44-notes (1.44.0-wmf.27; 2025-04-29), PageTriage, Moderator-Tools-Team, Observability-Metrics
andrea.denisse created P78391 (An Untitled Masterwork).
Jun 19 2025, 2:38 AM

Jun 16 2025

andrea.denisse closed T387256: [GRAFMIGR] Migrate MediaWiki.wikibase.articleplaceholder.button.translateArticle.count to statslib, a subtask of T371616: [EPIC][GRAFMIGR] Spruce up Wikidata Grafana Metrics, as Resolved.
Jun 16 2025, 2:37 PM · Wikidata Analytics (Kanban), User-ItamarWMDE, Wikidata, Epic, wmde-wikidata-tech
andrea.denisse closed T387256: [GRAFMIGR] Migrate MediaWiki.wikibase.articleplaceholder.button.translateArticle.count to statslib as Resolved.
Jun 16 2025, 2:37 PM · Wikidata Analytics (Radar/Epics/Stalled), MW-1.44-notes (1.44.0-wmf.24; 2025-04-08), wmde-wikidata-tech, Wikidata, Observability-Metrics

May 27 2025

andrea.denisse changed the status of T393894: New version of Grafana makes it not possible to remove option in long list of values from Open to Stalled.

I still see the issue:

Screen Recording 2025-05-27 191830.gif (770×1 px, 1 MB)

May 27 2025, 6:49 PM · SRE Observability (FY2025/2026-Q1), Grafana
andrea.denisse added a comment to T393883: Grafana 11.6.1 changed how it shows images in annotations.

Hi @Peter, I recently upgraded our Grafana instances to v12.0.1, so the issue might already be resolved in this version.

May 27 2025, 6:14 PM · Test-Platform (dek kvin (Current Sprint)), Synthetic-Performance-Testing
andrea.denisse added a comment to T394069: Rendering Graph's as images times out on Grafana 11.

This is still ongoing on Grafana v12.0.1. I'll explore clustered rendering further.

May 27 2025, 6:09 PM · SRE Observability (FY2025/2026-Q1)
andrea.denisse changed the status of T393894: New version of Grafana makes it not possible to remove option in long list of values from Stalled to Open.
May 27 2025, 6:05 PM · SRE Observability (FY2025/2026-Q1), Grafana
andrea.denisse added a comment to T393894: New version of Grafana makes it not possible to remove option in long list of values.

Hi @Dreamy_Jazz , I recently upgraded our Grafana instances to v12.0.1 and I think that this issue is fixed now.
Could you please take a look at it and let me know if the issue is still present for you?

May 27 2025, 6:05 PM · SRE Observability (FY2025/2026-Q1), Grafana
andrea.denisse added a comment to T395130: Migrate prometheus7001 to prometheus7002.

Thanks for the updates, @tappof, the revised task description looks much clearer now!

May 27 2025, 6:02 PM · SRE Observability (FY2024/2025-Q4), Observability-Metrics

May 26 2025

andrea.denisse closed T395098: Upgrade to Grafana 12.0.1 as Resolved.
May 26 2025, 7:38 PM · SRE Observability (FY2024/2025-Q4)
andrea.denisse closed T394045: When selecting a DC some Grafana panels show instances for other DC, a subtask of T395098: Upgrade to Grafana 12.0.1, as Resolved.
May 26 2025, 7:38 PM · SRE Observability (FY2024/2025-Q4)
andrea.denisse closed T394045: When selecting a DC some Grafana panels show instances for other DC as Resolved.

Hi @Vgutierrez, I've upgraded both Grafana instances to v12.0.1 and I can no longer reproduce the issue. I'll close this task as resolved, feel free to re-open in case it's still happening for you.

May 26 2025, 7:38 PM · SRE Observability (FY2024/2025-Q4)
andrea.denisse updated the task description for T395098: Upgrade to Grafana 12.0.1.
May 26 2025, 7:30 PM · SRE Observability (FY2024/2025-Q4)
andrea.denisse updated the task description for T395098: Upgrade to Grafana 12.0.1.
May 26 2025, 7:17 PM · SRE Observability (FY2024/2025-Q4)
andrea.denisse added a comment to T394045: When selecting a DC some Grafana panels show instances for other DC.

That file linked above is a screenshot uploaded by you. Nobody else could see your screenshot in the task description, they saw {F59935904} as plain text instead, because the file was private (this is likely due to a drag-and-drop bug in Phab). As a Phab admin, I get a button to "fix" these with one click, which, when it offered me this, I clicked "Yes". That generated the above attachment log message.

I have no input on this task.

May 26 2025, 7:09 PM · SRE Observability (FY2024/2025-Q4)
andrea.denisse updated subscribers of T394045: When selecting a DC some Grafana panels show instances for other DC.

Hi @Krinkle, thanks for taking a look. Did you take that screenshot from the grafana-next instance? I can't reproduce the error anymore in that instance. The main grafana instance doesn't include the patch for this issue.

May 26 2025, 7:01 PM · SRE Observability (FY2024/2025-Q4)