Page MenuHomePhabricator

hnowlan (Hugh Nowlan)
User

Projects (7)

Today

  • No visible events.

Tomorrow

  • No visible events.

Wednesday

  • No visible events.

User Details

User Since
Jan 6 2020, 12:19 PM (308 w, 6 d)
Availability
Available
LDAP User
Unknown
MediaWiki User
HNowlan (WMF) [ Global Accounts ]

Recent Activity

Thu, Dec 4

hnowlan moved T411343: thanos-store OOMing on titan eqiad from Inbox to Radar on the observability board.
Thu, Dec 4, 3:24 PM · observability, SRE
hnowlan closed T411365: Yubikey-SSH-FIDO for Hugh Nowlan (hnowlan) as Resolved.
Thu, Dec 4, 12:02 PM · SRE-Access-Requests, SRE
hnowlan added a comment to T411365: Yubikey-SSH-FIDO for Hugh Nowlan (hnowlan).

This is resolved, thank you!

Thu, Dec 4, 12:02 PM · SRE-Access-Requests, SRE
hnowlan added a comment to T411343: thanos-store OOMing on titan eqiad.

I think the worst of this trend has been reversed by the revert of setting cutoff days to 1: https://grafana.wikimedia.org/goto/rwrkdsWDg?orgId=1

image.png (1×2 px, 146 KB)

Thu, Dec 4, 11:27 AM · observability, SRE

Wed, Dec 3

hnowlan added a project to T411204: Draft Guided Dashboards Design Proposal: SRE Observability.
Wed, Dec 3, 12:22 PM · SRE Observability, serviceops

Tue, Dec 2

hnowlan created T411527: Remove sockpuppet database.
Tue, Dec 2, 5:15 PM · database-backups, Data-Persistence-Backup, DBA, Data-Persistence
hnowlan added a comment to T411007: Investigate moving xhgui to Kubernetes.

An added benefit to this work would be defining xhgui as an actual service - currently both Arc Lamp and xhgui do not migrate with the services switchover, and we have an explicit ask from the data persistence team to accommodate this across SRE.

Tue, Dec 2, 3:32 PM · SRE Observability

Mon, Dec 1

hnowlan created T411365: Yubikey-SSH-FIDO for Hugh Nowlan (hnowlan).
Mon, Dec 1, 1:11 PM · SRE-Access-Requests, SRE

Wed, Nov 26

hnowlan moved T410933: Add Druid as a Private Grafana Datasource from Inbox to FY2025/2026-Q3 on the SRE Observability board.
Wed, Nov 26, 3:28 PM · Observability-Metrics, SRE Observability (FY2025/2026-Q3), SRE
hnowlan moved T411007: Investigate moving xhgui to Kubernetes from Inbox to Backlog on the SRE Observability board.
Wed, Nov 26, 3:28 PM · SRE Observability
hnowlan moved T410745: Strengthen regex for suffix matching in Prometheus::Blackbox::Check::(Http|Icmp|Tcp) generated rules from Inbox to FY2025/2026-Q2 on the SRE Observability board.
Wed, Nov 26, 3:07 PM · SRE Observability (FY2025/2026-Q2)

Tue, Nov 25

hnowlan triaged T411007: Investigate moving xhgui to Kubernetes as Low priority.
Tue, Nov 25, 11:16 AM · SRE Observability
hnowlan created T411007: Investigate moving xhgui to Kubernetes.
Tue, Nov 25, 11:16 AM · SRE Observability

Wed, Nov 19

hnowlan added a comment to T410296: Significant increase in wikifeeds latency and mobileapps error rate since 2025/11/13.

^ after deploying the above patch, the 504 errors have all but disappeared. Not sure if this has any bearing on the unavailable replicas issues, but it's something!

Wed, Nov 19, 5:58 PM · Wikipedia-Android-App-Backlog, Content-Transform-Team, serviceops, Wikifeeds
hnowlan renamed T410296: Significant increase in wikifeeds latency and mobileapps error rate since 2025/11/13 from Significant increase in wikifeeds latency since 2025/11/13 to Significant increase in wikifeeds latency and mobileapps error rate since 2025/11/13.
Wed, Nov 19, 5:07 PM · Wikipedia-Android-App-Backlog, Content-Transform-Team, serviceops, Wikifeeds
hnowlan moved T410152: Disk space saturation (/srv) on Titan hosts from Inbox to FY2025/2026-Q2 on the SRE Observability board.
Wed, Nov 19, 3:28 PM · SRE Observability (FY2025/2026-Q2)
hnowlan added a comment to T410198: Determine the source of internal requests going through the API gateway..

Seems like these cases should be changed to query the page-analytics service directly on https://page-analytics.discovery.wmnet:30443/metrics/pageviews/[...]

Wed, Nov 19, 12:54 PM · MediaWiki-Platform-Team (Kanban Board), Content-Transform-Team (Work In Progress), Essential-Work, PageViewInfo, Growth-Team, serviceops, OKR-Work

Tue, Nov 18

hnowlan added a comment to T410296: Significant increase in wikifeeds latency and mobileapps error rate since 2025/11/13.

Just as a datapoint - I roll-restarted mobileapps and it had an immediate impact on wikifeeds: https://grafana.wikimedia.org/goto/lmB4-hmvg?orgId=1

Tue, Nov 18, 3:59 PM · Wikipedia-Android-App-Backlog, Content-Transform-Team, serviceops, Wikifeeds
hnowlan updated subscribers of T410296: Significant increase in wikifeeds latency and mobileapps error rate since 2025/11/13.

This panel shows an increase in traffic to the wikifeeds_featured endpoint as well as wikifeeds_onthisday endpoints. And this panel shows an increase in upstream request timeout for those two endpoints.

So, my first guess is that it looks like there has been an increase in traffic two of these endpoints which may have existing inefficiencies which might need investigation and fixing vs something new happening to the services itself.

Tue, Nov 18, 11:51 AM · Wikipedia-Android-App-Backlog, Content-Transform-Team, serviceops, Wikifeeds

Mon, Nov 17

hnowlan added a comment to T410296: Significant increase in wikifeeds latency and mobileapps error rate since 2025/11/13.

Wikifeeds logs quite heavily in general, but it's hard to determine signal. Looks like there has been a solid increase in internal 504s, but there isn't really any further context the error messages.

Mon, Nov 17, 5:46 PM · Wikipedia-Android-App-Backlog, Content-Transform-Team, serviceops, Wikifeeds
hnowlan updated the task description for T410296: Significant increase in wikifeeds latency and mobileapps error rate since 2025/11/13.
Mon, Nov 17, 5:36 PM · Wikipedia-Android-App-Backlog, Content-Transform-Team, serviceops, Wikifeeds
hnowlan created T410296: Significant increase in wikifeeds latency and mobileapps error rate since 2025/11/13.
Mon, Nov 17, 5:34 PM · Wikipedia-Android-App-Backlog, Content-Transform-Team, serviceops, Wikifeeds
hnowlan added a comment to T410198: Determine the source of internal requests going through the API gateway..

Could you supply some of these IP addresses for investigation? My gut feeling is that these are going to be health checks. Is this the api-gateway or the rest-gateway?

I haven't checked the API gateway. Here are the top IPs for the REST gateway (naive sort to find dupes, not by rec/sec):

10.192.12.30
10.192.14.10
10.192.29.14
10.192.36.6
10.192.4.20
10.192.40.11
10.192.41.6
10.192.5.30
10.192.8.26
172.16.1.41
172.16.19.172
172.16.4.75

From 172.16.19.172 alone we see 40 req/sec. That seems a lot for health checks...

Also I'd assume that most health checks shouldn't go through the gateway, right?

Mon, Nov 17, 12:36 PM · MediaWiki-Platform-Team (Kanban Board), Content-Transform-Team (Work In Progress), Essential-Work, PageViewInfo, Growth-Team, serviceops, OKR-Work
hnowlan added a comment to T410198: Determine the source of internal requests going through the API gateway..

Could you supply some of these IP addresses for investigation? My gut feeling is that these are going to be health checks. Is this the api-gateway or the rest-gateway?

Mon, Nov 17, 11:38 AM · MediaWiki-Platform-Team (Kanban Board), Content-Transform-Team (Work In Progress), Essential-Work, PageViewInfo, Growth-Team, serviceops, OKR-Work
hnowlan added a comment to T410007: upstream request timeout, http-status 504 in the API.

This issue was happening as a result of the migration of the action API to a common gateway within WMF infrastructure (work ticket: T408223, higher level reasoning/tracking: T406607). We're currently undergoing a slow rollout of wikis by group with the exception of enwiki, which means that all wikis are currently behind the gateway, along with 10% of requests for enwiki. The gateway by default itself imposes a timeout of 15 seconds, which was causing the issue seen here. We've since raised the timeout and the queries in this ticket are now succeeding. Apologies for the disruption.

Mon, Nov 17, 11:31 AM · Discovery-Search (2025.10.20 - 2025.12.31), CirrusSearch, MW-Interfaces-Team, MediaWiki-Action-API
hnowlan added a comment to T408223: Action API via rest-gateway production rollout.

We received notifications from users that the search API which is configured to allow 50s timeouts to support costly search requests is now failing at 15s with an upstream request timeout (T410007). The user reported that the behavior started to change around nov 11th which is apparently when we started to roll out this new route on group2 wikis. I'm not 100% sure that this change is the cause of this new behavior but IIUC on all wikis except enwiki we now route api.php requests to the rest-gateway. If I'm not mistaken the rest-gateway has a default timeout of 15s which might explain this new behavior? Are there ways to vary this timeout based on the target action API?

Mon, Nov 17, 11:27 AM · OKR-Work, [MWI] FY2025-26 Q2, MW-Interfaces-Team (MWI-Roadmap)

Fri, Nov 14

hnowlan added a comment to T409076: Public cloud account request for moving meta monitoring off of wikitech-static.

I think we can close this ticket as we won't need a public cloud account for this work, can you confirm @tappof?

Fri, Nov 14, 4:08 PM · Infrastructure-Foundations

Wed, Nov 12

hnowlan added a member for WMF-NDA: MLechvien-WMF.
Wed, Nov 12, 5:51 PM
hnowlan added a member for acl*sre-team: MLechvien-WMF.
Wed, Nov 12, 5:50 PM
hnowlan changed the status of T409738: New VictorOps user request for Blake from Open to In Progress.

Invited! please let us know if you have any issues. Once you've created your account, please make sure you can log into https://app.oncall-optimizer.com/ and sync your calendar. Thanks!

Wed, Nov 12, 3:35 PM · observability
hnowlan moved T400074: ProbeDown - wdqs1015 from Inbox to Backlog on the SRE Observability board.
Wed, Nov 12, 3:08 PM · SRE Observability, collaboration-services
hnowlan moved T409115: Identify host "role" in alert summary / title for mariadb-related alerts from Inbox to Backlog on the SRE Observability board.
Wed, Nov 12, 3:07 PM · DBA, SRE Observability
hnowlan moved T372943: In the aftermath of T370304: Brainstorming of short- and medium-term observability / quality-of-life production changes from Inbox to Radar on the SRE Observability board.
Wed, Nov 12, 2:59 PM · SRE Observability, Sustainability (Incident Followup), MediaWiki-Platform-Team (Radar), serviceops, DBA

Nov 6 2025

hnowlan changed the status of T409409: Requesting access to analytics_privatedata_users and SQL Lab for Arian Bozorg (WMDE) from Open to In Progress.
Nov 6 2025, 1:19 PM · SRE, SRE-Access-Requests
hnowlan updated subscribers of T409409: Requesting access to analytics_privatedata_users and SQL Lab for Arian Bozorg (WMDE).

Hi Arian, thanks for the ticket - could you let us know what username you would like for your account? Usually we'd go with something akin to abozorg-wmde

Nov 6 2025, 1:16 PM · SRE, SRE-Access-Requests
hnowlan updated the task description for T409409: Requesting access to analytics_privatedata_users and SQL Lab for Arian Bozorg (WMDE).
Nov 6 2025, 1:11 PM · SRE, SRE-Access-Requests
hnowlan closed T409166: Grant Access to ops-limited for blake as Resolved.
Nov 6 2025, 12:22 PM · SRE-Access-Requests, SRE
hnowlan closed T408924: Requesting access to deployment for ItamarWMDE as Resolved.
Nov 6 2025, 10:19 AM · User-ItamarWMDE, SRE, SRE-Access-Requests
hnowlan added a comment to T408924: Requesting access to deployment for ItamarWMDE.

Merged! I see deployment in the user groups for itamar now.

Nov 6 2025, 10:19 AM · User-ItamarWMDE, SRE, SRE-Access-Requests

Nov 5 2025

hnowlan changed the status of T408924: Requesting access to deployment for ItamarWMDE from In Progress to Stalled.
Nov 5 2025, 4:50 PM · User-ItamarWMDE, SRE, SRE-Access-Requests
hnowlan assigned T400074: ProbeDown - wdqs1015 to tappof.
Nov 5 2025, 2:08 PM · SRE Observability, collaboration-services

Nov 4 2025

hnowlan added a comment to T408920: Grant Access to WMDE LDAP groups for vicaplet-wmde.

@hnowlan Just checking, it should be wmde not wmf in this case.

Nov 4 2025, 5:57 PM · SRE, LDAP-Access-Requests
hnowlan added a comment to T408920: Grant Access to WMDE LDAP groups for vicaplet-wmde.

Thanks for the clarification.

Nov 4 2025, 5:34 PM · SRE, LDAP-Access-Requests
hnowlan changed the status of T408920: Grant Access to WMDE LDAP groups for vicaplet-wmde from Open to In Progress.
Nov 4 2025, 5:20 PM · SRE, LDAP-Access-Requests
hnowlan added a comment to T408920: Grant Access to WMDE LDAP groups for vicaplet-wmde.

Hi Virginie, your account appears to already be a member of analytics-privatedata-users which should grant you Superset access. This access was added in T407605.

Nov 4 2025, 5:20 PM · SRE, LDAP-Access-Requests
hnowlan updated subscribers of T409166: Grant Access to ops-limited for blake.

L3 signed, NDA applies. Key verified OOB.

Nov 4 2025, 1:24 PM · SRE-Access-Requests, SRE

Nov 3 2025

hnowlan added a project to T372943: In the aftermath of T370304: Brainstorming of short- and medium-term observability / quality-of-life production changes: SRE Observability.
Nov 3 2025, 5:58 PM · SRE Observability, Sustainability (Incident Followup), MediaWiki-Platform-Team (Radar), serviceops, DBA
hnowlan edited projects for T408060: Distinguish request classes based on user-agent declaration, added: Traffic; removed SRE.
Nov 3 2025, 12:18 PM · Traffic, Hiddenparma
hnowlan added a comment to T408924: Requesting access to deployment for ItamarWMDE.

Key verified out of band.

Nov 3 2025, 12:16 PM · User-ItamarWMDE, SRE, SRE-Access-Requests
hnowlan updated the task description for T408924: Requesting access to deployment for ItamarWMDE.
Nov 3 2025, 12:16 PM · User-ItamarWMDE, SRE, SRE-Access-Requests
hnowlan changed the status of T408702: Promote dpogorzelski from ops-limited to ops from Open to Stalled.

Blocked on approval from @mark.

Nov 3 2025, 12:16 PM · SRE, SRE-Access-Requests, Machine-Learning-Team
hnowlan updated subscribers of T408924: Requesting access to deployment for ItamarWMDE.

Awaiting out of band verification of SSH key on Slack. Tagging @thcipriani as approver for deployment group.

Nov 3 2025, 12:12 PM · User-ItamarWMDE, SRE, SRE-Access-Requests
hnowlan changed the status of T408924: Requesting access to deployment for ItamarWMDE from Open to In Progress.
Nov 3 2025, 12:12 PM · User-ItamarWMDE, SRE, SRE-Access-Requests

Oct 22 2025

hnowlan moved T407484: Alert in need of triage: PuppetConstantChange (instance prometheus2007:9100) from Inbox to FY2025/2026-Q2 on the SRE Observability board.
Oct 22 2025, 2:18 PM · SRE Observability (FY2025/2026-Q2), sre-alert-triage

Oct 21 2025

hnowlan edited projects for T407826: X-Request-Id response header off by 5000, added: serviceops; removed observability.
Oct 21 2025, 10:07 AM · serviceops, Traffic

Oct 13 2025

hnowlan closed T406318: rest.php via rest-gateway production rollout as Resolved.

Migration complete. Impact on rest-gateway was minimal, no scaling up required.

Oct 13 2025, 2:33 PM · MW-Interfaces-Team (MWI-Roadmap), serviceops, Epic, OKR-Work
hnowlan closed T406318: rest.php via rest-gateway production rollout, a subtask of T400130: Central REST gateway for APIs, as Resolved.
Oct 13 2025, 2:33 PM · MW-Interfaces-Team (MWI-Roadmap), serviceops, Epic, OKR-Work
hnowlan updated the task description for T406318: rest.php via rest-gateway production rollout.
Oct 13 2025, 2:33 PM · MW-Interfaces-Team (MWI-Roadmap), serviceops, Epic, OKR-Work

Oct 9 2025

hnowlan added a comment to T406318: rest.php via rest-gateway production rollout.

We're rolling out 10% of enwiki at the moment, and we will leave things there until next week.

Oct 9 2025, 4:05 PM · MW-Interfaces-Team (MWI-Roadmap), serviceops, Epic, OKR-Work
hnowlan updated the task description for T406318: rest.php via rest-gateway production rollout.
Oct 9 2025, 4:05 PM · MW-Interfaces-Team (MWI-Roadmap), serviceops, Epic, OKR-Work
hnowlan updated the task description for T406318: rest.php via rest-gateway production rollout.
Oct 9 2025, 10:56 AM · MW-Interfaces-Team (MWI-Roadmap), serviceops, Epic, OKR-Work

Oct 7 2025

hnowlan closed T401396: Revisit backend routing for rest-gateway as Resolved.

We're now using mw-api-ext as appropriate for rest.php and mw-api-related APIs.

Oct 7 2025, 11:55 AM · MW-Interfaces-Team (MWI-Roadmap), serviceops, Epic, OKR-Work
hnowlan closed T401396: Revisit backend routing for rest-gateway, a subtask of T400130: Central REST gateway for APIs, as Resolved.
Oct 7 2025, 11:55 AM · MW-Interfaces-Team (MWI-Roadmap), serviceops, Epic, OKR-Work

Oct 3 2025

hnowlan created T406324: Add support to rest-gateway for action API.
Oct 3 2025, 12:03 PM · Patch-For-Review, MW-Interfaces-Team (MWI-Roadmap), serviceops, Epic, OKR-Work
hnowlan created T406318: rest.php via rest-gateway production rollout.
Oct 3 2025, 11:23 AM · MW-Interfaces-Team (MWI-Roadmap), serviceops, Epic, OKR-Work
hnowlan closed T400131: Improved API rerouting strategy for REST gateway, a subtask of T400130: Central REST gateway for APIs, as Resolved.
Oct 3 2025, 11:19 AM · MW-Interfaces-Team (MWI-Roadmap), serviceops, Epic, OKR-Work
hnowlan closed T400131: Improved API rerouting strategy for REST gateway as Resolved.

We've implemented rest-gateway-ro in multi-dc.lua, traffic flows moving as expected.

Oct 3 2025, 11:19 AM · Patch-For-Review, MW-Interfaces-Team, serviceops, OKR-Work
hnowlan closed T400346: Ensure REST Gateway headers don't conflict with Mediawiki headers, a subtask of T400130: Central REST gateway for APIs, as Resolved.
Oct 3 2025, 11:19 AM · MW-Interfaces-Team (MWI-Roadmap), serviceops, Epic, OKR-Work
hnowlan closed T400346: Ensure REST Gateway headers don't conflict with Mediawiki headers as Resolved.

I think everything here has been taken care of.

Oct 3 2025, 11:19 AM · MW-Interfaces-Team, serviceops, OKR-Work
hnowlan changed the status of T401396: Revisit backend routing for rest-gateway from Open to In Progress.
Oct 3 2025, 10:54 AM · MW-Interfaces-Team (MWI-Roadmap), serviceops, Epic, OKR-Work
hnowlan changed the status of T401396: Revisit backend routing for rest-gateway, a subtask of T400130: Central REST gateway for APIs, from Open to In Progress.
Oct 3 2025, 10:54 AM · MW-Interfaces-Team (MWI-Roadmap), serviceops, Epic, OKR-Work

Oct 1 2025

hnowlan moved T403301: Discuss OpenSearch 3 roadmap/future improvements from Inbox to Radar on the observability board.
Oct 1 2025, 2:48 PM · Essential-Work, Data-Platform-SRE (2025.09.26 - 2025.10.17), observability
hnowlan moved T405946: eqiad row C/D Observability host migrations from Inbox to Radar on the observability board.
Oct 1 2025, 2:09 PM · observability, SRE, DC-Ops, ops-eqiad
hnowlan moved T402889: Puppet CA certificate Puppet CA: mailman-puppetmaster.mailman.eqiad.wmflabs expired from Inbox to Radar on the SRE Observability board.
Oct 1 2025, 2:04 PM · SRE Observability, collaboration-services

Sep 30 2025

hnowlan added a comment to T405368: Execute test plan for rest gateway rerouting for rest.php requests and report findings.

I'm also seeing extra headers for all the transform POST endpoints that give HTML, e.g.:

WARNING: POST https://test2.wikipedia.org/w/rest.php/v1/transform/wikitext/to/html/Aleph from anon
REASON: mwapi vs gateway header differences
!+ x-frame-options: SAMEORIGIN
!+ x-xss-protection: 1; mode=block
~- server: mw-api-ext.codfw.main-6647cc6f69-59k4s
~+ server: mw-api-int.codfw.main-7c66456f9f-kg5cm
!+ referrer-policy: origin-when-cross-origin
!+ content-security-policy: default-src 'none'; frame-ancestors 'none'
!+ access-control-allow-methods: GET,HEAD
~- date: Wed, 24 Sep 2025 17:19:14 GMT
~+ date: Wed, 24 Sep 2025 17:19:13 GMT
!+ access-control-allow-headers: accept, content-type, content-length, cache-control, accept-language, api-user-agent, if-match, if-modified-since, if-none-match, dnt, accept-encoding
!+ access-control-expose-headers: etag
Sep 30 2025, 5:18 PM · MW-Interfaces-Team (MWI-Sprint-19 (2025-09-23 to 2025-10-07)), serviceops, OKR-Work

Sep 24 2025

hnowlan moved T404888: Parse DMARC reports and create a dashboard from data from Inbox to Backlog on the SRE Observability board.
Sep 24 2025, 3:12 PM · Patch-For-Review, SRE Observability, Epic, Infrastructure-Foundations, Mail
hnowlan moved T404546: IPoid: Rate of "request handled" log events flattened from Inbox to Radar on the SRE Observability board.
Sep 24 2025, 3:12 PM · Product Safety and Integrity (Sprint Mint Choc Chip Ice Cream (Oct 20 - Nov 7)), Essential-Work, SRE Observability, iPoid-Service
hnowlan closed T405367: Dial test2 traffic to have 50/50 split as Resolved.

The 50% change has been merged and is rolling out over the next 30 minutes or so. Please be aware of cache when testing as cached responses might limit the distribution of requests

Sep 24 2025, 11:10 AM · MW-Interfaces-Team, serviceops, OKR-Work
hnowlan closed T405367: Dial test2 traffic to have 50/50 split, a subtask of T400130: Central REST gateway for APIs, as Resolved.
Sep 24 2025, 11:09 AM · MW-Interfaces-Team (MWI-Roadmap), serviceops, Epic, OKR-Work

Sep 23 2025

hnowlan added a comment to T405367: Dial test2 traffic to have 50/50 split.

We have a change ready for this that can be pushed at any point. Unfortunately at present the only easy way to identify between requests to the gateway and requests that don't hit it is the presence of the content-length or content-security-policy headers. The via header is stripped at the edge.

Sep 23 2025, 3:02 PM · MW-Interfaces-Team, serviceops, OKR-Work
hnowlan closed T402412: Route test2wiki rest.php APIs through rest-gateway as Resolved.

test2wiki's rest.php is now routed via the rest-gateway. This can be seen in the Via header supplied by the gateway

Sep 23 2025, 2:23 PM · Patch-For-Review, MW-Interfaces-Team, serviceops, OKR-Work
hnowlan closed T402412: Route test2wiki rest.php APIs through rest-gateway , a subtask of T400152: [SPIKE] Test plan for rest.php routes in REST gateway, as Resolved.
Sep 23 2025, 2:23 PM · MW-Interfaces-Team (MWI-Sprint-19 (2025-09-23 to 2025-10-07)), serviceops, OKR-Work

Sep 10 2025

hnowlan moved T398092: Ensure pushgateway 1.11.0 avoids log spam when metric help strings are inconsistent from Inbox to FY2025/2026-Q2 on the SRE Observability board.
Sep 10 2025, 2:46 PM · Observability-Metrics, SRE Observability (FY2025/2026-Q2)
hnowlan renamed T398092: Ensure pushgateway 1.11.0 avoids log spam when metric help strings are inconsistent from Inconsistent Prometheus metrics generating many logs to Ensure pushgateway 1.11.0 avoids log spam when metric help strings are inconsistent.
Sep 10 2025, 2:45 PM · Observability-Metrics, SRE Observability (FY2025/2026-Q2)
hnowlan reopened T398092: Ensure pushgateway 1.11.0 avoids log spam when metric help strings are inconsistent, a subtask of T398091: Prometheus1005 out of disk on /, as Open.
Sep 10 2025, 2:45 PM · SRE Observability
hnowlan reopened T398092: Ensure pushgateway 1.11.0 avoids log spam when metric help strings are inconsistent as "Open".

Reopening, moving to o11y backlog.

Sep 10 2025, 2:45 PM · Observability-Metrics, SRE Observability (FY2025/2026-Q2)
hnowlan closed T398092: Ensure pushgateway 1.11.0 avoids log spam when metric help strings are inconsistent, a subtask of T398091: Prometheus1005 out of disk on /, as Resolved.
Sep 10 2025, 2:03 PM · SRE Observability
hnowlan closed T398092: Ensure pushgateway 1.11.0 avoids log spam when metric help strings are inconsistent as Resolved.

Resolving this issue for now, in order to track work elsewhere.

Sep 10 2025, 2:03 PM · Observability-Metrics, SRE Observability (FY2025/2026-Q2)
hnowlan removed a project from T403823: Create monitoring for incomplete GitLab restarts: observability.

Removing the observability tag here as I don't think there's anything for us to do on this task - please re-add us if needs be.

Sep 10 2025, 10:53 AM · collaboration-services

Sep 2 2025

hnowlan added a comment to T402181: Deploy Temporary accounts to all remaining small-sized projects.

Some notable jumps that line up exactly with these increases are a significant jump in mw-web 200s:

image.png (1×2 px, 147 KB)

Notably this does not seem to coincide with a significant increase in external requests that we can easily discern

Sep 2 2025, 5:27 PM · Trust and Safety Product Sprint (Sprint Princess Tarta (August 18 - September 5)), OKR-Work, Trust and Safety Product Team, Temporary accounts

Aug 27 2025

hnowlan moved T402889: Puppet CA certificate Puppet CA: mailman-puppetmaster.mailman.eqiad.wmflabs expired from Inbox to Radar on the SRE Observability board.
Aug 27 2025, 2:37 PM · SRE Observability, collaboration-services
hnowlan added a comment to T402889: Puppet CA certificate Puppet CA: mailman-puppetmaster.mailman.eqiad.wmflabs expired.

In theory this should have become critical at 1 week remaining - is the critical alert defined properly?

Aug 27 2025, 2:37 PM · SRE Observability, collaboration-services
hnowlan moved T401908: Define a policy for Grafana Alerting from Inbox to FY2025/2026-Q1 on the SRE Observability board.
Aug 27 2025, 2:32 PM · SRE Observability (FY2025/2026-Q1), Grafana
hnowlan moved T402418: Figure out why some alerts aren't making it to #wikimedia-data-platform-alerts IRC from Inbox to Radar on the observability board.
Aug 27 2025, 2:32 PM · Essential-Work, observability, Data-Platform-SRE (2025.08.16 - 2025.09.05)
hnowlan raised the priority of T402418: Figure out why some alerts aren't making it to #wikimedia-data-platform-alerts IRC from Medium to Needs Triage.
Aug 27 2025, 2:31 PM · Essential-Work, observability, Data-Platform-SRE (2025.08.16 - 2025.09.05)
hnowlan triaged T402418: Figure out why some alerts aren't making it to #wikimedia-data-platform-alerts IRC as Medium priority.
Aug 27 2025, 2:31 PM · Essential-Work, observability, Data-Platform-SRE (2025.08.16 - 2025.09.05)

Aug 25 2025

hnowlan closed T230250: Broken thumbnails for Commons images as Resolved.

These images are now rendering correctly - hard to pinpoint why as this issue is quite old.

Aug 25 2025, 4:58 PM · Commons, Thumbor
hnowlan closed T302979: Failure to produce an image at specified resolution as Resolved.

Since adding the resource changes in T392348, it looks like the 7000px version of this image now renders correctly.

Aug 25 2025, 4:56 PM · Commons, Thumbor
hnowlan closed T394680: HTTP 500 error for specific thumbnail as Resolved.

Image is now rendering, most likely fixed by T381594.

Aug 25 2025, 4:53 PM · Thumbor, Commons
hnowlan added a comment to T381594: Thumbnailing for c:File:Carl_Weigert.jpg fails due to py3exiv2 handling of invalid ICC profiles.

Looks like the affected thumbs are working now - a big thanks to @AntiCompositeNumber for the fix.

Aug 25 2025, 1:09 PM · Thumbor, Commons