Page MenuHomePhabricator

hnowlan (Hugh Nowlan)
User

Projects (8)

Today

  • No visible events.

Tomorrow

  • No visible events.

Tuesday

  • No visible events.

User Details

User Since
Jan 6 2020, 12:19 PM (334 w, 6 d)
Availability
Available
LDAP User
Unknown
MediaWiki User
HNowlan (WMF) [ Global Accounts ]

Recent Activity

Thu, Jun 4

hnowlan closed T425299: ATS backend errors for performance.discovery.wmnet should not page as Resolved.
Thu, Jun 4, 4:14 PM · Observability-Alerting, SRE
hnowlan moved T422056: wdqs: database node logs should be pushed to logstash from Inbox to Radar on the SRE Observability board.
Thu, Jun 4, 3:38 PM · Wikidata Platform Team (Sprint 06 (2026/06/02)), SRE Observability, OKR-Work
hnowlan moved T421996: Create an automation against the logs from Radar to Inbox on the SRE Observability board.
Thu, Jun 4, 3:38 PM · SRE Observability

Wed, Jun 3

hnowlan added a comment to T424794: Shorten IRC alert messages.

We'd definitely like to make the config on these messages more flexible, there's a lot of stuff in there for sure. In this case, just for a quick win would it work to remove the redundancy in TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity by removing Transit, peering or transport OUT as that's in the alert name already and is more configurable right now?

Wed, Jun 3, 5:39 PM · Observability-Alerting
hnowlan closed T428026: Statuspage API keys needs to be rotated as Resolved.
Wed, Jun 3, 5:33 PM · SRE Observability
hnowlan added a comment to T428026: Statuspage API keys needs to be rotated.

Thanks for the heads-up! It looks like this message is for the old statograph key which hasn't been used since we rotated in T412793 - our current key expires in 8 months. That said the message about "existing API keys" is a little ambiguous and alarming so I've rotated the key again.

Wed, Jun 3, 5:32 PM · SRE Observability
hnowlan moved T427469: Alerts showing "AlertLintProblem" from Inbox to Backlog on the SRE Observability board.
Wed, Jun 3, 5:09 PM · SRE Observability, SRE, observability
hnowlan moved T427469: Alerts showing "AlertLintProblem" from Inbox to Radar on the observability board.
Wed, Jun 3, 5:08 PM · SRE Observability, SRE, observability
hnowlan closed T427803: New VictorOps user request for cwilliams as Resolved.

Invite sent, added you to the SRE team. Let me know if you have any issues logging in

Wed, Jun 3, 3:01 PM · observability
hnowlan assigned T427469: Alerts showing "AlertLintProblem" to tappof.
Wed, Jun 3, 2:16 PM · SRE Observability, SRE, observability
hnowlan claimed T427803: New VictorOps user request for cwilliams.
Wed, Jun 3, 2:13 PM · observability

Tue, Jun 2

hnowlan closed T417160: wikimediastatus.net cache did not update and stale information was served to users as Declined.

Based on some initial requests I couldn't reproduce this - for now I am going to close but will reopen this if it comes back around.

Tue, Jun 2, 4:38 PM · SRE Observability, Sustainability (Incident Followup)

Fri, May 29

hnowlan created T427617: Create Observability best practices wikitech page.
Fri, May 29, 11:45 AM · SRE Observability

Wed, May 27

hnowlan added a comment to T357145: Consider moving to haproxy ingress for Thumbor workers.

Unfortunately I don't think envoy is gonna be fit for purpose even aside the concerns above - Envoy's circuit breakers are per-cluster rather than per-host, so unless we have some kind of extremely chaotic one-cluster-per-worker template explosion I don't think there's an easy way to mirror the per-worker maxconn behaviour in Envoy. If the Thumbor workers 503ed when busy that would mean we could leverage retries, but at that point we'd be making it not single threaded in the first place.

Wed, May 27, 4:25 PM · Kubernetes, serviceops-deprecated, Thumbor
hnowlan added a comment to T422056: wdqs: database node logs should be pushed to logstash.

In fact, being on Kubernetes might make this very straightforward.

Wed, May 27, 4:00 PM · Wikidata Platform Team (Sprint 06 (2026/06/02)), SRE Observability, OKR-Work
hnowlan moved T426809: Alert in need of triage: AlertLintProblem (instance localhost:9123) from Inbox to Backlog on the SRE Observability board.
Wed, May 27, 2:06 PM · Patch-For-Review, SRE Observability, sre-alert-triage

Thu, May 21

hnowlan added a comment to T426137: Increase trusted volunteer's visibility into production incidents.

As of a few minutes ago, all new incidents will be created as WMF-NDA by default. I'm working to move some historical events to WMF-NDA also.

Thu, May 21, 3:31 PM · Incident Tooling, SRE, corto
hnowlan closed T424835: Check for Mediawiki redirect loops via httpbb as Resolved.

@hnowlan I think we should either reframe this task to cover case 1 (detect and alert on redirect loops after deployment, by monitoring webrequest logs, somewhere other than httpbb) or close it as resolved since we already wrote the necessary tests for this incident. My instinct is just to close it. What do you think?

Thu, May 21, 9:38 AM · ServiceOps new, SRE Observability, Sustainability (Incident Followup)

Wed, May 20

hnowlan assigned T426809: Alert in need of triage: AlertLintProblem (instance localhost:9123) to tappof.
Wed, May 20, 2:09 PM · Patch-For-Review, SRE Observability, sre-alert-triage

Mon, May 18

hnowlan moved T426639: sre.kafka.roll-restart-reboot-brokers: command-config is not a recognized option from Inbox to Radar on the SRE Observability board.
Mon, May 18, 4:56 PM · Infrastructure-Foundations, SRE Observability, SRE-tools
hnowlan added a comment to T423108: Switch cortobot Google doc format to pageless.

This is a bit tricky to pull off in our current setup - the function essentially doesn't exist in the gdrive API we use, but is present in the gdoc API. Was there a reason for us going with that? Obviously there is other plumbing we'd need to add to change from the gdrive API but based on the scant docs it might look something like:

documentFormat := DocumentFormat{
	DocumentMode: "PAGELESS",
}
Mon, May 18, 3:46 PM · Patch-For-Review, Sustainability (Incident Followup), corto
hnowlan updated the task description for T426381: s7 master db2218 down.
Mon, May 18, 3:43 PM · DBA, Wikimedia-Incident
hnowlan updated the task description for T407200: Gerrit is down.
Mon, May 18, 3:40 PM · Incident Severity 2, Gerrit, Wikimedia-Incident
hnowlan updated the task description for T418840: mariadb replicas broken.
Mon, May 18, 3:33 PM · Incident Severity 2, Wikimedia-Incident
hnowlan changed the visibility for T419457: dse-k8s control plane OOM.
Mon, May 18, 3:31 PM · Incident Severity 3, Data-Platform-SRE (2026-03-06 - 2026-03-27), Wikimedia-Incident

Thu, May 14

hnowlan added a comment to T426137: Increase trusted volunteer's visibility into production incidents.

Some initial context: The kinds of issues SRE are dealing with have changed significantly in the last ~year. Historically, many incidents weren't ever documented on wikitech due to DENY reasons. So there isn't really any major change for that aspect of open visibility. The only change over the last few years, is the increased quantity of these DENY-related tasks, related to both scrapers and attackers (which there are a few Diff posts about). The majority of these incidents aren't user-facing due to the protections we have in place and due to SREs following our incident response processes and so there is little to report outside of sensitive actions to protect the projects. I do think that we can do more on this and there has been some work done on standardising communicating events like this that I'm hoping we can move forward soon.

Thu, May 14, 5:16 PM · Incident Tooling, SRE, corto
hnowlan closed T425400: Set external url for thanos.w.o web interface as Resolved.
Thu, May 14, 3:47 PM · SRE Observability
hnowlan moved T425795: Grafana: deploy grafana-dashboard-reporter-app from Inbox to FY2025/2026-Q4 on the SRE Observability board.
Thu, May 14, 3:45 PM · Patch-For-Review, SRE Observability (FY2025/2026-Q4), SRE-SLO
hnowlan moved T425400: Set external url for thanos.w.o web interface from Inbox to Backlog on the SRE Observability board.
Thu, May 14, 3:44 PM · SRE Observability
hnowlan changed the visibility for T425693: "upload at ulsfo depooled due to tcp timeout".
Thu, May 14, 1:29 PM · Incident Severity 2, Wikimedia-Incident
hnowlan created T426301: alert on server_ip NEL field for each DNS-pooled PoP.
Thu, May 14, 10:04 AM · Traffic, Sustainability (Incident Followup)
hnowlan added a project to T425693: "upload at ulsfo depooled due to tcp timeout": Incident Severity 2.
Thu, May 14, 9:58 AM · Incident Severity 2, Wikimedia-Incident
hnowlan created T426299: Ensure the pre-repooling checklist includes to restart liberica services whenever realserver IPs has changed.
Thu, May 14, 9:57 AM · Traffic, Sustainability (Incident Followup)
hnowlan updated the task description for T425693: "upload at ulsfo depooled due to tcp timeout".
Thu, May 14, 9:25 AM · Incident Severity 2, Wikimedia-Incident

Wed, May 13

hnowlan moved T425424: Deprecate Fundraising nsca icinga alert collection from Inbox to Radar on the observability board.
Wed, May 13, 1:44 PM · fundraising-tech-ops, observability

Mon, May 11

hnowlan moved T424835: Check for Mediawiki redirect loops via httpbb from Inbox to Radar on the SRE Observability board.
Mon, May 11, 3:00 PM · ServiceOps new, SRE Observability, Sustainability (Incident Followup)
hnowlan renamed T424835: Check for Mediawiki redirect loops via httpbb from Can we check content/http code of www.wikipedia.org to Check for Mediawiki redirect loops via httpbb.
Mon, May 11, 2:58 PM · ServiceOps new, SRE Observability, Sustainability (Incident Followup)
hnowlan claimed T417160: wikimediastatus.net cache did not update and stale information was served to users.
Mon, May 11, 2:57 PM · SRE Observability, Sustainability (Incident Followup)
hnowlan added a project to T424844: wdqs-proxy should integrate with prometheus: SRE Observability.
Mon, May 11, 9:12 AM · SRE Observability, Wikidata Platform Team (Sprint 05 (2026/05/05)), OKR-Work

May 6 2026

hnowlan assigned T425424: Deprecate Fundraising nsca icinga alert collection to tappof.
May 6 2026, 2:15 PM · fundraising-tech-ops, observability
hnowlan assigned T425400: Set external url for thanos.w.o web interface to tappof.
May 6 2026, 2:15 PM · SRE Observability
hnowlan moved T425115: Rebalance kafka logging codfw (2026 edition) from Inbox to FY2025/2026-Q4 on the SRE Observability board.
May 6 2026, 2:13 PM · SRE Observability (FY2025/2026-Q4)
hnowlan moved T424204: profile/module violations in use of profile::pki::get_cert() from Backlog to FY2025/2026-Q4 on the SRE Observability board.
May 6 2026, 2:12 PM · SRE Observability (FY2025/2026-Q4), Data-Persistence, ServiceOps new, Infrastructure-Foundations
hnowlan moved T424204: profile/module violations in use of profile::pki::get_cert() from Inbox to Backlog on the SRE Observability board.
May 6 2026, 2:12 PM · SRE Observability (FY2025/2026-Q4), Data-Persistence, ServiceOps new, Infrastructure-Foundations
hnowlan added a comment to T417160: wikimediastatus.net cache did not update and stale information was served to users.

First step for this task to determine whether this is reproducible - we've only ever hit this issue once, so it's not clear how/when this happens. If not reproducible it can be closed.

May 6 2026, 2:10 PM · SRE Observability, Sustainability (Incident Followup)
hnowlan moved T417160: wikimediastatus.net cache did not update and stale information was served to users from Inbox to Backlog on the SRE Observability board.
May 6 2026, 2:10 PM · SRE Observability, Sustainability (Incident Followup)

May 5 2026

hnowlan added a project to T424765: webrequest_sampled not updated: Incident Severity 3.
May 5 2026, 1:55 PM · Incident Severity 3, Wikimedia-Incident

Apr 30 2026

hnowlan added a comment to T406308: Link to view requestctl rule in superset no longer working.

I suspect this is a side effect of the fact that it's now rare that we have requests hitting a single requestctl parameter, instead there are generally overlapping multiple requestctl rules (particularly hap: etc). The search works fine afaict, it's just that searching for *just* the rule in question alone returns nothing. I'm not sure if there's something in superset for regexing this search - I suspect not although turnilo can do this fine.

Apr 30 2026, 4:06 PM · requestctl
hnowlan edited projects for T417160: wikimediastatus.net cache did not update and stale information was served to users, added: SRE Observability; removed Incident Tooling.
Apr 30 2026, 3:33 PM · SRE Observability, Sustainability (Incident Followup)
hnowlan added a comment to T424835: Check for Mediawiki redirect loops via httpbb.

I wonder if this would be better phrased as "should we detect redirect loops for wikis in general"? It would probably fit best as a httpbb check in that case.

Apr 30 2026, 3:29 PM · ServiceOps new, SRE Observability, Sustainability (Incident Followup)
hnowlan added a comment to T422923: New VictorOps user request - atsuko.

Could you please add me to the team? I don't have an access to change the SRE rotation.

Apr 30 2026, 9:21 AM · observability

Apr 29 2026

hnowlan closed T422923: New VictorOps user request - atsuko as Resolved.
Apr 29 2026, 4:02 PM · observability
hnowlan added a comment to T422923: New VictorOps user request - atsuko.

Hi Atsuko, sorry for the delay - I have invited you to VictorOps. Please let me know if you have any issues logging in.

Apr 29 2026, 4:02 PM · observability
hnowlan added a project to T393201: Fail more gracefully when a job is out of resources: Sustainability (Incident Followup).
Apr 29 2026, 3:16 PM · Sustainability (Incident Followup), Continuous-Integration-Infrastructure
hnowlan updated subscribers of T424204: profile/module violations in use of profile::pki::get_cert().

@colewhite will be looking at the opensearch module.

Apr 29 2026, 2:30 PM · SRE Observability (FY2025/2026-Q4), Data-Persistence, ServiceOps new, Infrastructure-Foundations
hnowlan added a comment to T424835: Check for Mediawiki redirect loops via httpbb.

Is this in order to avoid a recurrence of the issue in the incident? It seems fairly specific, is there risk it will recur? I'm not sure if this is actionable to create a useful alert

Apr 29 2026, 2:30 PM · ServiceOps new, SRE Observability, Sustainability (Incident Followup)
hnowlan moved T422816: Observability: Re-IP codfw private baremetal hosts to new per-rack vlans/subnets from Inbox to Migrating to SRE Observability on the observability board.
Apr 29 2026, 2:06 PM · observability, SRE
hnowlan edited projects for T360913: Swift proxy server misbehaviour (no longer calling `accept`?), added: Sustainability (Incident Followup); removed Sustainability.
Apr 29 2026, 1:58 PM · Sustainability (Incident Followup), SRE-OnFire, SRE-swift-storage
hnowlan moved T422186: New VictorOps user request from Radar to Migrating to SRE Observability on the observability board.
Apr 29 2026, 1:54 PM · observability
hnowlan moved T422923: New VictorOps user request - atsuko from Radar to Migrating to SRE Observability on the observability board.
Apr 29 2026, 1:54 PM · observability
hnowlan moved T422186: New VictorOps user request from Inbox to Radar on the observability board.
Apr 29 2026, 1:54 PM · observability
hnowlan moved T422923: New VictorOps user request - atsuko from Inbox to Radar on the observability board.
Apr 29 2026, 1:54 PM · observability
hnowlan moved T423851: Collect calico BGP metrics from Inbox to Radar on the observability board.
Apr 29 2026, 1:54 PM · Sustainability (Incident Followup), ServiceOps-good-first-task, ServiceOps new, observability, Prod-Kubernetes, Kubernetes
hnowlan moved T423852: Add calico network alerting from Inbox to Radar on the observability board.
Apr 29 2026, 1:53 PM · Sustainability (Incident Followup), ServiceOps-good-first-task, ServiceOps new, observability, Prod-Kubernetes, Kubernetes
hnowlan created T424849: Alert on puppetserver failure signals.
Apr 29 2026, 1:48 PM · Infrastructure-Foundations
hnowlan added a project to T357900: Alert on necessary puppetserver restarts: Sustainability (Incident Followup).
Apr 29 2026, 1:45 PM · Sustainability (Incident Followup), Puppet-Infrastructure, Infrastructure-Foundations
hnowlan assigned T418728: Increase Benthos capacity to tappof.
Apr 29 2026, 11:40 AM · SRE Observability (FY2025/2026-Q4), Sustainability (Incident Followup)

Apr 23 2026

hnowlan added a comment to T415975: Add urwikisource to RESTBase.

My bad, my sync didn't include the change as you deduced. Sync is underway and I see the url supplied working. For reference in future it's probably best to tag serviceops or ask in #wikimedia-serviceops on IRC for prompt support on restbase deploys.

Apr 23 2026, 10:42 AM · RESTBase

Apr 13 2026

hnowlan edited projects for T423109: Consider adding timestamps to vopsbot list, added: Sustainability (Incident Followup); removed Wikimedia-Incident.
Apr 13 2026, 11:44 AM · Observability-Alerting, Sustainability (Incident Followup)
hnowlan edited projects for T423108: Switch cortobot Google doc format to pageless, added: Sustainability (Incident Followup); removed Wikimedia-Incident.
Apr 13 2026, 11:44 AM · Patch-For-Review, Sustainability (Incident Followup), corto
hnowlan created T423109: Consider adding timestamps to vopsbot list.
Apr 13 2026, 10:50 AM · Observability-Alerting, Sustainability (Incident Followup)
hnowlan triaged T423108: Switch cortobot Google doc format to pageless as Low priority.
Apr 13 2026, 10:46 AM · Patch-For-Review, Sustainability (Incident Followup), corto
hnowlan created T423108: Switch cortobot Google doc format to pageless.
Apr 13 2026, 10:46 AM · Patch-For-Review, Sustainability (Incident Followup), corto
hnowlan created T423107: Add Mediawiki metrics around circuit breaking.
Apr 13 2026, 10:37 AM · MediaWiki-libs-Rdbms, Sustainability (Incident Followup)
hnowlan added a comment to T706: Requests for addition to the #acl*Project-Admins group (in comments).

Could I be added to the project admins ACL please? I would like to add tags to manage incident severity in our incident response and follow-up processes.

Apr 13 2026, 9:41 AM · Tracking-Neverending, Project-Admins

Apr 9 2026

hnowlan closed Restricted Task, a subtask of T418392: 503 Service Unavailable No server is available to handle this request., as Resolved.
Apr 9 2026, 10:00 AM · Wikimedia-Incident, netops, SRE, Infrastructure-Foundations, Traffic

Apr 8 2026

hnowlan claimed T422186: New VictorOps user request.
Apr 8 2026, 2:13 PM · observability
hnowlan moved T421996: Create an automation against the logs from Inbox to Radar on the SRE Observability board.
Apr 8 2026, 1:59 PM · SRE Observability

Apr 7 2026

hnowlan updated the task description for T422499: Create private Grafana instance.
Apr 7 2026, 2:31 PM · SRE Observability (FY2025/2026-Q4)
hnowlan created T422499: Create private Grafana instance.
Apr 7 2026, 1:59 PM · SRE Observability (FY2025/2026-Q4)
hnowlan closed T421650: IP Block/Throttling relief request: urbipedia.org - Bot attack mitigated as Resolved.

Glad that it's sorted out!

Apr 7 2026, 9:22 AM · Traffic, SRE

Apr 1 2026

hnowlan removed a project from T421465: Configure dse-k8s-worker10[20-23]: SRE.
Apr 1 2026, 5:01 PM · Data-Platform-SRE (2026-06-05 - 2026-06-26)
hnowlan closed T422020: Yubikey-SSH-FIDO for Tiziano Fogli (tappof / BACKUP) as Resolved.
Apr 1 2026, 5:00 PM · SRE, SRE-Access-Requests
hnowlan assigned T421996: Create an automation against the logs to herron.
Apr 1 2026, 2:19 PM · SRE Observability
hnowlan moved T331941: arclamp-log.py prunes data too soon (after 30d instead of 90d) from Inbox to Radar on the observability board.
Apr 1 2026, 2:16 PM · observability, Arc-Lamp
hnowlan moved T408917: Flame graphs don't seem to be collected on auth.wikimedia.org from Inbox to Radar on the observability board.
Apr 1 2026, 2:15 PM · observability, MediaWiki-Platform-Team (Radar), Arc-Lamp
hnowlan assigned T415317: Implement an alert to detect changes in the number of ingested series to tappof.
Apr 1 2026, 1:57 PM · SRE Observability (FY2025/2026-Q3), Observability-Metrics
hnowlan moved T418731: Either remove or update non-working metrics in VisualEditor from Inbox to Radar on the Observability-Metrics board.
Apr 1 2026, 1:57 PM · Patch-For-Review, Technical-Debt, VisualEditor, Observability-Metrics
hnowlan moved T417879: Improve OAuth API usage metrics from Inbox to Radar on the Observability-Metrics board.
Apr 1 2026, 1:57 PM · MediaWiki-Platform-Team (Kanban Board), Observability-Metrics, API Platform, MediaWiki-extensions-OAuth
hnowlan moved T420699: PrometheusSeriesCreationRateAnomalyHigh from Inbox to Radar on the observability board.
Apr 1 2026, 1:53 PM · Observability-Metrics, observability
hnowlan moved T421288: Action API: prefer the action parameter to be given as a query parameter, even for POST requests from Inbox to Radar on the observability board.
Apr 1 2026, 1:49 PM · MW-1.46-notes (1.46.0-wmf.24; 2026-04-14), MediaWiki-Platform-Team (Kanban Board), observability, ServiceOps new, MediaWiki-Action-API, MW-Interfaces-Team
hnowlan moved T420498: Factor in pooled status for SLO measurements from Inbox to Radar on the observability board.
Apr 1 2026, 1:47 PM · SRE-SLO, observability, Traffic
hnowlan moved T420676: Allow Prometheus query beyond 375 days in Grafana/Thanos from Radar to Migrating to SRE Observability on the observability board.
Apr 1 2026, 1:45 PM · Regression, Observability-Metrics, observability, Grafana
hnowlan moved T420676: Allow Prometheus query beyond 375 days in Grafana/Thanos from Inbox to Radar on the observability board.
Apr 1 2026, 1:45 PM · Regression, Observability-Metrics, observability, Grafana
hnowlan moved T418118: SystemdUnitFailed: grafana-ldap-users-sync.service on grafana1002:9100 from FY2025/2026-Q3 to FY2025/2026-Q4 on the SRE Observability board.
Apr 1 2026, 1:44 PM · SRE Observability (FY2025/2026-Q4)
hnowlan moved T417742: ThanosCompactHalted: pre compaction overlap check - overlaps found while gathering blocks. from FY2025/2026-Q3 to FY2025/2026-Q4 on the SRE Observability board.
Apr 1 2026, 1:44 PM · SRE Observability (FY2025/2026-Q4), Observability-Metrics
hnowlan moved T421183: Update Pint package from Inbox to FY2025/2026-Q4 on the SRE Observability board.
Apr 1 2026, 1:44 PM · SRE Observability (FY2025/2026-Q4)
hnowlan edited projects for T421183: Update Pint package, added: SRE Observability; removed observability.
Apr 1 2026, 1:44 PM · SRE Observability (FY2025/2026-Q4)
hnowlan changed the status of T421471: Requesting access to superset dashboard for mpostoronca from Open to In Progress.

Your access has been added - the change should be live within the next 30 or so minutes.

Apr 1 2026, 9:47 AM · SRE, SRE-Access-Requests
hnowlan added a comment to T421650: IP Block/Throttling relief request: urbipedia.org - Bot attack mitigated.

Upon reviewing our logs, every 429 for Urbipedia that I see is for the user agent QuickInstantCommons/1.5 MediaWiki/1.39.5; Urbipedia - addressing this UA will most likely resolve your issues.

Apr 1 2026, 9:32 AM · Traffic, SRE