User Details
- User Since
- Jan 6 2020, 12:19 PM (334 w, 6 d)
- Availability
- Available
- LDAP User
- Unknown
- MediaWiki User
- HNowlan (WMF) [ Global Accounts ]
Thu, Jun 4
Wed, Jun 3
We'd definitely like to make the config on these messages more flexible, there's a lot of stuff in there for sure. In this case, just for a quick win would it work to remove the redundancy in TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity by removing Transit, peering or transport OUT as that's in the alert name already and is more configurable right now?
Thanks for the heads-up! It looks like this message is for the old statograph key which hasn't been used since we rotated in T412793 - our current key expires in 8 months. That said the message about "existing API keys" is a little ambiguous and alarming so I've rotated the key again.
Invite sent, added you to the SRE team. Let me know if you have any issues logging in
Tue, Jun 2
Based on some initial requests I couldn't reproduce this - for now I am going to close but will reopen this if it comes back around.
Fri, May 29
Wed, May 27
Unfortunately I don't think envoy is gonna be fit for purpose even aside the concerns above - Envoy's circuit breakers are per-cluster rather than per-host, so unless we have some kind of extremely chaotic one-cluster-per-worker template explosion I don't think there's an easy way to mirror the per-worker maxconn behaviour in Envoy. If the Thumbor workers 503ed when busy that would mean we could leverage retries, but at that point we'd be making it not single threaded in the first place.
In fact, being on Kubernetes might make this very straightforward.
Thu, May 21
As of a few minutes ago, all new incidents will be created as WMF-NDA by default. I'm working to move some historical events to WMF-NDA also.
Wed, May 20
Mon, May 18
This is a bit tricky to pull off in our current setup - the function essentially doesn't exist in the gdrive API we use, but is present in the gdoc API. Was there a reason for us going with that? Obviously there is other plumbing we'd need to add to change from the gdrive API but based on the scant docs it might look something like:
documentFormat := DocumentFormat{
DocumentMode: "PAGELESS",
}Thu, May 14
Some initial context: The kinds of issues SRE are dealing with have changed significantly in the last ~year. Historically, many incidents weren't ever documented on wikitech due to DENY reasons. So there isn't really any major change for that aspect of open visibility. The only change over the last few years, is the increased quantity of these DENY-related tasks, related to both scrapers and attackers (which there are a few Diff posts about). The majority of these incidents aren't user-facing due to the protections we have in place and due to SREs following our incident response processes and so there is little to report outside of sensitive actions to protect the projects. I do think that we can do more on this and there has been some work done on standardising communicating events like this that I'm hoping we can move forward soon.
Wed, May 13
Mon, May 11
May 6 2026
First step for this task to determine whether this is reproducible - we've only ever hit this issue once, so it's not clear how/when this happens. If not reproducible it can be closed.
May 5 2026
Apr 30 2026
I suspect this is a side effect of the fact that it's now rare that we have requests hitting a single requestctl parameter, instead there are generally overlapping multiple requestctl rules (particularly hap: etc). The search works fine afaict, it's just that searching for *just* the rule in question alone returns nothing. I'm not sure if there's something in superset for regexing this search - I suspect not although turnilo can do this fine.
I wonder if this would be better phrased as "should we detect redirect loops for wikis in general"? It would probably fit best as a httpbb check in that case.
Apr 29 2026
Hi Atsuko, sorry for the delay - I have invited you to VictorOps. Please let me know if you have any issues logging in.
@colewhite will be looking at the opensearch module.
Is this in order to avoid a recurrence of the issue in the incident? It seems fairly specific, is there risk it will recur? I'm not sure if this is actionable to create a useful alert
Apr 23 2026
My bad, my sync didn't include the change as you deduced. Sync is underway and I see the url supplied working. For reference in future it's probably best to tag serviceops or ask in #wikimedia-serviceops on IRC for prompt support on restbase deploys.
Apr 13 2026
Could I be added to the project admins ACL please? I would like to add tags to manage incident severity in our incident response and follow-up processes.
Apr 9 2026
Apr 8 2026
Apr 7 2026
Glad that it's sorted out!
Apr 1 2026
Your access has been added - the change should be live within the next 30 or so minutes.
Upon reviewing our logs, every 429 for Urbipedia that I see is for the user agent QuickInstantCommons/1.5 MediaWiki/1.39.5; Urbipedia - addressing this UA will most likely resolve your issues.
