User Details
- User Since
- Jan 6 2020, 12:19 PM (308 w, 6 d)
- Availability
- Available
- LDAP User
- Unknown
- MediaWiki User
- HNowlan (WMF) [ Global Accounts ]
Thu, Dec 4
This is resolved, thank you!
I think the worst of this trend has been reversed by the revert of setting cutoff days to 1: https://grafana.wikimedia.org/goto/rwrkdsWDg?orgId=1
Wed, Dec 3
Tue, Dec 2
An added benefit to this work would be defining xhgui as an actual service - currently both Arc Lamp and xhgui do not migrate with the services switchover, and we have an explicit ask from the data persistence team to accommodate this across SRE.
Mon, Dec 1
Wed, Nov 26
Tue, Nov 25
Wed, Nov 19
Seems like these cases should be changed to query the page-analytics service directly on https://page-analytics.discovery.wmnet:30443/metrics/pageviews/[...]
Tue, Nov 18
Just as a datapoint - I roll-restarted mobileapps and it had an immediate impact on wikifeeds: https://grafana.wikimedia.org/goto/lmB4-hmvg?orgId=1
Mon, Nov 17
Wikifeeds logs quite heavily in general, but it's hard to determine signal. Looks like there has been a solid increase in internal 504s, but there isn't really any further context the error messages.
Could you supply some of these IP addresses for investigation? My gut feeling is that these are going to be health checks. Is this the api-gateway or the rest-gateway?
This issue was happening as a result of the migration of the action API to a common gateway within WMF infrastructure (work ticket: T408223, higher level reasoning/tracking: T406607). We're currently undergoing a slow rollout of wikis by group with the exception of enwiki, which means that all wikis are currently behind the gateway, along with 10% of requests for enwiki. The gateway by default itself imposes a timeout of 15 seconds, which was causing the issue seen here. We've since raised the timeout and the queries in this ticket are now succeeding. Apologies for the disruption.
Fri, Nov 14
I think we can close this ticket as we won't need a public cloud account for this work, can you confirm @tappof?
Wed, Nov 12
Invited! please let us know if you have any issues. Once you've created your account, please make sure you can log into https://app.oncall-optimizer.com/ and sync your calendar. Thanks!
Nov 6 2025
Hi Arian, thanks for the ticket - could you let us know what username you would like for your account? Usually we'd go with something akin to abozorg-wmde
Merged! I see deployment in the user groups for itamar now.
Nov 5 2025
Nov 4 2025
Thanks for the clarification.
Hi Virginie, your account appears to already be a member of analytics-privatedata-users which should grant you Superset access. This access was added in T407605.
L3 signed, NDA applies. Key verified OOB.
Nov 3 2025
Key verified out of band.
Blocked on approval from @mark.
Awaiting out of band verification of SSH key on Slack. Tagging @thcipriani as approver for deployment group.
Oct 22 2025
Oct 21 2025
Oct 13 2025
Migration complete. Impact on rest-gateway was minimal, no scaling up required.
Oct 9 2025
We're rolling out 10% of enwiki at the moment, and we will leave things there until next week.
Oct 7 2025
We're now using mw-api-ext as appropriate for rest.php and mw-api-related APIs.
Oct 3 2025
We've implemented rest-gateway-ro in multi-dc.lua, traffic flows moving as expected.
I think everything here has been taken care of.
Oct 1 2025
Sep 30 2025
Sep 24 2025
The 50% change has been merged and is rolling out over the next 30 minutes or so. Please be aware of cache when testing as cached responses might limit the distribution of requests
Sep 23 2025
We have a change ready for this that can be pushed at any point. Unfortunately at present the only easy way to identify between requests to the gateway and requests that don't hit it is the presence of the content-length or content-security-policy headers. The via header is stripped at the edge.
test2wiki's rest.php is now routed via the rest-gateway. This can be seen in the Via header supplied by the gateway
Sep 10 2025
Reopening, moving to o11y backlog.
Resolving this issue for now, in order to track work elsewhere.
Removing the observability tag here as I don't think there's anything for us to do on this task - please re-add us if needs be.
Sep 2 2025
Some notable jumps that line up exactly with these increases are a significant jump in mw-web 200s:
Notably this does not seem to coincide with a significant increase in external requests that we can easily discern
Aug 27 2025
In theory this should have become critical at 1 week remaining - is the critical alert defined properly?
Aug 25 2025
These images are now rendering correctly - hard to pinpoint why as this issue is quite old.
Since adding the resource changes in T392348, it looks like the 7000px version of this image now renders correctly.
Image is now rendering, most likely fixed by T381594.
Looks like the affected thumbs are working now - a big thanks to @AntiCompositeNumber for the fix.


