User Details
- User Since
- Jul 26 2022, 2:11 PM (176 w, 6 d)
- Availability
- Available
- IRC Nick
- claime
- LDAP User
- Clément Goubert
- MediaWiki User
- CGoubert-WMF [ Global Accounts ]
Today
Thanks for the heads up @VRiley-WMF.
Not strictly, but it would be nice to have for peace of mind. This may be a task that @Blake can work on in the coming quarter as prep for the next switchover.
Tagging in @KCVelaga_WMF as we discussed this briefly in Lisbon.
Thu, Dec 4
Yes but it calls restGatewayGet which routes through the rest-gateway instead of directly to AWS
Thanks @taavi for pointing out this is client-side JS and not internal.
I've linked the tasks I'd created as children of T410198: Determine the source of internal requests going through the API gateway., feel free to dedupe as wanted.
As far as I can tell from logstash, calls identified as internal (that get no ratelimit_key) are definitely from Wikifeeds and MediaWiki itself, and all to either page-analytics or device-analytics. device-analytics calls seem to also originate from PageViewInfo
Wed, Dec 3
cc @Kappakayala for SRE Manager approval
Tue, Dec 2
Thanks for all the support @brouberol <3
Hmm I'd like to be able to actually see a failed run, so I'll change the job definition to keep runs for longer so I can inspect the actual kubernetes objects next time it fails, CR uploaded.
Wed, Nov 26
Rebalance done on kafka-main-eqiad
Partition count stays imbalanced due to partition size variance, but storage is now balanced which should equalize storage and bandwidth needs.
Added documentation of FOREACHWIKI_IGNORE_ERRORS https://wikitech.wikimedia.org/w/index.php?title=Maintenance_scripts&oldid=2365511
Deployed and tested quickly, looks like it's fixed for me, resolving.
Feel free to reopen if there are still issues.
Tue, Nov 25
All ServiceOps hosts have been migrated to the new switch.
Waiting until T405950: eqiad row C/D Service Ops host migrations is done with moving the kafka-main nodes so we don't run into a network blip if the rebalance takes a while
The commands should be run on deployment.eqiad.wmnet, these are in the task just for ease of copy/pasting, a more complete troubleshooting documentation is on Wikitech
Mon, Nov 24
Thu, Nov 20
I'd like to do the actual implementation of this with @Blake in the upcoming months.
@KOfori Could you approve this ?
@Kappakayala and @hnowlan being OOO, @mark could I get approval for this please?
Wed, Nov 19
Tagging:
- Content-Transform-Team for Wikifeeds
- Growth-Team for GrowthExperiments
- PageViewInfo for PageViewInfo
Tue, Nov 18
Validating broker list: Broker 1003 does not have a rack.id defined Broker 1001 does not have a rack.id defined Broker 1004 does not have a rack.id defined Broker 1005 does not have a rack.id defined Broker 1002 does not have a rack.id defined -
According to the rest-gateway logs, both MediaWiki itself and Wikifeeds are making direct calls to rest-gateway for the page-analytics_pageviews route.
Hmm, we should probably also figure out a way to route these to mw-api-int instead of mw-api-ext somehow. I have to think about this.
The kafka-main rebalance question is now pretty critical to figure out. One of the broker's certificates is expiring in 2 days, so the brokers need a roll-restart to pick up the new one. As far as I can tell, this operation also requires topics to be balanced, which is not the case at the moment.
Silencing for 3 months.
Reopening for followup discussion of https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1204865
Nov 13 2025
Nov 12 2025
@Blake will handle this one
Puppet updated, but we've got some work to do so the hosts can be racked in E/F (see https://phabricator.wikimedia.org/T405285#11350683 )
We'll get back to you on this asap.
Puppet updated
Puppet updated
Puppet updated
Nov 11 2025
Nov 10 2025
Nov 5 2025
With a little bit of tweaking to the regex for api-gateway we now have correctly labeled metrics, and a (somewhat) useful rate limit graph
I merged the mapping for the rest-gateway and exported metrics look pretty good:
cgoubert@deploy2002:/srv/deployment-charts/helmfile.d/services/rest-gateway$ curl http://localhost:9090/metrics | grep service_ | grep -v '#' [...] ratelimit_service_rest_gateway_near_limit{policy="experiment-2025-shadow",unit="HOUR",user_class="anon"} 1 ratelimit_service_rest_gateway_over_limit{policy="experiment-2025-shadow",unit="HOUR",user_class="anon"} 10 ratelimit_service_rest_gateway_shadow_mode{policy="experiment-2025-shadow",unit="HOUR",user_class="anon"} 10 ratelimit_service_rest_gateway_total_hits{policy="experiment-2025-shadow",unit="HOUR",user_class="anon"} 13 ratelimit_service_rest_gateway_within_limit{policy="experiment-2025-shadow",unit="HOUR",user_class="anon"} 3

