Page MenuHomePhabricator

Clement_Goubert (claime)
Senior SRE

Projects (8)

Today

  • No visible events.

Tomorrow

  • No visible events.

Wednesday

  • No visible events.

User Details

User Since
Jul 26 2022, 2:11 PM (176 w, 6 d)
Availability
Available
IRC Nick
claime
LDAP User
Clément Goubert
MediaWiki User
CGoubert-WMF [ Global Accounts ]

Recent Activity

Today

Clement_Goubert added a comment to T408760: Q2:rack/setup/install wikikube-worker refresh.

Thanks for the heads up @VRiley-WMF.

Mon, Dec 15, 5:34 PM · ops-eqiad, serviceops, DC-Ops, SRE
Clement_Goubert updated subscribers of T330997: Support locking cookbooks run except for switchover related cookbooks.

Not strictly, but it would be nice to have for peace of mind. This may be a task that @Blake can work on in the coming quarter as prep for the next switchover.

Mon, Dec 15, 5:22 PM · SRE-tools, Infrastructure-Foundations, Datacenter-Switchover, SRE, serviceops
Clement_Goubert updated subscribers of T412520: Ensure requests originating from InstantCommons on third party wikis doesn't get rate limited too much.

Tagging in @KCVelaga_WMF as we discussed this briefly in Lisbon.

Mon, Dec 15, 5:16 PM · MediaWiki-Platform-Team (Kanban Board), OKR-Work
Clement_Goubert added a comment to T412585: Epic: Enforce API rate limits (WE5.1.3c).

I haven't been able to find this in the related tasks: what does the error response you'll get look like? and is there some way to "force" an error response to test client libraries?

Mon, Dec 15, 5:11 PM · MediaWiki-Platform-Team (Roadmap), serviceops, Traffic, Epic, OKR-Work, MW-Interfaces-Team, FY2025-26 KR 5.1

Thu, Dec 4

Clement_Goubert added a comment to T411769: Migrate wikifeeds backend calls away from rest-gateway.

Yes but it calls restGatewayGet which routes through the rest-gateway instead of directly to AWS

Thu, Dec 4, 1:27 PM · Essential-Work, Content-Transform-Team, serviceops-radar, Wikifeeds
Clement_Goubert closed T411770: Migrate GrowthExperiments away from rest-gateway, a subtask of T410198: Determine the source of internal requests going through the API gateway., as Invalid.
Thu, Dec 4, 1:24 PM · MediaWiki-Platform-Team (Kanban Board), Content-Transform-Team (Work In Progress), Essential-Work, PageViewInfo, Growth-Team, serviceops, OKR-Work
Clement_Goubert closed T411770: Migrate GrowthExperiments away from rest-gateway, a subtask of T411641: Fix external calls to AQS in Wikimedia extensions, as Invalid.
Thu, Dec 4, 1:24 PM · MediaWiki-Platform-Team (Kanban Board), OKR-Work, Traffic, PageViewInfo, Math
Clement_Goubert closed T411770: Migrate GrowthExperiments away from rest-gateway as Invalid.

Thanks @taavi for pointing out this is client-side JS and not internal.

Thu, Dec 4, 1:24 PM · serviceops-radar, Growth-Team, GrowthExperiments
Clement_Goubert added a comment to T411641: Fix external calls to AQS in Wikimedia extensions.

I've linked the tasks I'd created as children of T410198: Determine the source of internal requests going through the API gateway., feel free to dedupe as wanted.

Thu, Dec 4, 1:13 PM · MediaWiki-Platform-Team (Kanban Board), OKR-Work, Traffic, PageViewInfo, Math
Clement_Goubert added a parent task for T411769: Migrate wikifeeds backend calls away from rest-gateway: T411641: Fix external calls to AQS in Wikimedia extensions.
Thu, Dec 4, 1:12 PM · Essential-Work, Content-Transform-Team, serviceops-radar, Wikifeeds
Clement_Goubert added a subtask for T411641: Fix external calls to AQS in Wikimedia extensions: T411769: Migrate wikifeeds backend calls away from rest-gateway.
Thu, Dec 4, 1:12 PM · MediaWiki-Platform-Team (Kanban Board), OKR-Work, Traffic, PageViewInfo, Math
Clement_Goubert added a parent task for T411770: Migrate GrowthExperiments away from rest-gateway: T411641: Fix external calls to AQS in Wikimedia extensions.
Thu, Dec 4, 1:12 PM · serviceops-radar, Growth-Team, GrowthExperiments
Clement_Goubert added a subtask for T411641: Fix external calls to AQS in Wikimedia extensions: T411770: Migrate GrowthExperiments away from rest-gateway.
Thu, Dec 4, 1:12 PM · MediaWiki-Platform-Team (Kanban Board), OKR-Work, Traffic, PageViewInfo, Math
Clement_Goubert added a subtask for T411641: Fix external calls to AQS in Wikimedia extensions: T411771: Migrate PageViewInfo calls away from rest-gateway.
Thu, Dec 4, 1:12 PM · MediaWiki-Platform-Team (Kanban Board), OKR-Work, Traffic, PageViewInfo, Math
Clement_Goubert added a parent task for T411771: Migrate PageViewInfo calls away from rest-gateway: T411641: Fix external calls to AQS in Wikimedia extensions.
Thu, Dec 4, 1:12 PM · serviceops-radar, PageViewInfo
Clement_Goubert added a subtask for T410198: Determine the source of internal requests going through the API gateway.: T411770: Migrate GrowthExperiments away from rest-gateway.
Thu, Dec 4, 1:11 PM · MediaWiki-Platform-Team (Kanban Board), Content-Transform-Team (Work In Progress), Essential-Work, PageViewInfo, Growth-Team, serviceops, OKR-Work
Clement_Goubert added a parent task for T411770: Migrate GrowthExperiments away from rest-gateway: T410198: Determine the source of internal requests going through the API gateway..
Thu, Dec 4, 1:11 PM · serviceops-radar, Growth-Team, GrowthExperiments
Clement_Goubert created P86400 (An Untitled Masterwork).
Thu, Dec 4, 1:08 PM
Clement_Goubert added a comment to T410764: MediaWiki periodic job startupregistrystats-mediawikiwiki failed.

If you rename them, the next time the alert fires, it will create a new task instead of appending to the existing open one.

IMO that's not ideal; these tasks will clutter up search results and will be hard to navigate. Couldn't it use a custom Maniphest field rather than the title for identifying which task is related to a script?

Thu, Dec 4, 1:02 PM · MediaWiki-Platform-Team, SRE Observability, serviceops
Clement_Goubert updated the task description for T411771: Migrate PageViewInfo calls away from rest-gateway.
Thu, Dec 4, 12:39 PM · serviceops-radar, PageViewInfo
Clement_Goubert added a comment to T410198: Determine the source of internal requests going through the API gateway..

As far as I can tell from logstash, calls identified as internal (that get no ratelimit_key) are definitely from Wikifeeds and MediaWiki itself, and all to either page-analytics or device-analytics. device-analytics calls seem to also originate from PageViewInfo

Thu, Dec 4, 12:39 PM · MediaWiki-Platform-Team (Kanban Board), Content-Transform-Team (Work In Progress), Essential-Work, PageViewInfo, Growth-Team, serviceops, OKR-Work
Clement_Goubert added a subtask for T410198: Determine the source of internal requests going through the API gateway.: T411771: Migrate PageViewInfo calls away from rest-gateway.
Thu, Dec 4, 12:13 PM · MediaWiki-Platform-Team (Kanban Board), Content-Transform-Team (Work In Progress), Essential-Work, PageViewInfo, Growth-Team, serviceops, OKR-Work
Clement_Goubert added a parent task for T411771: Migrate PageViewInfo calls away from rest-gateway: T410198: Determine the source of internal requests going through the API gateway..
Thu, Dec 4, 12:13 PM · serviceops-radar, PageViewInfo
Clement_Goubert created T411771: Migrate PageViewInfo calls away from rest-gateway.
Thu, Dec 4, 12:13 PM · serviceops-radar, PageViewInfo
Clement_Goubert added a project to T411769: Migrate wikifeeds backend calls away from rest-gateway: Content-Transform-Team.
Thu, Dec 4, 12:10 PM · Essential-Work, Content-Transform-Team, serviceops-radar, Wikifeeds
Clement_Goubert added a project to T411770: Migrate GrowthExperiments away from rest-gateway: serviceops-radar.
Thu, Dec 4, 12:10 PM · serviceops-radar, Growth-Team, GrowthExperiments
Clement_Goubert updated the task description for T411770: Migrate GrowthExperiments away from rest-gateway.
Thu, Dec 4, 12:10 PM · serviceops-radar, Growth-Team, GrowthExperiments
Clement_Goubert created T411770: Migrate GrowthExperiments away from rest-gateway.
Thu, Dec 4, 12:09 PM · serviceops-radar, Growth-Team, GrowthExperiments
Clement_Goubert added a subtask for T410198: Determine the source of internal requests going through the API gateway.: T411769: Migrate wikifeeds backend calls away from rest-gateway.
Thu, Dec 4, 12:07 PM · MediaWiki-Platform-Team (Kanban Board), Content-Transform-Team (Work In Progress), Essential-Work, PageViewInfo, Growth-Team, serviceops, OKR-Work
Clement_Goubert added a parent task for T411769: Migrate wikifeeds backend calls away from rest-gateway: T410198: Determine the source of internal requests going through the API gateway..
Thu, Dec 4, 12:07 PM · Essential-Work, Content-Transform-Team, serviceops-radar, Wikifeeds
Clement_Goubert created T411769: Migrate wikifeeds backend calls away from rest-gateway.
Thu, Dec 4, 12:06 PM · Essential-Work, Content-Transform-Team, serviceops-radar, Wikifeeds
Clement_Goubert added projects to T410764: MediaWiki periodic job startupregistrystats-mediawikiwiki failed: serviceops, SRE Observability.

In general I think it would be nice to rename these auto-generated tasks and put something about the specific failure reason in their title. They are going to be very unhelpful when searching for related issues in the future.

Thu, Dec 4, 10:57 AM · MediaWiki-Platform-Team, SRE Observability, serviceops

Wed, Dec 3

Clement_Goubert updated subscribers of T411607: Security Issue Access Request for Blake.

cc @Kappakayala for SRE Manager approval

Wed, Dec 3, 11:59 AM · SecTeam-Processed, Security-Team, Security

Tue, Dec 2

Clement_Goubert added a comment to T407185: Fix Kafka replicas skew.

Thanks for all the support @brouberol <3

Tue, Dec 2, 4:39 PM · Data-Platform-SRE (2025.11.07 - 2025.11.28), Essential-Work, Data-Engineering-Radar, serviceops, Observability-Logging, Data-Engineering
Clement_Goubert triaged T411508: network::constants::mw_appserver_networks is out of date (or named poorly?) as Low priority.

Probably a good task to pair on with @Blake or @jasmine_ on tracking down what's used where in puppet.

Tue, Dec 2, 3:49 PM · Patch-For-Review, Puppet, serviceops
Clement_Goubert added a comment to T411417: MediaWiki periodic job campaignevents-aggregateanswers-metawiki failed.

Hmm I'd like to be able to actually see a failed run, so I'll change the job definition to keep runs for longer so I can inspect the actual kubernetes objects next time it fails, CR uploaded.

Tue, Dec 2, 2:59 PM · serviceops, Connection-Team
Clement_Goubert added a comment to T341553: Allow running one-off scripts manually.

Not sure if it has already been mentioned, but mwscript-k8s doesn't seem to support running maintenance scripts that are not defined in public code.

Last time I tried, you could do it with --file.

--file has the following documentation which to me implied I could not use it to define the script but only input files:

--file FILE           Copy a text file into the MediaWiki container (in the script's working directory) to be used as script input. Format:
                       path/to/local-file.txt[:remote-file.txt] -- omit colon section to use the same filename (with any leading path stripped).
                       Pass --file again to copy multiple files.

I also read https://wikitech.wikimedia.org/wiki/Maintenance_scripts#Not_yet_supported as saying that non-text files (i.e. non .txt files) are not supported as input

Perhaps the documentation needs updating to indicate this functionality or at least say that the argument also supports more than just "text file[s] .. to be used as script input"?

Tue, Dec 2, 2:36 PM · MW-on-K8s, serviceops
Clement_Goubert added a comment to T327663: Create a visual representation of where each service is active from, any given time.

Is anything still needed beyond the functionality in sudo cookbook -d sre.discovery.datacenter status all? That provides the following table:

Service                       Type           eqiad     codfw
=================================================================
apertium                      Active/Active  pooled    pooled    
api-gateway                   Active/Active  pooled    pooled    
apt                           Active/Passive pooled              
apus                          Active/Active  pooled    pooled    
citoid                        Active/Active  pooled    pooled    
config-master                 Active/Active  pooled    pooled    
cxserver                      Active/Active  pooled    pooled    
device-analytics              Active/Active  pooled    pooled    
docker-registry               Active/Passive           pooled    
echostore                     Active/Active  pooled    pooled    
eventgate-analytics           Active/Active  pooled    pooled    
[...]
Tue, Dec 2, 11:51 AM · Patch-For-Review, serviceops, observability

Wed, Nov 26

Clement_Goubert added a comment to T407185: Fix Kafka replicas skew.

Rebalance done on kafka-main-eqiad

image.png (1×2 px, 131 KB)

Partition count stays imbalanced due to partition size variance, but storage is now balanced which should equalize storage and bandwidth needs.

Wed, Nov 26, 4:39 PM · Data-Platform-SRE (2025.11.07 - 2025.11.28), Essential-Work, Data-Engineering-Radar, serviceops, Observability-Logging, Data-Engineering
Clement_Goubert added a comment to T411104: mwscript-k8s makes it challenging to run LoggedUpdateMaintenance across all wikis.

Thank you!

The issue with adding some type of log like you're mentioning is that set +e will exit mwscript-mwcron immediately without giving the opportunity to log what wiki was being used, and the mwscript-k8s wrapper is not aware of what's happening within the loop, neither does the kubernetes Pod the script is running inside of.

Do we have to use set +e though? Perhaps we can check $? manually if it is non-zero, which would give us the ability to perform any additional logging we want. Or maybe there is a reason why checking $? is not advisable?

Wed, Nov 26, 4:21 PM · Patch-For-Review, Growth-Team, MediaWiki-Engineering, serviceops, MediaWiki-Maintenance-system, MW-on-K8s
Clement_Goubert added a comment to T411104: mwscript-k8s makes it challenging to run LoggedUpdateMaintenance across all wikis.

Added documentation of FOREACHWIKI_IGNORE_ERRORS https://wikitech.wikimedia.org/w/index.php?title=Maintenance_scripts&oldid=2365511

Wed, Nov 26, 3:56 PM · Patch-For-Review, Growth-Team, MediaWiki-Engineering, serviceops, MediaWiki-Maintenance-system, MW-on-K8s
Clement_Goubert closed T411066: wikifunctions.org API no longer works via that URL (without 'www.') as Resolved.

Deployed and tested quickly, looks like it's fixed for me, resolving.
Feel free to reopen if there are still issues.

Wed, Nov 26, 12:35 PM · serviceops, SRE, Traffic, Wikimedia-Apache-configuration, MW-Interfaces-Team, Abstract Wikipedia team, MediaWiki-Action-API, Wikifunctions
Clement_Goubert changed the status of T407185: Fix Kafka replicas skew from Open to In Progress.
Wed, Nov 26, 11:40 AM · Data-Platform-SRE (2025.11.07 - 2025.11.28), Essential-Work, Data-Engineering-Radar, serviceops, Observability-Logging, Data-Engineering
Clement_Goubert claimed T411066: wikifunctions.org API no longer works via that URL (without 'www.').
Wed, Nov 26, 11:29 AM · serviceops, SRE, Traffic, Wikimedia-Apache-configuration, MW-Interfaces-Team, Abstract Wikipedia team, MediaWiki-Action-API, Wikifunctions
Clement_Goubert added a comment to T411066: wikifunctions.org API no longer works via that URL (without 'www.').

Maybe related to https://gerrit.wikimedia.org/r/c/operations/puppet/+/1198941

In trafficserver/gateway-check.lua.conf, "wikifunctions.org" is not in the "group0"/"group1"/"group2" config (which is right), so the rest gateway was not used for wikifunctions.org/w/api.php until that patch.

Incidentally, in profile/trafficserver/backend.yaml I see an early "target: http://www.wikifunctions.org/w/api.php" rule before the generic "target: 'http://(.*)/w/api.php'" rule. The former does not have the gateway-check.lua plugin and is meant to cause the use of a special mw-on-k8 cluster. There is no specific rule for wikifunctions.org, so it falls into the generic rule. I'm not sure why it 404s, though it might be related to rest-gateway/values.yaml not having domains/FQDNs for wikifunctions.org / www.wikifunctions.org.

Wed, Nov 26, 11:28 AM · serviceops, SRE, Traffic, Wikimedia-Apache-configuration, MW-Interfaces-Team, Abstract Wikipedia team, MediaWiki-Action-API, Wikifunctions
Clement_Goubert added a comment to T410748: MediaWiki periodic job campaignevents-aggregateanswers-metawiki failed.

The commands should be run on deployment.eqiad.wmnet, these are in the task just for ease of copy/pasting, a more complete troubleshooting documentation is on Wikitech

Thank you, I didn't realize there were more instructions there.

I had tried the command there but I don't have permissions to run them, that's why I wondered if I was doing something wrong. (I am in restricted but not deployment). In fact, the only deployer on the team is currently OOO (@cmelo). Is there anything else I could try? I still think it was probably just a fluke, but double-checking wouldn't hurt.

Wed, Nov 26, 11:18 AM · serviceops, Connection-Team

Tue, Nov 25

Clement_Goubert closed T405950: eqiad row C/D Service Ops host migrations as Resolved.

All ServiceOps hosts have been migrated to the new switch.

Tue, Nov 25, 4:52 PM · serviceops, SRE, DC-Ops, ops-eqiad
Clement_Goubert closed T405950: eqiad row C/D Service Ops host migrations, a subtask of T404609: eqiad: rows C/D Upgrade Tracking, as Resolved.
Tue, Nov 25, 4:52 PM · SRE, Infrastructure-Foundations, netops, DC-Ops, ops-eqiad
Clement_Goubert added a comment to T407185: Fix Kafka replicas skew.

Waiting until T405950: eqiad row C/D Service Ops host migrations is done with moving the kafka-main nodes so we don't run into a network blip if the rebalance takes a while

Tue, Nov 25, 3:23 PM · Data-Platform-SRE (2025.11.07 - 2025.11.28), Essential-Work, Data-Engineering-Radar, serviceops, Observability-Logging, Data-Engineering
Clement_Goubert closed T400490: Capacity plan for REST gateway , a subtask of T400130: Central REST gateway for APIs, as Resolved.
Tue, Nov 25, 3:11 PM · MW-Interfaces-Team (MWI-Roadmap), serviceops, Epic, OKR-Work
Clement_Goubert closed T400490: Capacity plan for REST gateway as Resolved.
Tue, Nov 25, 3:11 PM · MW-Interfaces-Team, serviceops, OKR-Work
Clement_Goubert closed T410612: Requesting access to ops for blake as Resolved.
Tue, Nov 25, 3:11 PM · SRE, SRE-Access-Requests
Clement_Goubert added a comment to T410748: MediaWiki periodic job campaignevents-aggregateanswers-metawiki failed.

The commands should be run on deployment.eqiad.wmnet, these are in the task just for ease of copy/pasting, a more complete troubleshooting documentation is on Wikitech

Tue, Nov 25, 3:09 PM · serviceops, Connection-Team
Clement_Goubert closed T406607: [Hypothesis] 5.2.2b: Action API Rerouting as Resolved.
Tue, Nov 25, 1:03 PM · OKR-Work, [MWI] FY2025-26 Q2, MW-Interfaces-Team (MWI-Roadmap)
Clement_Goubert closed T408223: Action API via rest-gateway production rollout as Resolved.
Tue, Nov 25, 1:02 PM · OKR-Work, [MWI] FY2025-26 Q2, MW-Interfaces-Team (MWI-Roadmap)
Clement_Goubert closed T408223: Action API via rest-gateway production rollout, a subtask of T406607: [Hypothesis] 5.2.2b: Action API Rerouting, as Resolved.
Tue, Nov 25, 1:02 PM · OKR-Work, [MWI] FY2025-26 Q2, MW-Interfaces-Team (MWI-Roadmap)
Clement_Goubert moved T345868: Rename the shellbox service to shellbox-score from 🛎 Services & Oids to 🥋Good First Task on the serviceops board.
Tue, Nov 25, 11:54 AM · Shellbox, serviceops

Mon, Nov 24

Clement_Goubert added a comment to T410612: Requesting access to ops for blake.

This broke Puppet runs on the puppetservers:

Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Function Call, User 'blake' lacks an api token for HP: please add it in profile::conftool::hiddenparma::api_tokens (file: /srv/puppet_code/environments/production/modules/profile/manifests/conftool/requestctl_client.pp, line: 34, column: 17) on node puppetserver2001.codfw.wmnet

See the inline comment in data.yaml: https://github.com/wikimedia/operations-puppet/blob/production/modules/admin/data/data.yaml#L68

Mon, Nov 24, 3:05 PM · SRE, SRE-Access-Requests
Clement_Goubert closed T410612: Requesting access to ops for blake as Resolved.
Mon, Nov 24, 1:17 PM · SRE, SRE-Access-Requests
Clement_Goubert updated the task description for T408223: Action API via rest-gateway production rollout.
Mon, Nov 24, 10:34 AM · OKR-Work, [MWI] FY2025-26 Q2, MW-Interfaces-Team (MWI-Roadmap)
Clement_Goubert closed T410858: Alert in need of triage: HelmfileAdminNGPendingChanges (instance deploy1003:9100) as Resolved.
Mon, Nov 24, 9:43 AM · serviceops, sre-alert-triage

Thu, Nov 20

Clement_Goubert moved T330996: Add scap lock/unlock steps to sre.switchdc.mediawiki cookbook from this.quarter 🍕 to 🥋Good First Task on the serviceops board.
Thu, Nov 20, 4:54 PM · serviceops, Datacenter-Switchover, SRE
Clement_Goubert moved T401396: Revisit backend routing for rest-gateway from Incoming 🐫 to 🥋Good First Task on the serviceops board.

I'd like to do the actual implementation of this with @Blake in the upcoming months.

Thu, Nov 20, 4:46 PM · MW-Interfaces-Team (MWI-Roadmap), serviceops, Epic, OKR-Work
Clement_Goubert updated subscribers of T410612: Requesting access to ops for blake.

@KOfori Could you approve this ?

Thu, Nov 20, 4:45 PM · SRE, SRE-Access-Requests
Clement_Goubert moved T410537: Add a --rack flag to sre.k8s.pool-depool-node from Incoming 🐫 to 🥋Good First Task on the serviceops board.
Thu, Nov 20, 4:43 PM · Infrastructure-Foundations, SRE-tools, serviceops
Clement_Goubert removed a project from T410573: October 2025 Bullseye reboots: Search Platform-owned hosts: serviceops.
Thu, Nov 20, 12:56 PM · Data-Platform-SRE (2025.11.07 - 2025.11.28), Essential-Work, SecTeam-Processed, SRE-swift-storage, Infrastructure Security, SRE, Security
Clement_Goubert updated the task description for T408223: Action API via rest-gateway production rollout.
Thu, Nov 20, 12:40 PM · OKR-Work, [MWI] FY2025-26 Q2, MW-Interfaces-Team (MWI-Roadmap)
Clement_Goubert assigned T410552: Improve detection of kafka-main broker TLS certificate rotations to Blake.
Thu, Nov 20, 12:03 PM · Patch-For-Review, serviceops
Clement_Goubert triaged T410612: Requesting access to ops for blake as Medium priority.

@Kappakayala and @hnowlan being OOO, @mark could I get approval for this please?

Thu, Nov 20, 11:40 AM · SRE, SRE-Access-Requests

Wed, Nov 19

Clement_Goubert added projects to T410198: Determine the source of internal requests going through the API gateway.: Content-Transform-Team, Growth-Team, PageViewInfo.

Tagging:

Wed, Nov 19, 12:53 PM · MediaWiki-Platform-Team (Kanban Board), Content-Transform-Team (Work In Progress), Essential-Work, PageViewInfo, Growth-Team, serviceops, OKR-Work

Tue, Nov 18

Clement_Goubert added a comment to T407185: Fix Kafka replicas skew.
Validating broker list:
  Broker 1003 does not have a rack.id defined
  Broker 1001 does not have a rack.id defined
  Broker 1004 does not have a rack.id defined
  Broker 1005 does not have a rack.id defined
  Broker 1002 does not have a rack.id defined
  -
Tue, Nov 18, 4:04 PM · Data-Platform-SRE (2025.11.07 - 2025.11.28), Essential-Work, Data-Engineering-Radar, serviceops, Observability-Logging, Data-Engineering
Clement_Goubert added a comment to T410198: Determine the source of internal requests going through the API gateway..

According to the rest-gateway logs, both MediaWiki itself and Wikifeeds are making direct calls to rest-gateway for the page-analytics_pageviews route.

Tue, Nov 18, 2:18 PM · MediaWiki-Platform-Team (Kanban Board), Content-Transform-Team (Work In Progress), Essential-Work, PageViewInfo, Growth-Team, serviceops, OKR-Work
Clement_Goubert added a comment to T405950: eqiad row C/D Service Ops host migrations.

@Clement_Goubert,

Is it possible that I could send the commands for this or do we need someone in your team? If we need someone in your team, could we schedule an hour or so for this tomorrow (Tuesday, Nov 18thy) at 17:00GMT start time? (So 9:00 AM Pacific?)

Apart from the drain (which I've not done and you've detailed someone in your team should do) the other steps to move a host is about 5 minutes total per host, with network connectivity loss of approximately 1 minute or less. (We ping the host during the move and it misses less than 12 ping sequence numbers.)

Tue, Nov 18, 1:35 PM · serviceops, SRE, DC-Ops, ops-eqiad
Clement_Goubert added a comment to T410273: api rate limiting: Assign ratelimit class based on IP range.

Hmm, we should probably also figure out a way to route these to mw-api-int instead of mw-api-ext somehow. I have to think about this.

Tue, Nov 18, 1:30 PM · MediaWiki-Platform-Team (Kanban Board), Patch-For-Review, serviceops, OKR-Work
Clement_Goubert added a comment to T407185: Fix Kafka replicas skew.

The kafka-main rebalance question is now pretty critical to figure out. One of the broker's certificates is expiring in 2 days, so the brokers need a roll-restart to pick up the new one. As far as I can tell, this operation also requires topics to be balanced, which is not the case at the moment.

Tue, Nov 18, 1:15 PM · Data-Platform-SRE (2025.11.07 - 2025.11.28), Essential-Work, Data-Engineering-Radar, serviceops, Observability-Logging, Data-Engineering
Clement_Goubert added a comment to T400969: Alert in need of triage: KubernetesWorkerUnschedulable .

Silencing for 3 months.

Tue, Nov 18, 12:34 PM · serviceops, sre-alert-triage
Clement_Goubert removed projects from T290357: Maintenance environment needed for running one-off commands: Prod-Kubernetes, serviceops.
Tue, Nov 18, 12:04 PM · Kubernetes, Toolhub
Clement_Goubert reopened T401396: Revisit backend routing for rest-gateway, a subtask of T400130: Central REST gateway for APIs, as In Progress.
Tue, Nov 18, 11:45 AM · MW-Interfaces-Team (MWI-Roadmap), serviceops, Epic, OKR-Work
Clement_Goubert reopened T401396: Revisit backend routing for rest-gateway as "In Progress".

Reopening for followup discussion of https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1204865

Tue, Nov 18, 11:45 AM · MW-Interfaces-Team (MWI-Roadmap), serviceops, Epic, OKR-Work

Nov 13 2025

Clement_Goubert added a comment to T408223: Action API via rest-gateway production rollout.

I was checking some graphs this evening while finalizing the plan for winding down PHP_ENGINE routing tomorrow, and I ran into something odd. Compare mw-api-ext traffic served by its next release over the last 3 days in eqiad vs. codfw (note: you can see the same effect on the set of releases that share an endpoints object with main but it's much noisier).

Specifically, what shifted a bunch of mw-api-ext traffic from codfw to eqiad around 15:00 UTC on the 11th and 9:00 UTC on the 12th?

I then realized this has to be when https://gerrit.wikimedia.org/r/1198936 and https://gerrit.wikimedia.org/r/1198937 were applied across the CDN fleet.

Although we created rest-gateway-ro.discovery.wmnet and updated multi-dc.lua to ensure A/A routing to rest-gateway works, rest-gateway itself does not implement A/A routing. Stated differently, any mw-api-ext-bound traffic routed via rest-gateway will always hit the primary DC on the upstream side.

Nov 13 2025, 10:28 AM · OKR-Work, [MWI] FY2025-26 Q2, MW-Interfaces-Team (MWI-Roadmap)

Nov 12 2025

Clement_Goubert updated the task description for T408223: Action API via rest-gateway production rollout.
Nov 12 2025, 3:25 PM · OKR-Work, [MWI] FY2025-26 Q2, MW-Interfaces-Team (MWI-Roadmap)
Clement_Goubert updated subscribers of T409817: Alert in need of triage: SystemdUnitFailed (instance registry1005:9100).

@Blake will handle this one

Nov 12 2025, 9:31 AM · serviceops, sre-alert-triage
Clement_Goubert added a comment to T408757: Q2:rack/setup/install wikikube-worker2332-56.

Puppet updated, but we've got some work to do so the hosts can be racked in E/F (see https://phabricator.wikimedia.org/T405285#11350683 )
We'll get back to you on this asap.

Nov 12 2025, 9:12 AM · SRE, ops-codfw, serviceops, DC-Ops
Clement_Goubert reassigned T408760: Q2:rack/setup/install wikikube-worker refresh from Clement_Goubert to Jhancock.wm.

Puppet updated

Nov 12 2025, 9:11 AM · ops-eqiad, serviceops, DC-Ops, SRE
Clement_Goubert assigned T408752: Q2:rack/setup/install wikikube-worker1335-59 to Jhancock.wm.

Puppet updated

Nov 12 2025, 9:10 AM · SRE, ops-eqiad, serviceops, DC-Ops
Clement_Goubert reassigned T408749: Q2:rack/setup/install wikikube-worker11XX from Clement_Goubert to Jhancock.wm.

Puppet updated

Nov 12 2025, 9:10 AM · SRE, ops-eqiad, serviceops, DC-Ops
Clement_Goubert updated the task description for T408223: Action API via rest-gateway production rollout.
Nov 12 2025, 8:44 AM · OKR-Work, [MWI] FY2025-26 Q2, MW-Interfaces-Team (MWI-Roadmap)

Nov 11 2025

Clement_Goubert updated the task description for T408223: Action API via rest-gateway production rollout.
Nov 11 2025, 2:56 PM · OKR-Work, [MWI] FY2025-26 Q2, MW-Interfaces-Team (MWI-Roadmap)

Nov 10 2025

Clement_Goubert updated the task description for T408223: Action API via rest-gateway production rollout.
Nov 10 2025, 11:38 AM · OKR-Work, [MWI] FY2025-26 Q2, MW-Interfaces-Team (MWI-Roadmap)
Clement_Goubert added projects to T405636: api-gateway helm chart: rest routes should return retry-after when a rate limit applies.: Traffic, serviceops.

After thinking about it a bit, I tend to agree with @pmiazga
Adding Traffic for their opinion

Nov 10 2025, 11:14 AM · MediaWiki-Platform-Team (Kanban Board), serviceops, Traffic, OKR-Work

Nov 5 2025

Clement_Goubert triaged T408183: api-gateway chart: metrics mapping for rerst-gateway as Medium priority.
Nov 5 2025, 4:58 PM · serviceops, OKR-Work, MediaWiki-Platform-Team
Clement_Goubert changed the status of T408183: api-gateway chart: metrics mapping for rerst-gateway, a subtask of T398919: Epic: API rate limiting dry run (WE5.1.3b), from Open to In Progress.
Nov 5 2025, 4:58 PM · MediaWiki-Platform-Team (Kanban Board), Epic, OKR-Work
Clement_Goubert changed the status of T408183: api-gateway chart: metrics mapping for rerst-gateway, a subtask of T406498: Test api rate limiting on production cluster, from Open to In Progress.
Nov 5 2025, 4:58 PM · serviceops, OKR-Work, MediaWiki-Platform-Team
Clement_Goubert changed the status of T408183: api-gateway chart: metrics mapping for rerst-gateway from Open to In Progress.
Nov 5 2025, 4:58 PM · serviceops, OKR-Work, MediaWiki-Platform-Team
Clement_Goubert added a comment to T408183: api-gateway chart: metrics mapping for rerst-gateway.

With a little bit of tweaking to the regex for api-gateway we now have correctly labeled metrics, and a (somewhat) useful rate limit graph

image.png (1×2 px, 186 KB)

Nov 5 2025, 4:57 PM · serviceops, OKR-Work, MediaWiki-Platform-Team
Clement_Goubert added a comment to T408183: api-gateway chart: metrics mapping for rerst-gateway.

I merged the mapping for the rest-gateway and exported metrics look pretty good:

cgoubert@deploy2002:/srv/deployment-charts/helmfile.d/services/rest-gateway$ curl http://localhost:9090/metrics | grep service_ | grep -v '#'
[...]
ratelimit_service_rest_gateway_near_limit{policy="experiment-2025-shadow",unit="HOUR",user_class="anon"} 1
ratelimit_service_rest_gateway_over_limit{policy="experiment-2025-shadow",unit="HOUR",user_class="anon"} 10
ratelimit_service_rest_gateway_shadow_mode{policy="experiment-2025-shadow",unit="HOUR",user_class="anon"} 10
ratelimit_service_rest_gateway_total_hits{policy="experiment-2025-shadow",unit="HOUR",user_class="anon"} 13
ratelimit_service_rest_gateway_within_limit{policy="experiment-2025-shadow",unit="HOUR",user_class="anon"} 3
Nov 5 2025, 3:43 PM · serviceops, OKR-Work, MediaWiki-Platform-Team
Clement_Goubert updated the task description for T408223: Action API via rest-gateway production rollout.
Nov 5 2025, 11:02 AM · OKR-Work, [MWI] FY2025-26 Q2, MW-Interfaces-Team (MWI-Roadmap)

Nov 4 2025

Clement_Goubert updated the task description for T408223: Action API via rest-gateway production rollout.
Nov 4 2025, 10:50 AM · OKR-Work, [MWI] FY2025-26 Q2, MW-Interfaces-Team (MWI-Roadmap)

Nov 3 2025

Clement_Goubert added a member for WMF-NDA: Blake.
Nov 3 2025, 3:10 PM
Clement_Goubert added a member for acl*sre-team: Blake.
Nov 3 2025, 2:54 PM