Page MenuHomePhabricator
Feed Search

Yesterday

Clement_Goubert renamed T414434: [WE5.4.8b] Bandwidth-Based Media Rate Limiting from [WE5.4.8] Media rate limiting to [WE5.4.8b] Bandwidth-Based Media Rate Limiting.
Tue, Apr 21, 4:10 PM · ServiceOps new, Epic
Clement_Goubert added a project to T423175: wikikube-worker2190 System Configuration Check error: ServiceOps-Upgrades-Hardware.
Tue, Apr 21, 4:08 PM · ServiceOps-Upgrades-Hardware, SRE, ServiceOps new, DC-Ops, ops-codfw
Clement_Goubert triaged T416112: HTTP 500/503 error on [[skwiki:Zoznam_SD_objektov_2]] due to php getting SIGABRT as Medium priority.
Tue, Apr 21, 4:06 PM · ServiceOps-Mediawiki, ServiceOps new, Wikimedia-production-error, affects-Kiwix-and-openZIM
Clement_Goubert updated the task description for T414434: [WE5.4.8b] Bandwidth-Based Media Rate Limiting.
Tue, Apr 21, 3:01 PM · ServiceOps new, Epic
Clement_Goubert triaged T424052: Tune policy from shadow-mode data before enforcement as High priority.
Tue, Apr 21, 3:00 PM · ServiceOps new, ServiceOps-Services-Oids
Clement_Goubert created T424052: Tune policy from shadow-mode data before enforcement.
Tue, Apr 21, 3:00 PM · ServiceOps new, ServiceOps-Services-Oids
Clement_Goubert triaged T424051: Create redioscope survey for upload rate limit as High priority.
Tue, Apr 21, 2:59 PM · ServiceOps new, ServiceOps-Services-Oids
Clement_Goubert created T424051: Create redioscope survey for upload rate limit.
Tue, Apr 21, 2:58 PM · ServiceOps new, ServiceOps-Services-Oids
Clement_Goubert triaged T424047: Build a dashboard for monitoring and analyzing media rate limiting behavior as High priority.
Tue, Apr 21, 2:55 PM · ServiceOps new, ServiceOps-Services-Oids
Clement_Goubert created T424047: Build a dashboard for monitoring and analyzing media rate limiting behavior.
Tue, Apr 21, 2:55 PM · ServiceOps new, ServiceOps-Services-Oids
Clement_Goubert triaged T424045: Validate chunked-transfer rate limit behavior as High priority.
Tue, Apr 21, 2:51 PM · ServiceOps new, ServiceOps-Services-Oids
Clement_Goubert created T424045: Validate chunked-transfer rate limit behavior.
Tue, Apr 21, 2:51 PM · ServiceOps new, ServiceOps-Services-Oids
Clement_Goubert updated the task description for T414434: [WE5.4.8b] Bandwidth-Based Media Rate Limiting.
Tue, Apr 21, 1:24 PM · ServiceOps new, Epic
Clement_Goubert changed the status of T422804: Reroute LiftWing endpoints from In Progress to Stalled.

After some finaggling with lua patterns, everything on our side is now in place to start migrating traffic once rate limits for liftwing are implemented. Marking stalled pending resolution of T413448: [5.2.2c Epic]: Support Higher Rate Limits for Lift Wing.

Tue, Apr 21, 11:22 AM · Patch-For-Review, Lift-Wing, Machine-Learning-Team, ServiceOps-SharedInfra, ServiceOps new, Epic, OKR-Work, [MWI] FY2025-26 Q3, MW-Interfaces-Team (MWI-Roadmap)
Clement_Goubert changed the status of T422804: Reroute LiftWing endpoints, a subtask of T413438: [Hypothesis] 5.2.2c: Reroute API Portal Endpoints, from In Progress to Stalled.
Tue, Apr 21, 11:21 AM · ServiceOps-SharedInfra, ServiceOps new, Epic, OKR-Work, [MWI] FY2025-26 Q3, MW-Interfaces-Team (MWI-Roadmap)
Clement_Goubert updated the task description for T422804: Reroute LiftWing endpoints.
Tue, Apr 21, 11:16 AM · Patch-For-Review, Lift-Wing, Machine-Learning-Team, ServiceOps-SharedInfra, ServiceOps new, Epic, OKR-Work, [MWI] FY2025-26 Q3, MW-Interfaces-Team (MWI-Roadmap)
Clement_Goubert updated the task description for T422804: Reroute LiftWing endpoints.
Tue, Apr 21, 10:14 AM · Patch-For-Review, Lift-Wing, Machine-Learning-Team, ServiceOps-SharedInfra, ServiceOps new, Epic, OKR-Work, [MWI] FY2025-26 Q3, MW-Interfaces-Team (MWI-Roadmap)

Mon, Apr 20

Clement_Goubert closed T423177: wikikube-worker2188 bus errors as Resolved.

tyvm :)

Mon, Apr 20, 3:51 PM · SRE, ServiceOps new, DC-Ops, ops-codfw
Clement_Goubert added a comment to T413448: [5.2.2c Epic]: Support Higher Rate Limits for Lift Wing.

Please ping @Blake when the rate limits are enforced so they can start migrating traffic, thanks.

Mon, Apr 20, 3:00 PM · Patch-For-Review, MediaWiki-Platform-Team (Q3 Kanban Board), Epic, OKR-Work, MW-Interfaces-Team (MWI-Roadmap)
Clement_Goubert updated the task description for T422804: Reroute LiftWing endpoints.
Mon, Apr 20, 10:21 AM · Patch-For-Review, Lift-Wing, Machine-Learning-Team, ServiceOps-SharedInfra, ServiceOps new, Epic, OKR-Work, [MWI] FY2025-26 Q3, MW-Interfaces-Team (MWI-Roadmap)
Clement_Goubert updated the task description for T422804: Reroute LiftWing endpoints.
Mon, Apr 20, 10:20 AM · Patch-For-Review, Lift-Wing, Machine-Learning-Team, ServiceOps-SharedInfra, ServiceOps new, Epic, OKR-Work, [MWI] FY2025-26 Q3, MW-Interfaces-Team (MWI-Roadmap)
Clement_Goubert updated the task description for T422804: Reroute LiftWing endpoints.
Mon, Apr 20, 10:02 AM · Patch-For-Review, Lift-Wing, Machine-Learning-Team, ServiceOps-SharedInfra, ServiceOps new, Epic, OKR-Work, [MWI] FY2025-26 Q3, MW-Interfaces-Team (MWI-Roadmap)
Clement_Goubert updated the task description for T422804: Reroute LiftWing endpoints.
Mon, Apr 20, 9:51 AM · Patch-For-Review, Lift-Wing, Machine-Learning-Team, ServiceOps-SharedInfra, ServiceOps new, Epic, OKR-Work, [MWI] FY2025-26 Q3, MW-Interfaces-Team (MWI-Roadmap)
Clement_Goubert added a comment to T423177: wikikube-worker2188 bus errors.

Depooled and downtimed for 30 days, all yours.

Mon, Apr 20, 8:15 AM · SRE, ServiceOps new, DC-Ops, ops-codfw
Clement_Goubert moved T423177: wikikube-worker2188 bus errors from Inbox to Radar (Pending) on the ServiceOps new board.
Mon, Apr 20, 8:13 AM · SRE, ServiceOps new, DC-Ops, ops-codfw
Clement_Goubert moved T423175: wikikube-worker2190 System Configuration Check error from Inbox to Scheduled (this Q) on the ServiceOps new board.
Mon, Apr 20, 8:12 AM · ServiceOps-Upgrades-Hardware, SRE, ServiceOps new, DC-Ops, ops-codfw
Clement_Goubert moved T423723: Upgrade kafka-logging to version 3.x from Inbox to Radar (Pending) on the ServiceOps new board.
Mon, Apr 20, 8:11 AM · Patch-For-Review, ServiceOps-Datastores, ServiceOps new, Infrastructure-Foundations, SRE
Clement_Goubert moved T423719: Repurpose tools-k8s-ctrl[1001-1002],tools-k8s-worker[1001-1008] to wikikube-worker13[73-82] from Inbox to Radar (Awareness) on the ServiceOps new board.
Mon, Apr 20, 8:10 AM · ServiceOps new, ServiceOps-Upgrades-Hardware, SRE, ops-eqiad, DC-Ops
Clement_Goubert added a comment to T422804: Reroute LiftWing endpoints.

@Clement_Goubert Hi, do we need to change anything on the recommendation-api side for this change?

Mon, Apr 20, 7:48 AM · Patch-For-Review, Lift-Wing, Machine-Learning-Team, ServiceOps-SharedInfra, ServiceOps new, Epic, OKR-Work, [MWI] FY2025-26 Q3, MW-Interfaces-Team (MWI-Roadmap)
Clement_Goubert added a comment to T419976: Upgrade redis_misc hosts to Debian Trixie (Redis 8.0).

I *think* that api-gateway/Ratelimit is just storing ephemeral counters. You'd want to check with @daniel though in case there is some token/session stuff (though we try to keep that stateless, with expiration, to avoid storage lookups. I haven't been involved in the ongoing API rate limiter work.

For the ratelimit service we don't care about loosing state. A reset of all counters is fine.

As far as I know, the ratelimit counters are the only thing the gateway(s) use Redis for, but I am not 100% sure. Best check with @Clement_Goubert
and @hnowlan .

Mon, Apr 20, 7:40 AM · Patch-For-Review, Infrastructure-Foundations, ServiceOps new, MediaWiki-Platform-Team (Radar), ServiceOps-Datastores, MW-Interfaces-Team

Thu, Apr 16

Clement_Goubert updated the task description for T422804: Reroute LiftWing endpoints.
Thu, Apr 16, 4:35 PM · Patch-For-Review, Lift-Wing, Machine-Learning-Team, ServiceOps-SharedInfra, ServiceOps new, Epic, OKR-Work, [MWI] FY2025-26 Q3, MW-Interfaces-Team (MWI-Roadmap)
Clement_Goubert added a comment to T422804: Reroute LiftWing endpoints.

Routes merged into the rest-gateway, initial tests look good:

$ curl -H 'Host: api.wikimedia.org' https://rest-gateway.discovery.wmnet:4113/service/lw/inference/v1/models/enwiki-goodfaith:predict -X POST -d '{"rev_id": 12345}' -H "Content-type: application/json"
{"enwiki":{"models":{"goodfaith":{"version":"0.5.1"}},"scores":{"12345":{"goodfaith":{"score":{"prediction":true,"probability":{"false":0.07396339218373627,"true":0.9260366078162637}}}}}}}
Thu, Apr 16, 4:35 PM · Patch-For-Review, Lift-Wing, Machine-Learning-Team, ServiceOps-SharedInfra, ServiceOps new, Epic, OKR-Work, [MWI] FY2025-26 Q3, MW-Interfaces-Team (MWI-Roadmap)
Clement_Goubert added a comment to T422955: Detect elevated rates of EtcdConfig fetch failures.

@Clement_Goubert @JMeybohm - Does this sound reasonable to you? I think it should be relatively low effort (e.g., exporter rule -> task-severity alert), and is a surprising enough gap in our monitoring that it probably makes sense to prioritize soon (i.e., errorlog is basically /dev/null).

Thu, Apr 16, 2:47 PM · ServiceOps-SharedInfra, ServiceOps new
Clement_Goubert added a comment to T406745: MediaWiki periodic job db-lag-stats-reporter failed.

I have no idea what this is, @FCeratto-WMF is this something you set up?

Thu, Apr 16, 1:40 PM · MW-Interfaces-Team, MediaWiki-Core-JobQueue
Clement_Goubert added a comment to T406745: MediaWiki periodic job db-lag-stats-reporter failed.

This one failed because the mesh didn't come up quickly enough.

maintenance/getLagTimes.php: Start run
The service mesh is unavailable, which can lead to unexpected results.
Thu, Apr 16, 12:58 PM · MW-Interfaces-Team, MediaWiki-Core-JobQueue
Clement_Goubert closed T423538: MediaWiki periodic job update-flaggedrev-stats failed as Resolved.

Failed job deleted.

Thu, Apr 16, 12:40 PM · ServiceOps new, Wikimedia-production-error, FlaggedRevs
Clement_Goubert closed T423538: MediaWiki periodic job update-flaggedrev-stats failed, a subtask of T422486: MediaWiki periodic job failures due to timeouts, as Resolved.
Thu, Apr 16, 12:40 PM · ServiceOps new (Next quarter), DBA
Clement_Goubert added a comment to T423538: MediaWiki periodic job update-flaggedrev-stats failed.
[pod/update-flaggedrev-stats-29604968-r4d4n/mediawiki-main-app] 2026-04-16T00:10:58.775367153Z fawiki ValidationStatistics           Wikimedia\Rdbms\DBConnectionError from line 1125 of /srv/mediawiki/php-1.46.0-wmf.23/includes/libs/Rdbms/LoadBalancer/LoadBalancer.php: Cannot access the database: MySQL server has gone away (db1181)
Thu, Apr 16, 12:37 PM · ServiceOps new, Wikimedia-production-error, FlaggedRevs

Wed, Apr 15

Clement_Goubert added a comment to T365687: Improve calico-typha firewall rules.

SGTM

Wed, Apr 15, 3:53 PM · Patch-For-Review, ServiceOps new, Prod-Kubernetes, Kubernetes
Clement_Goubert added a comment to T423395: wikikube-worker2280 unreachable.

Given it's in a weird state where connections are hanging, the kubernetes scheduler thinks it's still ok to schedule workloads there, and they never start or never terminate, breaking deployments.
I had to manually force delete all non-daemonset workloads on it to unblock.

Wed, Apr 15, 2:49 PM · ServiceOps-Upgrades-Hardware, Prod-Kubernetes, ServiceOps new
Clement_Goubert created T423413: debmonitor-client crashes for growthbook image.
Wed, Apr 15, 12:21 PM · Data-Platform-SRE (2026-03-27 - 2026-04-17), Infrastructure-Foundations, SRE-tools
Clement_Goubert added a comment to T423395: wikikube-worker2280 unreachable.
grep . /sys/devices/system/cpu/cpu0/cpuidle/state*/name
/sys/devices/system/cpu/cpu0/cpuidle/state0/name:POLL
/sys/devices/system/cpu/cpu0/cpuidle/state1/name:C1
/sys/devices/system/cpu/cpu0/cpuidle/state2/name:C1E
/sys/devices/system/cpu/cpu0/cpuidle/state3/name:C6

C6 seems to be enabled

I'd still leave it like that. First occasion of this and we have a couple supermicros already. We can try disabling C6 if we see this again I'd say.

Wed, Apr 15, 10:16 AM · ServiceOps-Upgrades-Hardware, Prod-Kubernetes, ServiceOps new
Clement_Goubert added a comment to T423395: wikikube-worker2280 unreachable.
grep . /sys/devices/system/cpu/cpu0/cpuidle/state*/name
/sys/devices/system/cpu/cpu0/cpuidle/state0/name:POLL
/sys/devices/system/cpu/cpu0/cpuidle/state1/name:C1
/sys/devices/system/cpu/cpu0/cpuidle/state2/name:C1E
/sys/devices/system/cpu/cpu0/cpuidle/state3/name:C6

C6 seems to be enabled

Wed, Apr 15, 10:11 AM · ServiceOps-Upgrades-Hardware, Prod-Kubernetes, ServiceOps new

Tue, Apr 14

Clement_Goubert added a comment to T420223: High (relatively) number of memcached errors in eqiad.

That seems to have lowered the total amount of TKOs, or at least some of the volatility on it, but the overall error level still bothers me

Tue, Apr 14, 11:40 AM · Patch-For-Review, Infrastructure-Foundations, ServiceOps new, ServiceOps-Datastores

Mon, Apr 13

Clement_Goubert added a comment to T420223: High (relatively) number of memcached errors in eqiad.

@Clement_Goubert wikikube-worker1273 shows the majority of tkos, and it is a big outlier compared to the other ones. Can we depool it to see how the TKO impact looks like afterwards?

Mon, Apr 13, 4:28 PM · Patch-For-Review, Infrastructure-Foundations, ServiceOps new, ServiceOps-Datastores
Clement_Goubert added projects to T422804: Reroute LiftWing endpoints: Machine-Learning-Team, Lift-Wing.
Mon, Apr 13, 3:58 PM · Patch-For-Review, Lift-Wing, Machine-Learning-Team, ServiceOps-SharedInfra, ServiceOps new, Epic, OKR-Work, [MWI] FY2025-26 Q3, MW-Interfaces-Team (MWI-Roadmap)
Clement_Goubert closed T422926: Thumbor is using an unmantained HAProxy version as Resolved.

Thumbor has been upgraded to haproxy 3.2 on trixie. Resolving.

Mon, Apr 13, 10:18 AM · Patch-For-Review, Thumbor, ServiceOps-Services-Oids, Traffic, ServiceOps new

Sun, Apr 12

Clement_Goubert lowered the priority of T423027: 2026-04-12 Gerrit Outage (was: DiskSpace) from Unbreak Now! to High.

Temporary fix was to extend the root LV by 20GB. This should hold us over until Monday when collaboration-services and Release-Engineering-Team can take a deeper look.

Sun, Apr 12, 3:07 PM · Patch-For-Review, Wikimedia-Incident, Gerrit, collaboration-services
Clement_Goubert added a comment to T423027: 2026-04-12 Gerrit Outage (was: DiskSpace).
cgoubert@gerrit2003:/var/log/apache2$ sudo lvextend -L+20G -r /dev/vg0/root 
  Size of logical volume vg0/root changed from 74.50 GiB (19073 extents) to 94.50 GiB (24193 extents).
  Logical volume vg0/root successfully resized.
resize2fs 1.47.0 (5-Feb-2023)
Filesystem at /dev/mapper/vg0-root is mounted on /; on-line resizing required
old_desc_blocks = 10, new_desc_blocks = 12
The filesystem on /dev/mapper/vg0-root is now 24773632 (4k) blocks long.
Sun, Apr 12, 2:45 PM · Patch-For-Review, Wikimedia-Incident, Gerrit, collaboration-services

Fri, Apr 10

Clement_Goubert added a project to T422804: Reroute LiftWing endpoints: ServiceOps-SharedInfra.
Fri, Apr 10, 1:21 PM · Patch-For-Review, Lift-Wing, Machine-Learning-Team, ServiceOps-SharedInfra, ServiceOps new, Epic, OKR-Work, [MWI] FY2025-26 Q3, MW-Interfaces-Team (MWI-Roadmap)
Clement_Goubert added a project to T413438: [Hypothesis] 5.2.2c: Reroute API Portal Endpoints: ServiceOps-SharedInfra.
Fri, Apr 10, 1:20 PM · ServiceOps-SharedInfra, ServiceOps new, Epic, OKR-Work, [MWI] FY2025-26 Q3, MW-Interfaces-Team (MWI-Roadmap)
Clement_Goubert triaged T422937: Cleanup ATS configuration for API paths as Medium priority.
Fri, Apr 10, 1:20 PM · ServiceOps-SharedInfra, ServiceOps new, Epic, OKR-Work, [MWI] FY2025-26 Q3, MW-Interfaces-Team (MWI-Roadmap)
Clement_Goubert created T422937: Cleanup ATS configuration for API paths.
Fri, Apr 10, 1:19 PM · ServiceOps-SharedInfra, ServiceOps new, Epic, OKR-Work, [MWI] FY2025-26 Q3, MW-Interfaces-Team (MWI-Roadmap)
Clement_Goubert claimed T422926: Thumbor is using an unmantained HAProxy version.
Fri, Apr 10, 10:26 AM · Patch-For-Review, Thumbor, ServiceOps-Services-Oids, Traffic, ServiceOps new
Clement_Goubert edited projects for T422926: Thumbor is using an unmantained HAProxy version, added: ServiceOps-Services-Oids, Thumbor; removed Patch-For-Review.

Tagging @JTweed-WMF for awareness.

Fri, Apr 10, 10:26 AM · Patch-For-Review, Thumbor, ServiceOps-Services-Oids, Traffic, ServiceOps new
Clement_Goubert added a comment to T422455: Massive increase in "EtcdConfig failed to fetch data: Timeout was reached" warnings and errors since March 17th.

Ah, thanks for highlighting that @JMeybohm!

I'd not noticed the in-drops previously. Those definitely correlate with when coredns was scheduled there, and appear to hover in the 0 - 50 mpps range (so a couple of lost packets per minute), though peaking at up to ~ 5x that. I do wonder to what degree that's a contributing cause vs. more an effect of a slowdown in software (e.g., something that slows processing of inbound datagrams).

That's also an interesting observation about the conntrack entries - indeed, the same can be seen on other hosts (example) where coredns was scheduled at the time. No doubt these are rather "popular" pods (both for workloads that would likely "reuse" the same entries and for those where the source port would constantly churn), but the effect is still quite impressive.

Next steps

As much as I would like to see it, I suspect we're unlikely to get to the bottom of the specific issue that struck this time around.

In which case, I think it makes sense to think about what else we do next. Thoughts:

  1. Cleanup - We should decide whether to revert https://gerrit.wikimedia.org/r/1268573. On the one hand, it seems to have no effect on the lingering rate of timeouts. On the other, I'm not opposed to over-provisioning a such a critical cluster-wide service.

I'd rather err on the side of over than under-provisioning. There's a fairly good chance that having more pods would also lessen both the amount of traffic and the size of the conntrack per host, which could help with the packet loss

  1. Observability - This went on for weeks, and we didn't really notice until periodic jobs started failing more frequently.
    • One aspect of that is timeouts are happening early enough that the associated warnings will never make it into mediawiki-errors logstash unless they result in an unhandled ConfigException and instead end up in the errorlog (and that's only in the PHP-FPM case; for CLI, we only have stderr, unless they happen during LBFactory::autoReconfigure).
    • I'm wondering whether we should have an alert for this, though I'm not sure off hand what the standard pattern is for alerting on a logs-based signal.
Fri, Apr 10, 9:14 AM · Patch-For-Review, MediaWiki-Engineering, ServiceOps new, Wikimedia-production-error

Thu, Apr 9

Clement_Goubert triaged T422678: MediaWiki periodic job update-special-pages-s5 failed as Low priority.
Thu, Apr 9, 4:16 PM · ServiceOps new, Wikimedia-production-error, MediaWiki-Special-pages
Clement_Goubert changed the status of T422804: Reroute LiftWing endpoints from Open to In Progress.
Thu, Apr 9, 10:34 AM · Patch-For-Review, Lift-Wing, Machine-Learning-Team, ServiceOps-SharedInfra, ServiceOps new, Epic, OKR-Work, [MWI] FY2025-26 Q3, MW-Interfaces-Team (MWI-Roadmap)
Clement_Goubert changed the status of T422804: Reroute LiftWing endpoints, a subtask of T413438: [Hypothesis] 5.2.2c: Reroute API Portal Endpoints, from Open to In Progress.
Thu, Apr 9, 10:34 AM · ServiceOps-SharedInfra, ServiceOps new, Epic, OKR-Work, [MWI] FY2025-26 Q3, MW-Interfaces-Team (MWI-Roadmap)
Clement_Goubert created T422804: Reroute LiftWing endpoints.
Thu, Apr 9, 10:33 AM · Patch-For-Review, Lift-Wing, Machine-Learning-Team, ServiceOps-SharedInfra, ServiceOps new, Epic, OKR-Work, [MWI] FY2025-26 Q3, MW-Interfaces-Team (MWI-Roadmap)
Clement_Goubert added a comment to T420336: mw-parsoid improvements.

Sounds good.

  • I will run another round of tests to see where we are with concurrency after bumping the resources.
  • I will keep using X-Wikimedia-Debug: k8s-mw-parsoid-eqiad for the immediate future
  • I will keep using ssh mw-experimental.eqiad.wmnet for the immediate future

That said is there a better way to get the current active env than the following?

curl -s --user-agent <UA> "https://meta.wikimedia.org/w/api.php?action=query&meta=siteinfo&format=json" | jq -r '.query.general["wmf-config"].wmfMasterDatacenter'
Thu, Apr 9, 9:42 AM · Content-Transform-Team (Work In Progress), User-jijiki, ServiceOps-Services-Oids, ServiceOps new, OKR-Work

Wed, Apr 8

Clement_Goubert added a comment to T422455: Massive increase in "EtcdConfig failed to fetch data: Timeout was reached" warnings and errors since March 17th.

I agree with a longer soak, but I'm also not opposed to doubling the replica even if the current situation holds. My rationale is that we keep having infrequent issues with coredns related to scaling up the number of client making DNS queries, and that would give us more margin without having to revisit that once again. Thoughts?

Wed, Apr 8, 11:19 AM · Patch-For-Review, MediaWiki-Engineering, ServiceOps new, Wikimedia-production-error
Clement_Goubert added a comment to T422580: MediaWiki periodic job update-special-pages-s5 failed.

It's the same error as in the same cause, but it's a different run of the job (started yesterday, failed this morning). Phaultfinder will refile a task for a failure if the previous one was closed, which is the case here. If we leave this task open, Phaultfinder will *update it* with the next alert.

Wed, Apr 8, 11:14 AM · ServiceOps new, Wikimedia-production-error, MediaWiki-Special-pages
Clement_Goubert moved T422580: MediaWiki periodic job update-special-pages-s5 failed from Inbox to Radar (Pending) on the ServiceOps new board.
Wed, Apr 8, 11:07 AM · ServiceOps new, Wikimedia-production-error, MediaWiki-Special-pages
Clement_Goubert added a comment to T422580: MediaWiki periodic job update-special-pages-s5 failed.

That runs every 3 days, we'll see if it runs correctly next time or not. Logs show a DB issue dewiki Error: 2006 MySQL server has gone away

Wed, Apr 8, 11:07 AM · ServiceOps new, Wikimedia-production-error, MediaWiki-Special-pages
Clement_Goubert added a comment to T420223: High (relatively) number of memcached errors in eqiad.

In order to try and exclude the new switches with old vlan as a possible cause, I've pooled a host that is on the new vlan wikikube-worker1347.eqiad.wmnet, and am in the process of renumbering an existing one wikikube-worker1273.eqiad.wmnet.
Once workloads get scheduled on them we should see if they have exhibit the issue or not.

Wed, Apr 8, 9:58 AM · Patch-For-Review, Infrastructure-Foundations, ServiceOps new, ServiceOps-Datastores

Tue, Apr 7

Clement_Goubert added a comment to T420223: High (relatively) number of memcached errors in eqiad.

In order to try and exclude the new switches with old vlan as a possible cause, I've pooled a host that is on the new vlan wikikube-worker1347.eqiad.wmnet, and am in the process of renumbering an existing one wikikube-worker1273.eqiad.wmnet.
Once workloads get scheduled on them we should see if they have exhibit the issue or not.

Tue, Apr 7, 2:56 PM · Patch-For-Review, Infrastructure-Foundations, ServiceOps new, ServiceOps-Datastores
Clement_Goubert added a comment to T421711: ServiceOps: Re-IP eqiad private baremetal hosts to new per-rack vlans/subnets.

We should create a spreadsheet for this otherwise we'll lose track of which servers are renumbered.
For wikikube workers, I have a patch up to fix sre.k8s.renumber-node https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1268568

Tue, Apr 7, 2:26 PM · ServiceOps new
Clement_Goubert added a comment to T421337: MediaWiki periodic job updatequerypages-wantedpages-s1 failed.

I think that whole category of jobs should be ran manually if failed, but we should also check if we need to be careful with what time of day we choose to do that. As an aside, maybe we should start documenting which of the cronjobs need a manual run in these cases or not.

Tue, Apr 7, 10:22 AM · ServiceOps new, Wikimedia-production-error, MediaWiki-Special-pages
Clement_Goubert added a comment to T416623: Decommission NodeJS IPoid service.

@kostajh @Gehel AFAIK the traffic is now migrated, could we clean up the old service?

It was written in description to wait for SLO conversations to be finalized. While technically capturing the SLOs is still ongoing, I don't think it should be a blocker to decommission.

Yes, I think we could do that.

Tue, Apr 7, 10:15 AM · Essential-Work, Product Safety and Integrity, ServiceOps-Services-Oids, ServiceOps new, iPoid-Service (IPoid OpenSearch)
Clement_Goubert closed T418146: Reroute Core APIs through the REST gateway as Resolved.

All done, thanks @JMeybohm !

Tue, Apr 7, 10:08 AM · Patch-For-Review, MW-Interfaces-Team, ServiceOps new, OKR-Work
Clement_Goubert closed T418146: Reroute Core APIs through the REST gateway, a subtask of T413438: [Hypothesis] 5.2.2c: Reroute API Portal Endpoints, as Resolved.
Tue, Apr 7, 10:08 AM · ServiceOps-SharedInfra, ServiceOps new, Epic, OKR-Work, [MWI] FY2025-26 Q3, MW-Interfaces-Team (MWI-Roadmap)

Wed, Mar 25

Clement_Goubert updated the task description for T421233: Reroute Feed API endpoint.
Wed, Mar 25, 1:53 PM · ServiceOps new, Epic, OKR-Work, [MWI] FY2025-26 Q3, MW-Interfaces-Team (MWI-Roadmap)
Clement_Goubert closed T418148: Reroute Link Recommendation APIs through the REST gateway, a subtask of T413438: [Hypothesis] 5.2.2c: Reroute API Portal Endpoints, as Resolved.
Wed, Mar 25, 1:35 PM · ServiceOps-SharedInfra, ServiceOps new, Epic, OKR-Work, [MWI] FY2025-26 Q3, MW-Interfaces-Team (MWI-Roadmap)
Clement_Goubert closed T418148: Reroute Link Recommendation APIs through the REST gateway as Resolved.
Wed, Mar 25, 1:35 PM · User-Raine, MW-Interfaces-Team, ServiceOps new, OKR-Work
Clement_Goubert triaged T421233: Reroute Feed API endpoint as High priority.
Wed, Mar 25, 1:31 PM · ServiceOps new, Epic, OKR-Work, [MWI] FY2025-26 Q3, MW-Interfaces-Team (MWI-Roadmap)
Clement_Goubert changed the status of T421233: Reroute Feed API endpoint from Open to In Progress.
Wed, Mar 25, 1:31 PM · ServiceOps new, Epic, OKR-Work, [MWI] FY2025-26 Q3, MW-Interfaces-Team (MWI-Roadmap)
Clement_Goubert changed the status of T421233: Reroute Feed API endpoint, a subtask of T413438: [Hypothesis] 5.2.2c: Reroute API Portal Endpoints, from Open to In Progress.
Wed, Mar 25, 1:31 PM · ServiceOps-SharedInfra, ServiceOps new, Epic, OKR-Work, [MWI] FY2025-26 Q3, MW-Interfaces-Team (MWI-Roadmap)
Clement_Goubert created T421233: Reroute Feed API endpoint.
Wed, Mar 25, 1:31 PM · ServiceOps new, Epic, OKR-Work, [MWI] FY2025-26 Q3, MW-Interfaces-Team (MWI-Roadmap)
Clement_Goubert closed T418145: Configure ATS to allow fractional routing for api.wikimedia.org, a subtask of T413438: [Hypothesis] 5.2.2c: Reroute API Portal Endpoints, as Resolved.
Wed, Mar 25, 11:21 AM · ServiceOps-SharedInfra, ServiceOps new, Epic, OKR-Work, [MWI] FY2025-26 Q3, MW-Interfaces-Team (MWI-Roadmap)
Clement_Goubert closed T418145: Configure ATS to allow fractional routing for api.wikimedia.org as Resolved.

New patch with fixed syntax merged, resolving

Wed, Mar 25, 11:21 AM · MW-Interfaces-Team, ServiceOps new, OKR-Work
Clement_Goubert added a comment to T421203: Bad ATS config led to large volume of 5xx from RESTBase.

The way I see it this was a case of "failure in depth".

  • We are still using restbase as the default fallback for requests to /api/rest_v1; we shouldn't do that anymore imho
Wed, Mar 25, 11:14 AM · Incident Severity 3, Traffic, Wikimedia-Incident

Tue, Mar 24

Clement_Goubert closed T418147: Reroute Device Analytics APIs through the REST gateway, a subtask of T413438: [Hypothesis] 5.2.2c: Reroute API Portal Endpoints, as Invalid.
Tue, Mar 24, 11:58 AM · ServiceOps-SharedInfra, ServiceOps new, Epic, OKR-Work, [MWI] FY2025-26 Q3, MW-Interfaces-Team (MWI-Roadmap)
Clement_Goubert closed T418147: Reroute Device Analytics APIs through the REST gateway as Invalid.

Endpoint is deprecated.

Tue, Mar 24, 11:58 AM · Patch-For-Review, ServiceOps new, OKR-Work, [MWI] FY2025-26 Q3, MW-Interfaces-Team (MWI-Roadmap)
Clement_Goubert updated the task description for T418146: Reroute Core APIs through the REST gateway.
Tue, Mar 24, 11:57 AM · Patch-For-Review, MW-Interfaces-Team, ServiceOps new, OKR-Work
Clement_Goubert updated the task description for T418148: Reroute Link Recommendation APIs through the REST gateway.
Tue, Mar 24, 11:57 AM · User-Raine, MW-Interfaces-Team, ServiceOps new, OKR-Work
Clement_Goubert updated the task description for T418146: Reroute Core APIs through the REST gateway.
Tue, Mar 24, 11:56 AM · Patch-For-Review, MW-Interfaces-Team, ServiceOps new, OKR-Work
Clement_Goubert updated the task description for T420967: wikikube-worker137[3-4] implementation tracking.
Tue, Mar 24, 11:35 AM · ServiceOps-Upgrades-Hardware, ServiceOps new
Clement_Goubert updated the task description for T420967: wikikube-worker137[3-4] implementation tracking.
Tue, Mar 24, 11:32 AM · ServiceOps-Upgrades-Hardware, ServiceOps new
Clement_Goubert updated the task description for T420967: wikikube-worker137[3-4] implementation tracking.
Tue, Mar 24, 11:29 AM · ServiceOps-Upgrades-Hardware, ServiceOps new
Clement_Goubert added a comment to T341560: Migrate mwmaint server functionality to mw-on-k8s.

Thanks for the edits, but in the future, you can just flag them to us, and we'll coordinate rewriting the documentation to be up to date.
I've had a quick pass to replace some mwscript calls with mwscript-k8s, but some of these need an actual rewrite, some are redundant with already written documentation, some require consulting with other teams before rewrite, and I'm wary of having documentation that has been "updated" with outdated procedures (for instance using old-style mwscript from deployment servers in use-cases where mwscript-k8s works, which is everything but importImages.php)

Tue, Mar 24, 11:16 AM · serviceops-deprecated, MW-on-K8s
Clement_Goubert updated subscribers of T418147: Reroute Device Analytics APIs through the REST gateway.
Tue, Mar 24, 10:57 AM · Patch-For-Review, ServiceOps new, OKR-Work, [MWI] FY2025-26 Q3, MW-Interfaces-Team (MWI-Roadmap)
Clement_Goubert updated subscribers of T418148: Reroute Link Recommendation APIs through the REST gateway.
Tue, Mar 24, 10:57 AM · User-Raine, MW-Interfaces-Team, ServiceOps new, OKR-Work
Clement_Goubert updated subscribers of T418146: Reroute Core APIs through the REST gateway.
Tue, Mar 24, 10:56 AM · Patch-For-Review, MW-Interfaces-Team, ServiceOps new, OKR-Work
Clement_Goubert removed a project from T393054: Q4:rack/setup/install aux-k8s-worker200[6-9].codfw.wmnet: serviceops-deprecated.
Tue, Mar 24, 10:35 AM · SRE, ops-codfw, DC-Ops
Clement_Goubert removed a project from T393053: Q4:rack/setup/install aux-k8s-worker100[6-9].eqiad.wmnet: serviceops-deprecated.
Tue, Mar 24, 10:35 AM · SRE, ops-eqiad, DC-Ops

Mon, Mar 23

Clement_Goubert updated the task description for T420967: wikikube-worker137[3-4] implementation tracking.
Mon, Mar 23, 5:25 PM · ServiceOps-Upgrades-Hardware, ServiceOps new
Clement_Goubert triaged T420967: wikikube-worker137[3-4] implementation tracking as Medium priority.
Mon, Mar 23, 5:24 PM · ServiceOps-Upgrades-Hardware, ServiceOps new
Clement_Goubert created T420967: wikikube-worker137[3-4] implementation tracking.
Mon, Mar 23, 5:24 PM · ServiceOps-Upgrades-Hardware, ServiceOps new
Clement_Goubert updated the task description for T418147: Reroute Device Analytics APIs through the REST gateway.
Mon, Mar 23, 5:15 PM · Patch-For-Review, ServiceOps new, OKR-Work, [MWI] FY2025-26 Q3, MW-Interfaces-Team (MWI-Roadmap)