After some finaggling with lua patterns, everything on our side is now in place to start migrating traffic once rate limits for liftwing are implemented. Marking stalled pending resolution of T413448: [5.2.2c Epic]: Support Higher Rate Limits for Lift Wing.
- Feed Queries
- All Stories
- Search
- Feed Search
- Transactions
- Transaction Logs
Yesterday
Mon, Apr 20
tyvm :)
Please ping @Blake when the rate limits are enforced so they can start migrating traffic, thanks.
Depooled and downtimed for 30 days, all yours.
In T422804#11833678, @KartikMistry wrote:@Clement_Goubert Hi, do we need to change anything on the recommendation-api side for this change?
In T419976#11835707, @daniel wrote:In T419976#11835431, @aaron wrote:I *think* that api-gateway/Ratelimit is just storing ephemeral counters. You'd want to check with @daniel though in case there is some token/session stuff (though we try to keep that stateless, with expiration, to avoid storage lookups. I haven't been involved in the ongoing API rate limiter work.
For the ratelimit service we don't care about loosing state. A reset of all counters is fine.
As far as I know, the ratelimit counters are the only thing the gateway(s) use Redis for, but I am not 100% sure. Best check with @Clement_Goubert
and @hnowlan .
Thu, Apr 16
Routes merged into the rest-gateway, initial tests look good:
$ curl -H 'Host: api.wikimedia.org' https://rest-gateway.discovery.wmnet:4113/service/lw/inference/v1/models/enwiki-goodfaith:predict -X POST -d '{"rev_id": 12345}' -H "Content-type: application/json" {"enwiki":{"models":{"goodfaith":{"version":"0.5.1"}},"scores":{"12345":{"goodfaith":{"score":{"prediction":true,"probability":{"false":0.07396339218373627,"true":0.9260366078162637}}}}}}}
In T422955#11830027, @Scott_French wrote:@Clement_Goubert @JMeybohm - Does this sound reasonable to you? I think it should be relatively low effort (e.g., exporter rule -> task-severity alert), and is a surprising enough gap in our monitoring that it probably makes sense to prioritize soon (i.e., errorlog is basically /dev/null).
In T406745#11829487, @jcrespo wrote:I have no idea what this is, @FCeratto-WMF is this something you set up?
This one failed because the mesh didn't come up quickly enough.
maintenance/getLagTimes.php: Start run The service mesh is unavailable, which can lead to unexpected results.
Failed job deleted.
[pod/update-flaggedrev-stats-29604968-r4d4n/mediawiki-main-app] 2026-04-16T00:10:58.775367153Z fawiki ValidationStatistics Wikimedia\Rdbms\DBConnectionError from line 1125 of /srv/mediawiki/php-1.46.0-wmf.23/includes/libs/Rdbms/LoadBalancer/LoadBalancer.php: Cannot access the database: MySQL server has gone away (db1181)
Wed, Apr 15
SGTM
Given it's in a weird state where connections are hanging, the kubernetes scheduler thinks it's still ok to schedule workloads there, and they never start or never terminate, breaking deployments.
I had to manually force delete all non-daemonset workloads on it to unblock.
In T423395#11823551, @JMeybohm wrote:In T423395#11823542, @Clement_Goubert wrote:grep . /sys/devices/system/cpu/cpu0/cpuidle/state*/name /sys/devices/system/cpu/cpu0/cpuidle/state0/name:POLL /sys/devices/system/cpu/cpu0/cpuidle/state1/name:C1 /sys/devices/system/cpu/cpu0/cpuidle/state2/name:C1E /sys/devices/system/cpu/cpu0/cpuidle/state3/name:C6C6 seems to be enabled
I'd still leave it like that. First occasion of this and we have a couple supermicros already. We can try disabling C6 if we see this again I'd say.
grep . /sys/devices/system/cpu/cpu0/cpuidle/state*/name /sys/devices/system/cpu/cpu0/cpuidle/state0/name:POLL /sys/devices/system/cpu/cpu0/cpuidle/state1/name:C1 /sys/devices/system/cpu/cpu0/cpuidle/state2/name:C1E /sys/devices/system/cpu/cpu0/cpuidle/state3/name:C6
C6 seems to be enabled
Tue, Apr 14
That seems to have lowered the total amount of TKOs, or at least some of the volatility on it, but the overall error level still bothers me
Mon, Apr 13
In T420223#11815390, @elukey wrote:@Clement_Goubert wikikube-worker1273 shows the majority of tkos, and it is a big outlier compared to the other ones. Can we depool it to see how the TKO impact looks like afterwards?
Thumbor has been upgraded to haproxy 3.2 on trixie. Resolving.
Sun, Apr 12
Temporary fix was to extend the root LV by 20GB. This should hold us over until Monday when collaboration-services and Release-Engineering-Team can take a deeper look.
cgoubert@gerrit2003:/var/log/apache2$ sudo lvextend -L+20G -r /dev/vg0/root Size of logical volume vg0/root changed from 74.50 GiB (19073 extents) to 94.50 GiB (24193 extents). Logical volume vg0/root successfully resized. resize2fs 1.47.0 (5-Feb-2023) Filesystem at /dev/mapper/vg0-root is mounted on /; on-line resizing required old_desc_blocks = 10, new_desc_blocks = 12 The filesystem on /dev/mapper/vg0-root is now 24773632 (4k) blocks long.
Fri, Apr 10
Tagging @JTweed-WMF for awareness.
In T422455#11806562, @Scott_French wrote:Ah, thanks for highlighting that @JMeybohm!
I'd not noticed the in-drops previously. Those definitely correlate with when coredns was scheduled there, and appear to hover in the 0 - 50 mpps range (so a couple of lost packets per minute), though peaking at up to ~ 5x that. I do wonder to what degree that's a contributing cause vs. more an effect of a slowdown in software (e.g., something that slows processing of inbound datagrams).
That's also an interesting observation about the conntrack entries - indeed, the same can be seen on other hosts (example) where coredns was scheduled at the time. No doubt these are rather "popular" pods (both for workloads that would likely "reuse" the same entries and for those where the source port would constantly churn), but the effect is still quite impressive.
Next steps
As much as I would like to see it, I suspect we're unlikely to get to the bottom of the specific issue that struck this time around.
In which case, I think it makes sense to think about what else we do next. Thoughts:
- Cleanup - We should decide whether to revert https://gerrit.wikimedia.org/r/1268573. On the one hand, it seems to have no effect on the lingering rate of timeouts. On the other, I'm not opposed to over-provisioning a such a critical cluster-wide service.
I'd rather err on the side of over than under-provisioning. There's a fairly good chance that having more pods would also lessen both the amount of traffic and the size of the conntrack per host, which could help with the packet loss
- Observability - This went on for weeks, and we didn't really notice until periodic jobs started failing more frequently.
- One aspect of that is timeouts are happening early enough that the associated warnings will never make it into mediawiki-errors logstash unless they result in an unhandled ConfigException and instead end up in the errorlog (and that's only in the PHP-FPM case; for CLI, we only have stderr, unless they happen during LBFactory::autoReconfigure).
- I'm wondering whether we should have an alert for this, though I'm not sure off hand what the standard pattern is for alerting on a logs-based signal.
Thu, Apr 9
In T420336#11798416, @Jgiannelos wrote:Sounds good.
- I will run another round of tests to see where we are with concurrency after bumping the resources.
- I will keep using X-Wikimedia-Debug: k8s-mw-parsoid-eqiad for the immediate future
- I will keep using ssh mw-experimental.eqiad.wmnet for the immediate future
That said is there a better way to get the current active env than the following?
curl -s --user-agent <UA> "https://meta.wikimedia.org/w/api.php?action=query&meta=siteinfo&format=json" | jq -r '.query.general["wmf-config"].wmfMasterDatacenter'
Wed, Apr 8
I agree with a longer soak, but I'm also not opposed to doubling the replica even if the current situation holds. My rationale is that we keep having infrequent issues with coredns related to scaling up the number of client making DNS queries, and that would give us more margin without having to revisit that once again. Thoughts?
It's the same error as in the same cause, but it's a different run of the job (started yesterday, failed this morning). Phaultfinder will refile a task for a failure if the previous one was closed, which is the case here. If we leave this task open, Phaultfinder will *update it* with the next alert.
That runs every 3 days, we'll see if it runs correctly next time or not. Logs show a DB issue dewiki Error: 2006 MySQL server has gone away
In T420223#11794423, @Clement_Goubert wrote:In order to try and exclude the new switches with old vlan as a possible cause, I've pooled a host that is on the new vlan wikikube-worker1347.eqiad.wmnet, and am in the process of renumbering an existing one wikikube-worker1273.eqiad.wmnet.
Once workloads get scheduled on them we should see if they have exhibit the issue or not.
Tue, Apr 7
In order to try and exclude the new switches with old vlan as a possible cause, I've pooled a host that is on the new vlan wikikube-worker1347.eqiad.wmnet, and am in the process of renumbering an existing one wikikube-worker1273.eqiad.wmnet.
Once workloads get scheduled on them we should see if they have exhibit the issue or not.
We should create a spreadsheet for this otherwise we'll lose track of which servers are renumbered.
For wikikube workers, I have a patch up to fix sre.k8s.renumber-node https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1268568
I think that whole category of jobs should be ran manually if failed, but we should also check if we need to be careful with what time of day we choose to do that. As an aside, maybe we should start documenting which of the cronjobs need a manual run in these cases or not.
In T416623#11757466, @kostajh wrote:In T416623#11752681, @MLechvien-WMF wrote:Yes, I think we could do that.
All done, thanks @JMeybohm !
Wed, Mar 25
New patch with fixed syntax merged, resolving
In T421203#11747752, @Joe wrote:The way I see it this was a case of "failure in depth".
- We are still using restbase as the default fallback for requests to /api/rest_v1; we shouldn't do that anymore imho
Tue, Mar 24
Endpoint is deprecated.
Thanks for the edits, but in the future, you can just flag them to us, and we'll coordinate rewriting the documentation to be up to date.
I've had a quick pass to replace some mwscript calls with mwscript-k8s, but some of these need an actual rewrite, some are redundant with already written documentation, some require consulting with other teams before rewrite, and I'm wary of having documentation that has been "updated" with outdated procedures (for instance using old-style mwscript from deployment servers in use-cases where mwscript-k8s works, which is everything but importImages.php)