Page MenuHomePhabricator

Significant increase in wikifeeds latency and mobileapps error rate since 2025/11/13
Closed, ResolvedPublicPRODUCTION ERROR

Description

From around 1430UTC on the 13th of November, p99 and p95 latency in Wikifeeds increased dramatically and have stayed high since. This change follows with a significant increase in timeouts of requests to wikifeeds itself. This can be seen on this dashboard. Wikifeeds is also alerting frequently in #wikimedia-operations for swagger probe failures.

I don't see a significant increase in requests to the service, although there are some aberrant bumps in requests around the time of this issue emerging that haven't really stayed notable.

There has been a slight increase in CPU throttling in the wikifeeds service itself, and the service's own metrics show a similar spike in latency so this points to either a change in the service or a change in a service that it pulls from which is a known pattern for wikifeeds itself. Memory and CPU use have taken on sawtooth patterns since this issue emerged and remain unusual. Does this signify workers dying and restarting perhaps? The service itself shows no OOMKill-related restarts.
From a quick check, it doesn't look like mobileapps has seen a significant change around the same time which has historically been a source of wikifeeds choppiness.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Wikifeeds logs quite heavily in general, but it's hard to determine signal. Looks like there has been a solid increase in internal 504s and 500 errors, but there isn't really any further context the error messages. There are also 4xx internal errors that don't appear to be trending upwards

This panel shows a spike in RX traffic corresponding to the increase in latencies, etc.

This panel shows an increase in traffic to the wikifeeds_featured endpoint as well as wikifeeds_onthisday endpoints. And this panel shows an increase in upstream request timeout for those two endpoints.

So, my first guess is that it looks like there has been an increase in traffic two of these endpoints which may have existing inefficiencies which might need investigation and fixing vs something new happening to the services itself.

I happened to notice some KubernetesDeploymentUnavailableReplicas alert noise for mobileapps in codfw in -operations today.

Looking back over the past week or so, it seems that also started on the 13th around 14:00 UTC and correlates pretty strongly - compare the fraction of unavailable replicas for mobileapps with the rate of upstream timeouts and latency variation at rest-gateway for wikifeeds requests.

The only thing that pops out at me in the SAL is a deployment to mobileapps ~ 30 minutes prior (SAL), which presumably picked up https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1204849. Not sure if it's plausible that deployment could be involved, particularly given the delayed effect.

Edit: The only potentially curious thing I can see in the commit range difference between 2025-10-13-122439-production and 2025-11-13-122825-production is the upgrade from node 18 to 22.

Also, looking at mobileapps specifically there are definitely some curious spikes in traffic that correlate, but I'm not sure whether that's cause vs. effect (also wow, what are these orders of magnitude? is the app calling itself via internal handlers?).

This panel shows an increase in traffic to the wikifeeds_featured endpoint as well as wikifeeds_onthisday endpoints. And this panel shows an increase in upstream request timeout for those two endpoints.

So, my first guess is that it looks like there has been an increase in traffic two of these endpoints which may have existing inefficiencies which might need investigation and fixing vs something new happening to the services itself.

There's no appreciable increase in traffic to these endpoints based on turnilo data or rest-gateway hits from what I can see - for example is this for all onthisday endpoints:

image.png (1×1 px, 58 KB)

I'd say Scott is on the money if mobileapps is also in trouble. We've seen some fairly wacky performance rabbitholes with the service in the past that we couldn't get to the bottom of in the past (see T397750). Funnily enough looking at it from Envoy, wikifeeds sees a *drop* in overall latency from mobileapps, but a marked increase in 5xx errors

Looping in @Dbrant in case he might have any insight on the changes that went out in that image. Can/should we roll back the mobileapps image to see if there is any improvement?

Very curious - I don't think we've made any changes to either wikifeeds or mobileapps that directly affect the type(s) or quantities of requests made.
Looking at the logs more closely, a good majority of 504 errors have to do with zhwiki, where it's trying to get the /feed/featured route, which in turn tries to fetch page/summary/Special:Search for some reason (?).
The other thing we know about zhwiki is that it does special handling of language variants. But that's also been the case for quite a while, not just last week...

Change #1206878 had a related patch set uploaded (by Dbrant; author: Dbrant):

[mediawiki/services/wikifeeds@master] Explicitly filter out non-article namespaces from most-read list.

https://gerrit.wikimedia.org/r/1206878

Just as a datapoint - I roll-restarted mobileapps and it had an immediate impact on wikifeeds: https://grafana.wikimedia.org/goto/lmB4-hmvg?orgId=1

This is almost certainly going to return in the next few hours though. It bought us about 30 minutes of normal performance. We've been seeing persistent issues with unavailable mobileapps replicas since this issue first emerged on wikifeeds also:

15:49 <+jinxer-wm> Deployment mobileapps-production in mobileapps at codfw has persistently unavailable replicas

Change #1206878 merged by jenkins-bot:

[mediawiki/services/wikifeeds@master] Explicitly filter out non-article namespaces from most-read list.

https://gerrit.wikimedia.org/r/1206878

Mentioned in SAL (#wikimedia-operations) [2025-11-18T16:55:59Z] <hnowlan> silenced wikifeeds codfw swagger alert for 24h T410296

Mentioned in SAL (#wikimedia-operations) [2025-11-19T11:30:49Z] <claime> Roll restarting mobileapps in codfw - unavailable replicas - T410296

hnowlan renamed this task from Significant increase in wikifeeds latency since 2025/11/13 to Significant increase in wikifeeds latency and mobileapps error rate since 2025/11/13.Nov 19 2025, 5:06 PM

^ after deploying the above patch, the 504 errors have all but disappeared. Not sure if this has any bearing on the unavailable replicas issues, but it's something!

^ after deploying the above patch, the 504 errors have all but disappeared. Not sure if this has any bearing on the unavailable replicas issues, but it's something!

Nice! Good to have that issue out of the picture. The latency issues are still present - we did two roll_restarts today at 1130 and 1710 which have suppressed the timeouts to a degree in wikifeeds but they're still present to some degree albeit less prominently. It'd be very interesting (albeit confusing) if the reduction in mobileapps error rate has reduced the wikifeeds timeouts somewhat!

It looks like mobileapps in codfw is still experiencing this "oscillating" unavailability (last 6h hours; note the effect of the rolling restarts that @hnowlan mentioned).

Going back to the start on 13th, I want to follow up on one aspect:

Edit: The only potentially curious thing I can see in the commit range difference between 2025-10-13-122439-production and 2025-11-13-122825-production is the upgrade from node 18 to 22.

Is it at all feasible to try reverting mobileapps to the 2025-10-13-122439-production image (on a temporary basis) to rule out this being an artifact of something that changed with the node upgrade?

The reason I ask is that, if you look at the fraction of pods becoming unavailable vs. average or max CPU, or average or max memory usage, the service does seem to be behaving wildly differently in terms of resource usage across that image update.

While that correlation could be spurious (e.g., due to an "unlucky" change in workload), it would probably be handy to confirm or refute that.

I wouldn't have an issue with (temporarily) reverting the image. There was only one nontrivial change since that time, and it's not super critical.

Change #1207271 had a related patch set uploaded (by Scott French; author: Scott French):

[operations/deployment-charts@master] mobileapps: revert to 2025-10-13-122439-production

https://gerrit.wikimedia.org/r/1207271

Mentioned in SAL (#wikimedia-operations) [2025-11-20T12:21:37Z] <claime> roll-restart of mobileapps codfw - T410296

Change #1207271 merged by jenkins-bot:

[operations/deployment-charts@master] mobileapps: revert to 2025-10-13-122439-production

https://gerrit.wikimedia.org/r/1207271

Roughly 11 hours have passed since the revert to 2025-10-13-122439-production was deployed to mobileapps in codfw (13:35 UTC) and the situation is looking much better:

While we should let this soak longer to get a bit more confidence, this really feels like the trigger is a change somewhere between 2025-10-13-122439-production and 2025-11-13-122825-production.

Reviewing that commit range again, https://gerrit.wikimedia.org/r/1194966 (Bump base image to node 22) is the only patch in that range that could plausibly affect the performance of the app like this.

For example, I'm wondering if the jump from 18 to 22 requires some amount of memory management tuning (e.g., heap size, which could also explain the impact on CPU) to get back to a stable configuration.

Yep, the upgrade from 18 to 22 definitely looks like the culprit. I'm guessing the next step would be for us to revert that change in the repo, so we could continue normal development, and in the meantime do some sort of profiling (?) under node 22 to see what's going on with this service. Looping in @Jgiannelos since Content-Transform-Team owns that task (T393434)

Would it make sense to try to upgrade from node 18 to 20 to see if the behavior is similar?

@Jgiannelos - Yes, I think that would be highly informative. I suspect this is related to the container-aware heap-sizing changes that landed in Node 20, which would have cut the old-gen size limit down to 512MiB (in a 1GiB container). If we see the same effect there, then we probably want to explore tuning --max-old-space-size (and possibly --max-semi-space-size, since that will have shrunk as well vs. Node 18).

From a quick research I agree that this the memory management introduced on node 20. We had similar issues here:
https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1163823

cc @hnowlan who was involved with handling the node 20 related issue.

I think its worth trying the flags you mention @Scott_French (--max-old-space-size and --max-semi-space-size)

Ah, interesting! Thanks @Jgiannelos - I didn't realize there had already been an attempt to upgrade to Node 20. Indeed, if that hit similar issues, then no need to reproduce that experiment.

In that case, maybe we can start with --max-old-space-size alone. In Node 18, it would have been 2GiB (i.e., larger than the 1GiB container limit). Perhaps we could try something like 90% of the limit in order to keep it out of OOMKill territory?

Another thing to try (ideally a second step) would be setting --max-semi-space-size to the static value it would have had on Node 18 (16MiB). If I understand correctly, it seems from Node 20 onward this will be dynamically set to 4MiB in a 1GiB container, which could maybe be pushing objects out into the old-gen prematurely.

Change #1218287 had a related patch set uploaded (by Jgiannelos; author: Jgiannelos):

[mediawiki/services/mobileapps@master] Revert "Bump base image to node 22"

https://gerrit.wikimedia.org/r/1218287

Change #1218287 merged by jenkins-bot:

[mediawiki/services/mobileapps@master] Revert "Bump base image to node 22"

https://gerrit.wikimedia.org/r/1218287

Change #1227799 had a related patch set uploaded (by Jgiannelos; author: Jgiannelos):

[operations/deployment-charts@master] mobileapps: Set limits on memory usage to avoid latency increase

https://gerrit.wikimedia.org/r/1227799

Change #1227799 merged by jenkins-bot:

[operations/deployment-charts@master] mobileapps: Set limits on memory usage to avoid latency increase

https://gerrit.wikimedia.org/r/1227799

Change #1229572 had a related patch set uploaded (by Scott French; author: Jgiannelos):

[operations/deployment-charts@master] mobileapps: Use max-old-space-size instead

https://gerrit.wikimedia.org/r/1229572

Change #1229572 merged by jenkins-bot:

[operations/deployment-charts@master] mobileapps: Use max-old-space-size instead

https://gerrit.wikimedia.org/r/1229572

A couple of hours in after @Jgiannelos set --max-old-space-size (and deployed the new node 22-based image), we're again seeing cyclic latency excursions (as measured from Envoy's view on the Wikifeeds side) that seem to correlate with CPU and memory (note: these are totals, not per-pod behavior) bumps.

Now, this isn't nearly as severe as what we were seeing the last time mobileapps was running on node 22 - e.g., we're not (at least yet), seeing widespread pod unavailability (compare with T410296#11394696).

Proposal:

  1. Would it make sense to try setting --max-semi-space-size=16, in order to restore what I believe would have been the node 18 default? As noted in T410296#11408049, I'm wondering if the cyclic behavior we're still seeing is a result of pushing objects into the old-gen space too aggressively.
  2. If not, or if #2 isn't effective, would it make sense to revert the pair of changes ( --max-old-space-size and image bump) before the weekend? My vote is "yes please."

Change #1230877 had a related patch set uploaded (by Jgiannelos; author: Jgiannelos):

[operations/deployment-charts@master] mobileapps: Define max-semi-space-size for node

https://gerrit.wikimedia.org/r/1230877

MLechvien-WMF edited projects, added ServiceOps new; removed serviceops.
MLechvien-WMF changed the subtype of this task from "Task" to "Production Error".

Change #1230877 merged by jenkins-bot:

[operations/deployment-charts@master] mobileapps: Define max-semi-space-size for node

https://gerrit.wikimedia.org/r/1230877

Change #1230917 had a related patch set uploaded (by Jgiannelos; author: Jgiannelos):

[operations/deployment-charts@master] mobileapps: Revert to last known working state (node18)

https://gerrit.wikimedia.org/r/1230917

@Scott_French I reverted to the previous working state. The max-semi-space-size didn't do the trick.

I think the next step is to:

  • Introspect a running mobileapps image in production
  • Get the exact memory flags
  • Reapply these values to the node22 images/deployment-charts

Something like:

> node --v8-options
> node -e "console.log(require('v8').getHeapStatistics())"

Change #1230917 merged by jenkins-bot:

[operations/deployment-charts@master] mobileapps: Revert to last known working state (node18)

https://gerrit.wikimedia.org/r/1230917

I did some research in a local minikube env running our node22 and node18 images in pods with the same resource limits as prod and here is what I understood from the problem:

  • Node 18 doesn't respect cgroup limits
    • This means it *doesn't* respect the k8s request/limits on memory usage
  • Node 22 does respect cgroup limits
    • This means it *does* respect the k8s request/limits
  • The default heap size node 18 sees is something like: max_old_space_size ≈ min(physical_memory * 0.5, 4GB) // for 64-bit systems
    • In our case 4G
  • When we enforced the limits using node 22 in production this effectively was way more restricted than the node18 based env

Here is an example from a pod running service-runner with the same config like prod in minikube:
pod.yaml: https://phabricator.wikimedia.org/P87969

  • it practically defines 2 pods (node18 and node22 based) using mobileapps prod images
  • it uses a dummy app.js that logs heap statistics
  • it uses the same memory/concurrenct/logging config with prod
  • it uses the same resource limits with prod

node 18:

Heap statistics: {
  "total_heap_size": 12697600,
  "total_heap_size_executable": 1048576,
  "total_physical_size": 12455936,
  "total_available_size": 4334806312,
  "used_heap_size": 11239520,
  "heap_size_limit": 4345298944,
  "malloced_memory": 270416,
  "peak_malloced_memory": 2281568,
  "does_zap_garbage": 0,
  "number_of_native_contexts": 1,
  "number_of_detached_contexts": 0,
  "total_global_handles_size": 8192,
  "used_global_handles_size": 4064,
  "external_memory": 2174990
}

And here is the same running node 22:

Heap statistics: {
  "total_heap_size": 13287424,
  "total_heap_size_executable": 524288,
  "total_physical_size": 12509184,
  "total_available_size": 538900824,
  "used_heap_size": 10301672,
  "heap_size_limit": 549453824,
  "malloced_memory": 278624,
  "peak_malloced_memory": 1499936,
  "does_zap_garbage": 0,
  "number_of_native_contexts": 1,
  "number_of_detached_contexts": 0,
  "total_global_handles_size": 8192,
  "used_global_handles_size": 2784,
  "external_memory": 2151487
}

The two important metrics from above are:

  • heap_size_limit:
    • ~4G vs ~512M between two node versions
  • total_available_size:
    • similar differences with above (~4G vs ~512M)

I am still trying to understand why the node18 doesn't OOM given that the heap is way larger than the limits but I assume that the app never hit the k8s resource limit.

The other interesting part is the worker_heap_limit_mb: 750 in the service runner config.
In theory this should limit the total heap to 2 x 750M (concurrency is 2). Maybe that's the reason the pod never OOMs (?)

Overall:
I think other than the flags we said we need to enforce above (--max-old-space-size and --max-semi-space-size) I think we need to reconsider the sizing of the pod because currently it doesn't mean much for the node18 runtime.
In practice node believes it has ~4G to use and it doesn't OOM because it maxes out below the k8s limit.

Thanks for the additional investigation, @Jgiannelos.

As requested in T410296#11548686, here is the result of v8.getHeapStatistics() from an arbitrary production pod (v18.20.4):

{
  total_heap_size: 5009408,
  total_heap_size_executable: 524288,
  total_physical_size: 5480448,
  total_available_size: 4341505280,
  used_heap_size: 3665984,
  heap_size_limit: 4345298944,
  malloced_memory: 254104,
  peak_malloced_memory: 443544,
  does_zap_garbage: 0,
  number_of_native_contexts: 1,
  number_of_detached_contexts: 0,
  total_global_handles_size: 8192,
  used_global_handles_size: 2208,
  external_memory: 1277583
}

This would be consistent with what you've seen in your local testing via minikube: heap_size_limit is 4345298944 bytes, which translates into the sum of:

  • 4 GiB inferred (i.e., at this physical memory size) max old-space size
  • 3x the 16 MiB default max semi-space size

The other interesting part is the worker_heap_limit_mb: 750 in the service runner config.
In theory this should limit the total heap to 2 x 750M (concurrency is 2). Maybe that's the reason the pod never OOMs (?)

Oh, wow! I'd completely missed that mobileapps is running with 2 workers. So, it turns out that OOMs are indeed happening roughly all the time, e.g. in codfw right now:

$ Kubectl get pods -l release=production
NAME                                     READY   STATUS                   RESTARTS          AGE
mobileapps-production-777fdf9686-2c82j   3/3     Running                  331 (11m ago)     6d23h
mobileapps-production-777fdf9686-2clb4   3/3     Running                  337 (4m2s ago)    6d23h
mobileapps-production-777fdf9686-2lqrm   3/3     Running                  328 (82s ago)     6d23h
mobileapps-production-777fdf9686-2zfm6   3/3     Running                  329 (21m ago)     6d23h
[ ... ]
$ kubectl describe pod mobileapps-production-777fdf9686-2c82j
[ ... ]
Containers:
  mobileapps-production:
    [ ... ]
    State:          Running
      Started:      Mon, 02 Feb 2026 16:29:55 +0000
    Last State:     Terminated
      Reason:       OOMKilled
      Exit Code:    137
      Started:      Mon, 02 Feb 2026 16:08:42 +0000
      Finished:     Mon, 02 Feb 2026 16:29:55 +0000
    [ ... ]

With the container limit at 1 GiB and these parameters, and assuming both workers experience similar rate of heap growth, my expectation is that we'll never see service-runner's HeapWatch restart trigger, as the container will have OOM'd well before that.


Given all of that, I wonder if the next "simplest" experiment is to set --max-old-space-size=4096 (in addition to --max-semi-space-size=16) to fully restore the node 18 defaults.

While this feels wrong, in that clearly that's not achievable given the container limits, it would probably provide good insight into whether it's possible to restore the node 18 status quo performance purely via v8 heap flags.

If the answer is yes, then the question becomes what needs done in order to make that sustainable (e.g., changing the container limits to make OOMs less frequent, or perhaps tuning worker_heap_limit_mb if indeed that functionality is preferable over an OOM).

I can try setting the flags so we reproduce the state of the node 18 env and see how it goes. That said I believe we should reconsider the sizing with the combination of:

  • pod memory request/limit
  • worker count per service runner
  • service runner heap limit

Overall I think that we should fallback to the defaults and just configure the resources via k8s config.

Change #1236323 had a related patch set uploaded (by Jgiannelos; author: Jgiannelos):

[mediawiki/services/mobileapps@master] Bump base image to node 22

https://gerrit.wikimedia.org/r/1236323

Change #1236323 merged by jenkins-bot:

[mediawiki/services/mobileapps@master] Bump base image to node 22

https://gerrit.wikimedia.org/r/1236323

Change #1236332 had a related patch set uploaded (by Jgiannelos; author: Jgiannelos):

[operations/deployment-charts@master] pcs: Configure node runtime memory limits

https://gerrit.wikimedia.org/r/1236332

Change #1236332 merged by jenkins-bot:

[operations/deployment-charts@master] pcs: Configure node runtime memory limits

https://gerrit.wikimedia.org/r/1236332

Roughly 9h after switching back to node 22 with --max-old-space-size=4096 and --max-semi-space-size=16, we're seeing some interesting results:

While we saw some latency oscillations initially (e.g., measured at mobilapps or downstream at wikifeeds), they appear to (1) be far less severe than what we saw on the last attempt and (2) have diminished by the ~ 6h mark, though tail latency remains elevated. Similarly, we see far more modest variation in aggregate CPU and memory usage. The rate of OOMs also seems to have stabilized at something similar to what we see on node 18 (roughly one per hour, per container).

Performance is clearly not the same as node 18 (e.g., elevated tail latency), but it seems as if restoring the "clearly unreasonable" 4 GiB old-space limit has in turn relaxed some GC behavior [0] that had a significant performance impact on previous attempts (at the cost of periodic OOMs, similar to what we see on node 18).

This seems stable enough to leave as-is for now to assess how things evolve and consider what to measure / change next.

@Jgiannelos - Agreed with your take in T410296#11575396. Even if this has been informative, it's not a solution on its own, and tuning these explicitly is brittle. Do you happen to know how this arrived at the 2 worker per-instance concurrency or the HeapWatch limit? I'm wondering if it makes sense to increase the container limit, while running more workers per instance and fewer pods overall (i.e., relying on some degree on "multiplexing" of memory demand across workers).

[0] Aside: I wish there were some GC metrics to corroborate this, rather than relying on secondary signals (e.g., the correlated CPU usage oscillations). From a quick check of the /metrics handlers exposed by mobileapps and the statsd-exporter, it doesn't look like we do, though I realize there may be some service-runner subtleties to that (e.g., what would that even show me, when there are multiple worker subprocesses?).

In the scope of node22 upgrade it looks like we are good enough. I think @hnowlan might have some insight on the worker/heap limit sizing from the last round of the same incidence we had on node18 -> node20 upgrade.
Overall i believe that:

  • the whole paradigm of multiple workers per service-runner predates k8s deployments
    • maybe just using one worker per instance and bumping the pod count is more compatible to the current architecture.
  • having both heapwatch on service runner and node being aware of cgroups/k8s limits and k8s limiting the memory is a bit too complicated
  • with a bit of trial and error we can leave the default nodejs memory behaviour and fine tune the combination
    • k8s pod count
    • k8s memory limits
    • service runner heapwatch limits

I think the work on our side (content transform team) is complete, and this should unblock you to proceed with the memory resources config.

Change #1237880 had a related patch set uploaded (by Jgiannelos; author: Jgiannelos):

[operations/deployment-charts@master] wikifeeds: Use flags for memory compatibility with node18

https://gerrit.wikimedia.org/r/1237880

Change #1237880 abandoned by Jgiannelos:

[operations/deployment-charts@master] wikifeeds: Use flags for memory compatibility with node18

https://gerrit.wikimedia.org/r/1237880

Closing this one for now with wikifeeds and mobileapps running node 22. I think that ideally we should spend some time in the feature to fine tune the sizing of the pods, but for now production doesn't look so problematic.

Jgiannelos claimed this task.