User Details
- User Since
- Nov 7 2023, 9:56 PM (109 w, 5 d)
- Availability
- Available
- LDAP User
- Unknown
- MediaWiki User
- GChoi-WMF [ Global Accounts ]
Wed, Dec 10
From Slack thread w/ @akosiaris (thank you!)
kube_env wikifunctions codfw kubectl get pods
Tue, Dec 9
Some of these are pending local reproducibility and will be completed in followup tasks; new error returns addressed in the linked MR are resolved
I believe this is ready to be resolved, at least from my end/experience; will move to 'Needs sign-off' for consensus!
resolved!
@Jdforrester-WMF that sounds good! Linking this task
Added instructions to our cheatsheet
This is a likely side effect of a larger issue of excessive function invocation; closing for the quarter as this will be ongoing investigation
Mon, Dec 8
Memory leaks detected this quarter and created tools to track Node garbage collection and memory usage using Docker stats.
Thu, Dec 4
Wed, Dec 3
Sat, Nov 29
Thu, Nov 27
got the script and PNG working locally
Wed, Nov 26
@DSmit-WMF I tried removing all of the filters from our WL Logstash and found something but I can't see that id in the backend logs yet 👀
Good call! We've been keeping an eye on this through our logging and monitoring lately especially as we're focused on our system load and memory usage. Generally, 503s can happen occasionally even when traffic is low; our server may have been briefly unable to handle the requests at that particular moment. This can happen during small internal delays, pod restarts, or brief network hiccups. We don't wish to see this trending upward however, so thank you for raising this!
Tue, Nov 25
- in core docker-compose.override.yml, add two new keys under function-orchestrator:
entrypoint: [] command: "npm start"
and add under its environment:
NODE_OPTIONS: --expose-gc
- in Orchestrator package.json, replace "start": with either "node --trace-gc server.js" or for more verbose GC logs"node --trace-gc --trace-gc-nvp server.js",
Sat, Nov 22
Thu, Nov 20
Hello @99of9~ server errors such as this can be expected as it indicates our servers are currently overloaded, hopefully brief! Is this still occurring for you?
Wed, Nov 19
reopening as per COW investigation on Orchestrator OOM's today
Resolving now since:
- original implementation has changed
- maxSimultaneousExecutions has been configured and increased
Tue, Nov 18
Resolving this hypothesis as the last occurrence was >2mos ago.
Have not found clear evidence of memory leakage, I am resolving this hypothesis so that I can move on to others
I'm not sure why the Evaluator revert is linked here but this Rust patch is complete
I've since added logs to show if/where the rate-limiting middleware was rate-limiting the requests from the Evaluator and confirm that that mystery is solved. Evaluator errors have also since been initiated.
Invalid json errors are handled now.
Marking as resolved. This was a temporary bandaid but other occurrences in the code rendered similar error logs down the road. Additionally, was encouraged not to use optional chaining recently
Mon, Nov 17
Testing this again and I see the expected behavior; marking as resolved, TY
Sat, Nov 15
Nov 14 2025
This task was an action item from one of our team meetings following the Py Evaluator outage; IIRC @DSantamaria's proposition to have a user-facing system availability endpoint set up.
Still testing the new queries in alerting but the setup is complete
Nov 13 2025
It may be possible for us to set up a user-facing page that behaves as a type of health check; we could use the scripts we use during our deploy process that pings our services that ensures they are alive and responsive. Starting off, the page would be sparse, albeit usable. What do you think we wire this up and then beautify it next year (next fiscal quarter(s))?
Nov 6 2025
Nov 3 2025
linked to this mr
Nov 2 2025
Oct 31 2025
Oct 29 2025
Oct 27 2025
Oct 22 2025
Oct 21 2025
Oct 18 2025
Oct 16 2025
In fetchObject.js, these blocks allow errors to be thrown w/o a try-catch. Given this function underwent some recent changes, could someone remind me why the errors are being allowed ethrown this way
Oct 15 2025
Oct 13 2025
Closing this since we have confirmed this in Production and are now successfully able to surface them on Grafana.

