Following up from the meeting, one source of truth for running/reachable instances is in the node_debian_version Prometheus metric, just as an example:
- Feed Queries
- All Stories
- Search
- Feed Search
- Transactions
- Transaction Logs
Fri, Jun 12
Thu, Jun 11
In T428867#12008912, @Volans wrote:The systemd unit is clearly failed:
● backup_cinder_volumes.service loaded failed failed backup cinder volumes ● remove_dangling_cinder_snapshots.service loaded failed failed backup cinder volumesAs for why it didn't alert I think it might be related to team-wmcs/general_systemd_unit_down.yaml
# deploy-tag: ops # deploy-site: eqiad
Wed, Jun 10
@taavi do you reckon there's anything in this task that's not covered by checks in T328502: Move WMCS off of Icinga and introduce alertmanager ?
Tue, Jun 9
This is done, labs-ip-alias-dump icinga check has been removed in Ib8d290 and flavor property check is tracked in parent task
Mon, Jun 8
Following up from team meeting, only the k8s etcd checks are relevant nowadays to port to AM, the rest can be ditched
I'll call this done and good enough™: I tackled the low-hanging fruits, namely webservice defaults have been lowered significantly which have increased memory utilization. Another side effect is that new tools will have the lower defaults, thus overall slowing down the rate of memory requests we have to satisfy at all times.
Thu, May 28
Wed, May 27
This is done
This is done: toolsbeta alerts get rewritten at deploy time to both change severity and strip the page tag from annotations
Now roll-restart works as expected
I just did it™ in toolsbeta:
Tue, May 26
In T427204#11955072, @Sebastian_Berlin-WMSE wrote:Looks like you linked to the "secret" panel again 🙂 I took a shot at removing the "-rw" at the start and that worked.
In T427204#11954411, @Sebastian_Berlin-WMSE wrote:That's quite possible. I haven't really checked CPU usage. I was just hoping more CPUs means more faster 🙂 I'd guess that the CPU usage fluctuates for the different steps in the process that the tool runs.
My ulimit -n 4096 bandaid is deployed in cloud/cicd/gitlab-ci now; I don't know whether there's appetite to ship a smaller ulimit -n as a whole on gitlab workers. Either way I'm untagging wmcs, feel free to resolve/decline as you see fit
Thank you for the detailed explanation @cmooney, definitely TIL things about BFD I didn't know! Hard for me to say if rare enough, metric queries in the form of bird_bfd_session_up{instance=~"^cloudlb.*"} == 0 seem to confirm this is the first time we've seen it on cloudlb. There's likely something we can do in terms of alerts to at least detect the issue at least temporarily until we get new switches
Thank you for the detailed report @Sebastian_Berlin-WMSE !
Mon, May 25
Yes I'm happy with pointing rabbit clients to the whole cluster and let client-side logic handle failures/retries, the latest rounds of reboots have been successful
This is deployed, designate now uses zk for coordination. A rolling-restart of cloudcontrol is dealt with by tooz as expected,e.g
Fri, May 22
Thu, May 21
To test this theory I changed webservice-cli gitlab-ci to lower the open files ulimit
The problem seems to be fakeroot and a huge ulimit -n, so fakeroot spends all its time closing files up to ulimit
Wed, May 20
Status update: I tried deploying the memory request 64mb change though toolsbeta said no due to limitrange
In T377568#11931844, @Andrew wrote:After some discussion today, I propose that we just switch off and decom cloudnet200[78]-dev.
Mon, May 18
Verified on lima-kilo on Linux, nuked the VM when ./start-devenv.sh asked and ran the verification commands
Fri, May 15
+infra-foundations JFYI / for visibility and feedback, not urgent in any shape or form
@Jclark-ctr once T426180 is resolved and hosts can be reimaged, please rack as follows
May 14 2026
Thank you for following up @tappof ! I'll give the implementation a go in the next couple of week and report back in case I need help, for sure I'll reach out for reviews
The first reduction is default memory requests has been deployed, as expected we're now under the alerting threshold for memory requests (from ~88% to ~76%)
May 13 2026
+1
May 8 2026
May 7 2026
I'm adding o11y folks for their input both on the idea as a whole and on the proposed implementation. For context, this is not urgent on the Toolforge side, more something "nice to have" and that has confused folks looking at toolforge alerts. If the idea looks sane I can take on the implementation (modulo the usual work scheduling)
May 6 2026
May 5 2026
May 2 2026
Apr 30 2026
Apr 29 2026
Upstream bug at https://bugs.launchpad.net/oslo.messaging/+bug/2150632
Apr 28 2026
I crunched some numbers today to see the resource distribution across racks:
Apr 27 2026
Apr 24 2026
Apr 23 2026
I started from webservice-cli limits, and was thinking of the following deployment plan:
Hello and thank you for reaching out. Wikimedia Foundation offers and supports a platform for running managed user workloads called Toolforge (https://wikitech.wikimedia.org/wiki/Portal:Toolforge). In practice it means having containers running on k8s, though with less per-containers resources (e.g. 4GB). Would WISE run distributed/partitioned/sharded in multiple containers and thus could run on Toolforge? Running on Toolforge, among other benefits, would mean not having to operate/admin a VM and instead run on a supported platform. Please let us know!
I have listed the blockers below, and I'm having an hard time understanding why Toolforge would not work with a fully containerized deployment. Would you mind expanding on the details of each blocker and how Toolforge does not satisfy it? thank you
Hello, thank you for your interest in Cloud VPS. From the description it seems the software can be containerized and run on Toolforge, which is the preferred and supported way to run software. What are the expected resource requirements? And could you provide example interactions of the software you intend to run? If the code is already available please provide links to the source code as well. Thank you !
Apr 20 2026
In T423598#11837669, @fgiunchedi wrote:I think we should be considering importing osbpo.debian.net apt repo as an upstream into aptrepo puppet module (i.e. modules/aptrepo/files/updates and related) and serve it locally from apt.w.o.
Is archive.ubuntu.com working now? Was it the only host failing?
In T423598#11838641, @MoritzMuehlenhoff wrote:In T423598#11837669, @fgiunchedi wrote:I think we should be considering importing osbpo.debian.net apt repo as an upstream into aptrepo puppet module (i.e. modules/aptrepo/files/updates and related) and serve it locally from apt.w.o.
How big is that component in total?
I took a look at the codfw set up, with one thing to change: the tooz backend url should list all zk servers, so we can safely roll-restart zookeeper.service as well and designate/tooz will failover
In T422916#11830798, @bd808 wrote:In T422916#11828441, @fgiunchedi wrote:That's fair re: user confusion concerns. From my SRE POV I was surprised to find that the CSP report url we announce filters the feed of legitimate, albeit confusing to tool maintainers, reports. I am thinking of a middle ground where we collect all reports and present the report firehose unfiltered only on demand. The known-domains retention of course can be short as we don't really care for it except for operational problems. What do you think ?
I'm not sure in the current pipeline where I would store reports with a different retention pattern or where I would insert output filtering to screen that noise from the maintainer facing interface. I don't have any objections to someone figuring those things out and implementing them if it feels like it would add value for administrative investigations.
Thank you for letting us know @YochayCO, appreciate it!