User Details
- User Since
- Oct 3 2014, 8:06 AM (602 w, 2 h)
- Availability
- Available
- IRC Nick
- godog
- LDAP User
- Filippo Giunchedi
- MediaWiki User
- FGiunchedi (WMF) [ Global Accounts ]
Yesterday
This is completed
That's fair re: user confusion concerns. From my SRE POV I was surprised to find that the CSP report url we announce filters the feed of legitimate, albeit confusing to tool maintainers, reports. I am thinking of a middle ground where we collect all reports and present the report firehose unfiltered only on demand. The known-domains retention of course can be short as we don't really care for it except for operational problems. What do you think ?
Wed, Apr 15
Tue, Apr 14
Thinking about this problem a little more: we would be lowering the default memory request, while leaving limit untouched, therefore I think it should be safe to do: many tools are already exceeding their requests though not hitting the limit today. I'm for testing a 128mb memory default request and take it from there, what do you think ?
The two failures I can identify are: oslo.messaging not failing over (T422820) and tooz lamenting memcached unavailable. I'm going to rename this task to address the latter
This is done, rabbit/openstack are now able to survive a rabbit host or server process going down (i.e. all durable queues) and automatically reload certs without a restart
This is done in eqiad and codfw
Done in codfw too. I left the security groups in place also in light of T422801
Mon, Apr 13
This is completed in eqiad1, codfw next
I have finished applying the configuration change to all eqiad1 trove instances, this time around more or less manually. Next up is deleting the exchanges and restart guest-agent
There isn't very much more info though I opened T422830: Openstack uwsgi logging to '<frozen importlib._bootstrap>.log'
Fri, Apr 10
I went back through the cloudcontrol1007 logs to see how extensive this problem is, P90364 contains logs across openstack components sorted by time and filtered for when they reconnected to rabbit. It looks like some components did reconnect as expected when cloudrabbit1001 went down, I haven't looked deep into why/how though
I saved all openstack components logs from /var/log to /root/filippo-T417393 on cloudcontrol1007 and cloudcontrol1011 to save them from rotation temporarily and further investigation.
Thu, Apr 9
The other aspect to consider, which was the culprit in this case, is OS_PROJECT_ID vs OS_PROJECT_NAME usage (+OS_PROJECT_DOMAIN_NAME, which is default AFAICT).
Alternatively we can ask unattended-upgrades to not do anything until cloud-init has finished, though I'd rather avoid fixing up the fix up
Indeed under normal circumstances cloud-init will try to bring back puppet to 7 after the first puppet run (from modules/openstack/templates/nova/vendordata.txt.erb)
Wed, Apr 8
Following up from IRC: stopping memcached on all cloudcontrols, together with all designate servers, then restarting memcached and designate seems to have brought things back
FWIW the oslo timeout issue looks like to me a whole lot like https://bugs.launchpad.net/oslo.messaging/+bug/2096926
Today cloudrabbit1001 and cloudcontrol1011 were tested:
Tue, Apr 7
For context / more info
FS utilization has stabilized as segments are reclaimed
Fri, Apr 3
opentofu-infra-diff.service is failing on cloudcontrol1007 wrt this:
Wed, Apr 1
Update from network sync meeting: hosting read-only NFS behind LVS, like we're going to do for http/rsync (T306550) should be explored as a solution, which is going to be easier and simpler to maintain as opposed to the shared IP. I need to investigate/test more though at least conceptually read only NFS should be fine from the client's POV
I did some digging and the space is used by raft quorum logs via shared wal -> per-queue segments. The segments are single files on the filesystem and not actually deleted until the segment is full: https://www.rabbitmq.com/docs/quorum-queues#resource-use
Tue, Mar 31
Unfortunately I was too hasty: we do still see this, though it is unclear to me whether what the impact (if any) is
Plan LGTM, I haven't tested the code changes though Pontoon can help with that for sure.
No longer observed
Ok now all queues but trove-guestagent are using quorum/durable, I'll be looking into deploying that too next.
Mon, Mar 30
The final bits of flipping neutron-l3-agent to quorum queues will be done tomorrow at 7 UTC within a scheduled window. The actual work to be performed:
This took a bunch of tries today and despite my best attempts to mess with rabbit and oslo, openstack reacted reasonably well IMHO.
This is fixed now, both rabbitmq server and CLI use ipv6 for erlang distribution protocol and its ports for CLI tools are open between rabbitmq servers
Fri, Mar 27
Wed, Mar 25
The oslo setting I mentioned rabbit_transient_quorum_queue refers to the transient queues (reply, fanout) that openstack manages on rabbit, as opposed to the "service" queues which are indeed already quorum.
Tue, Mar 24
After some investigation I found the following:
- rabbitmq cli utils use tcp ports 35672-35682 to communicate with nodes. In "reverse" in the sense that the host running e.g. rabbitmqctl acts as a temporary server to which other nodes connect to. These ports will need to be open on the firewall.
- erlang doesn't implement happy eyeballs for dualstack hosts thus even if the ports are open they need to bind to v6, thus we'll need at least RABBITMQ_CTL_ERL_ARGS="-proto_dist inet6_tcp" to make cli tools DTRT
Mon, Mar 23
Change deployed and rabbit roll-restarted:
Thu, Mar 19
FWIW I found some prior art / ideas here T367592: hadoop rolling reboot cookbook: add start-datetime flag
This is done, however 32GB barely made a dent into the % requests vs available. Resolving and will followup in parent task.
Wed, Mar 18
Today during T417393: Carry out controlled network switch down tests in cloud the same failure happened, namely cloudrabbit1001 was disconnected from the network and a partition was formed. When the host came back stopping and starting rabbit on the host eventually made things recover.
Tests today went significantly better: cloud vps networking stayed intact, I did start with failing over cloudgw which meant hosts using anycast addresses already failed over: lb, services. Rabbit suffered from network partition (i.e. T418444) though stopping and starting rabbit on cloudrabbit1001 eventually made things recover. cloudcontrol1011 was the last host and not tested yet
Mar 17 2026
I also was wondering about a resumable rolling reboot feature for cookbooks and found this task, and of course I'm +1! The way I understand the feature currently is the following:
This is fixed! Thank you all for your help, and will follow up with another task to get http proxy support for spicerack openstack backend (and other related can of worms!)
Mar 16 2026
We discussed this in the team meeting today: to restore functionality I have https://gerrit.wikimedia.org/r/c/operations/homer/public/+/1253574 out. I'll be following up with a specific spicerack task for the openstack backend to be able to use an http proxy.
Confirmed that rabbitmq reloads certs without a restart:
I agree cloudcumin talking via prod http proxy like any other client is the right fix here. @Volans what do you think of the above idea? namely get cumin O backend to talk through prod proxies to the openstack api? from a quick look to wmcs-cookbooks spicerack shouldn't be affected in the sense that openstack interaction happens through CLI anyways and thus works
@taavi mentioned that https://gerrit.wikimedia.org/r/c/operations/homer/public/+/970275 might have broken this communication, which seems likely. Short of reverting that change what's the right approach here to make sure cloudcumin can talk to the openstack api? cc @ayounsi @cmooney
Mar 13 2026
Mar 12 2026
I'm also assuming that the most requests come from nfs workers above, to be verified once a nfs worker is added and how it changes memory requests %
Following the docs at https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Kubernetes#Add_a_worker this is what I'm planning on running, then wait for completion, observe memory requests percentage at https://grafana.wmcloud.org/goto/cffsdds3j2juod?orgId=1 and repeat as needed to bring reservation % to say 70%
Mar 11 2026
Following up from T419674: ToolforgeKubernetesCapacity alert actionability
Thank you @dcaro for the pointer to T404726 ! I went through it again and it was a good read; I'm resolving this one in favor of T414513: Add new alerts for Toolforge cluster high load and will follow up there
Plan is to grab another announced maint window on Tues March 17th to resume the testing.