Page MenuHomePhabricator

fgiunchedi (Filippo Giunchedi)
/* No comment */

Today

  • No visible events.

Tomorrow

  • No visible events.

Sunday

  • No visible events.

User Details

User Since
Oct 3 2014, 8:06 AM (602 w, 2 h)
Availability
Available
IRC Nick
godog
LDAP User
Filippo Giunchedi
MediaWiki User
FGiunchedi (WMF) [ Global Accounts ]

Recent Activity

Yesterday

fgiunchedi closed T423378: Do not share shm files for openstack uwsgi processes as Resolved.

This is completed

Thu, Apr 16, 2:14 PM · cloud-services-team (FY2025/2026-Q3-Q4), Cloud-VPS
fgiunchedi closed T423378: Do not share shm files for openstack uwsgi processes, a subtask of T421054: Move all openstack rabbitmq queues to quorum, as Resolved.
Thu, Apr 16, 2:14 PM · cloud-services-team (FY2025/2026-Q3-Q4), Cloud-VPS
fgiunchedi updated the task description for T423560: Consider splitting tf-infra-test into multiple "test suites".
Thu, Apr 16, 9:07 AM · cloud-services-team
fgiunchedi created T423560: Consider splitting tf-infra-test into multiple "test suites".
Thu, Apr 16, 9:06 AM · cloud-services-team
fgiunchedi added a comment to T422916: CSP violations with known domains in the blocked-uri are not collected by csp-report.

That's fair re: user confusion concerns. From my SRE POV I was surprised to find that the CSP report url we announce filters the feed of legitimate, albeit confusing to tool maintainers, reports. I am thinking of a middle ground where we collect all reports and present the report firehose unfiltered only on demand. The known-domains retention of course can be short as we don't really care for it except for operational problems. What do you think ?

Thu, Apr 16, 8:22 AM · Tools

Wed, Apr 15

fgiunchedi created T423378: Do not share shm files for openstack uwsgi processes.
Wed, Apr 15, 7:52 AM · cloud-services-team (FY2025/2026-Q3-Q4), Cloud-VPS

Tue, Apr 14

fgiunchedi added a comment to T420565: Audit tools memory requests vs actual usage.

Thinking about this problem a little more: we would be lowering the default memory request, while leaving limit untouched, therefore I think it should be safe to do: many tools are already exceeding their requests though not hitting the limit today. I'm for testing a 128mb memory default request and take it from there, what do you think ?

Tue, Apr 14, 9:46 AM · tools-platform-team, cloud-services-team, Toolforge
fgiunchedi renamed T422646: memcache is a SPOF for designate/tooz coordination from Designate API timing out to memcache is a SPOF for designate/tooz coordination.
Tue, Apr 14, 8:40 AM · Cloud-VPS, cloud-services-team (FY2025/2026-Q3-Q4)
fgiunchedi added a comment to T422646: memcache is a SPOF for designate/tooz coordination.

The two failures I can identify are: oslo.messaging not failing over (T422820) and tooz lamenting memcached unavailable. I'm going to rename this task to address the latter

Tue, Apr 14, 8:38 AM · Cloud-VPS, cloud-services-team (FY2025/2026-Q3-Q4)
fgiunchedi closed T418444: Increased openstack latency and rabbitmq rolling restarts on certificate update as Resolved.

This is done, rabbit/openstack are now able to survive a rabbit host or server process going down (i.e. all durable queues) and automatically reload certs without a restart

Tue, Apr 14, 7:13 AM · cloud-services-team (FY2025/2026-Q3-Q4), Cloud-VPS
fgiunchedi closed T421054: Move all openstack rabbitmq queues to quorum, a subtask of T418444: Increased openstack latency and rabbitmq rolling restarts on certificate update, as Resolved.
Tue, Apr 14, 7:09 AM · cloud-services-team (FY2025/2026-Q3-Q4), Cloud-VPS
fgiunchedi closed T421054: Move all openstack rabbitmq queues to quorum as Resolved.

This is done in eqiad and codfw

Tue, Apr 14, 7:09 AM · cloud-services-team (FY2025/2026-Q3-Q4), Cloud-VPS
fgiunchedi closed T421857: Move trove DB instances to rabbitmq transient quorum queues, a subtask of T421054: Move all openstack rabbitmq queues to quorum, as Resolved.
Tue, Apr 14, 7:04 AM · cloud-services-team (FY2025/2026-Q3-Q4), Cloud-VPS
fgiunchedi closed T421857: Move trove DB instances to rabbitmq transient quorum queues as Resolved.
Tue, Apr 14, 7:04 AM · cloud-services-team (FY2025/2026-Q3-Q4), Cloud-VPS
fgiunchedi added a comment to T421857: Move trove DB instances to rabbitmq transient quorum queues.

Done in codfw too. I left the security groups in place also in light of T422801

Tue, Apr 14, 7:04 AM · cloud-services-team (FY2025/2026-Q3-Q4), Cloud-VPS

Mon, Apr 13

fgiunchedi added a comment to T421857: Move trove DB instances to rabbitmq transient quorum queues.

This is completed in eqiad1, codfw next

Mon, Apr 13, 9:59 AM · cloud-services-team (FY2025/2026-Q3-Q4), Cloud-VPS
fgiunchedi added a comment to T421857: Move trove DB instances to rabbitmq transient quorum queues.

I have finished applying the configuration change to all eqiad1 trove instances, this time around more or less manually. Next up is deleting the exchanges and restart guest-agent

Mon, Apr 13, 9:41 AM · cloud-services-team (FY2025/2026-Q3-Q4), Cloud-VPS
fgiunchedi edited P90475 (An Untitled Masterwork).
Mon, Apr 13, 9:08 AM
fgiunchedi created P90475 (An Untitled Masterwork).
Mon, Apr 13, 9:07 AM
fgiunchedi added a comment to T421911: Keystone logs no longer appearing in logstash.

There isn't very much more info though I opened T422830: Openstack uwsgi logging to '<frozen importlib._bootstrap>.log'

Mon, Apr 13, 7:33 AM · Cloud-VPS, User-aborrero, cloud-services-team

Fri, Apr 10

fgiunchedi added a comment to T422820: oslo.messaging does not failover to the next rabbit host on traffic blackhole situations.

I went back through the cloudcontrol1007 logs to see how extensive this problem is, P90364 contains logs across openstack components sorted by time and filtered for when they reconnected to rabbit. It looks like some components did reconnect as expected when cloudrabbit1001 went down, I haven't looked deep into why/how though

Fri, Apr 10, 2:18 PM · Cloud-VPS, cloud-services-team (FY2025/2026-Q3-Q4)
fgiunchedi updated the task description for T422646: memcache is a SPOF for designate/tooz coordination.
Fri, Apr 10, 1:13 PM · Cloud-VPS, cloud-services-team (FY2025/2026-Q3-Q4)
fgiunchedi added a comment to T422646: memcache is a SPOF for designate/tooz coordination.

I saved all openstack components logs from /var/log to /root/filippo-T417393 on cloudcontrol1007 and cloudcontrol1011 to save them from rotation temporarily and further investigation.

Fri, Apr 10, 1:06 PM · Cloud-VPS, cloud-services-team (FY2025/2026-Q3-Q4)
fgiunchedi updated the task description for T422646: memcache is a SPOF for designate/tooz coordination.
Fri, Apr 10, 10:52 AM · Cloud-VPS, cloud-services-team (FY2025/2026-Q3-Q4)
fgiunchedi created T422916: CSP violations with known domains in the blocked-uri are not collected by csp-report.
Fri, Apr 10, 9:21 AM · Tools

Thu, Apr 9

fgiunchedi created T422830: Openstack uwsgi logging to '<frozen importlib._bootstrap>.log'.
Thu, Apr 9, 1:52 PM · tools-platform-team, Cloud-VPS
fgiunchedi created T422829: Toolforge HTML head links sometimes are issued as http://<tool>.toolforge:443.
Thu, Apr 9, 1:27 PM · Toolforge, cloud-services-team
fgiunchedi updated the task description for T422820: oslo.messaging does not failover to the next rabbit host on traffic blackhole situations.
Thu, Apr 9, 1:01 PM · Cloud-VPS, cloud-services-team (FY2025/2026-Q3-Q4)
fgiunchedi added a comment to T417393: Carry out controlled network switch down tests in cloud.

cloudcontrol nodes not in C8 (i.e. 1006/1007) though didn't seem to give up trying to connect to rabbitmq01.eqiad1.wikimediacloud.org:5671 whereas cloudcontrol1011 stopped trying to talk to rabbitmq01 as expected.

I tuned oslo settings for rabbitmq timeout &c but it was a long time ago, probably before our current rabbitmq setup. So we should do some new testing and reviewing of those settings. This blog post is very old but somewhat relevant to the topic: https://medium.com/@george.shuklin/rabbit-heartbeat-timeouts-in-openstack-fa5875e0309a

Thu, Apr 9, 12:56 PM · Cloud-VPS, cloud-services-team (FY2025/2026-Q3-Q4)
fgiunchedi created T422820: oslo.messaging does not failover to the next rabbit host on traffic blackhole situations.
Thu, Apr 9, 12:56 PM · Cloud-VPS, cloud-services-team (FY2025/2026-Q3-Q4)
fgiunchedi updated the task description for T422801: Consider allowing cumin access to all Cloud VPS VMs.
Thu, Apr 9, 10:15 AM · tools-platform-team, Cloud-VPS
fgiunchedi created T422801: Consider allowing cumin access to all Cloud VPS VMs.
Thu, Apr 9, 10:09 AM · tools-platform-team, Cloud-VPS
fgiunchedi added a comment to T422515: wmcs cookbook "--project" arg is ambiguous, could mean project id or project name.

The other aspect to consider, which was the culprit in this case, is OS_PROJECT_ID vs OS_PROJECT_NAME usage (+OS_PROJECT_DOMAIN_NAME, which is default AFAICT).

Thu, Apr 9, 8:46 AM · Cloud-VPS, cloud-services-team
fgiunchedi added a comment to T422509: Cloud init and unattended upgrades while bootstrapping Trixie VMs.

Alternatively we can ask unattended-upgrades to not do anything until cloud-init has finished, though I'd rather avoid fixing up the fix up

Thu, Apr 9, 8:33 AM · Cloud-VPS, cloud-services-team
fgiunchedi added a comment to T422509: Cloud init and unattended upgrades while bootstrapping Trixie VMs.

Indeed under normal circumstances cloud-init will try to bring back puppet to 7 after the first puppet run (from modules/openstack/templates/nova/vendordata.txt.erb)

Thu, Apr 9, 8:30 AM · Cloud-VPS, cloud-services-team

Wed, Apr 8

fgiunchedi added a comment to T422646: memcache is a SPOF for designate/tooz coordination.

Following up from IRC: stopping memcached on all cloudcontrols, together with all designate servers, then restarting memcached and designate seems to have brought things back

Wed, Apr 8, 2:22 PM · Cloud-VPS, cloud-services-team (FY2025/2026-Q3-Q4)
fgiunchedi added a comment to T417393: Carry out controlled network switch down tests in cloud.

FWIW the oslo timeout issue looks like to me a whole lot like https://bugs.launchpad.net/oslo.messaging/+bug/2096926

Wed, Apr 8, 2:21 PM · Cloud-VPS, cloud-services-team (FY2025/2026-Q3-Q4)
fgiunchedi created T422646: memcache is a SPOF for designate/tooz coordination.
Wed, Apr 8, 1:32 PM · Cloud-VPS, cloud-services-team (FY2025/2026-Q3-Q4)
fgiunchedi added a comment to T417393: Carry out controlled network switch down tests in cloud.

Today cloudrabbit1001 and cloudcontrol1011 were tested:

Wed, Apr 8, 9:38 AM · Cloud-VPS, cloud-services-team (FY2025/2026-Q3-Q4)

Tue, Apr 7

fgiunchedi added a comment to T422515: wmcs cookbook "--project" arg is ambiguous, could mean project id or project name.

For context / more info

Tue, Apr 7, 3:25 PM · Cloud-VPS, cloud-services-team
fgiunchedi created T422515: wmcs cookbook "--project" arg is ambiguous, could mean project id or project name.
Tue, Apr 7, 3:00 PM · Cloud-VPS, cloud-services-team
fgiunchedi updated the task description for T421857: Move trove DB instances to rabbitmq transient quorum queues.
Tue, Apr 7, 10:12 AM · cloud-services-team (FY2025/2026-Q3-Q4), Cloud-VPS
fgiunchedi added a comment to T421054: Move all openstack rabbitmq queues to quorum.

FS utilization has stabilized as segments are reclaimed

Tue, Apr 7, 7:08 AM · cloud-services-team (FY2025/2026-Q3-Q4), Cloud-VPS

Fri, Apr 3

fgiunchedi renamed T421025: Add PTR record for azwikimedia (mail.wikimedia.az) from Add PTR record for azwikimedia to Add PTR record for azwikimedia (mail.wikimedia.az).
Fri, Apr 3, 10:05 AM · cloud-services-team, Cloud-VPS (Quota-requests)
fgiunchedi added a comment to T421025: Add PTR record for azwikimedia (mail.wikimedia.az).

opentofu-infra-diff.service is failing on cloudcontrol1007 wrt this:

Fri, Apr 3, 10:04 AM · cloud-services-team, Cloud-VPS (Quota-requests)

Wed, Apr 1

fgiunchedi added a comment to T411248: Plan to make clouddumps more resilient and easier to operate.

Update from network sync meeting: hosting read-only NFS behind LVS, like we're going to do for http/rsync (T306550) should be explored as a solution, which is going to be easier and simpler to maintain as opposed to the shared IP. I need to investigate/test more though at least conceptually read only NFS should be fine from the client's POV

Wed, Apr 1, 1:28 PM · cloud-services-team (FY2025/2026-Q3-Q4), Data-Services, Cloud-VPS
fgiunchedi added a comment to T421054: Move all openstack rabbitmq queues to quorum.

I did some digging and the space is used by raft quorum logs via shared wal -> per-queue segments. The segments are single files on the filesystem and not actually deleted until the segment is full: https://www.rabbitmq.com/docs/quorum-queues#resource-use

Wed, Apr 1, 8:18 AM · cloud-services-team (FY2025/2026-Q3-Q4), Cloud-VPS

Tue, Mar 31

fgiunchedi reopened T347017: rabbitmq: missing heartbeats issue as "Open".

Unfortunately I was too hasty: we do still see this, though it is unclear to me whether what the impact (if any) is

Tue, Mar 31, 3:32 PM · User-dcaro, Cloud-VPS, User-aborrero, cloud-services-team
fgiunchedi added a comment to T420993: Rotate discovery intermediate certificate (expires 2026-05-03).

Plan LGTM, I haven't tested the code changes though Pontoon can help with that for sure.

Tue, Mar 31, 3:11 PM · ServiceOps new, Infrastructure-Foundations, Patch-For-Review
fgiunchedi added a comment to T421054: Move all openstack rabbitmq queues to quorum.

Note that despite the graph we have ~300G free on the VG, plus the vg0/srv LV is not really used and we can reclaim its space in case it is needed

Is it a leftover from an old partman recipe or something?

Tue, Mar 31, 2:51 PM · cloud-services-team (FY2025/2026-Q3-Q4), Cloud-VPS
fgiunchedi created T421857: Move trove DB instances to rabbitmq transient quorum queues.
Tue, Mar 31, 9:38 AM · cloud-services-team (FY2025/2026-Q3-Q4), Cloud-VPS
fgiunchedi closed T347017: rabbitmq: missing heartbeats issue as Invalid.

No longer observed

Tue, Mar 31, 8:33 AM · User-dcaro, Cloud-VPS, User-aborrero, cloud-services-team
fgiunchedi created T421832: wmcs.openstack.restart_openstack attempts to restart services on decom cloudcontrol1005.
Tue, Mar 31, 8:08 AM · Cloud-VPS, tools-infrastructure-team, cloud-services-team (FY2025/2026-Q3-Q4)
fgiunchedi added a comment to T421054: Move all openstack rabbitmq queues to quorum.

Ok now all queues but trove-guestagent are using quorum/durable, I'll be looking into deploying that too next.

Tue, Mar 31, 7:37 AM · cloud-services-team (FY2025/2026-Q3-Q4), Cloud-VPS

Mon, Mar 30

fgiunchedi added a comment to T421054: Move all openstack rabbitmq queues to quorum.

The final bits of flipping neutron-l3-agent to quorum queues will be done tomorrow at 7 UTC within a scheduled window. The actual work to be performed:

Mon, Mar 30, 12:29 PM · cloud-services-team (FY2025/2026-Q3-Q4), Cloud-VPS
fgiunchedi added a comment to T421054: Move all openstack rabbitmq queues to quorum.

This took a bunch of tries today and despite my best attempts to mess with rabbit and oslo, openstack reacted reasonably well IMHO.

Mon, Mar 30, 11:44 AM · cloud-services-team (FY2025/2026-Q3-Q4), Cloud-VPS
fgiunchedi closed T420923: rabbitmqctl list_queues in eqiad/codfw times out after 60s, a subtask of T418444: Increased openstack latency and rabbitmq rolling restarts on certificate update, as Resolved.
Mon, Mar 30, 9:19 AM · cloud-services-team (FY2025/2026-Q3-Q4), Cloud-VPS
fgiunchedi closed T420923: rabbitmqctl list_queues in eqiad/codfw times out after 60s as Resolved.

This is fixed now, both rabbitmq server and CLI use ipv6 for erlang distribution protocol and its ports for CLI tools are open between rabbitmq servers

Mon, Mar 30, 9:19 AM · cloud-services-team (FY2025/2026-Q3-Q4), Cloud-VPS

Fri, Mar 27

fgiunchedi added a comment to T420565: Audit tools memory requests vs actual usage.

Of immediate note the fact that 50% of namespaces/tools use less than 20%, i.e. we could be reducing their requests by 4-5x

Can we check how many of the ones that use more memory are actually setting the limits themselves vs using the default values?
It would be great if we can reduce the defaults without needing any user to add extra config to their tools.

Fri, Mar 27, 11:29 AM · tools-platform-team, cloud-services-team, Toolforge
fgiunchedi created P89954 (An Untitled Masterwork).
Fri, Mar 27, 11:28 AM
fgiunchedi created P89953 audit-deployments-T420565.
Fri, Mar 27, 11:27 AM

Wed, Mar 25

fgiunchedi added a comment to T421054: Move all openstack rabbitmq queues to quorum.

The oslo setting I mentioned rabbit_transient_quorum_queue refers to the transient queues (reply, fanout) that openstack manages on rabbit, as opposed to the "service" queues which are indeed already quorum.

Wed, Mar 25, 10:01 AM · cloud-services-team (FY2025/2026-Q3-Q4), Cloud-VPS
fgiunchedi edited projects for T418831: grafana.wmcloud.org unavailable - failed db migration, added: cloud-services-team (FY2025/2026-Q3-Q4); removed cloud-services-team.
Wed, Mar 25, 8:41 AM · cloud-services-team (FY2025/2026-Q3-Q4), Cloud-VPS

Tue, Mar 24

fgiunchedi created T421054: Move all openstack rabbitmq queues to quorum.
Tue, Mar 24, 11:45 AM · cloud-services-team (FY2025/2026-Q3-Q4), Cloud-VPS
fgiunchedi added a comment to T420923: rabbitmqctl list_queues in eqiad/codfw times out after 60s.

After some investigation I found the following:

  1. rabbitmq cli utils use tcp ports 35672-35682 to communicate with nodes. In "reverse" in the sense that the host running e.g. rabbitmqctl acts as a temporary server to which other nodes connect to. These ports will need to be open on the firewall.
  2. erlang doesn't implement happy eyeballs for dualstack hosts thus even if the ports are open they need to bind to v6, thus we'll need at least RABBITMQ_CTL_ERL_ARGS="-proto_dist inet6_tcp" to make cli tools DTRT
Tue, Mar 24, 8:45 AM · cloud-services-team (FY2025/2026-Q3-Q4), Cloud-VPS

Mon, Mar 23

fgiunchedi created T420923: rabbitmqctl list_queues in eqiad/codfw times out after 60s.
Mon, Mar 23, 1:36 PM · cloud-services-team (FY2025/2026-Q3-Q4), Cloud-VPS
fgiunchedi added a comment to T418444: Increased openstack latency and rabbitmq rolling restarts on certificate update.

Change deployed and rabbit roll-restarted:

Mon, Mar 23, 12:54 PM · cloud-services-team (FY2025/2026-Q3-Q4), Cloud-VPS

Thu, Mar 19

fgiunchedi added a comment to T419967: Add --min-uptime to cookbooks.

FWIW I found some prior art / ideas here T367592: hadoop rolling reboot cookbook: add start-datetime flag

Thu, Mar 19, 10:58 AM · SRE-tools, serviceops-radar, Infrastructure-Foundations
fgiunchedi created T420565: Audit tools memory requests vs actual usage.
Thu, Mar 19, 10:28 AM · tools-platform-team, cloud-services-team, Toolforge
fgiunchedi created P89886 (An Untitled Masterwork).
Thu, Mar 19, 9:33 AM
fgiunchedi closed T419824: Add new k8s toolforge workers to cater for memory requests, a subtask of T414513: Add new alerts for Toolforge cluster high load, as Resolved.
Thu, Mar 19, 9:16 AM · cloud-services-team, Toolforge
fgiunchedi closed T419824: Add new k8s toolforge workers to cater for memory requests as Resolved.

This is done, however 32GB barely made a dent into the % requests vs available. Resolving and will followup in parent task.

Thu, Mar 19, 9:16 AM · cloud-services-team (FY2025/2026-Q3-Q4), Toolforge

Wed, Mar 18

fgiunchedi added a comment to T418444: Increased openstack latency and rabbitmq rolling restarts on certificate update.

Today during T417393: Carry out controlled network switch down tests in cloud the same failure happened, namely cloudrabbit1001 was disconnected from the network and a partition was formed. When the host came back stopping and starting rabbit on the host eventually made things recover.

Wed, Mar 18, 10:28 AM · cloud-services-team (FY2025/2026-Q3-Q4), Cloud-VPS
fgiunchedi added a comment to T417393: Carry out controlled network switch down tests in cloud.

Tests today went significantly better: cloud vps networking stayed intact, I did start with failing over cloudgw which meant hosts using anycast addresses already failed over: lb, services. Rabbit suffered from network partition (i.e. T418444) though stopping and starting rabbit on cloudrabbit1001 eventually made things recover. cloudcontrol1011 was the last host and not tested yet

Wed, Mar 18, 10:23 AM · Cloud-VPS, cloud-services-team (FY2025/2026-Q3-Q4)

Mar 17 2026

fgiunchedi added a comment to T419967: Add --min-uptime to cookbooks.

I also was wondering about a resumable rolling reboot feature for cookbooks and found this task, and of course I'm +1! The way I understand the feature currently is the following:

Mar 17 2026, 2:19 PM · SRE-tools, serviceops-radar, Infrastructure-Foundations
fgiunchedi created T420360: Add proxy support to cumin openstack backend.
Mar 17 2026, 2:08 PM · Infrastructure-Foundations, SRE-tools, Cumin
fgiunchedi closed T419996: cloudcumin not able to communicate with openstack.eqiad1.wikimediacloud.org:25000 anymore as Resolved.

This is fixed! Thank you all for your help, and will follow up with another task to get http proxy support for spicerack openstack backend (and other related can of worms!)

Mar 17 2026, 2:06 PM · Cloud-VPS, cloud-services-team
fgiunchedi created P89870 (An Untitled Masterwork).
Mar 17 2026, 8:03 AM

Mar 16 2026

fgiunchedi added a comment to T419996: cloudcumin not able to communicate with openstack.eqiad1.wikimediacloud.org:25000 anymore.

We discussed this in the team meeting today: to restore functionality I have https://gerrit.wikimedia.org/r/c/operations/homer/public/+/1253574 out. I'll be following up with a specific spicerack task for the openstack backend to be able to use an http proxy.

Mar 16 2026, 5:11 PM · Cloud-VPS, cloud-services-team
fgiunchedi added a comment to T418444: Increased openstack latency and rabbitmq rolling restarts on certificate update.

Confirmed that rabbitmq reloads certs without a restart:

Mar 16 2026, 2:45 PM · cloud-services-team (FY2025/2026-Q3-Q4), Cloud-VPS
fgiunchedi updated subscribers of T419996: cloudcumin not able to communicate with openstack.eqiad1.wikimediacloud.org:25000 anymore.

I agree cloudcumin talking via prod http proxy like any other client is the right fix here. @Volans what do you think of the above idea? namely get cumin O backend to talk through prod proxies to the openstack api? from a quick look to wmcs-cookbooks spicerack shouldn't be affected in the sense that openstack interaction happens through CLI anyways and thus works

Mar 16 2026, 2:17 PM · Cloud-VPS, cloud-services-team
fgiunchedi updated subscribers of T419996: cloudcumin not able to communicate with openstack.eqiad1.wikimediacloud.org:25000 anymore.

@taavi mentioned that https://gerrit.wikimedia.org/r/c/operations/homer/public/+/970275 might have broken this communication, which seems likely. Short of reverting that change what's the right approach here to make sure cloudcumin can talk to the openstack api? cc @ayounsi @cmooney

Mar 16 2026, 8:53 AM · Cloud-VPS, cloud-services-team

Mar 13 2026

fgiunchedi created T419996: cloudcumin not able to communicate with openstack.eqiad1.wikimediacloud.org:25000 anymore.
Mar 13 2026, 2:16 PM · Cloud-VPS, cloud-services-team

Mar 12 2026

fgiunchedi created T419877: Permanently set 'noout' for cloudceph.
Mar 12 2026, 3:43 PM · Ceph, Cloud-VPS, cloud-services-team (FY2025/2026-Q3-Q4)
fgiunchedi added a comment to T417393: Carry out controlled network switch down tests in cloud.

Thank you, I looked at cloudvirt.drain though I couldn't find an option specifically to make sure the destination host is not in the rack we are draining. Maybe not a huge issue though? The scenario I'm thinking about is we're draining a cloudvirt and all/most VMs migrate to another cloudvirt in the same rack, of course things would converge eventually at the risk of moving VMs a bunch of times.

Correct, those cookbooks are not rack aware at all. The most efficient process would be to run the set_maintenance script on all cloudvirts in a given rack and then drain the individual cloudvirts; that would avoid moving any VMs more than once.

Mar 12 2026, 3:39 PM · Cloud-VPS, cloud-services-team (FY2025/2026-Q3-Q4)
fgiunchedi added a comment to T419824: Add new k8s toolforge workers to cater for memory requests.

I'm also assuming that the most requests come from nfs workers above, to be verified once a nfs worker is added and how it changes memory requests %

Mar 12 2026, 11:11 AM · cloud-services-team (FY2025/2026-Q3-Q4), Toolforge
fgiunchedi added a comment to T419824: Add new k8s toolforge workers to cater for memory requests.

Following the docs at https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Kubernetes#Add_a_worker this is what I'm planning on running, then wait for completion, observe memory requests percentage at https://grafana.wmcloud.org/goto/cffsdds3j2juod?orgId=1 and repeat as needed to bring reservation % to say 70%

Mar 12 2026, 11:02 AM · cloud-services-team (FY2025/2026-Q3-Q4), Toolforge
fgiunchedi created T419824: Add new k8s toolforge workers to cater for memory requests.
Mar 12 2026, 10:47 AM · cloud-services-team (FY2025/2026-Q3-Q4), Toolforge
fgiunchedi added a comment to T417393: Carry out controlled network switch down tests in cloud.

Plan is to grab another announced maint window on Tues March 17th to resume the testing.

I have also opened subtasks for the remaining racks, one notable difference is that those do contain cloudvirt hosts. @Andrew what's the recommended procedure to temporarily drain a rack of VMs and then put them back? So far I found wmcs.openstack.cloudvirt.drain cookbook mentioned on wikitech

wmcs.openstack.cloudvirt.drain should be what you need -- it will mark migrate VMs off the host and also mark the host as in maintenance. Then to repool you'll use wmcs.openstack.cloudvirt.unset_maintenance

Mar 12 2026, 10:44 AM · Cloud-VPS, cloud-services-team (FY2025/2026-Q3-Q4)

Mar 11 2026

fgiunchedi added a comment to T414513: Add new alerts for Toolforge cluster high load.

Following up from T419674: ToolforgeKubernetesCapacity alert actionability

Mar 11 2026, 1:45 PM · cloud-services-team, Toolforge
fgiunchedi closed T419674: ToolforgeKubernetesCapacity alert actionability as Invalid.

Thank you @dcaro for the pointer to T404726 ! I went through it again and it was a good read; I'm resolving this one in favor of T414513: Add new alerts for Toolforge cluster high load and will follow up there

Mar 11 2026, 1:31 PM · cloud-services-team (FY2025/2026-Q3-Q4), Toolforge
fgiunchedi created T419674: ToolforgeKubernetesCapacity alert actionability.
Mar 11 2026, 10:26 AM · cloud-services-team (FY2025/2026-Q3-Q4), Toolforge
fgiunchedi added a comment to T417393: Carry out controlled network switch down tests in cloud.

Plan is to grab another announced maint window on Tues March 17th to resume the testing.

Mar 11 2026, 8:18 AM · Cloud-VPS, cloud-services-team (FY2025/2026-Q3-Q4)
fgiunchedi created T419658: Controlled cloudsw down tests for F4.
Mar 11 2026, 8:08 AM · Cloud-VPS, cloud-services-team (FY2025/2026-Q3-Q4)
fgiunchedi created T419657: Controlled cloudsw down tests for E4.
Mar 11 2026, 8:06 AM · Cloud-VPS, cloud-services-team (FY2025/2026-Q3-Q4)
fgiunchedi updated the task description for T419656: Controlled cloudsw down tests for D5.
Mar 11 2026, 8:02 AM · Cloud-VPS, cloud-services-team (FY2025/2026-Q3-Q4)
fgiunchedi created T419656: Controlled cloudsw down tests for D5.
Mar 11 2026, 7:59 AM · Cloud-VPS, cloud-services-team (FY2025/2026-Q3-Q4)

Mar 10 2026

fgiunchedi updated the task description for T419508: Debug and understand why bringing down cloud net/gw/lb resulted in cloud vps network down.
Mar 10 2026, 2:16 PM · Cloud-VPS, cloud-services-team (FY2025/2026-Q3-Q4)
fgiunchedi added a comment to T419508: Debug and understand why bringing down cloud net/gw/lb resulted in cloud vps network down.

Thanks @fgiunchedi. The other aspects we should monitor are the keepalived operations on both the cloudnet and cloudgw nodes, to make sure they are failing over. I think if we test each element one at a time we can be set up and logged on to all hosts and monitoring events, so hopefully we can isolate exactly what parts aren't working as expected.

Mar 10 2026, 11:12 AM · Cloud-VPS, cloud-services-team (FY2025/2026-Q3-Q4)
fgiunchedi updated the task description for T419508: Debug and understand why bringing down cloud net/gw/lb resulted in cloud vps network down.
Mar 10 2026, 11:07 AM · Cloud-VPS, cloud-services-team (FY2025/2026-Q3-Q4)