Page MenuHomePhabricator

Volans (Riccardo Coccioli)
SRE

Today

  • No visible events.

Tomorrow

  • No visible events.

Tuesday

  • No visible events.

User Details

User Since
Feb 10 2016, 11:25 AM (539 w, 3 d)
Availability
Available
IRC Nick
volans
LDAP User
Volans
MediaWiki User
RCoccioli (WMF) [ Global Accounts ]

Recent Activity

Yesterday

Volans added a comment to T428867: Openstack cinder volumes backups are broken.

Finally a full round of backups from cloudbackup2003 has completed:

Jun 13 17:09:35 cloudbackup2003 systemd[1]: backup_cinder_volumes.service: Consumed 22h 21min 17.189s CPU time, 467.8G memory peak.
Sat, Jun 13, 10:49 PM · cloud-services-team, Cloud-VPS
Volans added a comment to T428867: Openstack cinder volumes backups are broken.

cloudbackup2004 round of backups completed successfully tonight after running for ~2 days (started on 2026-06-11 11:25:52m not too bad), the next run will start at 07 UTC.

Jun 13 02:19:53 cloudbackup2004 systemd[1]: backup_cinder_volumes.service: Consumed 2d 59min 968ms CPU time, 463.9G memory peak.
Sat, Jun 13, 6:02 AM · cloud-services-team, Cloud-VPS

Fri, Jun 12

Volans added a comment to T428867: Openstack cinder volumes backups are broken.

Cleanup completed, the next backup run will start in less than 2h, hopefully will run all the way.

Fri, Jun 12, 5:14 PM · cloud-services-team, Cloud-VPS
Volans closed T428939: Request creation of osbpo VPS project as Resolved.

Project created. @Andrew resolving this task, feel free to add Zigo when the account will be created.

Fri, Jun 12, 3:16 PM · cloud-services-team, Cloud-VPS (Project-requests)
Volans added a comment to T428939: Request creation of osbpo VPS project.

Creating the project

Fri, Jun 12, 3:06 PM · cloud-services-team, Cloud-VPS (Project-requests)
Volans added a comment to T427731: Quota increase request for project wlm-it-visual.

@Ferdi2005 is my understanding correct that some of the current workload of wlmitvisual01 will go into the new wlm-job host? Because if that's the case it doesn't seem you need all that additional resources, but I might be missing some context. In particular if the same pre-existing load will be split, CPU seems already largely unused and the RAM too seems to not require particular increases if part of the used RAM from the old host goes to the new job host.

Fri, Jun 12, 2:55 PM · Cloud-VPS (Quota-requests)
Volans added a comment to T428867: Openstack cinder volumes backups are broken.

Agreed on irc to clear the unprotected snapshots created by cloudbackup in 2026. Extracted the list with:

$ while read -r vol; do
    sudo rbd --pool eqiad1-cinder snap ls "$vol" \
        | grep -v " yes " | grep _cloudbackup | grep 2026 \
        | awk -v vol="$vol" '{print "sudo rbd --pool eqiad1-cinder snap remove --snap-id " $1 " " vol}'
done < volumes_to_clear.txt
Fri, Jun 12, 2:02 PM · cloud-services-team, Cloud-VPS
Volans added a comment to T428867: Openstack cinder volumes backups are broken.

I've extracted the list of volumes from backy2 ls for which the last backup is from march, and I passed that to rbd to list the snapshots. What I can do is pre-emptively delete all the snapshots associated so to probably force a full backup in all cases and get out of this empasse. Thoughts?

Fri, Jun 12, 12:29 PM · cloud-services-team, Cloud-VPS
Volans added a comment to T428867: Openstack cinder volumes backups are broken.

The full run on cloudbackup2003 failed again for volume-9ec2e481-26dd-41d1-81bf-24b43abcee12, due to T428995. I've manually cleaned snapshots for that volume. But I'm not sure what's the best action here. We depend on an unmaintained software that has bugs.

Fri, Jun 12, 12:12 PM · cloud-services-team, Cloud-VPS
Volans added a comment to T428865: Openstack root disk backups log errors.

Partially duplicate of T397105, that was closed without fixing the problem.

Fri, Jun 12, 11:44 AM · cloud-services-team, Cloud-VPS
Volans closed T428830: Urgent: Backup restore request (Toolforge) as Declined.

@Green_Cardamom Unfortunately we don't have those files available in this particular case. Toolforge users are responsible for keeping backups of their data, see https://wikitech.wikimedia.org/wiki/Help:Toolforge/Tool_accounts#Backup_Toolforge_data . Related issue T428867.

Fri, Jun 12, 9:19 AM · cloud-services-team, Data-Services, Toolforge
Volans updated the task description for T429003: mwopenstackclients.py reliability.
Fri, Jun 12, 9:18 AM · cloud-services-team, Cloud-VPS
Volans created T429003: mwopenstackclients.py reliability.
Fri, Jun 12, 8:15 AM · cloud-services-team, Cloud-VPS
Volans created T429002: wmcs-backup should not delete backups before taking the new one.
Fri, Jun 12, 7:56 AM · cloud-services-team, Cloud-VPS
Volans added a comment to T428867: Openstack cinder volumes backups are broken.

I've manually removed the snapshot for the failed volume so that it should force it to make a full backup and restarted the process on backup2003.

Fri, Jun 12, 6:41 AM · cloud-services-team, Cloud-VPS
Volans created T428995: wmcs-backup fails to create differential when the data changed too much and fails the whole run.
Fri, Jun 12, 6:34 AM · cloud-services-team, Cloud-VPS
Volans added a comment to T428867: Openstack cinder volumes backups are broken.

cloudbackup2004 backups are still going, currently backing up the tool-nfs volume (17.9% 160.0MB/sØ ETA 14h53m). I'll keep an eye that the 07:00 UTC timer doesn't overlap.

Fri, Jun 12, 6:09 AM · cloud-services-team, Cloud-VPS

Thu, Jun 11

Volans added a comment to T428867: Openstack cinder volumes backups are broken.

Backups on cloudbackup2003 have started and old backups were kept as expected. New ones so far have been all full backups.

Thu, Jun 11, 7:49 PM · cloud-services-team, Cloud-VPS
Volans added a comment to T428867: Openstack cinder volumes backups are broken.

The cleanup (that unfortunately was setup to run before the backup and not after) has completed a while ago and the actual backups are running fine so far. Keep monitoring and leaving the task open until successful completion.

Thu, Jun 11, 5:36 PM · cloud-services-team, Cloud-VPS
Volans added a comment to T428939: Request creation of osbpo VPS project.

SGTM +1

Thu, Jun 11, 5:09 PM · cloud-services-team, Cloud-VPS (Project-requests)
Volans added a comment to T427731: Quota increase request for project wlm-it-visual.

@Andrew current RAM usage for the two VMs

image.png (3,138×1,378 px, 253 KB)

Thu, Jun 11, 4:51 PM · Cloud-VPS (Quota-requests)
Volans triaged T427731: Quota increase request for project wlm-it-visual as Medium priority.
Thu, Jun 11, 4:43 PM · Cloud-VPS (Quota-requests)
Volans created T428893: Add monitoring for backy2 openstack backups.
Thu, Jun 11, 12:35 PM · cloud-services-team, Cloud-VPS
Volans lowered the priority of T428867: Openstack cinder volumes backups are broken from Unbreak Now! to Medium.

With the above patches merged the remote backups are now working again. I've manually triggered the ones from cloudbackup2004 that are currently running. The ones from cloudbackup2003 will trigger automatically at 19UTC tonight, I'll check later on that they will start properly.

Thu, Jun 11, 12:12 PM · cloud-services-team, Cloud-VPS
Volans added a comment to T428832: db1262 crashed.

Should the alert on icinga be acked/downtimed? What about the open incident on splunk? (If not resolved it would alert again tomorrow IIRC)

Thu, Jun 11, 10:59 AM · SRE, DC-Ops, ops-eqiad, Data-Persistence, DBA
Volans added a comment to T428867: Openstack cinder volumes backups are broken.

The systemd unit is clearly failed:

Thu, Jun 11, 9:32 AM · cloud-services-team, Cloud-VPS
Volans added a comment to T428867: Openstack cinder volumes backups are broken.

cloudbackup1* hosts doesn't seem to be affected just because they have a leg into the cloud-private vlan:

Thu, Jun 11, 9:23 AM · cloud-services-team, Cloud-VPS
Volans added a comment to T428867: Openstack cinder volumes backups are broken.

Do we have any concern that after fixing the issue the next round of backups might cause harm to the infrastructure because of the backlog?

Thu, Jun 11, 9:07 AM · cloud-services-team, Cloud-VPS
Volans created T428867: Openstack cinder volumes backups are broken.
Thu, Jun 11, 8:57 AM · cloud-services-team, Cloud-VPS
Volans created T428865: Openstack root disk backups log errors.
Thu, Jun 11, 8:53 AM · cloud-services-team, Cloud-VPS

Thu, Jun 4

Volans created T428164: Possible webproxy misconfiguration for beta-iabot and iabot-beta.
Thu, Jun 4, 1:03 PM · Cyberbot
Volans created T428163: Possible webproxy misconfiguration for iiab-med.
Thu, Jun 4, 1:01 PM · Internet In A Box

Wed, Jun 3

Volans added a comment to T422801: Consider allowing cumin access to all Cloud VPS VMs.

...also @Volans what are your thoughts about key rotation? Is it fine with you if i simply ignore that question for our first pass?

Wed, Jun 3, 2:09 PM · tools-infrastructure-team, Cloud-VPS
Volans added a comment to T422801: Consider allowing cumin access to all Cloud VPS VMs.

No I meant use sudo for the command to be executed (sudo lsb_release -d), it will work in all hosts as it works for root too.
Cumin's connection is managed by an ssh config file, see:
https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/profile/templates/cumin/cloud_ssh_config.erb

Wed, Jun 3, 2:02 PM · tools-infrastructure-team, Cloud-VPS
Volans added a comment to T427780: Provide downtime duration information in sre.mysql cookbooks.

Would adding the duration to the existing log messages in [1], [2] and [3] be enough?

Wed, Jun 3, 9:41 AM · Infrastructure-Foundations, Spicerack, SRE-tools, DBA
Volans created T428026: Statuspage API keys needs to be rotated.
Wed, Jun 3, 8:04 AM · SRE Observability

Thu, May 28

Volans added a comment to T427465: Move thumbnail caching from upload cluster to text.

Some random questions/comments:

Thu, May 28, 8:03 AM · Patch-For-Review, Data-Persistence, Traffic

Mon, May 25

Volans added a comment to T422801: Consider allowing cumin access to all Cloud VPS VMs.

So we just need to use sudo. For another task related to cumin I had Moritz check if there was any side effect of using sudo when connected from the root user and it was deemed ok.
There will be an added benefit of having more audit logs on the target host as all sudo commands are logged. The other angle is to depend on sudo for those commands, but if sudo is broken in a host where we connect as root we can always fix it just not using sudo and then be able to use everything as usual.

Mon, May 25, 9:48 AM · tools-infrastructure-team, Cloud-VPS

Tue, May 19

Volans created T426727: Automate cluster reboots with cookbooks.
Tue, May 19, 8:57 AM · cloud-services-team

May 5 2026

Volans added a comment to T425387: Quota increase for development metrics.

+1 for me

May 5 2026, 5:40 PM · Cloud-VPS (Quota-requests)
Volans added a comment to T425420: Increase quota for wikitextexp project.

+1 for me

May 5 2026, 5:39 PM · Content-Transform-Team, Cloud-VPS (Quota-requests)
Volans edited projects for T285709: Some SAL log entries (e.g. switchdc, scap backport) are getting cut off because long lines are being split over IRC, added: SRE; removed Infrastructure-Foundations, Spicerack, SRE-tools.

Given the new title removing SRE Infrastructure Foundations as tcpircbot/logmsgbot is not owned/managed by the team and the related SRE-tools/Spicerack tags.

May 5 2026, 7:33 AM · SRE, Patch-Needs-Improvement

May 2 2026

Volans added a project to T425239: page error 404 missing when url ends with .php: ServiceOps-Mediawiki.
May 2 2026, 11:11 PM · ServiceOps-Mediawiki, Wikimedia-Hackathon-2026, SRE
Volans added a comment to T424658: Ensure cloudvirt capacity is more evenly spread out among racks.

I don't' have access

May 2 2026, 8:11 PM · Cloud-VPS, cloud-services-team (FY2025/2026-Q3-Q4)
Volans closed T424068: Request creation of wise VPS project as Resolved.

Access problem was resolved. Resolving this task as the project is now working.

May 2 2026, 6:03 PM · WISE, Cloud-VPS (Project-requests)
Volans closed T424068: Request creation of wise VPS project, a subtask of T424092: Semantic image search for Wikimedia Commons, as Resolved.
May 2 2026, 6:03 PM · WISE, Wikimedia-Hackathon-2026, Technical-Tool-Request
Volans added a comment to T424068: Request creation of wise VPS project.

New flavor available: g4.cores8.ram64.disk20 you can select this flavor in horizon UI. You can also switch the flavor of an existing instance once it's stopped going to the edit instance page.

May 2 2026, 12:12 PM · WISE, Cloud-VPS (Project-requests)
Volans added a comment to T424068: Request creation of wise VPS project.

As discussed at the hackathon with Volans, we're equesting the 64GB instance for now

May 2 2026, 9:58 AM · WISE, Cloud-VPS (Project-requests)

Apr 30 2026

Volans added a comment to T424068: Request creation of wise VPS project.

Project created. @Carandraug, @Gopavasanth please verify that you have access and also make sure to join the cloud-announce mailing list.

Apr 30 2026, 11:30 AM · WISE, Cloud-VPS (Project-requests)
Volans added a comment to T424068: Request creation of wise VPS project.

thanks, +1 from my side, creating the project

Apr 30 2026, 9:18 AM · WISE, Cloud-VPS (Project-requests)
Volans added a comment to T424068: Request creation of wise VPS project.

Or you can add them later. But I need you to link your ldap account into your phabricator profile (see LDAP user in https://phabricator.wikimedia.org/p/Carandraug/ ).

Apr 30 2026, 8:55 AM · WISE, Cloud-VPS (Project-requests)
Volans added a comment to T424068: Request creation of wise VPS project.

@Carandraug could you confirm that we should add also @Gopavasanth as project maintainers?

Apr 30 2026, 8:51 AM · WISE, Cloud-VPS (Project-requests)
Volans added a comment to T420993: Rotate discovery intermediate certificate (expires 2026-05-03).

To keep everyone up to date, this morning Moritz asked me to have a look at why debmonitor-client was failing on the two pki hosts (pki1001.eqiad.wmnet,pki2002.codfw.wmnet) with:

SSLError(SSLError(1, '[SSL: SSLV3_ALERT_CERTIFICATE_EXPIRED] sslv3 alert certificate expired (_ssl.c:2636)')): /hosts/pki1001.eqiad.wmnet/update
Apr 30 2026, 8:42 AM · ServiceOps new, Infrastructure-Foundations, Patch-For-Review

Apr 28 2026

Volans closed T423789: New upstream release for Pywikibot as Resolved.

Pywikibot has been updated in paws. The deployment was timing out for me, it worked for Francesco, but then paws hub was down because the hub was not being deployed in the node where a required mountpoint was. Francesco solved it adding an explicit nodeSelector.

Apr 28 2026, 1:28 PM · cloud-services-team, tools-platform-team, PAWS
Volans closed T423790: New upstream release for Pywikibot as Resolved.

New version of pywikibot has been deployed to toolforge.

Apr 28 2026, 1:01 PM · cloud-services-team, tools-platform-team, Toolforge
Volans moved T424443: Request creation of thelounge Cloud VPS project from Inbox to Feedback needed on the Cloud-VPS (Project-requests) board.
Apr 28 2026, 8:57 AM · WikiLounge, Cloud-VPS (Project-requests)
Volans closed T424564: Quota increase for development metrics as Resolved.

+1 executed cookbook wmcs.openstack.quota_increase --cores 8 --ram 34G --project development-metrics --cluster-name eqiad1. Limits are now increased.

Apr 28 2026, 8:56 AM · cloud-services-team, Cloud-VPS (Quota-requests)

Apr 27 2026

Volans added a comment to T419525: Configure vanity domain for lingualibre.

to horizon's hiera key profile::acme_chief::certificates in the puppet's prefix for project-proxy-acme-chief in the project-proxy project.
After the certificate was created correctly I've added:

Apr 27 2026, 12:59 PM · cloud-services-team, Cloud-VPS, Lingua-Libre
Volans moved T424068: Request creation of wise VPS project from Inbox to Feedback needed on the Cloud-VPS (Project-requests) board.
Apr 27 2026, 9:22 AM · WISE, Cloud-VPS (Project-requests)
Volans moved T424192: Request creation of wolf-a Cloud VPS project from Inbox to Feedback needed on the Cloud-VPS (Project-requests) board.
Apr 27 2026, 9:22 AM · Cloud-VPS (Project-requests)

Apr 14 2026

Volans added a comment to T395883: Ganeti: ensure primary/secondary node redundancy.

On @ayounsi request I've added a check for same-host colocation and made it Arzhel-compatible 😉

Apr 14 2026, 3:55 PM · User-MoritzMuehlenhoff, Ganeti
Volans added a comment to T395883: Ganeti: ensure primary/secondary node redundancy.

The above output has been generated with the patch I just sent for a cookbook.

Apr 14 2026, 3:34 PM · User-MoritzMuehlenhoff, Ganeti
Volans added a comment to T395883: Ganeti: ensure primary/secondary node redundancy.

I made a quick check on all the PoPs:

Apr 14 2026, 3:24 PM · User-MoritzMuehlenhoff, Ganeti
Volans added a comment to T395883: Ganeti: ensure primary/secondary node redundancy.

Today with the esams router work I noticed some imbalance of VMs:

$ sudo cumin 'P:netbox::host%location ~ "BY.*esams" and ganeti*' 2> /dev/null
ganeti[3005,3007].esams.wmnet
$ sudo cumin 'P:netbox::host%location ~ "BW.*esams" and ganeti*' 2> /dev/null
ganeti[3006,3008].esams.wmnet
$ sudo cumin -i 'F:lldp.parent ~ "ganeti300[5,7]\..*"' 2> /dev/null
bast3007.wikimedia.org,doh3005.wikimedia.org,hcaptcha-proxy[3001-3002].wikimedia.org,install3004.wikimedia.org,ncredir3006.esams.wmnet,netflow3004.esams.wmnet,prometheus3004.esams.wmnet
$ sudo cumin -i 'F:lldp.parent ~ "ganeti300[6,8]\..*"' 2> /dev/null
doh3006.wikimedia.org,durum[3005-3006].esams.wmnet,ncredir3005.esams.wmnet,tcp-proxy[3001-3002].esams.wmnet
Apr 14 2026, 2:15 PM · User-MoritzMuehlenhoff, Ganeti
Volans added a comment to T423282: Timeouts on puppetserver1002 past reboot.

Some execption example from the puppetserver logs (cut out as they are pretty long):

Apr 14 2026, 1:12 PM · Puppet-Infrastructure, Infrastructure-Foundations, SRE

Apr 9 2026

Volans added a comment to T422801: Consider allowing cumin access to all Cloud VPS VMs.

+1 to add the keys from my pov. Cumin aliases can help to group/exclude those hosts from common selections.

Apr 9 2026, 3:08 PM · tools-infrastructure-team, Cloud-VPS

Apr 8 2026

Volans added a comment to T422261: sre.hosts.reboot-single cookbook removes any and all downtimes after reboot.

I don't think that's possible in Icinga due to Icinga "APIs", for the Alermanager downtime it already removes only the downtime created by the cookbook itself.

Apr 8 2026, 5:12 PM · Infrastructure-Foundations, Traffic

Apr 7 2026

Volans added a comment to T422261: sre.hosts.reboot-single cookbook removes any and all downtimes after reboot.

IMHO the solution here is to create a dedicated cookbook for the LVS that has all the logic needed for LVS reboots (reboot first the secondaries, then the primaries, start from lower traffic important ones to the more critical ones, disable puppet, stop pybal, etc...).
It could be either a cookbook that targets a given traffic-level+datacenter or a rolling-restart-reboot one that can go over all of them or a subset of them.

Apr 7 2026, 8:48 AM · Infrastructure-Foundations, Traffic

Apr 2 2026

Volans closed T420360: Add proxy support to cumin openstack backend as Resolved.

The cloudcumin hosts are now using the webproxies to connect to the openstack APIs and the firewall rule has been reverted. Resolving.

Apr 2 2026, 2:36 PM · Infrastructure-Foundations, SRE-tools, Cumin
Volans triaged T422115: Missing includes in DNS repo from Netbox-generated snippets as Medium priority.

I've merged and release the fix, do you want to keep the task open to implement some form of check to prevent this from happening?

Apr 2 2026, 12:27 PM · Traffic, SRE, Infrastructure-Foundations, DNS, netops, netbox
Volans added a comment to T330997: Support locking cookbooks run except for switchover related cookbooks.

Given this has been moved to the backlog I'll leave here a comment for our future selves: it could be useful to also adjust the proposed feature to support a slightly different use case, that of a maintenance on a cumin host.
The operator should be able to set a lock that prevent any cookbook from running on the given cumin host but allows them to run on the other cumin host(s) (and saying so in the lock message for the user).

Apr 2 2026, 10:24 AM · Patch-For-Review, ServiceOps new, SRE-tools, Infrastructure-Foundations, Datacenter-Switchover, SRE
Volans edited projects for T422114: Prometheus doing requests via proxy that gets 400, added: SRE Observability; removed Observability-Metrics.
Apr 2 2026, 9:31 AM · SRE Observability (FY2025/2026-Q4)
Volans updated the task description for T422115: Missing includes in DNS repo from Netbox-generated snippets.
Apr 2 2026, 9:27 AM · Traffic, SRE, Infrastructure-Foundations, DNS, netops, netbox
Volans created T422115: Missing includes in DNS repo from Netbox-generated snippets.
Apr 2 2026, 9:25 AM · Traffic, SRE, Infrastructure-Foundations, DNS, netops, netbox
Volans created T422114: Prometheus doing requests via proxy that gets 400.
Apr 2 2026, 9:21 AM · SRE Observability (FY2025/2026-Q4)

Mar 26 2026

Volans added a comment to T421348: Add tox-uv support to the tox-v{3,4} Docker images.

I made a quick try to do an opt-in approach installing tox-uv via tox.ini, but it's not working easily because we manage the python versions with pyenv and that doesn't play well with uv.
I think a new image where we manage python versions with uv and have tox-uv instead and the opt-in is done via the integration/config repo is probably the cleanest way to do it.

Mar 26 2026, 11:43 AM · Release-Engineering-Team, Infrastructure-Foundations

Mar 23 2026

Volans claimed T420360: Add proxy support to cumin openstack backend.
Mar 23 2026, 11:29 AM · Infrastructure-Foundations, SRE-tools, Cumin
Volans added a comment to T420676: Allow Prometheus query beyond 375 days in Grafana/Thanos.

One quick workaround could be to see the metric in grafana explore with 2 queries, one of which with a 1y offset, see example (must be logged in into grafana IIRC to open the link).

Mar 23 2026, 11:05 AM · Regression, Observability-Metrics, observability, Grafana
Volans added a comment to T420676: Allow Prometheus query beyond 375 days in Grafana/Thanos.

AFAIK that was explicitly set in https://gerrit.wikimedia.org/r/c/operations/puppet/+/1006006 for T356788

Mar 23 2026, 10:42 AM · Regression, Observability-Metrics, observability, Grafana
Volans added a comment to T420623: netbox report error for puppetdb serial versus netbox serial for backup1012.

That's what's in puppetdb and what's reported by facter on the host though:

$ sudo facter -p serialnumber
S480845X3505676
Mar 23 2026, 10:38 AM · collaboration-services, SRE, ops-eqiad, DC-Ops

Mar 16 2026

Volans added a comment to T419996: cloudcumin not able to communicate with openstack.eqiad1.wikimediacloud.org:25000 anymore.

Conceptually that could work for me, but I fear that we might need to patch cumin for that. Given that keystoneauth1 uses python's requests library AFAIK, we could just set in cumin's config the environmental variables for the proxy, but then it will be used for all request's based backends, like PuppetDB.
So if we want to add proxy support specifically for the openstack backend we might need to patch it specifically. For additional context cumin has now a large refactor merged in master that will be released soon but it can't be released in a hurry given the major refactor.

Mar 16 2026, 2:29 PM · Cloud-VPS, cloud-services-team
Volans edited projects for T419919: Consider reducing verbosity of IRC logging, added: SRE; removed Cumin.

Removing the cumin tag as cumin doesn't log to IRC at all. Adding the SRE one as this is not a technical problem but a workflow one that involves everyone touching production (not only SREs).

Mar 16 2026, 11:34 AM · SRE, Infrastructure-Foundations
Volans added a comment to T419877: Permanently set 'noout' for cloudceph.

I brought this up to the ceph working group and the only comments I got were from Ben that was a bit surprised about our strategy and wanted to know more about the past failures scenarios. It was mentioned that ceph has quite few tunable for the rebalance and maybe tweaking them could lead to a better strategy that going with noout or norebalance or nobackfill.

Mar 16 2026, 11:10 AM · Ceph, Cloud-VPS, cloud-services-team (FY2025/2026-Q3-Q4)

Mar 9 2026

Volans updated the task description for T419457: dse-k8s control plane OOM.
Mar 9 2026, 8:29 PM · Incident Severity 3, Data-Platform-SRE (2026-03-06 - 2026-03-27), Wikimedia-Incident
Volans renamed T419457: dse-k8s control plane OOM from k8s-dse control plane OOM to dse-k8s control plane OOM.
Mar 9 2026, 8:28 PM · Incident Severity 3, Data-Platform-SRE (2026-03-06 - 2026-03-27), Wikimedia-Incident
Volans closed T419457: dse-k8s control plane OOM as Resolved.

Resolving the task as the follow up work is being tracked separately by the Data Engineering team, but feel free to re-open or reference this task if needed.

Mar 9 2026, 8:13 PM · Incident Severity 3, Data-Platform-SRE (2026-03-06 - 2026-03-27), Wikimedia-Incident
Volans updated the task description for T419457: dse-k8s control plane OOM.
Mar 9 2026, 7:29 PM · Incident Severity 3, Data-Platform-SRE (2026-03-06 - 2026-03-27), Wikimedia-Incident

Mar 6 2026

Volans added a comment to T414411: cp5022 is unreachable.

This host is still marked as Active in Netbox but disappeared from PuppetDB, if it's still broken please fix its status in Netbox.

Mar 6 2026, 4:30 PM · SRE, DC-Ops, ops-eqsin, Traffic

Mar 5 2026

Volans updated subscribers of T419166: Audit production for systemd parse warnings.

Sigh... it looks likes we're mostly ok, apart the ones fixed by your patch we have some ganeti host complaining about ConditionPathExists (cc @MoritzMuehlenhoff )

Mar 5 2026, 10:24 PM · Infrastructure-Foundations, Patch-For-Review, Security, SRE

Mar 3 2026

Volans added a comment to T418813: Quota increases for gitlab-runners.

Given the size of the request this will need a discussion in the next WMCS team sync up on Thu., As I'm not aware of any prior coordinated plans for this specific migration, if there was any could you please link them here to add additional context?

Mar 3 2026, 9:37 AM · User-dcaro, Cloud-VPS (Quota-requests)
Volans moved T418813: Quota increases for gitlab-runners from Inbox to Discussion needed on the Cloud-VPS (Quota-requests) board.
Mar 3 2026, 9:31 AM · User-dcaro, Cloud-VPS (Quota-requests)
Volans added a comment to T418831: grafana.wmcloud.org unavailable - failed db migration.

Seems related to https://github.com/grafana/grafana/issues/118836

Mar 3 2026, 8:46 AM · cloud-services-team (FY2025/2026-Q3-Q4), Cloud-VPS

Feb 24 2026

Volans added a comment to T418210: Adding the Tools Infrastructure / Tools Platform folks as Bitu Account Managers.

{done}

Feb 24 2026, 2:50 PM · Infrastructure-Foundations

Feb 23 2026

Volans added a comment to T330997: Support locking cookbooks run except for switchover related cookbooks.

Not a fan of negative names, I think check_global_lock is descriptive enough. That would result in something like this in run()?

try:
  check_global_lock(cookbook_fqdn)
except LockError:
  #log and exit
Feb 23 2026, 2:26 PM · Patch-For-Review, ServiceOps new, SRE-tools, Infrastructure-Foundations, Datacenter-Switchover, SRE
Volans added a comment to T330997: Support locking cookbooks run except for switchover related cookbooks.

I think I'd be inclined to prefer the more-defensive option (maybe @Clement_Goubert has a preference here?).

Just to make sure I understand - does this mean that we'd add a function in locking.py (something like check_global_lock()), which will then check if the global lock exists, and if it does, will check whether the cookbook that executed the function is on the allowlist? The client in this case is _menu.py?

Feb 23 2026, 10:02 AM · Patch-For-Review, ServiceOps new, SRE-tools, Infrastructure-Foundations, Datacenter-Switchover, SRE

Feb 20 2026

Volans added a comment to T330997: Support locking cookbooks run except for switchover related cookbooks.

@Blake I think we should do that check inside the logic that fires up the cookbook, namely inside CookbookItem.run() [1].
Depending on what the API in locking will be there are different options of course.

Feb 20 2026, 9:15 AM · Patch-For-Review, ServiceOps new, SRE-tools, Infrastructure-Foundations, Datacenter-Switchover, SRE

Feb 18 2026

Volans added a comment to T415199: Toolforge NFS tracing misses some dumps events.

For the paths with .., we are still catching the ones that have it after the mount prefix right? (like /mnt/nfs/home/data/public/something/../../somethingelse)

The issue is with the ones that have it in the prefix itself that we don't catch right? (like /usr/../mnt/home/data/public/..)

Feb 18 2026, 2:46 PM · Toolforge, cloud-services-team
Volans claimed T399313: Add tracing to understand Toolforge and CloudVPS usage and dependencies.
Feb 18 2026, 8:59 AM · cloud-services-team (FY2025/2026-Q3-Q4), Patch-For-Review, Cloud-VPS, Toolforge
Volans placed T361306: Decommission cookbook: stop when user inputs "abort" up for grabs.

Un-assigning to myself as I'm not working on this.

Feb 18 2026, 8:58 AM · Patch-For-Review, Infrastructure-Foundations, Data-Platform-SRE
Volans closed T379259: Outdated cookbooks cleanup as Resolved.

Resolving this old task, there was some cleanup done at the time, in case it's deemed necessary to do a new pass a new task should be created.

Feb 18 2026, 8:58 AM · Infrastructure-Foundations, SRE-tools