User Details
- User Since
- Feb 10 2016, 11:25 AM (539 w, 3 d)
- Availability
- Available
- IRC Nick
- volans
- LDAP User
- Volans
- MediaWiki User
- RCoccioli (WMF) [ Global Accounts ]
Yesterday
Finally a full round of backups from cloudbackup2003 has completed:
Jun 13 17:09:35 cloudbackup2003 systemd[1]: backup_cinder_volumes.service: Consumed 22h 21min 17.189s CPU time, 467.8G memory peak.
cloudbackup2004 round of backups completed successfully tonight after running for ~2 days (started on 2026-06-11 11:25:52m not too bad), the next run will start at 07 UTC.
Jun 13 02:19:53 cloudbackup2004 systemd[1]: backup_cinder_volumes.service: Consumed 2d 59min 968ms CPU time, 463.9G memory peak.
Fri, Jun 12
Cleanup completed, the next backup run will start in less than 2h, hopefully will run all the way.
Project created. @Andrew resolving this task, feel free to add Zigo when the account will be created.
Creating the project
@Ferdi2005 is my understanding correct that some of the current workload of wlmitvisual01 will go into the new wlm-job host? Because if that's the case it doesn't seem you need all that additional resources, but I might be missing some context. In particular if the same pre-existing load will be split, CPU seems already largely unused and the RAM too seems to not require particular increases if part of the used RAM from the old host goes to the new job host.
Agreed on irc to clear the unprotected snapshots created by cloudbackup in 2026. Extracted the list with:
$ while read -r vol; do sudo rbd --pool eqiad1-cinder snap ls "$vol" \ | grep -v " yes " | grep _cloudbackup | grep 2026 \ | awk -v vol="$vol" '{print "sudo rbd --pool eqiad1-cinder snap remove --snap-id " $1 " " vol}' done < volumes_to_clear.txt
I've extracted the list of volumes from backy2 ls for which the last backup is from march, and I passed that to rbd to list the snapshots. What I can do is pre-emptively delete all the snapshots associated so to probably force a full backup in all cases and get out of this empasse. Thoughts?
The full run on cloudbackup2003 failed again for volume-9ec2e481-26dd-41d1-81bf-24b43abcee12, due to T428995. I've manually cleaned snapshots for that volume. But I'm not sure what's the best action here. We depend on an unmaintained software that has bugs.
Partially duplicate of T397105, that was closed without fixing the problem.
@Green_Cardamom Unfortunately we don't have those files available in this particular case. Toolforge users are responsible for keeping backups of their data, see https://wikitech.wikimedia.org/wiki/Help:Toolforge/Tool_accounts#Backup_Toolforge_data . Related issue T428867.
I've manually removed the snapshot for the failed volume so that it should force it to make a full backup and restarted the process on backup2003.
cloudbackup2004 backups are still going, currently backing up the tool-nfs volume (17.9% 160.0MB/sØ ETA 14h53m). I'll keep an eye that the 07:00 UTC timer doesn't overlap.
Thu, Jun 11
Backups on cloudbackup2003 have started and old backups were kept as expected. New ones so far have been all full backups.
The cleanup (that unfortunately was setup to run before the backup and not after) has completed a while ago and the actual backups are running fine so far. Keep monitoring and leaving the task open until successful completion.
SGTM +1
@Andrew current RAM usage for the two VMs
With the above patches merged the remote backups are now working again. I've manually triggered the ones from cloudbackup2004 that are currently running. The ones from cloudbackup2003 will trigger automatically at 19UTC tonight, I'll check later on that they will start properly.
Should the alert on icinga be acked/downtimed? What about the open incident on splunk? (If not resolved it would alert again tomorrow IIRC)
The systemd unit is clearly failed:
cloudbackup1* hosts doesn't seem to be affected just because they have a leg into the cloud-private vlan:
Do we have any concern that after fixing the issue the next round of backups might cause harm to the infrastructure because of the backlog?
Thu, Jun 4
Wed, Jun 3
No I meant use sudo for the command to be executed (sudo lsb_release -d), it will work in all hosts as it works for root too.
Cumin's connection is managed by an ssh config file, see:
https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/profile/templates/cumin/cloud_ssh_config.erb
Would adding the duration to the existing log messages in [1], [2] and [3] be enough?
Thu, May 28
Some random questions/comments:
Mon, May 25
So we just need to use sudo. For another task related to cumin I had Moritz check if there was any side effect of using sudo when connected from the root user and it was deemed ok.
There will be an added benefit of having more audit logs on the target host as all sudo commands are logged. The other angle is to depend on sudo for those commands, but if sudo is broken in a host where we connect as root we can always fix it just not using sudo and then be able to use everything as usual.
Tue, May 19
May 5 2026
+1 for me
+1 for me
Given the new title removing SRE Infrastructure Foundations as tcpircbot/logmsgbot is not owned/managed by the team and the related SRE-tools/Spicerack tags.
May 2 2026
I don't' have access
Access problem was resolved. Resolving this task as the project is now working.
New flavor available: g4.cores8.ram64.disk20 you can select this flavor in horizon UI. You can also switch the flavor of an existing instance once it's stopped going to the edit instance page.
Apr 30 2026
Project created. @Carandraug, @Gopavasanth please verify that you have access and also make sure to join the cloud-announce mailing list.
thanks, +1 from my side, creating the project
Or you can add them later. But I need you to link your ldap account into your phabricator profile (see LDAP user in https://phabricator.wikimedia.org/p/Carandraug/ ).
@Carandraug could you confirm that we should add also @Gopavasanth as project maintainers?
To keep everyone up to date, this morning Moritz asked me to have a look at why debmonitor-client was failing on the two pki hosts (pki1001.eqiad.wmnet,pki2002.codfw.wmnet) with:
SSLError(SSLError(1, '[SSL: SSLV3_ALERT_CERTIFICATE_EXPIRED] sslv3 alert certificate expired (_ssl.c:2636)')): /hosts/pki1001.eqiad.wmnet/update
Apr 28 2026
Pywikibot has been updated in paws. The deployment was timing out for me, it worked for Francesco, but then paws hub was down because the hub was not being deployed in the node where a required mountpoint was. Francesco solved it adding an explicit nodeSelector.
New version of pywikibot has been deployed to toolforge.
+1 executed cookbook wmcs.openstack.quota_increase --cores 8 --ram 34G --project development-metrics --cluster-name eqiad1. Limits are now increased.
Apr 27 2026
to horizon's hiera key profile::acme_chief::certificates in the puppet's prefix for project-proxy-acme-chief in the project-proxy project.
After the certificate was created correctly I've added:
Apr 14 2026
On @ayounsi request I've added a check for same-host colocation and made it Arzhel-compatible 😉
The above output has been generated with the patch I just sent for a cookbook.
I made a quick check on all the PoPs:
Today with the esams router work I noticed some imbalance of VMs:
$ sudo cumin 'P:netbox::host%location ~ "BY.*esams" and ganeti*' 2> /dev/null ganeti[3005,3007].esams.wmnet $ sudo cumin 'P:netbox::host%location ~ "BW.*esams" and ganeti*' 2> /dev/null ganeti[3006,3008].esams.wmnet $ sudo cumin -i 'F:lldp.parent ~ "ganeti300[5,7]\..*"' 2> /dev/null bast3007.wikimedia.org,doh3005.wikimedia.org,hcaptcha-proxy[3001-3002].wikimedia.org,install3004.wikimedia.org,ncredir3006.esams.wmnet,netflow3004.esams.wmnet,prometheus3004.esams.wmnet $ sudo cumin -i 'F:lldp.parent ~ "ganeti300[6,8]\..*"' 2> /dev/null doh3006.wikimedia.org,durum[3005-3006].esams.wmnet,ncredir3005.esams.wmnet,tcp-proxy[3001-3002].esams.wmnet
Some execption example from the puppetserver logs (cut out as they are pretty long):
Apr 9 2026
+1 to add the keys from my pov. Cumin aliases can help to group/exclude those hosts from common selections.
Apr 8 2026
I don't think that's possible in Icinga due to Icinga "APIs", for the Alermanager downtime it already removes only the downtime created by the cookbook itself.
Apr 7 2026
IMHO the solution here is to create a dedicated cookbook for the LVS that has all the logic needed for LVS reboots (reboot first the secondaries, then the primaries, start from lower traffic important ones to the more critical ones, disable puppet, stop pybal, etc...).
It could be either a cookbook that targets a given traffic-level+datacenter or a rolling-restart-reboot one that can go over all of them or a subset of them.
Apr 2 2026
The cloudcumin hosts are now using the webproxies to connect to the openstack APIs and the firewall rule has been reverted. Resolving.
I've merged and release the fix, do you want to keep the task open to implement some form of check to prevent this from happening?
Given this has been moved to the backlog I'll leave here a comment for our future selves: it could be useful to also adjust the proposed feature to support a slightly different use case, that of a maintenance on a cumin host.
The operator should be able to set a lock that prevent any cookbook from running on the given cumin host but allows them to run on the other cumin host(s) (and saying so in the lock message for the user).
Mar 26 2026
I made a quick try to do an opt-in approach installing tox-uv via tox.ini, but it's not working easily because we manage the python versions with pyenv and that doesn't play well with uv.
I think a new image where we manage python versions with uv and have tox-uv instead and the opt-in is done via the integration/config repo is probably the cleanest way to do it.
Mar 23 2026
One quick workaround could be to see the metric in grafana explore with 2 queries, one of which with a 1y offset, see example (must be logged in into grafana IIRC to open the link).
AFAIK that was explicitly set in https://gerrit.wikimedia.org/r/c/operations/puppet/+/1006006 for T356788
That's what's in puppetdb and what's reported by facter on the host though:
$ sudo facter -p serialnumber S480845X3505676
Mar 16 2026
Conceptually that could work for me, but I fear that we might need to patch cumin for that. Given that keystoneauth1 uses python's requests library AFAIK, we could just set in cumin's config the environmental variables for the proxy, but then it will be used for all request's based backends, like PuppetDB.
So if we want to add proxy support specifically for the openstack backend we might need to patch it specifically. For additional context cumin has now a large refactor merged in master that will be released soon but it can't be released in a hurry given the major refactor.
Removing the cumin tag as cumin doesn't log to IRC at all. Adding the SRE one as this is not a technical problem but a workflow one that involves everyone touching production (not only SREs).
I brought this up to the ceph working group and the only comments I got were from Ben that was a bit surprised about our strategy and wanted to know more about the past failures scenarios. It was mentioned that ceph has quite few tunable for the rebalance and maybe tweaking them could lead to a better strategy that going with noout or norebalance or nobackfill.
Mar 9 2026
Resolving the task as the follow up work is being tracked separately by the Data Engineering team, but feel free to re-open or reference this task if needed.
Mar 6 2026
This host is still marked as Active in Netbox but disappeared from PuppetDB, if it's still broken please fix its status in Netbox.
Mar 5 2026
Sigh... it looks likes we're mostly ok, apart the ones fixed by your patch we have some ganeti host complaining about ConditionPathExists (cc @MoritzMuehlenhoff )
Mar 3 2026
Given the size of the request this will need a discussion in the next WMCS team sync up on Thu., As I'm not aware of any prior coordinated plans for this specific migration, if there was any could you please link them here to add additional context?
Seems related to https://github.com/grafana/grafana/issues/118836
Feb 24 2026
{done}
Feb 23 2026
Feb 20 2026
@Blake I think we should do that check inside the logic that fires up the cookbook, namely inside CookbookItem.run() [1].
Depending on what the API in locking will be there are different options of course.
Feb 18 2026
Un-assigning to myself as I'm not working on this.
Resolving this old task, there was some cleanup done at the time, in case it's deemed necessary to do a new pass a new task should be created.
