Page MenuHomePhabricator

Volans (Riccardo Coccioli)
SRE

Today

  • No visible events.

Tomorrow

  • No visible events.

Friday

  • No visible events.

User Details

User Since
Feb 10 2016, 11:25 AM (521 w, 6 d)
Availability
Available
IRC Nick
volans
LDAP User
Volans
MediaWiki User
RCoccioli (WMF) [ Global Accounts ]

Recent Activity

Yesterday

Volans added a comment to T356296: confd setup left without configuration doesn't stop confd.

@Scott_French from a quick check unless I did some lazy mistake I'd say around 733
(the second grep is just to allow cumin to aggregate them all stripping all the time, host, pid info)

Tue, Feb 10, 10:30 PM · ServiceOps new, Infrastructure-Foundations, SRE
Volans added a comment to T415237: etherpad table size is 233GB.

Given the hot topic I'll add my 2 cents. I think that the proposed solution will most likely dissatisfies everyone because from one side will disrupts the existing workflows for those that use etherpad regularly and from the other side will require WMF to keep maintaining it over time. It really just solves the DB table size problem, assuming no excessive abuse is done before the next truncation.

Tue, Feb 10, 8:34 PM · User-notice, collaboration-services, Wikimedia-Etherpad, Data-Persistence

Thu, Feb 5

Volans added a comment to T416567: Cache expirations (?) every UTC midnight.

@jijiki out of curiosity, have you already looked at/excluded logrotate from the picture? AFAICT it runs fleetwide few seconds after midnight and it could be possible that some service upon getting notified of the log rotation does a less smooth reload?

Thu, Feb 5, 1:12 PM · User-jijiki, ServiceOps new

Tue, Feb 3

Volans added a comment to T388874: Update Kubernetes library version in spicerack.

The cumin hosts are on v22.6.0-2 as reported by https://debmonitor.wikimedia.org/packages/python3-kubernetes

Tue, Feb 3, 12:33 PM · ServiceOps-Upgrades-Hardware, ServiceOps new, Datacenter-Switchover

Fri, Jan 23

Volans added a comment to T415199: Toolforge NFS tracing misses some dumps events.

After quite few testing the route of LSM_PROBE wasn't feasible because of some missing feature in the kernel that AFAICT can't just be enabled but requires recompilation compared to the standard Debian kernel. I've tried various hooks including file_open and vfs_open without success. I've come up with a solution using the current approach and sent the above patch for it.

Fri, Jan 23, 9:56 PM · Toolforge, cloud-services-team

Thu, Jan 22

Volans added a comment to T415199: Toolforge NFS tracing misses some dumps events.

Thanks for the report, that's why I was asking if we knew of tools using for example dumps. So it looks like the pre-existing code had an optimization to exclude anything that doesn't start with /mnt/nfs and I naively assumed that it was already checked that the data at that point was already a resolved path, but it's not.
It looks like we could move the hook from the open/openat syscall to LSM_PROBE (I've quickly checked and we seem to have support for it, but I'll check more extensively on the whole fleet of workers to be sure) that should gets hooked after the kernel has resolved the path.
I'm working on a patch to test the above assumptions.

Thu, Jan 22, 3:36 PM · Toolforge, cloud-services-team
Volans claimed T415199: Toolforge NFS tracing misses some dumps events.
Thu, Jan 22, 3:10 PM · Toolforge, cloud-services-team
Volans added a comment to T375014: Support listing pooled / active authdns hosts (rather than all).

@MLechvien-WMF I haven't work on this since the original unplanned effort that generated the above patch. As I have been also moved to a different team since then I don't think I will be able to work on this. The existing patch can surely be taken and carried over by interested parties in the affected teams (serviceops, traffic, I/F) if that approach still makes sense.

Thu, Jan 22, 11:29 AM · Patch-For-Review, Infrastructure-Foundations, Spicerack, SRE-tools

Thu, Jan 15

Volans closed T414648: Request to increase quotas for logging project as Resolved.

Limits increased.

Thu, Jan 15, 4:29 PM · Observability-Logging, Cloud-VPS (Quota-requests)

Mon, Jan 12

Volans triaged T414240: [wikireplicas] Create views for new wiki kaiwiki as Medium priority.
Mon, Jan 12, 11:16 AM · cloud-services-team (FY2025/2026-Q3-Q4), Data-Services
Volans triaged T414249: No new changesets added to cloud/instance-puppet.git since df63d60 as Medium priority.
Mon, Jan 12, 11:14 AM · Cloud-VPS, cloud-services-team
Volans triaged T414199: Add datetime versions of timestamp fields to Wikireplica databases as Medium priority.

Adding Data-Persistence and Data-Engineering for visibility and feasibility feedbacks.

Mon, Jan 12, 11:14 AM · Data-Persistence, Data-Engineering, cloud-services-team, Data-Services
Volans triaged T414229: jobs-api does not properly handle quota errors when restarting a job as Medium priority.
Mon, Jan 12, 11:10 AM · Toolforge, cloud-services-team
Volans triaged T414106: Add end-to-end monitoring for PAWS as Medium priority.
Mon, Jan 12, 11:08 AM · cloud-services-team, PAWS
Volans triaged T410948: New upstream release for Pywikibot as Medium priority.
Mon, Jan 12, 11:06 AM · Toolforge (Toolforge iteration 25), cloud-services-team (FY2025/2026-Q3-Q4), tools-platform-team
Volans triaged T413793: lima-kilo: Migrate away from overlay2 storage as Medium priority.
Mon, Jan 12, 11:06 AM · cloud-services-team, Toolforge
Volans triaged T413782: lima-kilo: INJECT_FACTS_AS_VARS default to `True` is deprecated as Medium priority.
Mon, Jan 12, 11:06 AM · cloud-services-team, Toolforge

Jan 8 2026

Volans added a comment to T413097: Raise quota on wikiqlever so that an instance with 256 GB RAM and 3 x 4 TB SSD can be launched.

@Physikerwelt if you need NFS you can refer to https://wikitech.wikimedia.org/wiki/Help:Shared_storage

Jan 8 2026, 9:40 AM · cloud-services-team, Cloud-VPS (Quota-requests), WikiCite, Wikidata, Wikidata-Query-Service

Jan 7 2026

Volans added a comment to T413097: Raise quota on wikiqlever so that an instance with 256 GB RAM and 3 x 4 TB SSD can be launched.

Limits increased to 1TB of disk and 32G of RAM.

Jan 7 2026, 5:09 PM · cloud-services-team, Cloud-VPS (Quota-requests), WikiCite, Wikidata, Wikidata-Query-Service
Volans added a comment to T413097: Raise quota on wikiqlever so that an instance with 256 GB RAM and 3 x 4 TB SSD can be launched.

While I don't know the specs of the machine running WDQS

Jan 7 2026, 5:05 PM · cloud-services-team, Cloud-VPS (Quota-requests), WikiCite, Wikidata, Wikidata-Query-Service
Volans added a comment to T413097: Raise quota on wikiqlever so that an instance with 256 GB RAM and 3 x 4 TB SSD can be launched.

@Physikerwelt Do you need also the VCPU limit to be increased? I see that they are all used right now.

Jan 7 2026, 4:58 PM · cloud-services-team, Cloud-VPS (Quota-requests), WikiCite, Wikidata, Wikidata-Query-Service
Volans triaged T413985: Alert on metricsinfra prometheus storing a few days of metrics as Medium priority.
Jan 7 2026, 4:22 PM · cloud-services-team, Cloud-VPS

Dec 18 2025

Volans added a comment to T413080: Design and build the next generation of container-registry service for the WMF production realm.

There is also an opportunity here to try to consolidate the technology used for container registries around the foundation, for example there is also the Toolforge one. If in the long term we end up using the same stack for both should simplify their operations and maintenance and allow to have more people familiar with the stack.

Dec 18 2025, 2:22 PM · ServiceOps new, Epic, Ceph, Kubernetes, Infrastructure-Foundations, Data-Platform-SRE, Machine-Learning-Team

Dec 15 2025

Volans added a comment to T412700: Improve release process of Spicerack and service catalog.

I've sent a proposal patch with a possible approach to keep the benefits of the @dataclass approach and allow to ignore extra arguments when present, preventing errors.
It would be still nice to have a check in the Puppet repo to remind the user that a patch for spicerack is needed, given that the warning comment at the top of the file goes unnoticed.

Dec 15 2025, 2:09 PM · ServiceOps new, Patch-For-Review, Infrastructure-Foundations

Nov 27 2025

Volans added a comment to T253986: update bacula-sd config so that it listens on IPv6.

If you want to bind any address, from the docs at [1] it seems that you can just omit the setting and not specify any of SDAddresses and SDAddress. SDPort seems optional as we're using the default port.

Nov 27 2025, 1:30 PM · Data-Persistence-Backup, SRE, IPv6

Nov 25 2025

Volans added a comment to T337422: Add default Cloud VPS project alerts for low disk space and low inode count.

The problem is not the space, that disk is out of inodes:

# df -hi /srv/
Filesystem     Inodes IUsed IFree IUse% Mounted on
/dev/sda         5.0M  5.0M     0  100% /srv
Nov 25 2025, 8:22 PM · cloud-services-team, Cloud-VPS

Nov 21 2025

Volans added a comment to T409857: [toolsdb] Automatically terminate long transactions.

FYI in production there are already mechanism to automatically kill queries, so we might just be able to reuse them.

Nov 21 2025, 6:36 PM · cloud-services-team, Toolforge
Volans added a comment to T410426: Requesting access to analytics-privatedata-users for dsmit.

@DSmit-WMF did you had a chance to go through the documentation to clarify which access do you need to the analytics-privatedata-users group?
Perhaps just level 1 is already enough for you?

Nov 21 2025, 5:48 PM · SRE, SRE-Access-Requests
Volans added a project to T409707: Requesting access to Analytics_Privatedata for Chandra-WMDE: Data-Engineering.

@Milimetric / @Ahoelzl by any chance one of you could review this task for approval?

Nov 21 2025, 5:13 PM · Data-Engineering, SRE, SRE-Access-Requests
Volans added a comment to T410612: Requesting access to ops for blake.

@MoritzMuehlenhoff yes but I think Kwaku is covering for Kavitha while she's away in the next few days.

Nov 21 2025, 10:54 AM · SRE, SRE-Access-Requests

Nov 20 2025

Volans added a comment to T410572: Replace deprecated Phabricator Conduit API call by @ProdPasteBot with its stable equivalent.

Adding collaboration-services

Nov 20 2025, 5:45 PM · collaboration-services, Phabricator
Volans updated the task description for T410612: Requesting access to ops for blake.
Nov 20 2025, 11:48 AM · SRE, SRE-Access-Requests
Volans moved T410612: Requesting access to ops for blake from Untriaged to Manager/NDA Approval/Confirmation on the SRE-Access-Requests board.
Nov 20 2025, 11:47 AM · SRE, SRE-Access-Requests

Nov 19 2025

Volans added a comment to T410537: Add a --rack flag to sre.k8s.pool-depool-node.

Just for context referencing past ideas on the topic: T327300

Nov 19 2025, 9:59 PM · Infrastructure-Foundations, SRE-tools, serviceops
Volans triaged T410506: New SSH keys for effie as Medium priority.
Nov 19 2025, 2:30 PM · SRE, SRE-Access-Requests
Volans moved T410506: New SSH keys for effie from Untriaged to Patch in Review on the SRE-Access-Requests board.
Nov 19 2025, 2:30 PM · SRE, SRE-Access-Requests
Volans closed T409854: Requesting access to run queries on superset.wikimedia.org for Nik Gkountas as Resolved.
Nov 19 2025, 12:14 PM · SRE, SRE-Access-Requests
Volans moved T410473: Requesting access to analytics-privatedata-users for catrope from Patch in Review to Awaiting User Input on the SRE-Access-Requests board.

@Catrope patch merged, will be live within ~30 minutes. Kerberos principal created, you should have received an email about it with instructions on how to setup your password.
Once you've verified that everything works as expected please resolve this task. Thanks :)

Nov 19 2025, 9:37 AM · SRE, SRE-Access-Requests
Volans moved T410473: Requesting access to analytics-privatedata-users for catrope from Manager/NDA Approval/Confirmation to Patch in Review on the SRE-Access-Requests board.
Nov 19 2025, 9:24 AM · SRE, SRE-Access-Requests
Volans updated the task description for T410473: Requesting access to analytics-privatedata-users for catrope.
Nov 19 2025, 9:24 AM · SRE, SRE-Access-Requests
Volans updated the task description for T410473: Requesting access to analytics-privatedata-users for catrope.
Nov 19 2025, 9:24 AM · SRE, SRE-Access-Requests
Volans moved T410473: Requesting access to analytics-privatedata-users for catrope from Untriaged to Manager/NDA Approval/Confirmation on the SRE-Access-Requests board.

Adding Data-Engineering for visibility, no approval required for WMF staff.
Pending approval from @SCherukuwada

Nov 19 2025, 8:27 AM · SRE, SRE-Access-Requests
Volans updated the task description for T410473: Requesting access to analytics-privatedata-users for catrope.
Nov 19 2025, 8:24 AM · SRE, SRE-Access-Requests
Volans triaged T410473: Requesting access to analytics-privatedata-users for catrope as Medium priority.
Nov 19 2025, 8:22 AM · SRE, SRE-Access-Requests
Volans updated the task description for T409707: Requesting access to Analytics_Privatedata for Chandra-WMDE.
Nov 19 2025, 8:15 AM · Data-Engineering, SRE, SRE-Access-Requests

Nov 18 2025

Volans updated the task description for T409707: Requesting access to Analytics_Privatedata for Chandra-WMDE.
Nov 18 2025, 7:18 PM · Data-Engineering, SRE, SRE-Access-Requests
Volans updated subscribers of T409707: Requesting access to Analytics_Privatedata for Chandra-WMDE.

@WMDECyn, in case @Chandra-WMDE's position is a fixed term contract, could you provide us with the expiration date so that we can add it to data.yaml to track it?

Nov 18 2025, 7:18 PM · Data-Engineering, SRE, SRE-Access-Requests
Volans triaged T409707: Requesting access to Analytics_Privatedata for Chandra-WMDE as Medium priority.
Nov 18 2025, 6:20 PM · Data-Engineering, SRE, SRE-Access-Requests
Volans moved T410291: Grant Access to ops-limited for matthieulec from Manager/NDA Approval/Confirmation to Awaiting User Input on the SRE-Access-Requests board.

@MLechvien-WMF the patch has been merged, and then puppet-merged (an internal step required to make changes to the puppet repo live in production). Your user will be picked up and created at the next puppet run and will be on all hosts within 30 minutes from now. Once you've verified all works as expected please resolve this task, thanks.

Nov 18 2025, 6:06 PM · SRE-Access-Requests, SRE
Volans moved T410426: Requesting access to analytics-privatedata-users for dsmit from Untriaged to Awaiting User Input on the SRE-Access-Requests board.
Nov 18 2025, 5:15 PM · SRE, SRE-Access-Requests
Volans added a comment to T410426: Requesting access to analytics-privatedata-users for dsmit.

Please refer to https://wikitech.wikimedia.org/wiki/Data_Platform/Data_access#Access_Levels to clarify which level access you need to the analytics-privatedata-users group.

Nov 18 2025, 5:14 PM · SRE, SRE-Access-Requests
Volans triaged T410426: Requesting access to analytics-privatedata-users for dsmit as Medium priority.
Nov 18 2025, 5:12 PM · SRE, SRE-Access-Requests
Volans added a comment to T403153: Upgrade cloudcumin hosts to bookworm.

We've discussed this in the sre-tools-infrastructure team and there are no objections for the goal to keep the cloudcumin hosts in sync with the cumin ones and no objections for an in-place upgrade as suggested by Moritz offline.

Nov 18 2025, 2:40 PM · Cloud-VPS, cloud-services-team
Volans added a comment to T409279: Update to FIDO backed production SSH key for btullis.

@BTullis can you confirm all is working fine and we can resolve this task?

Nov 18 2025, 11:16 AM · SRE, SRE-Access-Requests
Volans moved T409854: Requesting access to run queries on superset.wikimedia.org for Nik Gkountas from Patch in Review to Awaiting User Input on the SRE-Access-Requests board.

@ngkountas patch merged, it should get live within 30 minutes from now. Once you've verified all works as expected please resolve this task.

Nov 18 2025, 9:53 AM · SRE, SRE-Access-Requests
Volans moved T410270: Update mfischerwmf ssh key from Patch in Review to Awaiting User Input on the SRE-Access-Requests board.

@MFischer patch merged, it should get live within 30 minutes from now. Once you've verified all works as expected please resolve this task.

Nov 18 2025, 9:52 AM · SRE, SRE-Access-Requests
Volans moved T409893: Requesting access to analytics-privatedata-users for AnkitaM from Patch in Review to Awaiting User Input on the SRE-Access-Requests board.

@MGerlach @AnkitaM: patch merged, it should get live within 30 minutes from now. Once you've verified all works as expected please resolve this task.

Nov 18 2025, 9:52 AM · Data-Engineering, SRE, SRE-Access-Requests
Volans closed T409894: Grant Access to ldap/nda for AnkitaM as Resolved.

Patch merged, resolving. @AnkitaM you're currently part of the LDAP nda group and should be able to access all the UIs that require this group. Feel free to re-open in case you encounter any issue.

Nov 18 2025, 9:51 AM · SRE, LDAP-Access-Requests
Volans closed T409894: Grant Access to ldap/nda for AnkitaM, a subtask of T406203: Start formal collaboration on understanding the use of maintenance templates, as Resolved.
Nov 18 2025, 9:51 AM · Research (FY2025-26-Research-October-December)

Nov 17 2025

Volans moved T409933: Requesting access to analytics-privatedata-users for lpintscher from Patch in Review to Awaiting User Input on the SRE-Access-Requests board.

@Lydia_Pintscher patch merged, it should get live within 30 minutes from now. Once you've verified all works as expected please resolve this task.

Nov 17 2025, 5:45 PM · SRE, SRE-Access-Requests
Volans moved T409409: Requesting access to analytics_privatedata_users and SQL Lab for Arian Bozorg (WMDE) from Patch in Review to Awaiting User Input on the SRE-Access-Requests board.
Nov 17 2025, 5:45 PM · SRE, SRE-Access-Requests
Volans moved T409933: Requesting access to analytics-privatedata-users for lpintscher from Manager/NDA Approval/Confirmation to Patch in Review on the SRE-Access-Requests board.

Whoops, my bad, it was not tracked in the summary and I missed it in the thread.

Nov 17 2025, 5:40 PM · SRE, SRE-Access-Requests
Volans updated the task description for T409933: Requesting access to analytics-privatedata-users for lpintscher.
Nov 17 2025, 5:39 PM · SRE, SRE-Access-Requests
Volans triaged T410291: Grant Access to ops-limited for matthieulec as Medium priority.

Pending approval from either @Kappakayala or @mark (from data.yaml approval list)

Nov 17 2025, 5:31 PM · SRE-Access-Requests, SRE
Volans triaged T409409: Requesting access to analytics_privatedata_users and SQL Lab for Arian Bozorg (WMDE) as Medium priority.

@Arian_Bozorg patch merged, it should get live within 30 minutes from now. Once you've verified all works as expected please resolve this task.

Nov 17 2025, 5:29 PM · SRE, SRE-Access-Requests
Volans updated the task description for T409409: Requesting access to analytics_privatedata_users and SQL Lab for Arian Bozorg (WMDE).
Nov 17 2025, 5:28 PM · SRE, SRE-Access-Requests
Volans triaged T409933: Requesting access to analytics-privatedata-users for lpintscher as Medium priority.

Pending approval from @WMDE-leszek

Nov 17 2025, 5:19 PM · SRE, SRE-Access-Requests
Volans updated the task description for T409933: Requesting access to analytics-privatedata-users for lpintscher.
Nov 17 2025, 5:18 PM · SRE, SRE-Access-Requests
Volans triaged T409854: Requesting access to run queries on superset.wikimedia.org for Nik Gkountas as Medium priority.
Nov 17 2025, 4:43 PM · SRE, SRE-Access-Requests
Volans added a comment to T409854: Requesting access to run queries on superset.wikimedia.org for Nik Gkountas.

@ngkountas just to be sure, when you say analytics-privatedata-users level 2 you mean with kerberos?

Nov 17 2025, 4:42 PM · SRE, SRE-Access-Requests
Volans updated the task description for T409854: Requesting access to run queries on superset.wikimedia.org for Nik Gkountas.
Nov 17 2025, 4:42 PM · SRE, SRE-Access-Requests
Volans moved T409893: Requesting access to analytics-privatedata-users for AnkitaM from Manager/NDA Approval/Confirmation to Patch in Review on the SRE-Access-Requests board.
Nov 17 2025, 4:16 PM · Data-Engineering, SRE, SRE-Access-Requests
Volans triaged T409893: Requesting access to analytics-privatedata-users for AnkitaM as Medium priority.
Nov 17 2025, 4:12 PM · Data-Engineering, SRE, SRE-Access-Requests
Volans triaged T410270: Update mfischerwmf ssh key as Medium priority.

Confirmed the ssh key out of band.

Nov 17 2025, 2:32 PM · SRE, SRE-Access-Requests
Volans moved T409893: Requesting access to analytics-privatedata-users for AnkitaM from Patch in Review to Manager/NDA Approval/Confirmation on the SRE-Access-Requests board.
Nov 17 2025, 2:23 PM · Data-Engineering, SRE, SRE-Access-Requests
Volans added a project to T409893: Requesting access to analytics-privatedata-users for AnkitaM: Data-Engineering.

Adding Data-Engineering for visibility and @Milimetric, @Ahoelzl for approval (either of them) from the data engineering side, as per docs.

Nov 17 2025, 2:21 PM · Data-Engineering, SRE, SRE-Access-Requests
Volans updated the task description for T409893: Requesting access to analytics-privatedata-users for AnkitaM.
Nov 17 2025, 1:53 PM · Data-Engineering, SRE, SRE-Access-Requests
Volans moved T409279: Update to FIDO backed production SSH key for btullis from Ready To Go to Awaiting User Input on the SRE-Access-Requests board.
Nov 17 2025, 1:46 PM · SRE, SRE-Access-Requests

Nov 10 2025

Volans added a comment to T409712: Wrong url on checking silence on alertmanager.

Possibly related to T328869

Nov 10 2025, 11:28 AM · SRE, Observability-Alerting, observability

Nov 6 2025

Volans moved T409365: Grant zuul project access to `fast-iops` volume type and `4xiops` instance flavor from Discussion needed to Approved on the Cloud-VPS (Quota-requests) board.
Nov 6 2025, 11:50 AM · Cloud-VPS (Quota-requests)
Volans added a comment to T409365: Grant zuul project access to `fast-iops` volume type and `4xiops` instance flavor.

If I'm not mistaken the above patch should gran the usage of the requested instance type. I'm not sure if there is a separate setting for the fast-iops bit or it's included in that configuration.

Nov 6 2025, 11:49 AM · Cloud-VPS (Quota-requests)
Volans triaged T409365: Grant zuul project access to `fast-iops` volume type and `4xiops` instance flavor as Medium priority.
Nov 6 2025, 10:38 AM · Cloud-VPS (Quota-requests)
Volans moved T409365: Grant zuul project access to `fast-iops` volume type and `4xiops` instance flavor from Inbox to Discussion needed on the Cloud-VPS (Quota-requests) board.
Nov 6 2025, 10:37 AM · Cloud-VPS (Quota-requests)

Nov 3 2025

Volans added a comment to T409029: Flapping wikitech-static icinga alert.

From a quick check it seems that the host is randomly hammered by some traffic (to be investigated):

Nov 3 2025, 9:19 AM · wikitech.wikimedia.org, cloud-services-team

Oct 31 2025

Volans moved T408387: CloudVPS instance for ProVe from Inbox to Feedback needed on the Cloud-VPS (Project-requests) board.
Oct 31 2025, 8:46 AM · cloud-services-team (FY2025/2026-Q1-Q2), Cloud-VPS (Project-requests)

Oct 21 2025

Restricted Application added a project to T407787: Alertmanager triggers an alert on IRC and email after the alert has resolved: Infrastructure-Foundations.

For some related historical context on the lack of parity between the Icinga and AM APIs in Spicerack see also T293209 (look for optimal).

Oct 21 2025, 2:06 PM · Infrastructure-Foundations, SRE-tools, Spicerack, Traffic, Observability-Alerting

Oct 10 2025

Volans added a comment to T250415: Homer: add parallelization support.

[note for future self] If we can wait for Python 3.14 to be around in our systems then we should evaluate also the new InterpreterPoolExecutor for this, it might be a good fit.

Oct 10 2025, 9:17 AM · User-Elukey, Infrastructure-Foundations, SRE-tools, homer

Oct 6 2025

Volans added a comment to T393692: transfer.py fails when handling nftables-configured firewall.

You can use SSH_AUTH_SOCK=/run/keyholder/proxy.sock scp [OPTIONS] cumin1002.eqiad.wmnet:/path/... /path/... from cumin1003 for example.

Oct 6 2025, 2:58 PM · database-backups, Infrastructure-Foundations
Volans edited P83603 Test custom fact from Gerrit patch.
Oct 6 2025, 10:54 AM
Volans created P83603 Test custom fact from Gerrit patch.
Oct 6 2025, 10:53 AM

Sep 24 2025

Volans closed T405434: PuppetFailure Puppet has failed on cloudcumin1001:9100 as Resolved.

Transient failure of git pull for the cloud/wmcs-cookbooks repository, self-resolved at the next puppet run.

Sep 24 2025, 7:40 AM · cloud-services-team

Sep 23 2025

Volans added a comment to T393600: sre.discovery cookbooks: refactor use of resolve_with_client_ip.

@Scott_French sorry for the trouble. The patch that added the timeout to the cookbook's version of the function was added ~1.5 years after the functionality landed in Spicerack and somehow we missed to double check the functionality equivalency when migrating the cookbook to the spicerack's module function. Sorry about that.
I think we can set a reasonable timeout default without the need to make it tunable, at least as a first fix, that could go in at anytime.
I doubt we'll have special needs for specific timeouts though, we're talking about DNS queries, not HTTP requests 😉

Sep 23 2025, 6:40 PM · serviceops

Sep 22 2025

Volans added a watcher for tools-infrastructure-team: Volans.
Sep 22 2025, 6:29 AM
Volans added a member for tools-infrastructure-team: Volans.
Sep 22 2025, 6:29 AM

Sep 11 2025

Volans added a comment to T404373: Log DNS queries from Cloud VPS clients.

Ideally sampled logs would be good enough, depending how complex is the setup to sample them.
If there are no easy options for a real sampling we could also consider alternatives approaches:

  • a poor's man sampling playing with log rotation and retention (e.g. rotate often and keep only 1 block every N rotated ones)
  • a size-based retention that limits the total size of the logs to a predictable amount (will not help with issues in the past but I guess that most issues where we need the data are live/ongoing)
  • deduplicate the logs to increase the signal in the logs
Sep 11 2025, 4:21 PM · Patch-For-Review, cloud-services-team, Cloud-VPS
Volans added a comment to T404300: Remove KernelErrors alerts.

+1 for me

Sep 11 2025, 9:06 AM · cloud-services-team (FY2025/2026-Q1-Q2)
Volans added a comment to T404282: KernelErrors Server cloudcephosd1041 logged kernel errors.

The problem is that there is no evidence in hardware logs and I doubt we'll get any replacement from Dell without them.

Sep 11 2025, 8:39 AM · cloud-services-team
Volans triaged T404282: KernelErrors Server cloudcephosd1041 logged kernel errors as Medium priority.
Sep 11 2025, 7:52 AM · cloud-services-team
Volans added a comment to T404282: KernelErrors Server cloudcephosd1041 logged kernel errors.

I've found this in kern.log/dmesg but nothing in racadm logs (both getsel and lclog):

Sep 11 02:33:32 cloudcephosd1041 kernel: [125071.786152] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
Sep 11 02:33:32 cloudcephosd1041 kernel: [125071.786154] {1}[Hardware Error]: It has been corrected by h/w and requires no further action
Sep 11 02:33:32 cloudcephosd1041 kernel: [125071.786155] {1}[Hardware Error]: event severity: corrected
Sep 11 02:33:32 cloudcephosd1041 kernel: [125071.786156] {1}[Hardware Error]:  Error 0, type: corrected
Sep 11 02:33:32 cloudcephosd1041 kernel: [125071.786156] {1}[Hardware Error]:  fru_text: B1
Sep 11 02:33:32 cloudcephosd1041 kernel: [125071.786157] {1}[Hardware Error]:   section_type: memory error
Sep 11 02:33:32 cloudcephosd1041 kernel: [125071.786157] {1}[Hardware Error]:   error_status: 0x0000000000000400
Sep 11 02:33:32 cloudcephosd1041 kernel: [125071.786158] {1}[Hardware Error]:   physical_address: 0x0000000f9ec05180
Sep 11 02:33:32 cloudcephosd1041 kernel: [125071.786159] {1}[Hardware Error]:   node: 1 card: 0 module: 0 rank: 2 bank: 18 device: 6 row: 53592 column: 448
Sep 11 02:33:32 cloudcephosd1041 kernel: [125071.786160] {1}[Hardware Error]:   error_type: 2, single-bit ECC
Sep 11 02:33:32 cloudcephosd1041 kernel: [125071.786161] {1}[Hardware Error]:   DIMM location:  B1
Sep 11 02:33:32 cloudcephosd1041 kernel: [125071.786168] [Firmware Warn]: GHES: Invalid error status block length!
Sep 11 02:33:32 cloudcephosd1041 kernel: [125071.786185] soft_offline: 0xf9ec05: invalidated
Sep 11 02:33:32 cloudcephosd1041 kernel: [125071.786192] mce: [Hardware Error]: Machine check events logged
Sep 11 02:33:32 cloudcephosd1041 kernel: [125071.786194] mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 255: 940000000000009f
Sep 11 02:33:32 cloudcephosd1041 kernel: [125071.794205] mce: [Hardware Error]: TSC 0 ADDR f9ec05180
Sep 11 02:33:32 cloudcephosd1041 kernel: [125071.799695] mce: [Hardware Error]: PROCESSOR 0:806f8 TIME 1757558012 SOCKET 0 APIC 0 microcode 2b000639
Sep 11 2025, 7:52 AM · cloud-services-team

Sep 10 2025

Volans added a comment to T404163: CodeSearch is unresponsive (2025-09-10).

Restart completed

Sep 10 2025, 9:13 AM · VPS-project-Codesearch