User Details
- User Since
- Feb 10 2016, 11:25 AM (521 w, 6 d)
- Availability
- Available
- IRC Nick
- volans
- LDAP User
- Volans
- MediaWiki User
- RCoccioli (WMF) [ Global Accounts ]
Yesterday
@Scott_French from a quick check unless I did some lazy mistake I'd say around 733
(the second grep is just to allow cumin to aggregate them all stripping all the time, host, pid info)
Given the hot topic I'll add my 2 cents. I think that the proposed solution will most likely dissatisfies everyone because from one side will disrupts the existing workflows for those that use etherpad regularly and from the other side will require WMF to keep maintaining it over time. It really just solves the DB table size problem, assuming no excessive abuse is done before the next truncation.
Thu, Feb 5
@jijiki out of curiosity, have you already looked at/excluded logrotate from the picture? AFAICT it runs fleetwide few seconds after midnight and it could be possible that some service upon getting notified of the log rotation does a less smooth reload?
Tue, Feb 3
The cumin hosts are on v22.6.0-2 as reported by https://debmonitor.wikimedia.org/packages/python3-kubernetes
Fri, Jan 23
After quite few testing the route of LSM_PROBE wasn't feasible because of some missing feature in the kernel that AFAICT can't just be enabled but requires recompilation compared to the standard Debian kernel. I've tried various hooks including file_open and vfs_open without success. I've come up with a solution using the current approach and sent the above patch for it.
Thu, Jan 22
Thanks for the report, that's why I was asking if we knew of tools using for example dumps. So it looks like the pre-existing code had an optimization to exclude anything that doesn't start with /mnt/nfs and I naively assumed that it was already checked that the data at that point was already a resolved path, but it's not.
It looks like we could move the hook from the open/openat syscall to LSM_PROBE (I've quickly checked and we seem to have support for it, but I'll check more extensively on the whole fleet of workers to be sure) that should gets hooked after the kernel has resolved the path.
I'm working on a patch to test the above assumptions.
@MLechvien-WMF I haven't work on this since the original unplanned effort that generated the above patch. As I have been also moved to a different team since then I don't think I will be able to work on this. The existing patch can surely be taken and carried over by interested parties in the affected teams (serviceops, traffic, I/F) if that approach still makes sense.
Thu, Jan 15
Limits increased.
Mon, Jan 12
Adding Data-Persistence and Data-Engineering for visibility and feasibility feedbacks.
Jan 8 2026
@Physikerwelt if you need NFS you can refer to https://wikitech.wikimedia.org/wiki/Help:Shared_storage
Jan 7 2026
Limits increased to 1TB of disk and 32G of RAM.
@Physikerwelt Do you need also the VCPU limit to be increased? I see that they are all used right now.
Dec 18 2025
There is also an opportunity here to try to consolidate the technology used for container registries around the foundation, for example there is also the Toolforge one. If in the long term we end up using the same stack for both should simplify their operations and maintenance and allow to have more people familiar with the stack.
Dec 15 2025
I've sent a proposal patch with a possible approach to keep the benefits of the @dataclass approach and allow to ignore extra arguments when present, preventing errors.
It would be still nice to have a check in the Puppet repo to remind the user that a patch for spicerack is needed, given that the warning comment at the top of the file goes unnoticed.
Nov 27 2025
If you want to bind any address, from the docs at [1] it seems that you can just omit the setting and not specify any of SDAddresses and SDAddress. SDPort seems optional as we're using the default port.
Nov 25 2025
The problem is not the space, that disk is out of inodes:
# df -hi /srv/ Filesystem Inodes IUsed IFree IUse% Mounted on /dev/sda 5.0M 5.0M 0 100% /srv
Nov 21 2025
FYI in production there are already mechanism to automatically kill queries, so we might just be able to reuse them.
@DSmit-WMF did you had a chance to go through the documentation to clarify which access do you need to the analytics-privatedata-users group?
Perhaps just level 1 is already enough for you?
@Milimetric / @Ahoelzl by any chance one of you could review this task for approval?
@MoritzMuehlenhoff yes but I think Kwaku is covering for Kavitha while she's away in the next few days.
Nov 20 2025
Adding collaboration-services
Nov 19 2025
Just for context referencing past ideas on the topic: T327300
@Catrope patch merged, will be live within ~30 minutes. Kerberos principal created, you should have received an email about it with instructions on how to setup your password.
Once you've verified that everything works as expected please resolve this task. Thanks :)
Adding Data-Engineering for visibility, no approval required for WMF staff.
Pending approval from @SCherukuwada
Nov 18 2025
@WMDECyn, in case @Chandra-WMDE's position is a fixed term contract, could you provide us with the expiration date so that we can add it to data.yaml to track it?
@MLechvien-WMF the patch has been merged, and then puppet-merged (an internal step required to make changes to the puppet repo live in production). Your user will be picked up and created at the next puppet run and will be on all hosts within 30 minutes from now. Once you've verified all works as expected please resolve this task, thanks.
Please refer to https://wikitech.wikimedia.org/wiki/Data_Platform/Data_access#Access_Levels to clarify which level access you need to the analytics-privatedata-users group.
We've discussed this in the sre-tools-infrastructure team and there are no objections for the goal to keep the cloudcumin hosts in sync with the cumin ones and no objections for an in-place upgrade as suggested by Moritz offline.
@BTullis can you confirm all is working fine and we can resolve this task?
@ngkountas patch merged, it should get live within 30 minutes from now. Once you've verified all works as expected please resolve this task.
@MFischer patch merged, it should get live within 30 minutes from now. Once you've verified all works as expected please resolve this task.
Patch merged, resolving. @AnkitaM you're currently part of the LDAP nda group and should be able to access all the UIs that require this group. Feel free to re-open in case you encounter any issue.
Nov 17 2025
@Lydia_Pintscher patch merged, it should get live within 30 minutes from now. Once you've verified all works as expected please resolve this task.
Whoops, my bad, it was not tracked in the summary and I missed it in the thread.
Pending approval from either @Kappakayala or @mark (from data.yaml approval list)
@Arian_Bozorg patch merged, it should get live within 30 minutes from now. Once you've verified all works as expected please resolve this task.
Pending approval from @WMDE-leszek
@ngkountas just to be sure, when you say analytics-privatedata-users level 2 you mean with kerberos?
Confirmed the ssh key out of band.
Adding Data-Engineering for visibility and @Milimetric, @Ahoelzl for approval (either of them) from the data engineering side, as per docs.
Nov 10 2025
Possibly related to T328869
Nov 6 2025
If I'm not mistaken the above patch should gran the usage of the requested instance type. I'm not sure if there is a separate setting for the fast-iops bit or it's included in that configuration.
Nov 3 2025
From a quick check it seems that the host is randomly hammered by some traffic (to be investigated):
Oct 31 2025
Oct 21 2025
For some related historical context on the lack of parity between the Icinga and AM APIs in Spicerack see also T293209 (look for optimal).
Oct 10 2025
[note for future self] If we can wait for Python 3.14 to be around in our systems then we should evaluate also the new InterpreterPoolExecutor for this, it might be a good fit.
Oct 6 2025
You can use SSH_AUTH_SOCK=/run/keyholder/proxy.sock scp [OPTIONS] cumin1002.eqiad.wmnet:/path/... /path/... from cumin1003 for example.
Sep 24 2025
Transient failure of git pull for the cloud/wmcs-cookbooks repository, self-resolved at the next puppet run.
Sep 23 2025
@Scott_French sorry for the trouble. The patch that added the timeout to the cookbook's version of the function was added ~1.5 years after the functionality landed in Spicerack and somehow we missed to double check the functionality equivalency when migrating the cookbook to the spicerack's module function. Sorry about that.
I think we can set a reasonable timeout default without the need to make it tunable, at least as a first fix, that could go in at anytime.
I doubt we'll have special needs for specific timeouts though, we're talking about DNS queries, not HTTP requests 😉
Sep 22 2025
Sep 11 2025
Ideally sampled logs would be good enough, depending how complex is the setup to sample them.
If there are no easy options for a real sampling we could also consider alternatives approaches:
- a poor's man sampling playing with log rotation and retention (e.g. rotate often and keep only 1 block every N rotated ones)
- a size-based retention that limits the total size of the logs to a predictable amount (will not help with issues in the past but I guess that most issues where we need the data are live/ongoing)
- deduplicate the logs to increase the signal in the logs
+1 for me
The problem is that there is no evidence in hardware logs and I doubt we'll get any replacement from Dell without them.
I've found this in kern.log/dmesg but nothing in racadm logs (both getsel and lclog):
Sep 11 02:33:32 cloudcephosd1041 kernel: [125071.786152] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
Sep 11 02:33:32 cloudcephosd1041 kernel: [125071.786154] {1}[Hardware Error]: It has been corrected by h/w and requires no further action
Sep 11 02:33:32 cloudcephosd1041 kernel: [125071.786155] {1}[Hardware Error]: event severity: corrected
Sep 11 02:33:32 cloudcephosd1041 kernel: [125071.786156] {1}[Hardware Error]: Error 0, type: corrected
Sep 11 02:33:32 cloudcephosd1041 kernel: [125071.786156] {1}[Hardware Error]: fru_text: B1
Sep 11 02:33:32 cloudcephosd1041 kernel: [125071.786157] {1}[Hardware Error]: section_type: memory error
Sep 11 02:33:32 cloudcephosd1041 kernel: [125071.786157] {1}[Hardware Error]: error_status: 0x0000000000000400
Sep 11 02:33:32 cloudcephosd1041 kernel: [125071.786158] {1}[Hardware Error]: physical_address: 0x0000000f9ec05180
Sep 11 02:33:32 cloudcephosd1041 kernel: [125071.786159] {1}[Hardware Error]: node: 1 card: 0 module: 0 rank: 2 bank: 18 device: 6 row: 53592 column: 448
Sep 11 02:33:32 cloudcephosd1041 kernel: [125071.786160] {1}[Hardware Error]: error_type: 2, single-bit ECC
Sep 11 02:33:32 cloudcephosd1041 kernel: [125071.786161] {1}[Hardware Error]: DIMM location: B1
Sep 11 02:33:32 cloudcephosd1041 kernel: [125071.786168] [Firmware Warn]: GHES: Invalid error status block length!
Sep 11 02:33:32 cloudcephosd1041 kernel: [125071.786185] soft_offline: 0xf9ec05: invalidated
Sep 11 02:33:32 cloudcephosd1041 kernel: [125071.786192] mce: [Hardware Error]: Machine check events logged
Sep 11 02:33:32 cloudcephosd1041 kernel: [125071.786194] mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 255: 940000000000009f
Sep 11 02:33:32 cloudcephosd1041 kernel: [125071.794205] mce: [Hardware Error]: TSC 0 ADDR f9ec05180
Sep 11 02:33:32 cloudcephosd1041 kernel: [125071.799695] mce: [Hardware Error]: PROCESSOR 0:806f8 TIME 1757558012 SOCKET 0 APIC 0 microcode 2b000639Sep 10 2025
Restart completed