User Details
- User Since
- Feb 10 2016, 11:25 AM (513 w, 4 d)
- Availability
- Available
- IRC Nick
- volans
- LDAP User
- Volans
- MediaWiki User
- RCoccioli (WMF) [ Global Accounts ]
Thu, Nov 27
If you want to bind any address, from the docs at [1] it seems that you can just omit the setting and not specify any of SDAddresses and SDAddress. SDPort seems optional as we're using the default port.
Tue, Nov 25
The problem is not the space, that disk is out of inodes:
# df -hi /srv/ Filesystem Inodes IUsed IFree IUse% Mounted on /dev/sda 5.0M 5.0M 0 100% /srv
Fri, Nov 21
FYI in production there are already mechanism to automatically kill queries, so we might just be able to reuse them.
@DSmit-WMF did you had a chance to go through the documentation to clarify which access do you need to the analytics-privatedata-users group?
Perhaps just level 1 is already enough for you?
@Milimetric / @Ahoelzl by any chance one of you could review this task for approval?
@MoritzMuehlenhoff yes but I think Kwaku is covering for Kavitha while she's away in the next few days.
Thu, Nov 20
Adding collaboration-services
Wed, Nov 19
Just for context referencing past ideas on the topic: T327300
@Catrope patch merged, will be live within ~30 minutes. Kerberos principal created, you should have received an email about it with instructions on how to setup your password.
Once you've verified that everything works as expected please resolve this task. Thanks :)
Adding Data-Engineering for visibility, no approval required for WMF staff.
Pending approval from @SCherukuwada
Tue, Nov 18
@WMDECyn, in case @Chandra-WMDE's position is a fixed term contract, could you provide us with the expiration date so that we can add it to data.yaml to track it?
@MLechvien-WMF the patch has been merged, and then puppet-merged (an internal step required to make changes to the puppet repo live in production). Your user will be picked up and created at the next puppet run and will be on all hosts within 30 minutes from now. Once you've verified all works as expected please resolve this task, thanks.
Please refer to https://wikitech.wikimedia.org/wiki/Data_Platform/Data_access#Access_Levels to clarify which level access you need to the analytics-privatedata-users group.
We've discussed this in the sre-tools-infrastructure team and there are no objections for the goal to keep the cloudcumin hosts in sync with the cumin ones and no objections for an in-place upgrade as suggested by Moritz offline.
@BTullis can you confirm all is working fine and we can resolve this task?
@ngkountas patch merged, it should get live within 30 minutes from now. Once you've verified all works as expected please resolve this task.
@MFischer patch merged, it should get live within 30 minutes from now. Once you've verified all works as expected please resolve this task.
Patch merged, resolving. @AnkitaM you're currently part of the LDAP nda group and should be able to access all the UIs that require this group. Feel free to re-open in case you encounter any issue.
Mon, Nov 17
@Lydia_Pintscher patch merged, it should get live within 30 minutes from now. Once you've verified all works as expected please resolve this task.
Whoops, my bad, it was not tracked in the summary and I missed it in the thread.
Pending approval from either @Kappakayala or @mark (from data.yaml approval list)
@Arian_Bozorg patch merged, it should get live within 30 minutes from now. Once you've verified all works as expected please resolve this task.
Pending approval from @WMDE-leszek
@ngkountas just to be sure, when you say analytics-privatedata-users level 2 you mean with kerberos?
Confirmed the ssh key out of band.
Adding Data-Engineering for visibility and @Milimetric, @Ahoelzl for approval (either of them) from the data engineering side, as per docs.
Nov 10 2025
Possibly related to T328869
Nov 6 2025
If I'm not mistaken the above patch should gran the usage of the requested instance type. I'm not sure if there is a separate setting for the fast-iops bit or it's included in that configuration.
Nov 3 2025
From a quick check it seems that the host is randomly hammered by some traffic (to be investigated):
Oct 31 2025
Oct 21 2025
For some related historical context on the lack of parity between the Icinga and AM APIs in Spicerack see also T293209 (look for optimal).
Oct 10 2025
[note for future self] If we can wait for Python 3.14 to be around in our systems then we should evaluate also the new InterpreterPoolExecutor for this, it might be a good fit.
Oct 6 2025
You can use SSH_AUTH_SOCK=/run/keyholder/proxy.sock scp [OPTIONS] cumin1002.eqiad.wmnet:/path/... /path/... from cumin1003 for example.
Sep 24 2025
Transient failure of git pull for the cloud/wmcs-cookbooks repository, self-resolved at the next puppet run.
Sep 23 2025
@Scott_French sorry for the trouble. The patch that added the timeout to the cookbook's version of the function was added ~1.5 years after the functionality landed in Spicerack and somehow we missed to double check the functionality equivalency when migrating the cookbook to the spicerack's module function. Sorry about that.
I think we can set a reasonable timeout default without the need to make it tunable, at least as a first fix, that could go in at anytime.
I doubt we'll have special needs for specific timeouts though, we're talking about DNS queries, not HTTP requests 😉
Sep 22 2025
Sep 11 2025
Ideally sampled logs would be good enough, depending how complex is the setup to sample them.
If there are no easy options for a real sampling we could also consider alternatives approaches:
- a poor's man sampling playing with log rotation and retention (e.g. rotate often and keep only 1 block every N rotated ones)
- a size-based retention that limits the total size of the logs to a predictable amount (will not help with issues in the past but I guess that most issues where we need the data are live/ongoing)
- deduplicate the logs to increase the signal in the logs
+1 for me
The problem is that there is no evidence in hardware logs and I doubt we'll get any replacement from Dell without them.
I've found this in kern.log/dmesg but nothing in racadm logs (both getsel and lclog):
Sep 11 02:33:32 cloudcephosd1041 kernel: [125071.786152] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
Sep 11 02:33:32 cloudcephosd1041 kernel: [125071.786154] {1}[Hardware Error]: It has been corrected by h/w and requires no further action
Sep 11 02:33:32 cloudcephosd1041 kernel: [125071.786155] {1}[Hardware Error]: event severity: corrected
Sep 11 02:33:32 cloudcephosd1041 kernel: [125071.786156] {1}[Hardware Error]: Error 0, type: corrected
Sep 11 02:33:32 cloudcephosd1041 kernel: [125071.786156] {1}[Hardware Error]: fru_text: B1
Sep 11 02:33:32 cloudcephosd1041 kernel: [125071.786157] {1}[Hardware Error]: section_type: memory error
Sep 11 02:33:32 cloudcephosd1041 kernel: [125071.786157] {1}[Hardware Error]: error_status: 0x0000000000000400
Sep 11 02:33:32 cloudcephosd1041 kernel: [125071.786158] {1}[Hardware Error]: physical_address: 0x0000000f9ec05180
Sep 11 02:33:32 cloudcephosd1041 kernel: [125071.786159] {1}[Hardware Error]: node: 1 card: 0 module: 0 rank: 2 bank: 18 device: 6 row: 53592 column: 448
Sep 11 02:33:32 cloudcephosd1041 kernel: [125071.786160] {1}[Hardware Error]: error_type: 2, single-bit ECC
Sep 11 02:33:32 cloudcephosd1041 kernel: [125071.786161] {1}[Hardware Error]: DIMM location: B1
Sep 11 02:33:32 cloudcephosd1041 kernel: [125071.786168] [Firmware Warn]: GHES: Invalid error status block length!
Sep 11 02:33:32 cloudcephosd1041 kernel: [125071.786185] soft_offline: 0xf9ec05: invalidated
Sep 11 02:33:32 cloudcephosd1041 kernel: [125071.786192] mce: [Hardware Error]: Machine check events logged
Sep 11 02:33:32 cloudcephosd1041 kernel: [125071.786194] mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 255: 940000000000009f
Sep 11 02:33:32 cloudcephosd1041 kernel: [125071.794205] mce: [Hardware Error]: TSC 0 ADDR f9ec05180
Sep 11 02:33:32 cloudcephosd1041 kernel: [125071.799695] mce: [Hardware Error]: PROCESSOR 0:806f8 TIME 1757558012 SOCKET 0 APIC 0 microcode 2b000639Sep 10 2025
Restart completed
Both ssh and the VM console hangs, without giving any prompt. Forcing a VM restart.
I'm looking into it
Sep 8 2025
This might have been related to the migration to puppet7 and the new puppetdb hosts probably. I can't recall. Resolving as it cannot be reproduced right now AFAICT, feel free to re-open if that's not the case.
It works fine for me:
>>> p.hiera_lookup('cumin1003.eqiad.wmnet', 'profile::puppet::agent::force_puppet7')
DRY-RUN: Executing commands ['puppet lookup --render-as s --compile --node cumin1003.eqiad.wmnet profile::puppet::agent::force_puppet7 2>/dev/null'] on 1 hosts: puppetserver1001.eqiad.wmnet
'true'Aug 28 2025
+1 for me to upgrade them to bookworm for simplicity and to be in sync with the cumin hosts.
Jul 21 2025
Quick update, the cumin hosts are now on bookworm where python3-kubernetes is on v22.6.0, but we still have cumin1002 around on bullseye until the DBA-stuff has been all made compatible with bookworm, see T389380.
@Marostegui sorry I didn't understand that the old host was already unracked or otherwise unreachable also on the management side and I thought my earlier reply was already covering the questions, my bad.
Jul 16 2025
Jul 14 2025
@JMeybohm do you have a specific use case that cannot/is hard to solve simply changing the batch_size of the call to puppet.run()?
https://doc.wikimedia.org/spicerack/master/api/spicerack.puppet.html#spicerack.puppet.PuppetHosts.run
If the new host has a new hostname I think the usual decom template can be used and followed. I don't see any blocker there, if the host is not up and running or the disks were removed the only thing skipped by the decom cookbook will be the disk wipe.
Jul 9 2025
Are we sure that the network card is properly installed? I'm getting this from racadm:
An immediate workaround was implemented in https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1167593 that gives us also an idea on what could be useful to expose from spicerack. If a plain KafkaAdminClient just pre-configured with the connection to the current cluster or some more advanced wrapper to extract specific information.
As a starter probably just exposing the KafkaAdminClient could be enough I guess but I'll leave to the kafka experts the decision on this.
I wonder if the prometheus servers have a similar behavior of applying changes from puppet exported resources.
Jul 3 2025
Ack, let's do both: disable it in the bios and skip it in the import
Option 2 LGTM too
Totally agree there is no point. For the idrac one the only potential use case would be to match it with our existing mgmt but probably not.
Jul 2 2025
Jun 27 2025
There are currently changes to remove:
frpig2001.mgmt.frack.codfw.wmnet pay-lvs2001.mgmt.frack.codfw.wmnet pay-lvs2002.mgmt.frack.codfw.wmnet pay-lvs2001.frack.codfw.wmnet pay-lvs2002.frack.codfw.wmnet frban1001.mgmt.frack.eqiad.wmnet frpig2001.frack.codfw.wmnet frban1001.frack.eqiad.wmnet
and their related reverse PTR records.
Are those ok to be removed from the live DNS?
When editing netbox DNS records please always make sure to run the sre.dns.netbox cookbook as otherwise there are pending changes that will block other users and trigger icinga alerts.
Jun 26 2025
But I've tested the scp_dump that was failing and it's fixed. So I think the provision should work. Feel free to try it (from cumin2002 that has the latest version, I will update the others as soon as other testing on other changes is completed)