User Details
- User Since
- Oct 3 2014, 8:06 AM (584 w, 4 d)
- Availability
- Available
- IRC Nick
- godog
- LDAP User
- Filippo Giunchedi
- MediaWiki User
- FGiunchedi (WMF) [ Global Accounts ]
Today
JFYI we can now proceed with cloudcephosd1052 too
Yesterday
Checking the kube-state-metrics container logs I get the following:
Fri, Dec 12
Something I noticed is that kube-state-metrics as a whole has been occasionally and lately failing scrapes from prometheus: (long url, toolforge.org isn't allowed on w.wiki) https://prometheus.svc.toolforge.org/tools/graph?g0.expr=sum_over_time(up%7Bjob%3D%22k8s-kube-state-metrics%22%7D%5B1h%5D)%2F%20count_over_time(up%7Bjob%3D%22k8s-kube-state-metrics%22%7D%5B1h%5D)&g0.tab=0&g0.stacked=0&g0.show_exemplars=0&g0.range_input=4w&g0.end_input=2025-12-12%2016%3A54%3A05&g0.moment_input=2025-12-12%2016%3A54%3A05
Thu, Dec 4
I took a look at why cloudcephosd1052 still has second nic up, currently:
This is done! I'll followup with an announcement to sre@
Wed, Dec 3
I dug into this a little, currently:
Tue, Dec 2
Availability as seen by network probes:
Mon, Dec 1
ALSO: make sure grafana's default is UTC on new dashboards
I'm aware there is/was work going on on thanos/titan in T410152: Disk space saturation (/srv) on Titan hosts and perhaps related
Fri, Nov 28
As to effectively do the audit, we can adapt search-grafana-dashboards.js from https://wikitech.wikimedia.org/wiki/Grafana#Search/audit_metrics_usage_across_dashboards
I briefly looked the clouddumps1002 downtime from yesterday, and of course there was ~30m downtime for dumps.w.o since 1002 serves those:
Thu, Nov 27
Something else to note: the alerts are deployed in eqiad only, not codfw
Leaving a suggestion here for a workaround for the record: while having a native pg facility to detect password changes would be optimal; I think what we could do is write the (hashed, possibly salted with a salt we control) passwords on the filesystem (e.g. one per file). Then use the following logic:
- if the password file doesn't exist, the user and its password needs to be created in pg and the fs
- if the password file does exist, compare its value with the current puppet password. If they differ then update pg and the fs. If they don't then there's nothing to do.
Something I wanted to add: I'm not very familiar with that part of the puppet codebase though I was wondering if we can start referring to the networks and their attributes by name. It will enable us to have puppet code e.g. "give me the subnets for VXLAN/IPv6-dualstack, either v4 or v6 or both", in other words get to more understandable and manageable code/understanding. Ditto for other attributes/properties for a given network like its gateway, etc. Let me know what you think!
Wed, Nov 26
Looks good to me! Definitely good to refactor these bits
Tue, Nov 25
The logical side on the host side is done. Next up is deleting the interfaces from netbox for the hosts and unplug network cables. I'll file subtasks
Mon, Nov 24
I can't currently reproduce the issue -- navigating to https://hub-paws.wmcloud.org/ spawns a container for me and works as expected. Is this still a problem @Sadeiiw67 ?
Fri, Nov 21
I'm calling this one done since we have a workaround in place; I'll followup on the Debian bug for an actual fix
Thu, Nov 20
Wed, Nov 19
Reported to Debian as https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1121006
Tue, Nov 18
And lsblk -t for comparison:
As far as I understand the problem, lvm metadata size and alignment can be related to the underlying block device reported data, specifically the "optimal i/o size":
I investigated this a little more today. The issue is grub_lvm_detect allocating memory based on the lvm metadata areas, of which locn[2] is of size 4293914624 (or 0xFFF00000)
I'm giving debugging this issue one more go, as part of this we now have pause-reboot.cfg included for cloudcontrol2010-dev which will avoid mucking on apt1002
Something else I forgot: I'm assuming codfw also is applicable in this case? i.e. these hosts we'll be moving to single nic as well
Mon, Nov 17
Thank you @cmooney ! FYI as per Andrew we really only care about cloudcephosd1035 through cloudcephosd1052 since the rest will be decom'd soon anyways
Nov 13 2025
Nov 12 2025
Nov 11 2025
Thank you for following up @Dwisehaupt and I'm glad to know you are making progress! These days I am no longer part of Observability, and will defer to @hnowlan to take it from here
Nov 10 2025
Yes please @cmooney, much appreciated! Note that this is currently not a blocker / not high priority. In the sense that I'll be test-driving the change on 1048 and 1049 which are configured correctly AFAICT, having it done and dusted of course would be great
Nov 6 2025
Thank you @cmooney for the summary, I'll add a few thoughts I had while working on the Toolforge on Metal project design document.
Mentioning it here as a followup to T409244: Toolforge outage: toolsdb out of space: it is important we do monitor the ability to read/write toolsdb, and possibly page on it
Nov 5 2025
Doing a comparison with the replica on tools-db-6, there's ~800G free there:
disk space free trend for tools-db-4 over the last 30d
tools-db-4 storage volume is out of space, I'll use this task for tracking
Nov 3 2025
Oct 30 2025
Oct 29 2025
I'm +1 on getting docker's puppetization to do the right thing, in other words detect the default route's mtu and set said value as docker's default.
Oct 28 2025
Oct 27 2025
Alert is gone, optimistically resolving