User Details
- User Since
- Feb 10 2016, 11:25 AM (407 w, 3 d)
- Availability
- Available
- IRC Nick
- volans
- LDAP User
- Volans
- MediaWiki User
- RCoccioli (WMF) [ Global Accounts ]
Fri, Dec 1
@MoritzMuehlenhoff I see that ganeti[2009-2024] and ganeti[1009-1022] are lacking AAAA records while the rest have it. Can we add them to the rest of the cluster?
Any update for the ms-be cluster that is still mixed? Can it be migrated to all have IPv6?
Any update on this? The cluster is still mixed with some hosts having AAAA records and some without.
@akosiaris I see that:
- mw[1349-1413]
- mw[2259-2376]
- mc[2042-2055]
- parse[2001-2020]
Thu, Nov 30
@jhathaway ack, if we're not seeing any more failures in puppetboard let's close it and re-open in case they happen again.
I've commented on https://gerrit.wikimedia.org/r/c/mediawiki/core/+/972724/4/includes/poolcounter/PoolCounterConnectionManager.php#84 what it looked like the possible issue with the patch.
This doesn't seem to be a widespread login problem at this time. (lowering the priority)
All indications so far points to a rate-limiting issue with multiple people sharing the same public IP.
Wed, Nov 29
@ssingh what's your timeline to switch to use this new method to get what DNS hosts are pooled? As you know we need to adjust spicerack/cookbooks accordingly.
Given no objections I went ahead and fixed ALL names and slug to adhere to the standard. Triaging as low and leaving the task open to add a validator later.
Tue, Nov 28
I had a quick thought about the ENC++ problem as you have named it and I think in the end given a netbox device object (hostname + location + eventually other data) + hardware specs (auto-detection via Redfish?) we will need something to map this to:
- Puppet role (currently in site.pp)
- Hardware profile [BIOS virtualization + hardware RAID configuration] (currently manually set via cookbook argument and manually setup)
- Network profile [VLAN, skip IPv6, cassandra IPs, etc...] (currently manually set via Netbox provision script arguments]
As all the above cookbooks are already listed in T317855 I'm resolving this as duplicate.
Perfect, thanks for the update.
As the main blocker was resolved giving more permissions to the bot in T314917, setting the priority lower for a general solution in the future.
As there is already a workaround to do that in the cookbooks on demand and it will be even simpler with the cumin work mentioned, I'm declining this for now as it didn't get much traction. Happy to reopen it in the future if we feel it's necessary.
Untagged sre-tools and spicerack as I've created the dedicated sub-tasks for them.
We had only a couple of changes in the service.yaml schema in the last months and both were sent to Spicerack before hitting production on the Puppet side, so nothing broke in those cases.
What we were thinking is to instead of a refactor of the whole thing in spicerack maybe it would be simpler to have a CI check in puppet that checks that the fields are all there.
As we got an email from VO about unassigned overrides I think that the issue here is that only one rotation was assigned and not the one that actually pages:
Mon, Nov 27
@JMeybohm could you confirm the above or give me more context?
Interesting, I can confirm that on netbox-next admin the user taavi doesn't have any groups associated and as such doesn't have the additional privileges.
But looking at the ops group in the same DB taavi is reported in the Available users but not on the Chosen users, see https://netbox-next.wikimedia.org/admin/auth/group/8/change/
We got this today in the I/F IRC channel:
Sat, Nov 25
I see that on netbox-next you have 2 accounts, one with taavi and a personal email address and one with your wmf email and the username you're reporting.
Given that next is for experimentation and the DB is cloned from production on demand from time to time, I took the liberty to delete both users.
Could you try to re-login and see if this time it works?
Thu, Nov 23
This was fixed in wmflib v1.2.1 released on Feb. 2nd.
The change has been merged and released with Spicerack v7.3.0 on Oct. 4th. Resolving.
Trying to run the import puppetdb script on cloudgw1002 is now a noop, but for cloudgw2002-dev fails with this exception:
Thanks for reporting this. The issue was caused by a bug in one of the new custom validators that was hit only during the creation of a new device but not while editing an existing one.
The fix has been deployed to production. As an example this is a new test device created on netbox-next: https://netbox-next.wikimedia.org/dcim/devices/4642/
Some random additions:
When we introduced the sre.hosts.provision cookbook we envision
Piling many changes together simplifies the user interaction but leaves a lot of open questions to be answered before automating the process regarding what to do in case of errors:
Mon, Nov 20
I think that the problem is that the directory is defined in puppet with recurse=true in modules/prometheus/manifests/init.pp. Is that necessary? Could puppet just manage some subdirectories?
The hosts were setup in Netbox with a public VLAN and FQDN (wikimedia.org) while they should have been setup with the private one (eqiad.wmnet FQDNs).
The changes were not committed to the DNS (running the sre.dns.netbox cookbook), as a result Icinga has been alerting for Uncommitted DNS changes in Netbox since Friday.
I've noticed that the provision cookbook was run for all the hosts, and failed for all of them. That's because the connection to the Redfish API of the iDRAC is via IP address but then the check that remote IPMI works uses the DNS and the management DNS records were not committed.
Great, thanks. Then I think T350656#9312531 should explain everything :)
Wed, Nov 15
There is no python2 in our setup of bullseye or bookworm. python3-phabricator is on Debian (see https://packages.debian.org/bookworm/python3-phabricator )
Tue, Nov 14
This is now done.
yes, that's correct
Mon, Nov 13
Update: for the production side of things this is completed. Leaving open for now as the https://doc.wikimedia.org/spicerack/master/api/spicerack.puppet.html#spicerack.puppet.PuppetHosts.get_ca_servers method doesn't yet support SRV records but is currently used only in WMCS.
Thu, Nov 9
cp1108 completed: see T350179#9321006
I got from traffic cp1108 to try, I run in parallel a tcpdump on the install host (following https://wikitech.wikimedia.org/wiki/Server_Lifecycle/Reimage#DHCP_issues ) and there was NO REQUEST incoming matching ANY of the MAC addresses of the host:
- eno12399np0: 04:32:01:14:b5:80 the active one
- eno12409np1: 04:32:01:14:b5:81
- eno8303: b4:45:06:f6:5a:be
- eno8403: b4:45:06:f6:5a:bf
Wed, Nov 8
Another thing that is strictly related to icinga at the moment is the raid_handler that is triggered by any raid alert and creates a task with the output of a script run on the fly via nrpe. See for example T316565
If you're not interested in the report of the results of the commands you can set the worker.reporter property to the NullReporter (
from cumin.transports.clustershell import NullReporter).
If you're not interested in the progress bars being printed for some commands you can set worker.progress_bars = False.
Here a full example:
Indeed, do you have a dbctl host that is already out of production or not used for any reason, or we could pick a replica with low weight in the secondary dc
But your history has a little bit later in the file:
Tue, Nov 7
Does it makes sense to migrate those fairly complex alerts that are reporting a lot of information in the alert itself to alertmanager?
How many metrics would a raid_megaraid for example need to generate to have the same level of information? (per host and per disk)
Ok, I think I found the problem, the write_callback(self, callback, id, **args) doesn't pass any datacenter selection when calling obj = self.get(*id), while its signature allows for it: get(self, name, dc=None). (that in turns calls get_all() but the dc is always propagated there).
There is no automatic expiration on any key written/edited by dbctl AFAIK.
As for the history content it's a bit complicated but from a quick look at the mirror logs I see that there was a db2103 key in eqiad at some point this morning:
The code is not checking if he autoselection of the parent is None or not. That said re-running the script now works fine. What was changed in the Netbox data to fix the issue?
I've also run while read line; do sudo dbctl instance "${line}" get; sleep 1; done < dblist getting the dblist from etcd and I couldn't repro the error.
@Dzahn once he above patch is merged you can proceed directly running the reimage cookbook on the host as the VM was correctly created and the last step was calling the reimage cookbook.
I was having a look, I checked etcd and I didn't find two records that could match the name db2103. I also can't repro it, both on cumin1001 and cumin2002:
Mon, Nov 6
@jbond I think that the decommission cookbook needs some adjustment too, both because it checks some git checkout on the puppetmaster's CA and also because it does remove the certificate.
Nov 2 2023
My understanding is that all those hosts have been already reimaged into their related insetup::* role. I'm wondering why you need to re-image them again instead of just switching role in site.pp and run puppet. The insetup role just installs the same base system that any other role would do (if the appropriate insetup role has been chosen).
Oct 31 2023
@cmooney thanks for the summary, couple of questions:
Oct 27 2023
This hasn't happened in a long time. Resolving.
Resolving for now, feel free to re-open in case it happens again.
Oct 19 2023
Disributed locking is now live in Spicerack and used by the Cookbooks.
For a general overview see https://doc.wikimedia.org/spicerack/master/introduction.html#distributed-locking
Oct 18 2023
Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Function Call, No puppet role has been assigned to this node. (file: /etc/puppet/manifests/site.pp, line: 2939, column: 9) on node db1229.eqiad.wmnet
Thanks for the task! I think another potential use case are the docker-reporter* units on the build host.
Oct 16 2023
The original idea for the report of images to debmonitor was that they should be reported at creation time, and, given their immutability, it shouldn't require the need to report them again until deletion. Given the lack of a way to properly cleanup them the current implementation, as you know, is different.
Oct 12 2023
Mentioning T348525 too to avoid duplicate work.
Oct 11 2023
I really think that we need to find a solution for this. It has been pending for too long.
I noticed also that aux-k8s-ctrl.svc.eqiad.wmnet is missing the PTR record in the operations/dns repository.
FYI the service IPs are still allocated in Netbox:
https://netbox.wikimedia.org/ipam/ip-addresses/?q=kibana.svc
https://netbox.wikimedia.org/ipam/ip-addresses/?q=logstash.svc
FYI the SVC addresses are still allocated in Netbox: https://netbox.wikimedia.org/ipam/ip-addresses/?q=apple-search
I guess they should be removed. When doing so remember to run the sre.dns.netbox cookbook too.
FYI The service IPs in Netbox are still allocated to the service and probably needs cleanup:
https://netbox.wikimedia.org/ipam/ip-addresses/?q=graphoid
@klausman the DNS step is marked as done, but I see the ORES SVC records still existing in Netbox ( https://netbox.wikimedia.org/ipam/ip-addresses/?q=ores ) is that a leftover or pending some other step? (when removed a run of the sre.dns.netbox cookbook is needed)
Oct 10 2023
Oct 9 2023
For the record as Giuseppe is out, I had a chat with @CDanis going over the plan and numbers and we didn't find anything worrisome or blockers. I'll proceed with the current implementation, anyway it will be off by default and switched on only with a puppet change to update the config file, that will also allow to easily stop using the locks in case there is any issue.
@cmooney adding a note here to not forget. We'll need to check how it will work for Ganeti VMs, in particular the makevm cookbook has a knowledge of DCs that have per-rack subnets and to treat them differently, but it needs to be aware of rows then it needs some refactoring and possible get the information live instead of being hardcoded.
Oct 4 2023
This should be resolved, feel free to re-open in case you have any issue.
IMHO I think we should stick to the agreed format in T284614#7214588 and T284614#7222919 and rename (and re-slug) the 3 non matching ones into the format PowerEdge R440 - ConfigFundraising 202107 and so on. @wiki_willy what do you think?