User Details
- User Since
- Feb 10 2016, 11:25 AM (422 w, 5 d)
- Availability
- Available
- IRC Nick
- volans
- LDAP User
- Volans
- MediaWiki User
- RCoccioli (WMF) [ Global Accounts ]
Yesterday
We do have get_certificate_metadata() that raises spicerack.puppet.PuppetServerCheckError if the cert is not found (as opposed to other errors).
What I was suggesting is that we could do that check directly in destroy() in the puppetserver class so that it behaves the same of the old puppetmaster one.
That's indeed the current behaviour and clearly an error, thanks for reporting it!
The exit codes of the puppetserver ca clean command are not documented in Puppet, or at least I couldn't find them in the public docs/manpage/help messages/source code.
Ideally puppetserver should report two different set of errors, the ones in which there is a certificate but it failed to perform some cleaning operations and the one where the certificate does not exists at all, but it doesn't seem the case.
Given that it doesn't, I think we shouldn't rely on specific output messages of the CLI and exit codes as it could hide other errors now or in the future.
Thu, Mar 7
But is this still task still valid? The alert hosts were migrated to bookworm this week and puppet is running fine there.
Wed, Mar 6
Tue, Mar 5
Got the list of affected hosts with nodeset -S '","' -e "es20[35-40]" on a cumin host, then I run the following code on Netbox:
>>> import uuid >>> request_id = uuid.uuid4() >>> user = User.objects.get(username='volans') >>> def update(d): ... ip = d.primary_ip6 ... ip.dns_name = "" ... ip.save() ... log = ip.to_objectchange('update') ... log.request_id = request_id ... log.user = user ... log.save() ... >>> devices = Device.objects.filter(name__in=["es2035","es2036","es2037","es2038","es2039","es2040"]) >>> len(devices) 6 >>> [d.name for d in devices] ['es2035', 'es2036', 'es2037', 'es2038', 'es2039', 'es2040'] >>> for device in devices: ... update(device) ...
Re-opening as AAAA records were erroneously added to the hosts (AAAA records:N). I'll remove them programmatically.
Mon, Mar 4
Sounds good to me, let me know once done so that I can make the related changes to the report to include those too.
@wiki_willy yes, if we go that way then I guess a separate tab on the accounting sheet with both asset tags (chassis and motherboard), compiled only for the hosts that have had the motherboard replaced but the asset tag not reset, should be enough information for the report to be adapted to include that information.
Sat, Mar 2
FYI I've re-renamed PowerEdge R450 - Restbase-1G to PowerEdge R450 - ConfigRestbase-1G or we'd have issues in the firmware upgrades as outlined in T348036.
Fri, Mar 1
@RLazarus please update also requestctl-generator when you do it.
I've added an item to the task description. At the moment because superset is in the process to be migrated to k8s the requestctl-generator file lives in two different places and needs to be modified in both place, the current prod one will disappear soon though.
Thu, Feb 29
I don't think there is a clean solution if the iDrac doesn't allow to override the value on the motherboard when done outside of warranty.
We could check if there is a way on the host to get both values and decide which one we want to export.
But it there isn't, then we'll need to keep both old and new values in at least one place to make sure we can cross-check them. That place IMHO could be either the accounting spreadsheet or Netbox , and then the report will be modified accordingly.
There is no mapping, the reported device types are just not following the correct naming scheme, as you can see here comparing with the others: https://netbox.wikimedia.org/dcim/device-types/?q=PowerEdge and as previously discussed in T348036
Of course I don't want to add additional lag to any live-traffic data (pybal, mwconfig, dbctl) and if we deem adding spicerack locks to the replication might cause that let's find another solution. For example we could have a failover etcd cookbook that when run will read the active locks from the primary cluster and manually replicate them on the secondary one explicitely. Or any other viable option.
Wed, Feb 28
Do we need to change anything on the sre.dns.netbox cookbook?
It currently runs:
cd {git} && utils/deploy-check.py -g {netbox} --deploy
Even from itself? As in what happens if an operator runs authdns-update on a depooled host?
But a second instance wouldn't prevent the current issue, right?
That etcdmirror is mirroring only the /conftool keys it's totally news to me, I assumed it was replicating the whole content of etcd. But indeed it does not:
Tue, Feb 27
I've checked all the devices with names starting in db and es and the only ones with IPv6 AAAA records are: dbprov1004 and dbprov2004
Cleanup completed, leaving the task open for DCOps to prevent this from happening.
Got the list of affected hosts with nodeset -S '","' -e "db[2196-2220],es10[35-40]" on a cumin host, then I run the following code on Netbox:
@Jhancock.wm @wiki_willy few considerations here:
Is the 400 because of a missing cert? From a cumin host I get:
$ curl -G "https://puppetdb1003.eqiad.wmnet/pdb/query/v4/resources/Nagios_host" --data-urlencode 'query=["and", ["=", ["parameter", "ensure"], "present"], ["=", "exported", true]]' <html> <head><title>400 No required SSL certificate was sent</title></head> <body> <center><h1>400 Bad Request</h1></center> <center>No required SSL certificate was sent</center> <hr><center>nginx/1.22.1</center> </body> </html>
Mon, Feb 26
@fnegri Thanks a lot for resuming this and taking care of it. As I've stated in the CR it looks good to me, but I would like an explicit approval from Search and some support in testing it after releasing it to make sure it all works fine.
@sbassett no objection from my side, feel free to unblock it.
Patch merged, this should be done.
FWIW the bulk of the bandwidth usage is generated by only 15 IPv6s that are downloading mostly .webm videos, see the already filtered superset dashboard.
Sat, Feb 24
We do have the ability to run cookbooks for non global roots via the secure-cookbook binary (see its usage in modules/admin/data/data.yaml) but the reimage does indeed require pwstore access for the management password.
Fri, Feb 23
The fact that the lookup is run on puppetserver is by design, it's just to detect if the profile::puppet::agent::force_puppet7 hiera is set or not.
In this case that's failing with:
Does this means that it will fire with only 2 hosts in a given PoP? (we have 35 hosts per PoP at the moment abd the 3% threshold is 1.05)
Maybe the threshold should be different between PoPs and core sites? In the core sites the current threshold is 25.74 and 35.01 in codfw and eqiad respectively.
Thu, Feb 22
I've tested that the cookbook works fine with the existing cached firmwares on the cumin hosts and today I've synced the whole cache to cumin1002, so at least for standard upgrade this shouldn't be a blocker until it's fixed.
I've copied the logs in /var/log/{cumin,debdeploy,spicerack} on cumin1001 to /var/log/cumin1001 on cumin1002 and cumin2002 using SSH_AUTH_SOCK=/run/keyholder/proxy.sock scp....
Wed, Feb 21
The one at a time part is what worries me a bit, in this last case there were 9 hosts and they needed 3 IPs per host, so 18 times...
The other bit is the name, that in the specific case of cassandra must follow the existing standard ('-a', '-b', ...) to match the puppet side of things.
Do they have 10G NICs? Is the NIC firmware at the correct version? See https://wikitech.wikimedia.org/wiki/SRE/Dc-operations/Platform-specific_documentation/Dell_Documentation#Urgent_Firmware_Revision_Notices:
@Eevans yes, we've done it already in T305568#7992643 :(
I've created the records for 3 cassandra instances (-a, -b and -c) in Netbox.
Changelog: https://netbox.wikimedia.org/extras/changelog/?request_id=b7d99b31-a005-43be-91a2-f53dfac1c597
Code executed:
>>> import uuid >>> request_id = uuid.uuid4() >>> user = User.objects.get(username='volans') >>> def update(d): ... prefix = Prefix.objects.get(prefix=d.primary_ip4.address.cidr) ... for letter in ('a', 'b', 'c'): ... extra_ip_address = prefix.get_first_available_ip() ... extra_dns_name = f"{d.name}-{letter}.{d.site.slug}.wmnet" ... address = IPAddress( ... address=extra_ip_address, ... status="active", ... dns_name=extra_dns_name, ... vrf=prefix.vrf.pk if prefix.vrf else None, ... assigned_object=d.primary_ip4.assigned_object, ... tenant=d.tenant, ... ) ... address.save() ... log = address.to_objectchange('create') ... log.request_id = request_id ... log.user = user ... log.save() ... >>> names = [f'restbase10{i}' for i in range(34, 43)] >>> names ['restbase1034', 'restbase1035', 'restbase1036', 'restbase1037', 'restbase1038', 'restbase1039', 'restbase1040', 'restbase1041', 'restbase1042'] >>> devices = Device.objects.filter(name__in=names) >>> len(devices) 9 >>> [d.name for d in devices] ['restbase1034', 'restbase1035', 'restbase1036', 'restbase1037', 'restbase1038', 'restbase1039', 'restbase1040', 'restbase1041', 'restbase1042'] >>> for d in devices: ... update(d) ...
Re-opening as Puppet is broken on ncmonitor1001 since 7.5 days.
Mon, Feb 19
Patch available at https://gerrit.wikimedia.org/r/c/operations/puppet/+/1003112
Feb 16 2024
Feb 14 2024
Feb 13 2024
@Dzahn you can get a working console either setting the known hosts files to /dev/null and the strict checking to no in the ssh command running it from the master node or you can get a working console just running the actual command directly on the ganeti host where the VM is running (the part from /usr/lib/ganeti/tools/kvm-console-wrapper ...).
Should we have /var/lib/ganeti/known_hosts be managed by Puppet?
I think there is some confusion, let me clarify some things:
The cookbook doesn't reboot the host once in the Debian Installer, it's the Debian Installer that reboots the hosts once the base installation is completed.
FYI the host is up and running with the old OS but new puppet role and puppet disabled since 26 days, it has disappeared from puppetdb (because of the puppet disabled) hence from monitoring and everything else, being effectively a ghost apart it being reported by a Netbox report.
Feb 12 2024
So, etcd ACLs are managed in two different ways, there are very few static users (like 2~3) with their passwords in the private repo and then auto-generated users for each cluster with auto-generated passwords with a seed that is defined in hiera as etcd::autogen_pwd_seed. The value of the seed is different between production (private repo) and PCC:
We're not doing anymore CSV dumps, see T310615. Closing it.
Feb 7 2024
We could either catch the exception and retry or acquire a lock for all puppetserver ca operations.
@BTullis thanks for the follow up for your hosts.
Feb 6 2024
For posterity I'd also like to mention how misleading was the error message, as the debian-installer UI looked like it was failing to get the proper netmask in the network configuration, and indeed on one of the host failing I found from ip address:
inet 10.64.32.125/22 scope global eno1
instead of a normal:
inet 10.64.32.125/22 brd 10.64.35.255 scope global eno1
Ok the sretest1001 reimage is going through, I'll leave the task open until the reimage finishes. The issue was a typo (double ||) in the partman hiera configuration.
Feb 5 2024
Done.
FYI cloudcephosd1040 had a wrong WMF asset tag wmf108805, I guess it was supposed to be wmf10805
@wiki_willy @Jclark-ctr @RobH As I see that Rob is out this week, to unblock the rest of SREs with any DNS-related change in Netbox I'm running the sre.dns.netbox cookbook.
Feb 3 2024
There are pending DNS changes in Netbox not committed to the auto-generated DNS repository related to those hosts since yesterday:
Feb 2 2024
Feb 1 2024
All the hosts have stopped logging the error. Resolving.
Jan 31 2024
Sorry, moving in the dashboard column automatically resolved it, re-opening.
All tests with @jcrespo for netbox were successful, patches merged, this should be solved for the Netbox side of things.
@jcrespo we do have our first backup in netboxdb2002:
Thanks for the deep dive, the plan LGTM!
I guess we could try to see if there is a better way to flush iptables+logging rules during the migration, but the risk of still leave some partial config probably makes it not worth when with the reboot we're sure of the results.
Indeed the dhcrelay not working as expected is a bit annoying also because if we run a dhcrelay for each VM, we'd need to hook also at VM shutdown to kill it otherwise at the next startup on the same tap interface we'll get 2 instances running (unless the previous one crashes when the interface is deleted, but I doubt it).
Jan 30 2024
Adding Data-Engineering as the change will also affect an-db[1001-1002].eqiad.wmnet in addition to the netbox DBs.
@cmooney did you had a chance to test the above failure scenario? AFAICT is still happening
Jan 29 2024
Jan 25 2024
Jan 22 2024
Debugging this it seems that this was caused by a race condition in which run-puppet-agent check passed and said that puppet was not running but by the time puppet agent was run the puppet lock had been already re-created by another standby run.
Declining because of inactivity and unclear line of action due to the opposed views. Feel free to re-open if you feel like there is a clear direction to follow.
@bking Assigning it to you as you've already sent a patch for it. When you have a minute please resume it so we can complete and merge it.