User Details
- User Since
- Feb 10 2016, 11:25 AM (428 w, 3 d)
- Availability
- Available
- IRC Nick
- volans
- LDAP User
- Volans
- MediaWiki User
- RCoccioli (WMF) [ Global Accounts ]
Thu, Apr 18
Some Juniper equipment relies on DHCP for ZTP as well, and maybe there are other uses of DHCP. Any idea if anything else relies on DHCP too?
I think that treating them as x2 with "omit_replicas_in_mwconfig": true might just work. The spares could be either set as candidate masters for each section or just simple replicas for each section, given the above config. Upon promotion of a space to master in one section it should be removed from all the other sections and replaced with the new spare.
Tue, Apr 16
The change would not be very small as to make it general we would need to make cumin support multiple instances of each backend, each one with their own settings and also a way to select them via the query language. Definitely a breaking change for the existing config and query language.
Given the lack of interest in the last few years closing it as declined. Can be re-opened if there is renewed interest in working on this.
I see only one case where the implementation is straightforward and clean on the UI side, the one with --batch 1.
Cumin is currently working with the running user from the cuminunpriv1001 host (after a kinit) towards kerberized hosts, like for example the install hosts.
See also T224097 for a similar use case.
The underlying logic is very similar to T164587, I'm merging this into that one.
The issue has been solved upstream in v1.9.0 and is included in the version in Debian bookworm. Nothing to do on our side specifically, will automatically get fixed once the underlying hosts will be upgraded.
Merging this with T346453 as the testing plan outlined in T346453#9713036 will cover also this use use case.
Mon, Apr 15
With exclusive locking now in place for the sre.dns.netbox cookbook I think we can consider this resolved.
@JMeybohm Is this something still needed?
Removing cumin, and I/F tags as there isn't anything pending from this side.
Resolving then, thanks all that contributed to the fix! Feel free to re-open if there is still any related issue for 3.11. For 3.12 we have a different one tracked in T354410.
With the above patch I think the issue should be solved and we can resolve the task. Anyone could try to repro it again?
Wed, Apr 10
De-assigning it from me as Brian is working on this.
Thanks for the summary, well outlined. I've spoken a bit with Arzhel and I think that the general idea of using Netbox data in a more streamlined way for the DHCP is sound. There are some comments/concerns/caveats that I would like to highlight, but nothing is a hard blocker:
Tue, Apr 9
This specific failure is due to the special nature of the secondary Icinga host that is not monitored by Icinga. The downtime is already performed best-effort by the cookbook. The issue should be solved in the Icinga puppettization instead in T362137. Closing as declined.
Thanks a lot for the detailed plan outline. The plan looks sane to me, I agree that the in-place migration is probably the less risky path.
Just one nit, we need to give plenty of advance notice to avoid long-running scripts that might touch conftool such as long-running cookbooks and long-running DBAs scripts that call dbctl at random times.
Mon, Apr 8
Too long has passed since then and doesn't seem to happen anymore.
Since the last update we've removed the Netbox CSV dumps all-together. Resolving
The host is alerting in Icinga, should it be downtimed?
Thu, Apr 4
nice finding!
Wed, Apr 3
Wow, that was quite an investigation for a /test key, thanks for the thorough analysis. As for the test2 value that could have been me when deploying the spicerack locks. I have in my bash history from the now defunct cumin1001 this for example:
sudo etcdctl -C https://conf1008.eqiad.wmnet:4001 set /test/volans '{"test": "value"}'
that although not the same might have failed if test was a key and not a directory (as it looks like) and I might have retried it with different values.
This just to say that I think it's safe to remove /test.
Tue, Apr 2
I've sent a proposal implementation in the patch above
@bking I'm not sure what do you mean. As mentioned earlier in T345337#9658807 Debian has v5.8.1 for python3-elasticsearch-curator and that's the current version used in production. The depedency in setup.py is defined as elasticsearch-curator~=5.0.
This is a duplicate of T355422#9664626
Fri, Mar 29
Many things have changed since december on the Puppet7 migration and I don't think we're seeing the same issue anymore. Tentatively resolving it, feel free to re-open if it happens again.
I've looked at the logs and the code, some clarification/questions/comments:
- because the cookbook was prompting the user, it means it was already stopped, waiting for user input. If no answer would have been entered the cookbook would have stayed there doing nothing, allowing for the operator to investigate the situation.
- the abort there is to interrupt the execution of that command raising an exception, what a cookbook would do after that is outside of the scope of the confirmation asking. In particular that confirmation was to commit or not the interface changes on the switch.
- Was the decommission cookbook re-run on the aborted host (elastic2050) after the incident was resolved to ensure all the decom steps were performed?
- the current implementation of the decommission cookbook is to execute the decom on all selected hosts catching any exception from the single host run and reporting them at the end. It could of course be changed maybe to prompt the user for confirmation to continue or not on error.
- The current implementation doesn't catch a Ctrl+c, so that would have interrupted the cookbook execution all-together.
- the decommission cookbook performs destructive actions and as such has already various warnings and prompts to the user to make sure is not run on the wrong hosts. The incident report (as of now) doesn't clarify why the cookbook was run on the wrong hosts and what could have prevented it.
Yeah, it's clearly a race condition that could be solved in both places (cookbook or spicerack), no strong opinion.
The problem is that from the current code it seems that puppet doesn't clearly return a proper exit code that could help understand the problem and parsing the output is brittle.
We could add an @retry with few attempts or check the puppet lock file on error.
For context the regenerate certificate is run seldom in production, is it run more frequently in WMCS?
Mar 28 2024
Sorry, ignore my previous comments, there was some misunderstanding:
Yes but to which endpoint is trying to connect? Please try to use puppetdb-api.discovery.wmnet:8090 and let us know if that works or not (that's a proxy that allows only some queries and not others, so it might need tweaking based on which queries naggen does).
Mar 27 2024
The premise seems to mix different things. PuppetDB is a totally separated service from the PuppetMaster/PuppetServer ones and runs on their own hosts. Are you saying that naggen fails to connect to PuppetDB?
elastic2088 is unreachable and reported as missing from PuppetDB by Netbox report. No host should be powered on with puppet disabled or not working for longer period of time. Please either reimage it or shut it down now and reimage it at a later stage (before powering it on).
elastic2037 is reported by Netbox for not being anymore in puppetdb, please either decommission it or shut it down. No host should be powered on without puppet running for extensive period of time.
What's the status of db2202? It has puppet disabled since 22 days! Puppet should never be disabled for long periods, anad now it's gone from puppetdb/monitoring/everything, is a ghost host only reported by a Netbox report and also spamming daily root@ due to expired cert for the debmonitor client.
Mar 26 2024
@BTullis indeed, that's another new device type created with the wrong slug. I've updated the slug in Netbox to fix it.
Mar 25 2024
Will you take care also of debian packaging it and any required dependencies?
Because spicerack is deployed with debian packages and upstream debian has 5.8.1 as the most recent release (not sure if has anything to do with the licencing) and 7.17.6 for python3-elasticsearch.
Yes, correct.
Mar 21 2024
Mar 18 2024
We do have get_certificate_metadata() that raises spicerack.puppet.PuppetServerCheckError if the cert is not found (as opposed to other errors).
What I was suggesting is that we could do that check directly in destroy() in the puppetserver class so that it behaves the same of the old puppetmaster one.
That's indeed the current behaviour and clearly an error, thanks for reporting it!
The exit codes of the puppetserver ca clean command are not documented in Puppet, or at least I couldn't find them in the public docs/manpage/help messages/source code.
Ideally puppetserver should report two different set of errors, the ones in which there is a certificate but it failed to perform some cleaning operations and the one where the certificate does not exists at all, but it doesn't seem the case.
Given that it doesn't, I think we shouldn't rely on specific output messages of the CLI and exit codes as it could hide other errors now or in the future.
Mar 7 2024
But is this still task still valid? The alert hosts were migrated to bookworm this week and puppet is running fine there.
Mar 6 2024
Mar 5 2024
Got the list of affected hosts with nodeset -S '","' -e "es20[35-40]" on a cumin host, then I run the following code on Netbox:
>>> import uuid >>> request_id = uuid.uuid4() >>> user = User.objects.get(username='volans') >>> def update(d): ... ip = d.primary_ip6 ... ip.dns_name = "" ... ip.save() ... log = ip.to_objectchange('update') ... log.request_id = request_id ... log.user = user ... log.save() ... >>> devices = Device.objects.filter(name__in=["es2035","es2036","es2037","es2038","es2039","es2040"]) >>> len(devices) 6 >>> [d.name for d in devices] ['es2035', 'es2036', 'es2037', 'es2038', 'es2039', 'es2040'] >>> for device in devices: ... update(device) ...
Re-opening as AAAA records were erroneously added to the hosts (AAAA records:N). I'll remove them programmatically.
Mar 4 2024
Sounds good to me, let me know once done so that I can make the related changes to the report to include those too.
@wiki_willy yes, if we go that way then I guess a separate tab on the accounting sheet with both asset tags (chassis and motherboard), compiled only for the hosts that have had the motherboard replaced but the asset tag not reset, should be enough information for the report to be adapted to include that information.
Mar 2 2024
FYI I've re-renamed PowerEdge R450 - Restbase-1G to PowerEdge R450 - ConfigRestbase-1G or we'd have issues in the firmware upgrades as outlined in T348036.
Mar 1 2024
@RLazarus please update also requestctl-generator when you do it.
I've added an item to the task description. At the moment because superset is in the process to be migrated to k8s the requestctl-generator file lives in two different places and needs to be modified in both place, the current prod one will disappear soon though.
Feb 29 2024
I don't think there is a clean solution if the iDrac doesn't allow to override the value on the motherboard when done outside of warranty.
We could check if there is a way on the host to get both values and decide which one we want to export.
But it there isn't, then we'll need to keep both old and new values in at least one place to make sure we can cross-check them. That place IMHO could be either the accounting spreadsheet or Netbox , and then the report will be modified accordingly.
There is no mapping, the reported device types are just not following the correct naming scheme, as you can see here comparing with the others: https://netbox.wikimedia.org/dcim/device-types/?q=PowerEdge and as previously discussed in T348036
Of course I don't want to add additional lag to any live-traffic data (pybal, mwconfig, dbctl) and if we deem adding spicerack locks to the replication might cause that let's find another solution. For example we could have a failover etcd cookbook that when run will read the active locks from the primary cluster and manually replicate them on the secondary one explicitely. Or any other viable option.
Feb 28 2024
Do we need to change anything on the sre.dns.netbox cookbook?
It currently runs:
cd {git} && utils/deploy-check.py -g {netbox} --deploy
Even from itself? As in what happens if an operator runs authdns-update on a depooled host?
But a second instance wouldn't prevent the current issue, right?
That etcdmirror is mirroring only the /conftool keys it's totally news to me, I assumed it was replicating the whole content of etcd. But indeed it does not:
Feb 27 2024
I've checked all the devices with names starting in db and es and the only ones with IPv6 AAAA records are: dbprov1004 and dbprov2004
Cleanup completed, leaving the task open for DCOps to prevent this from happening.