Great! Thanks a lot!
Other random things that needs to be updated sooner or later. I hope you don't mind if I drop them here, feel free to move them to a dedicated task.
@cmooney I've invited you and added you to the SRE team. Please follow the instructions at https://wikitech.wikimedia.org/wiki/VictorOps#Set_up_as_a_new_user
Version is 4.1.0, currently latest on apt.w.o and PyPI and deployed to all cumin hosts in production.
See worker.reporter and worker.progress_bars on https://doc.wikimedia.org/cumin/master/introduction.html#library for an example of how to implement it.
Cumin 4.1.0 has been released to PyPI, resolving. Feel free to re-open in case there is still any issue.
The patch that add the feature has been merged and deployed, resolving. I don't have any host to decommission right now, but I tested the dry-run mode and run manually the specific command that the cookbook would run and seems to work fine.
Feel free to reopen if you encounter any issue.
Tue, May 11
The above patch fixes the issue. For reference this is what's in apt cache for this specific package.
So, after some digging the issue here is that the source package reported by debmonitor-client is wrong because taken from the 'candidate' package and not the actual version to be installed (candidate is true in almost all cases in our infra) that in this case was specified manually.
I'm sending a patch to fix the client so that it correctly detects the right source package.
I think I know what's happening here, I double check a couple of things and update the task accordingly later.
Applying the above patch as a quick hot-patch I was able to get a full stacktrace:
[2021-05-11T10:04:11] Unable to update host 'cloudgw2001-dev.codfw.wmnet' Traceback (most recent call last): File "/srv/deployment/debmonitor/venv/lib/python3.7/site-packages/django/db/models/query.py", line 486, in get_or_create return self.get(**lookup), False File "/srv/deployment/debmonitor/venv/lib/python3.7/site-packages/django/db/models/query.py", line 399, in get self.model._meta.object_name bin_packages.models.PackageVersion.DoesNotExist: PackageVersion matching query does not exist.
Mon, May 10
@Legoktm sure, why not. Is there a canonical place where the repository is checkout at the commit that is currently in production?
I've deleted all the old instances that included also all the ones that had puppet broken. Sorry for the delay.
Thu, May 6
@JMeybohm this is now all supported.
We have a sre.hosts.remove-downtime cookbook that when run with --force will ask the user if it wants to proceed with the hosts without verifying them with puppetdb:
This is now possible via spicerack in any cookbook using:
Wed, May 5
Thanks for the task, while it's true that it could have been already removed from Icinga, it might happen for other reasons too and I'm not sure if hiding it completely is a great choice. We could look to see if we could improve the check and distinguish between a missing host and other failures.
Do you happen to know why that step failed in this specific case? And if was that the host was already not anymore in Icinga, do you know why?
In the usual workflow a host to be decommissioned should be in Icinga.
If possible it would be great if we could get a base bullseye image somehow, even if not auto-updated right from the start and otherwise created.
We have already a couple of hosts on bullseye and I need to add a python-build bullseye image to the registry to build the wheels of a Python application included in the role of one of the hosts, that of course requires a base bullseye image.
FYI Moritz has opened T281984 for the long term solution.
Mon, May 3
Thu, Apr 29
Also if you want to support bullseye it would be good to add 3.9 support to tox and setup.py too.
In setup.py there is a very specific version of prospector from 2018 ('prospector[with_everything]==126.96.36.199').
This, in conjunction with the fact that the same virtualenv is used for all tox envs of a given Python version it means that the dependency tree must be compiled in such a way that satisfy all the constraints and I think this brings in turns older versions of many of the tools, although I didn't dig to prove it.
Tue, Apr 27
Mon, Apr 26
Thanks for the confirmation and sorry about the trouble @jbond
What is the current status of eventlog1003?
It's reported by a cumin check that ensures that all hosts matching the alias A:all are part of one of the datacenters, and eventlog1003 is not part of the alias for A:eqiad.
AFAICT it's in PuppetDB but not assigned to any role in site.pp. See also https://puppetboard.wikimedia.org/node/eventlog1003.eqiad.wmnet
Thu, Apr 22
Which version was trying to install?
The latest version of debmonitor-client does this very check, see https://gerrit.wikimedia.org/r/c/operations/software/debmonitor/+/681319
Wed, Apr 21
Tue, Apr 20
@ayounsi do we have already a plan for how to manage the swap in Netbox? Should we discuss it?
Adding @jbond that might have some insights about it.
My 2 cents are that historically we've used 15 as a safe batch number that should not cause issues (see for example https://wikitech.wikimedia.org/wiki/Cumin#Run_Puppet_only_if_last_run_failed ). But that was a while ago and I hope we can now revisit that number and maybe bump it a bit.
That said some puppet catalogs are bigger than others (like the Icinga server) and so we should use a conservative enough number that doesn't overload the puppet servers even if used with a more demanding cluster from the puppet catalog compilation point of view.
Mon, Apr 19
Assigning to @jbond that was working on this.
Thu, Apr 15
Anytime is ok for debmonitor.
@Marostegui yes, I didn't mention all the other efforts because a bit off topic, but firmware upgrade is too in scope and in our roadmap. And totally makes sense once we have that to integrate it into the reimage workflow so that we naturally upgrade them along with OS upgrades.
Wed, Apr 14
@LSobanski I'll try to give you some context from the SRE I/F team side of things. Any feedback will be greatly appreciated, also to help set the team's priority around those items.
Apr 13 2021
From some quick tests it seems that the redirect works fine. Just as a note, possibly intended, if I'm logged in and open grafana.w.o it remains there "logged out".
Apr 1 2021
Current situation for me with what I think is a valid SSO session (if I go to idp.wikimedia.org it redirects me to https://idp.wikimedia.org/login and says Log In Successful without me entering any credential).
AFAICT at least on ms-be hosts I don't see the swift processes listen on v6, shouldn't that be addressed too?
Mar 31 2021
Once fixed it can be tested on gerrit1001 that has the same issue (the IPv6 not marked as primary_ip6).
It also updated the existing VIP with the DNS entry and changed the assigned object type:
Mar 29 2021
I think is another issue with Netbox APIs cache. Basically we get all the data at the start of the cookbook and then merge them in Python to speed up the operations. I can add a try/except there and try to gather the data again on failure.
Mar 25 2021
Mar 24 2021
@JMeybohm with the above patch, once merged and deployed, you'll be able to use icinga_hosts(["foo.bar.baz", ...], verbatim_hosts=True) to get an IcingaHosts instance that will not mangle the hostnames you'll be giving, that at that point must be valid Icinga host definitions, but not forcely hostnames.
Let me play the devil's advocate here, I hope to not have misunderstood your intentions, correct me if I'm wrong.
Mar 23 2021
Mar 22 2021
While the ownership of mailmain within SRE is being discussed, I've raised this request in today's SRE meeting and got approved so that @Ladsgroup is unblocked.
I would guess that the addition of ladsgroup to the mailman3-roots group was implicitly approved too during the last SRE meeting that approved the creation of the group, but for audit log purposes is better to have the approval here on task too.
Given that the group doesn't have an explicit approval list of people, I'm asking for @mark or @faidon to approve the request here on task.
Mar 19 2021
@CGlenn I've added you to the mobile domain too am.m.wikipedia.org, I consider the approval for the whole "language". Resolving, feel free to reopen if there is any issue.
Patch merged, added user to the wmf group.
Mar 18 2021
Doh, I think we have naming clash here :)
@Ottomata is there anything to be done on the analytics side to sync the user for the intended usage?
And feel free to resolve this task once it's all working as expected.
@cmassaro kerberos activated and patch with your access merged, please follow https://wikitech.wikimedia.org/wiki/Production_access#Setting_up_your_access to setup your SSH configuration file and test your access to one bastion first and then one of the hosts you want to access. FYI it can take up to 30 minutes from the merge time to get the change propagated to all bastions to allow you to SSH.
Created kerberos principal: