Wed, Jan 15
As per IRC chat it's ok as is, resolving.
Mon, Jan 13
I was made aware that the two above comments are contradictory. I don't recall the why of my above comment or any limitation on the 2 certs approach. I agree they are separate services and should not depend on each other.
@faidon: I mainly opened this because was the only DC without a rack group, even the network PoPs have one and use the name of the DC raw, not just 1, see https://netbox.wikimedia.org/dcim/rack-groups/
Fri, Jan 10
Wed, Jan 8
Thu, Jan 2
@Jclark-ctr by any chance do you have an ETA for this task? Just to know and to plan accordingly something related.
Indeed, done :)
@ema maybe could be related to NUMA utilization? Having a quick look at numastat (both -n and -m) there is a general imbalance between the two nodes (that I think is mostly on purpose due to our custom config), and the varnish process seems the one mostly responsible for it. But there was no spike in the graph either.
Tue, Dec 24
The issue for the DELETE has been fixed, I've successfully deleted the image docker-registry.wikimedia.org/python3-build-stretch:0.0.2 that was failing during the tests.
Please ensure that also the /upload endpoint still works as expected too.
Mon, Dec 23
Thanks, LGTM, feel free to proceed.
@crusnov thanks for the dry-run run, here my comments:
Interesting, given that the new cookbook kills the hosts that was unexpected, but the cookbook is very quick so I get why it happens.
My suggestion is to add a small sleep (with a log line to tell the user) before this line https://gerrit.wikimedia.org/r/plugins/gitiles/operations/cookbooks/+/master/cookbooks/sre/hosts/decommission.py#167
Probably 10~30s should be enough to run the other actions after any in-flight action.
Sat, Dec 21
Nothing on the host logs either. For the record it crashed 7 minutes after cp3051 (see T241306) and both are part of the upload esams cluster.
The host crashed again today, nothing in racadm, checked both getsel and lclog view.
Nothing in racadm, checked both getsel and lclog view. Nothing in syslog & co.
Fri, Dec 20
@Aklapper yes, as the host got reimaged I think the page was not updated, but I cannot edit it unfortunately.
Dec 17 2019
Dec 12 2019
Dec 11 2019
Dec 10 2019
That's great. The idea of the task was to link the specific dashboard that has the same data, while sometimes we use data that is not showed on grafana at all or we link a generic dashboard and not a specific graph.
I don't know though the current state of all those links, so I'll leave it to you best judgement.
Dec 9 2019
So far so good, leaving it open for another week or two to ensure the issue is totally fixed.
Currently open CRs towards the netbox-reports repo should be checked to see if they need to be resent towards the new repo:
The OOM issue has been fixed and for now memory, disk and CPU seems to be under control.
Resolve it for now, we can re-open if this will be required anyway.
I understand that this might seem confusing, but it was decided from the start that debmonitor should not keep track of those, because the idea of a specific release of Debian is very aleatory based on which APT repository you setup in the host and the packages you install.
The other way of looking at it is that a package version in a Debian repository is not for a specific release, a specific release uses that version but the versions are independent of that.
CC @MoritzMuehlenhoff FYI
Dec 8 2019
Re-opening as this has not being yet solved at the md software RAID layer, Icinga is still critical and /proc/mdstat still reports the above degraded status.
Dec 6 2019
I've noticed that Phabricator emails are failing the SPF check, re-opening to add details, feel free to move it to a separate task if needed.
Dec 5 2019
For the first one the downtime cookbook failed to run puppet on the Icinga active host to get the definitions of the reimaged hosts to downtime. Given how much puppet is slow on the icinga host it can happen if there are multiple runs at the same time, that we hit the timeout even with --attempts 30.
My suggestion for running parallel reimages is to open 2~3 tmux and run there sequential reimages and let them start few minutes apart from each other.
Dec 4 2019
Dec 3 2019
Dec 2 2019
I've updated the mgmt DNS name record in Netbox that was still reporting wezen. I've also a patch to cleanup the wezen record from DNS, will push it later today.
I've updated the mgmt interface's DNS names on Netbox that were still reporting the old names cloudvirtan*.
Not sure if it can be considered in scope for this task as the title is pretty generic.
Another check we need is to ensure that the hostname part of some DNS names matches the device name, in particular:
- mgmt ip address
Forgot to mention that https://netbox.wikimedia.org/ipam/ip-addresses/687/ had still the old name, I've updated it.
Dec 1 2019
FYI This still happens in buster too, the Debian bug is still open.
We've 88 hosts that don't have /var/log/wtmp.1 and they spammed cronspam today.
Nov 30 2019
@Papaul given we're setting the DNS name of the ip address in Netbox, that one too needs to be updated, see the links above:
IP: 10.193.2.251/16 Assignment: mw2231 (mgmt) DNS Name: graphite2002.mgmt.codfw.wmnet
IP: 10.193.1.118/16 Assignment: mw2231 old (mgmt) DNS Name: mw2231.mgmt.codfw.wmnet
My understanding is that 10.193.1.118 is the mgmt IP assigned to the new mw2231 (but please double check it). In that case we should attach Netbox IP 10.193.1.118/16 to the mw2231's mgmt interface and delete the 10.193.2.251/16 if is not anymore used.
It seems that Netbox's ip address has not been updated and still reports graphite2002 in the DNS name, see
Netbox status is currently Decommissioning, if the host has been unracked it should be Offline.
ms-be2013 and ms-be2014 are marked as Decommissioning in Netbox, if they were unracked their status should be changed to Offline.
Re-opening as the DNS name of the interfaces attached to those hosts have not been modified in Netbox.
IP address: 10.193.1.23/16 Parent: logstash2020 DNS name: kafka2001.mgmt.codfw.wmnet
Nov 28 2019
It might be another occurrence of T238305 (model matches)
@jbond on the CI instances you have 3.4, 3.5, 3.6 and 3.7 available although the system one is 3.5. Faidon did the packaging a while ago and if you see the CI jobs of many repos they run all environments from tox.
It looks like not all the puppet code was made ensure=>'absent'. We might have many more small things still laying around as a result.
I've also removed the crontab entries for wmf_auto_restart_uwsgi-netbox and prometheus-postgres-exporter.
Postgres user and related crontab are still present on the hosts and triggered a failure in the backup because there is no more DB to backup.
I've just removed the crontab for now.
I was able to debug the issue using tracemalloc:
I'm doing a quick debug attempt on acmechief-test2001
Nov 27 2019
I think the best way is to have it easily integrated in some form in the local workflow in our dev envs, so that when you tests locally they passes and when you commit you commit already the formatted version. Then CI is just ensuring that the code is already formatted according to the tool.
Otherwise the resulting workflow would be annoying: make a patch, send it to Gerrit, get V-1 from Jenkins, check the output, either run the tool manually locally or fix manually the format issues, send a second PS.
Nov 26 2019
Nov 25 2019
If needed, full list of R440 available here: https://puppetboard.wikimedia.org/fact/productname/PowerEdge+R440 (intentionally not mentioning their count here)
No problem for me for 1 cert, it seems a reasonable approach.
Nov 23 2019
Nov 22 2019
Nov 21 2019
It's hard to reply from the description, there is no quote task description button AFAIK.
The above patch  has not yet been merged.
Nov 20 2019
@fgiunchedi I think is fair request, but given we're in process of auto-generating all mgmt and then server's DNS records this might have less benefit that in the current situation. Would be ok to treat it as lower priority?
I don't mind the additional check, but again, I'm not sure how much is in scope for this task. If we do the extensive check then we should define a policy first, that is not strictly defined yet AFAIK.
Nov 19 2019
@herron @fgiunchedi I don't think that much, I guess you have to do the triggering part, I'm not super clear what you have in mind, a script to run from somewhere or what. I'll be careful with an email alias as it could be easily abused.
I actually think that's not enough if we want to enforce a policy, although it's not clear if that's the scope of this task.
I've reverted the above patch as it was reporting most servers as false positive for the new name coherence report. The regex was wrong and not able to match our current hostnames.
Also is not totally clear to me the goal of this check, as it was opened for the asset tag names specifically but the check was expanded to all hosts.
For asset tag hostnames for example we should check that the hostname matches the asset tag of the same device, and if those are already lowercase, that would be already enough.
The debmonitor test didn't test much as the debmonitor client sends the puppet client cert (not the CA) and it's the server that validates it with the CA.
@jbond just to be on the safe side and to verify the theory, if possible make a quick test that the new cert in the CR is able to verify exiting puppet node certs and cergen certs.
Nov 15 2019
A simplified version could be to use a cookbook to couple stuff:
Nov 14 2019
Nov 13 2019
As we discussed a while ago about this, the easiest solution is to pick another port for the public TLS server on the debmonitor servers as the 443 is already taken for the internal clients to report the package list to it and it's used to perform authz/n with the client certificate.