Ready for decom @RobH
We no longer have separate cassandra metrics hosts since moving to Prometheus.
This is completed, modulo ms-be2047 being diagnosed in T209921
Looks like the 504s started on Dec 3rd ~12:00
Thanks @Gilles for kickstarting this! For context these are the notes I took when we did the first round of cleanup a couple of years back: https://wikitech.wikimedia.org/wiki/Swift/Thumbnails_Cleanup
Given that the other hosts in this batch are fine and we've replaced the parts Dell wanted to replace what's the next step?
@RobH looks like of these hosts only ms-be1050 is accessible from cumin atm? ditto for logging in as my user via ssh
Tue, Dec 11
"fixed" for now by manually installing python-dnspython, following up on T209136 for a proper fix
Turns out depool-restbase isn't successful:
Mon, Dec 10
Resolving, we're onto new graphite hardware now with more resources.
Unassigning as I'm not going to work on this
I guess we should change puppet to create configs too and get rid of the placeholder
I recommend sending cronjobs output to logstash (as well as files?), when cronjobs are logging to syslog you can opt-in via ./modules/profile/files/rsyslog/lookup_table_output.json
Sat, Dec 8
Fri, Dec 7
Indeed, FWIW I tend to treat restbase and cassandra separate so this will be done as soon as the cassandra reshape (T210843) is done.
Wed, Dec 5
I took a quick look at this as well and indeed openlog() seems the simplest way. Also because altering programname in rsyslog isn't allowed, thus to fix this on the rsyslog side we'd have to use a different template for example, not really worth it IMO. Plus the bug is supposedly fixed in php 7.3 anyways.
Tue, Dec 4
I've looked briefly at how to implement prefixing syslog json messages with @cee: and I'd say we could do it on the "syslog side" i.e. ./includes/debug/logger/monolog/SyslogHandler.php or "logstash side" i.e. ./includes/debug/logger/monolog/LogstashFormatter.php. I don't have strong opinions or either really!
Also please rack these systems across different rows, any combination of rows will do. The rest of the task LGTM
This is completed!
Looks like this is working as intended for systemd provider (/usr/lib/ruby/vendor_ruby/puppet/provider/service/systemd.rb)
@Papaul names replaced! thanks
LGTM on my side too, I've reenabled the event handler.
This is back, any chance for reseating or swapping memory @Papaul ?
Mon, Dec 3
All hosts had their first puppet run done, and restbase2013 is bootstrapping cassandra instances. On the remaining hosts I had to chmod a-x /usr/sbin/cassandra due to T211027: puppet (systemd::service) attempts to start manually masked units and we'll need to restore that one host at a time when bootstrapping time comes.
tentatively resolving, graphite 0.9.15 is on labmon1001 (jessie) while production runs graphite 1.x on stretch
I can indeed reproduce the problem when fetching e.g. https://upload.wikimedia.org/wikipedia/commons/8/8e/Sunset_Toronto_Skyline_Panorama_from_Snake_Island.jpg
Is cleaning up swift global-math-render.* containers in scope for this? afaik with mathoid now these containers shouldn't be used anymore?
It does! +1
This is completed, thanks @Papaul and all involved.
Fri, Nov 30
I'll be preparing these hosts for cassandra to be bootstrapped there
Thanks @CDanis for looking into this! re: max() I have an hunch it might be due to having two prometheus servers backing the prometheus.svc endpoint in eqiad and codfw. To test this theory I tried looking for temperatures e.g. in esams. However with esams selected and e.g. cp3007 selected I'm not seeing any temperatures at all.
Thu, Nov 29
Note we've been here before in T172921: Nrpe command_timeout and "Service Check Timed Out" errors and sadly the command check timeout can be changed only globally on the icinga side, not per-service.
Yes this has been fixed by me a few hours ago! I was doing tests on that VM and disabled puppet, resolving.