Change Details

Failing over graphite in T85909 and T157022 was not fun nor fast, namely doing it via dns has several disavantages and mostly because clients are long-running and won't pick up dns changes by themselves. I'm excluding service-runner based services since dns caching there is fixed by the latest service-runner version (T158338) I'm listing here the services that needed manual restarts due to caching DNS records forever: * * ~~zuul (contint1001) | restarting it flush the queue, check that~~ Done https://integrationgerrit.wikimedia.org/zuul/ is not too busy before restartingr/#/c/operations/puppet/+/474128/ * jmxtrans (on old kafka analytics cluster) * navtiming (webperf) * eventbus * eventlogging * ~~nodepool (labnodepool1001)~~ Service is being removed T209361 ~~* statsv~~ ~~* mwerrors (eventlog1001)~~ ~~* parsoid (wtp* / ruthenium)~~