Failing over graphite in T85909 and T157022 was not fun nor fast, namely doing it via dns has several disavantages and mostly because clients are long-running and won't pick up dns changes by themselves. I'm excluding service-runner based services since dns caching there is fixed by the latest service-runner version (T158338)
I'm listing here the services that needed manual restarts due to caching DNS records forever:
* zuul (contint1001) | restarting it flush the queue, check that https://integration.wikimedia.org/zuul/ is not too busy before restarting
* jmxtrans (on old kafka analytics cluster)
* navtiming (webperf)
* eventbus
* eventlogging
* ~~nodepool (labnodepool1001)~~ Service is being removed T209361
~~* statsv~~
~~* mwerrors (eventlog1001)~~
~~* parsoid (wtp* / ruthenium)~~