Failing over graphite in T85909 and T157022 was not fun nor fast, namely doing it via dns has several disavantages and mostly because clients are long-running and won't pick up dns changes by themselves. I'm excluding service-runner based services since dns caching there is fixed by the latest service-runner version (T158338)
I'm listing here the services that needed manual restarts due to caching DNS records forever:
zuul (contint1001)Done https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/474128/- jmxtrans (on old kafka analytics cluster)
navtiming (webperf)- eventbus
- eventlogging
nodepool (labnodepool1001)Service is being removed T209361statsvmwerrors (eventlog1001)parsoid (wtp* / ruthenium)