Following up the 2024-04-17 mw-on-k8s eqiad outage, where the root cause was the number of DNS resolution requests from MediaWiki pods towards CoreDNS so to resolve mw-mcrouter's location mcrouter-main.mw-mcrouter.svc.cluster.local
To mitigate the issue, we have added a trailing dot to speed up the FQDN resolution, with good results on codfw
CoreDNS rps:
- Green: rps with mw-on-k8s using mcrouter-main.mw-mcrouter.svc.cluster.local
- At ~10:50 UTC we switched to mcrouter-main.mw-mcrouter.svc.cluster.local.
- Yellow: baseline rps for CoreDNS
While this looks alright for now, we are unsure how things may go down in times of high traffic, or during deployments. For that reason, we would like to cache the IP to which mcrouter-main.mw-mcrouter.svc.cluster.local resolves to, to APCu, with a TTL of 1s.
Using an environmental variable to define mcrouter's location was first introduced a while back in https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/973838
Notes:
- this task refers to changes only when running under php-fpm and not the CLI
- with the use of mcrouter-main.mw-mcrouter.svc.cluster.local, kubernetes knows where to route a request, in this case, to the node-local mw-mcrouter pod.
- the reason we are asking to have this information stored in the apcu , is that fetching from the apcu us faster than making a dns request