Following up the [[ https://wikitech.wikimedia.org/wiki/Incidents/2024-04-17_mw-on-k8s_eqiad_outage | 2024-04-17 mw-on-k8s eqiad outage ]], where the root cause was the number of DNS resolution requests from MediaWiki pods towards CoreDNS so to resolve mw-mcrouter's location `mcrouter-main.mw-mcrouter.svc.cluster.local`
To mitigate the issue, we have added a [[ https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1020768/2/helmfile.d/services/_mediawiki-common_/global.yaml | trailing dot ]] to speed up the FQDN resolution, with good results on codfw
CoreDNS rps:
* Green: rps with mw-on-k8s using `mcrouter-main.mw-mcrouter.svc.cluster.local`
* At ~10:50 UTC we switched to `mcrouter-main.mw-mcrouter.svc.cluster.local.`
* Yellow: baseline rps for CoreDNS
{F48302605}
While this looks alright for now, we are unsure how things may go down in times of high traffic, or during deployments. For that reason, we would like to cache the IP to which `mcrouter-main.mw-mcrouter.svc.cluster.local` resolves to, to APCu, with a TTL of 1s.
Using an environmental variable to define mcrouter's location was first introduced a while back in https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/973838
**Notes:**
* this task refers to changes only when running under php-fpm and not the CLI
* with the use of `mcrouter-main.mw-mcrouter.svc.cluster.local`, kubernetes knows where to route a request, in this case, to the node-local mw-mcrouter pod.
* the reason we are asking to have this information stored in the apcu , is that fetching from the apcu us faster than making a dns request