Page MenuHomePhabricator

Cache mw-mcrouter service ClusterIP in apcu cache
Open, HighPublic

Description

Following up the 2024-04-17 mw-on-k8s eqiad outage, where the root cause was the number of DNS resolution requests from MediaWiki pods towards CoreDNS so to resolve mw-mcrouter's location mcrouter-main.mw-mcrouter.svc.cluster.local

To mitigate the issue, we have added a trailing dot to speed up the FQDN resolution, with good results on codfw

CoreDNS rps:

  • Green: rps with mw-on-k8s using mcrouter-main.mw-mcrouter.svc.cluster.local
  • At ~10:50 UTC we switched to mcrouter-main.mw-mcrouter.svc.cluster.local.
  • Yellow: baseline rps for CoreDNS

image.png (472×1 px, 52 KB)

While this looks alright for now, we are unsure how things may go down in times of high traffic, or during deployments. For that reason, we would like to cache the IP to which mcrouter-main.mw-mcrouter.svc.cluster.local resolves to, to APCu, with a TTL of 1s.

Using an environmental variable to define mcrouter's location was first introduced a while back in https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/973838

Notes:

  • this task refers to changes only when running under php-fpm and not the CLI
  • with the use of mcrouter-main.mw-mcrouter.svc.cluster.local, kubernetes knows where to route a request, in this case, to the node-local mw-mcrouter pod.
  • the reason we are asking to have this information stored in the apcu , is that fetching from the apcu us faster than making a dns request

Event Timeline

jijiki triaged this task as High priority.Apr 24 2024, 8:41 AM

I am marking this as High Priority because the current status is:

  • codfw is using a mcrouter daemonset
  • eqiad is using the mcrouter container

We would like to consolidate those, and move forward to the next step which is, to remove the 2 mcrouter containers from the MediaWiki pod.

MSantos subscribed.

Moving to radar on our side. Please, let me know if there's any action we should take on this ticket.

@MSantos It would be great if someone from MediaWiki-Engineering could undertake this as it requires changes in mediawiki code

As we move to using more services with a daemonset-like pattern that will require resolving DNS names, like T365265: Create a per-release deployment of statsd-exporter for mw-on-k8s, it would really be appreciated if this could be prioritized.

Change #1039197 had a related patch set uploaded (by Effie Mouzeli; author: Effie Mouzeli):

[operations/mediawiki-config@master] mc.php: if $_SERVER['MCROUTER_SERVER'] is set, resolve it

https://gerrit.wikimedia.org/r/1039197

Change #1039197 abandoned by Effie Mouzeli:

[operations/mediawiki-config@master] mc.php: if $MCROUTER_SERVER is set, resolve it

Reason:

bad idea

https://gerrit.wikimedia.org/r/1039197

Change #1039197 restored by Effie Mouzeli:

[operations/mediawiki-config@master] mc.php: if $MCROUTER_SERVER is set, resolve it

https://gerrit.wikimedia.org/r/1039197