Page MenuHomePhabricator

Investigate DNS query improvements in MediaWiki-on-k8s
Open, MediumPublic

Description

Background

Although DNS resolution latency issue in T422455 was not the direct result of load on coredns, the specific way that resolution will typically (i.e., without dot suffixing) fan out to a sequence of search-path-suffixed queries means that, in the presence of an issue affecting a fraction of coredns queries, the probability of a given resolution being impacted is increased.

Proposal

We should look for low-hanging fruit where we can avoid these often-unnecessary (i.e., in the case of an external name) queries (one example solution again being dot-suffixing). Of particular interest are queries that are high volume, critical, or unlikely to be cached.

Two potential examples (T422455#11808708): non-mesh services referenced in wmf-config/ProductionServices.php and envoy upstream cluster configuration rendered by the mesh.configuration module.

Event Timeline

Scott_French moved this task from Inbox to Needs Info / Blocked on the ServiceOps new board.

Similar to T422955, if this sounds reasonable, let's try to schedule it for this quarter.

@JMeybohm - Do you think this is something we could make meaningful progress on this quarter?

I'm thinking the monitoring improvement (T422955) might be slightly higher priority in the short term (i.e., if it's a choice between them), but I'm also not sure how much work some of the known low-hanging fruit here might be (e.g., mesh configuration).

@JMeybohm - Do you think this is something we could make meaningful progress on this quarter?

I'm thinking the monitoring improvement (T422955) might be slightly higher priority in the short term (i.e., if it's a choice between them), but I'm also not sure how much work some of the known low-hanging fruit here might be (e.g., mesh configuration).

I think we should at least do the mesh config changes (and chart module updates) ASAP since it takes quite some time for them to roll out naturally.

Great, thanks @JMeybohm - Do you have a sense of what the highest-priority changes to the mesh configuration may be? i.e., some combination of dns_refresh_rate or respect_dns_ttl?

Great, thanks @JMeybohm - Do you have a sense of what the highest-priority changes to the mesh configuration may be? i.e., some combination of dns_refresh_rate or respect_dns_ttl?

I was more aiming towards adding the dot-suffix throughout the mesh config, eliminating a bunch of unnecessary DNS queries without changing the behavior otherwise.
In a second step we could look at respect_dns_ttl which feels like the more natural choice compared to changing the static dns_refresh_rate.

Ah, got it - thanks, @JMeybohm. For some reason, I thought there was something else beyond the dot-suffixing being proposed. In any case, +1 to starting with just the dot-suffixing, as it's clearly beneficial and implies minimal behavior change.