Page MenuHomePhabricator

Phabricator intermittently slow; db connection failures to m3-master.eqiad.wmnet with "Temporary failure in name resolution"
Closed, DeclinedPublic

Description

Have had several reports of Phabricator responding slowly, and have confirmed. Seems intermittent. I'm seeing db connection errors in logs with "Temporary failure in name resolution":

Got error 'PHP message: [2021-03-31 20:33:29] PHLOG: 'Retrying database connection to "m3-master.eqiad.wmnet" after connection failure (attempt 1; "AphrontConnectionQueryException"; error #2002): Attempt to connect to phuser@m3-master.eqiad.wmnet failed with error #2002: php_network_getaddresses: getaddrinfo failed: Temporary failure in name resolution.' at [/srv/deployment/phabricator/deployment-cache/revs/4547f31de8f69854e0cd9d3e0a802ce517360ee0/phabricator/src/infrastructure/storage/connection/mysql/AphrontBaseMySQLDatabaseConnection.php:135]'

Event Timeline

Isn't this why in MW land we tend to use IP addresses rather than hostnames because DNS resolution (especially inside PHP) can be flaky?

For reference:

$ host m3-master.eqiad.wmnet
m3-master.eqiad.wmnet is an alias for dbproxy1020.eqiad.wmnet.
dbproxy1020.eqiad.wmnet has address 10.64.32.179

@Reedy: perhaps? but we've had it configured that way forever and I assume db ops might want to change things like moving the dbs around without touching phabricator's config.

Just because something has worked fine for a long time, doesn't mean it always will. Maybe we've been lucky that Phab hasn't fallen foul of it, or at least at a high enough error rate to be noticeable.

Like in MW, it'd not cause problems for a long time, then something would happen, change etc, sometimes when we were having other outages and such. And as it was also a performance overhead to do the dnslookups, in most cases we changed them to IP addresses

I'm certainly not against doing it that way, I presume we could have puppet resolve the IP and that way dns changes would still get updated in the config with just a puppet run.

Change 676154 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] phabricator: use IP instead of host name for mysql host value

https://gerrit.wikimedia.org/r/676154

I'm not sure if we tend to use IP addresses directly in Mediawiki out of latency concerns, reliability concerns, PHP/other client bugginess concerns, or DNS recursor capacity concerns, or some mix of all of the above. (I've seen all of the above occur at different times.)

But while there is a long precedent of this, I personally consider it something of an antipattern. I think there is at least rough consensus that eventually we'd like to eliminate it in favor of better solutions like T171498: Implement machine-local forwarding DNS caches.

I'm certainly not against doing it that way, I presume we could have puppet resolve the IP and that way dns changes would still get updated in the config with just a puppet run.

This is true, but still provides opportunities for stubbed toes on production changes -- Puppet only runs every half an hour, and there's no automatic coupling of DNS pushes to Puppet runs. And we don't always bother with this level of sophistication and sometimes hardcode things even when we could do better (see also T239862).

Now, that all being said: it does sound like it might be appropriate in this case. If you do implement such an override, please just make sure it is well-documented, and that there's a task to unroll it eventually. Oh, and make sure DBA is cool with it as well, as it is another piece of complexity for them to keep in mind when managing the dbproxies.

Hope this makes sense and helps :)

Change 676154 abandoned by Dzahn:

[operations/puppet@production] phabricator: use IP instead of host name for mysql host value

Reason:

we don't have the module providing this function

https://gerrit.wikimedia.org/r/676154

Marostegui subscribed.

I am not sure I like this idea.
We use m3-master.eqiad.wmnet so we can point everything to it, so the application doesn't need to know about specific hosts. We have one master and one slave behind the proxies so we can have HA in case the master fails.

In addition to this, we have 2 pair of proxies to serve the same shard, for instance, for m3-master we have:
dbproxy1020 as the active proxy and dbproxy1016 as the standby proxy in case the first one fails.
If that happens we just need to change the DNS to point m3-master to the secondary proxy, this logic would be broken if we start pointing applications to given IPs rather than the CNAME.

Let's not do this please.

What are the next steps here? Are we OK to maintain the current state until a fix (possibly the one mentioned by @CDanis) is available?

Removing the DBA tag and subscribing myself instead. Once there are specific actions for DBA please re-add us and/or @mention me.

T171498 implies solving this in more places than just Phabricator so I think we can close this task.