Phabricator intermittently slow; db connection failures to m3-master.eqiad.wmnet with "Temporary failure in name resolution"
Closed, DeclinedPublic
Actions

Assigned To

None

Authored By

	brennen
	Mar 31 2021, 8:49 PM

Description

Have had several reports of Phabricator responding slowly, and have confirmed. Seems intermittent. I'm seeing db connection errors in logs with "Temporary failure in name resolution":

Got error 'PHP message: [2021-03-31 20:33:29] PHLOG: 'Retrying database connection to "m3-master.eqiad.wmnet" after connection failure (attempt 1; "AphrontConnectionQueryException"; error #2002): Attempt to connect to phuser@m3-master.eqiad.wmnet failed with error #2002: php_network_getaddresses: getaddrinfo failed: Temporary failure in name resolution.' at [/srv/deployment/phabricator/deployment-cache/revs/4547f31de8f69854e0cd9d3e0a802ce517360ee0/phabricator/src/infrastructure/storage/connection/mysql/AphrontBaseMySQLDatabaseConnection.php:135]'

Details

	Subject	Repo	Branch	Lines +/-
	phabricator: use IP instead of host name for mysql host value	operations/puppet	production	+5 -3

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Declined		None	T279013 Phabricator intermittently slow; db connection failures to m3-master.eqiad.wmnet with "Temporary failure in name resolution"
		Open		None	T171498 Implement machine-local forwarding DNS caches

Event Timeline

brennen created this task.Mar 31 2021, 8:49 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 31 2021, 8:50 PM

brennen added a project: serviceops.Mar 31 2021, 8:53 PM

Isn't this why in MW land we tend to use IP addresses rather than hostnames because DNS resolution (especially inside PHP) can be flaky?

For reference:

$ host m3-master.eqiad.wmnet
m3-master.eqiad.wmnet is an alias for dbproxy1020.eqiad.wmnet.
dbproxy1020.eqiad.wmnet has address 10.64.32.179

@Reedy: perhaps? but we've had it configured that way forever and I assume db ops might want to change things like moving the dbs around without touching phabricator's config.

Just because something has worked fine for a long time, doesn't mean it always will. Maybe we've been lucky that Phab hasn't fallen foul of it, or at least at a high enough error rate to be noticeable.

Like in MW, it'd not cause problems for a long time, then something would happen, change etc, sometimes when we were having other outages and such. And as it was also a performance overhead to do the dnslookups, in most cases we changed them to IP addresses

I'm certainly not against doing it that way, I presume we could have puppet resolve the IP and that way dns changes would still get updated in the config with just a puppet run.

Change 676154 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] phabricator: use IP instead of host name for mysql host value

https://gerrit.wikimedia.org/r/676154

gerritbot added a project: Patch-For-Review.Mar 31 2021, 9:20 PM

I'm not sure if we tend to use IP addresses directly in Mediawiki out of latency concerns, reliability concerns, PHP/other client bugginess concerns, or DNS recursor capacity concerns, or some mix of all of the above. (I've seen all of the above occur at different times.)

But while there is a long precedent of this, I personally consider it something of an antipattern. I think there is at least rough consensus that eventually we'd like to eliminate it in favor of better solutions like T171498: Implement machine-local forwarding DNS caches.

In T279013#6962446, @mmodell wrote:

I'm certainly not against doing it that way, I presume we could have puppet resolve the IP and that way dns changes would still get updated in the config with just a puppet run.

This is true, but still provides opportunities for stubbed toes on production changes -- Puppet only runs every half an hour, and there's no automatic coupling of DNS pushes to Puppet runs. And we don't always bother with this level of sophistication and sometimes hardcode things even when we could do better (see also T239862).

Now, that all being said: it does sound like it might be appropriate in this case. If you do implement such an override, please just make sure it is well-documented, and that there's a task to unroll it eventually. Oh, and make sure DBA is cool with it as well, as it is another piece of complexity for them to keep in mind when managing the dbproxies.

Hope this makes sense and helps :)

Change 676154 abandoned by Dzahn:

[operations/puppet@production] phabricator: use IP instead of host name for mysql host value

Reason:

we don't have the module providing this function

https://gerrit.wikimedia.org/r/676154

Maintenance_bot removed a project: Patch-For-Review.Mar 31 2021, 10:10 PM

I am not sure I like this idea.
We use m3-master.eqiad.wmnet so we can point everything to it, so the application doesn't need to know about specific hosts. We have one master and one slave behind the proxies so we can have HA in case the master fails.

In addition to this, we have 2 pair of proxies to serve the same shard, for instance, for m3-master we have:
dbproxy1020 as the active proxy and dbproxy1016 as the standby proxy in case the first one fails.
If that happens we just need to change the DNS to point m3-master to the secondary proxy, this logic would be broken if we start pointing applications to given IPs rather than the CNAME.

Let's not do this please.

jbond subscribed.Apr 7 2021, 4:41 PM

Aklapper moved this task from To Triage to Infrastructure on the Phabricator board.Apr 15 2021, 3:13 PM

What are the next steps here? Are we OK to maintain the current state until a fix (possibly the one mentioned by @CDanis) is available?

I've marked this as blocked by T171498: Implement machine-local forwarding DNS caches because that sounds like the right solution.

• mmodell triaged this task as Low priority.Apr 27 2021, 5:56 PM

Removing the DBA tag and subscribing myself instead. Once there are specific actions for DBA please re-add us and/or @mention me.

brennen moved this task from Backlog to Radar on the User-brennen board.Jun 30 2021, 3:07 PM

jijiki moved this task from Incoming 🐫 to 🙈🙉🙊Backlog on the serviceops board.Sep 28 2022, 2:23 PM

jijiki edited projects, added collaboration-services; removed serviceops.Dec 13 2022, 4:08 PM

T171498 implies solving this in more places than just Phabricator so I think we can close this task.

Phabricator intermittently slow; db connection failures to m3-master.eqiad.wmnet with "Temporary failure in name resolution"Closed, DeclinedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Phabricator intermittently slow; db connection failures to m3-master.eqiad.wmnet with "Temporary failure in name resolution"
Closed, DeclinedPublic
Actions

Related Objects
Search...