Page MenuHomePhabricator

"Couldn't resolve host 'ms-fe.pmtpa.wmnet'" CloudFiles errors
Closed, DeclinedPublic

Description

Lots of spam in swift-backend.log on fluorine, all coming from precise job runners (nothing else seems to be affected):

2012-09-25 19:21:46 mw8 zhwiki: InvalidResponseException in 'SwiftFileBackend::doGetFileStat' (given '{"src":"mwstore:\/\/local-swift\/timeline-render\/514fa51b83364f1ee36033b141081616.err"}'): Invalid response (0): (curl error: 6) Couldn't resolve host 'ms-fe.pmtpa.wmnet': Failed to obtain valid HTTP response.
2012-09-25 19:21:46 mw8 zhwiki: InvalidResponseException in 'SwiftFileBackend::doGetFileStat' (given '{"src":"mwstore:\/\/local-swift\/timeline-render\/514fa51b83364f1ee36033b141081616.err"}'): Invalid response (0): (curl error: 6) Couldn't resolve host 'ms-fe.pmtpa.wmnet': Failed to obtain valid HTTP response.
2012-09-25 19:21:46 mw8 zhwiki: InvalidResponseException in 'SwiftFileBackend::doGetFileStat' (given '{"src":"mwstore:\/\/local-swift\/timeline-render\/514fa51b83364f1ee36033b141081616.map"}'): Invalid response (0): (curl error: 6) Couldn't resolve host 'ms-fe.pmtpa.wmnet': Failed to obtain valid HTTP response.
2012-09-25 19:22:24 mw8 zhwiki: InvalidResponseException in 'SwiftFileBackend::doGetFileStat' (given '{"src":"mwstore:\/\/local-swift\/timeline-render\/42983552dc252c7e27508bb31d6940a3.err"}'): Invalid response (0): (curl error: 6) Couldn't resolve host 'ms-fe.pmtpa.wmnet': Failed to obtain valid HTTP response.

It seems somewhat random in that running eval.php on those boxes and manually doing swift calls via CloudFiles seems to work fine.


Version: wmf-deployment
Severity: normal

Details

Reference
bz40514

Event Timeline

bzimport raised the priority of this task from to High.Nov 22 2014, 1:11 AM
bzimport added projects: DNS, acl*sre-team.
bzimport set Reference to bz40514.
bzimport added a subscriber: Unknown Object (MLST).

Logging at this and the log "swift" via fenari, it seems like these requests are not even hitting swift (or if they are, must be dying hard enough that no reponse is given and nothing is logged).

The error log has died down after ms-be3 was pulled out. This might come back easily...until the replacement hardware is up.

I improved the errors for auth requests and they are now "Couldn't resolve host 'ms-fe.pmtpa.wmnet'", so these are all the same error.

Switched auth URL to an IP to avoid dns lookups for auth requests. I'll see if this works around the dns problems or pushes them down the road.

(In reply to comment #4)

Switched auth URL to an IP to avoid dns lookups for auth requests. I'll see if
this works around the dns problems or pushes them down the road.

Can is down the road :)

This really needs an ops person to look at.

Sorry for not updating the ticket earlier -- I've actually attempted to debug this and have chatted with Aaron last week or the one before that.

I've verified that at the time errors were spawned, DNS replies were coming into the system. Also, it's peculiar how no other infrastructure seems to be affected, not even the application server ones (this apparently affects only job runners). It's also something that has manifested recently, possibly after the precise upgrade.

I have some suspections that it may be curl-related (curl has an internal DNS cache that is enabled by default, so it's not just simple libc resolver calls).

I've asked Aaron to isolate the code in question and produce some kind of script that we can run repetively, reproduce and run under strace/gdb, rather than trying to attach them on random jobrunners and hope we catch it. The issue happens on jobrunners, so it's under php cli anyway, so the environment won't be that different anyway.

(In reply to comment #6 by Faidon)

I've asked Aaron to isolate the code in question and produce some kind of
script that we can run repetively, reproduce and run under strace/gdb, rather
than trying to attach them on random jobrunners and hope we catch it.

Faidon / Aaron: Has this happened yet?

(In reply to comment #6 by Faidon)

I've asked Aaron to isolate the code in question and produce some kind of
script that we can run repetively, reproduce and run under strace/gdb, rather
than trying to attach them on random jobrunners and hope we catch it.

Faidon / Aaron: Has this happened yet?

(In reply to comment #6 by Faidon)

I've asked Aaron to isolate the code in question and produce some kind of
script that we can run repetively, reproduce and run under strace/gdb, rather
than trying to attach them on random jobrunners and hope we catch it.

Faidon / Aaron: Has this happened yet?

Tried that a long time ago, didn't work.

This is still occurring from time to time :(

$ zgrep -c 'resolve host' swift-backend.log-201312*
swift-backend.log-20131201.gz:0
swift-backend.log-20131202.gz:0
swift-backend.log-20131203.gz:0
swift-backend.log-20131204.gz:0
swift-backend.log-20131205.gz:0
swift-backend.log-20131206.gz:0
swift-backend.log-20131207.gz:0
swift-backend.log-20131208.gz:0
swift-backend.log-20131209.gz:0
swift-backend.log-20131210.gz:51
swift-backend.log-20131211.gz:115
swift-backend.log-20131212.gz:35
swift-backend.log-20131213.gz:0
swift-backend.log-20131214.gz:0
swift-backend.log-20131215.gz:0
$