Page MenuHomePhabricator

Service metrics starts crashing if non-resolvable logstash domain is provided
Open, Needs TriagePublic

Description

In case statsd metrics collector in service-runner based service (like RESTBase ) is configured to send the metrics to a non-relovable domain (or the domain becomes non-resolvable while the service is up) the service starts crashing with something like

[2019-12-20T18:25:11.751Z] FATAL: restbase/13462 on deployment-restbase01: Error sending hot-shots message: Error: getaddrinfo ENOTFOUND labmon1001.eqiad.wmnet (err.code=ENOTFOUND, err.levelPath=fatal/service-runner/unhandled)
    Error: Error sending hot-shots message: Error: getaddrinfo ENOTFOUND labmon1001.eqiad.wmnet
        at handleCallback (/srv/deployment/restbase/deploy-cache/revs/92acf1e5ae89accc351b9e7b08d3dc1d9590551b/node_modules/hot-shots/lib/statsd.js:357:32)
        at doSend (dgram.js:372:7)
        at afterDns (dgram.js:362:5)
        at /srv/deployment/restbase/deploy-cache/revs/92acf1e5ae89accc351b9e7b08d3dc1d9590551b/node_modules/dnscache/lib/index.js:136:58
        at Array.forEach (native)
        at /srv/deployment/restbase/deploy-cache/revs/92acf1e5ae89accc351b9e7b08d3dc1d9590551b/node_modules/dnscache/lib/index.js:136:34
        at GetAddrInfoReqWrap.asyncCallback [as callback] (dns.js:62:16)
        at GetAddrInfoReqWrap.onlookup [as oncomplete] (dns.js:76:17)

The service should recover from a condition like this without fatalling out.