Eventbus does not handle gracefully changes in DNS recursors
Closed, DeclinedPublic
Actions

Assigned To

None

Authored By

	elukey
	Jul 19 2017, 12:24 PM

Description

Today when Moritz depooled and rebooted acamar in codfw (DNS recursor) all the kafka200[123] hosts have failed to answer Pybal's ProxyFetch health checks ending up in a brief outage.

Moritz rebooted acamar at ~11:38 UTC, and this is the first error in pybal logs on lvs2003:

[eventbus_8085 ProxyFetch] WARN: kafka2001.codfw.wmnet (enabled/up/pooled): Fetch failed, 30.000 s

The same happened to kafka2002 too, meanwhile kafka2003 was depooled due to the network maintenance happened earlier on for the row-d switches.

@ema tried to depool acamar again from dns-rec-ln.codfw.wmnet (listed in the /etc/resolv.conf of the hosts) at ~12:09 UTC, and these logs popped up in the pybal logs:

Jul 19 12:09:37 lvs2003 pybal[45967]: [eventbus_8085 ProxyFetch] WARN: kafka2001.codfw.wmnet (enabled/up/pooled): Fetch failed, 6.060 s
Jul 19 12:09:37 lvs2003 pybal[45967]: [eventbus_8085] ERROR: Monitoring instance ProxyFetch reports server kafka2001.codfw.wmnet (enabled/up/pooled) down: Getting http://localhost/v1/topics took longer than 5 seconds.
Jul 19 12:09:37 lvs2003 pybal[45967]: [eventbus_8085 ProxyFetch] WARN: kafka2002.codfw.wmnet (enabled/up/pooled): Fetch failed, 5.001 s
Jul 19 12:09:37 lvs2003 pybal[45967]: [eventbus_8085] ERROR: Monitoring instance ProxyFetch reports server kafka2002.codfw.wmnet (enabled/up/pooled) down: Getting http://localhost/v1/topics took longer than 5 seconds.
Jul 19 12:09:37 lvs2003 pybal[45967]: [eventbus_8085] ERROR: Could not depool server kafka2002.codfw.wmnet because of too many down!
Jul 19 12:09:40 lvs2003 pybal[45967]: [eventbus_8085 ProxyFetch] WARN: kafka2003.codfw.wmnet (enabled/up/pooled): Fetch failed, 8.024 s
Jul 19 12:09:40 lvs2003 pybal[45967]: [eventbus_8085] ERROR: Monitoring instance ProxyFetch reports server kafka2003.codfw.wmnet (enabled/up/pooled) down: Getting http://localhost/v1/topics took longer than 5 seconds.
Jul 19 12:09:40 lvs2003 pybal[45967]: [eventbus_8085] ERROR: Could not depool server kafka2003.codfw.wmnet because of too many down!
Jul 19 12:09:47 lvs2003 pybal[45967]: [eventbus_8085] INFO: Server kafka2001.codfw.wmnet (enabled/partially up/not pooled) is up
Jul 19 12:09:55 lvs2003 pybal[45967]: [eventbus_8085 ProxyFetch] WARN: kafka2003.codfw.wmnet (enabled/partially up/pooled): Fetch failed, 5.001 s
Jul 19 12:09:55 lvs2003 pybal[45967]: [eventbus_8085 ProxyFetch] WARN: kafka2002.codfw.wmnet (enabled/partially up/pooled): Fetch failed, 8.040 s
Jul 19 12:10:10 lvs2003 pybal[45967]: [eventbus_8085 ProxyFetch] WARN: kafka2002.codfw.wmnet (enabled/partially up/pooled): Fetch failed, 5.001 s
Jul 19 12:10:12 lvs2003 pybal[45967]: [eventbus_8085 ProxyFetch] WARN: kafka2001.codfw.wmnet (enabled/up/pooled): Fetch failed, 5.001 s
Jul 19 12:10:12 lvs2003 pybal[45967]: [eventbus_8085] ERROR: Monitoring instance ProxyFetch reports server kafka2001.codfw.wmnet (enabled/up/pooled) down: Getting http://localhost/v1/topics took longer than 5 seconds.
Jul 19 12:10:12 lvs2003 pybal[45967]: [eventbus_8085] ERROR: Could not depool server kafka2001.codfw.wmnet because of too many down!
Jul 19 12:10:13 lvs2003 pybal[45967]: [eventbus_8085 ProxyFetch] WARN: kafka2003.codfw.wmnet (enabled/partially up/pooled): Fetch failed, 8.023 s
Jul 19 12:10:32 lvs2003 pybal[45967]: [eventbus_8085] INFO: Server kafka2001.codfw.wmnet (enabled/partially up/pooled) is up
Jul 19 12:10:32 lvs2003 pybal[45967]: [eventbus_8085] INFO: Leaving previously pooled but down server kafka2001.codfw.wmnet pooled
Jul 19 12:10:32 lvs2003 pybal[45967]: [eventbus_8085] INFO: Server kafka2002.codfw.wmnet (enabled/partially up/pooled) is up
Jul 19 12:10:32 lvs2003 pybal[45967]: [eventbus_8085] INFO: Leaving previously pooled but down server kafka2002.codfw.wmnet pooled
Jul 19 12:10:34 lvs2003 pybal[45967]: [eventbus_8085] INFO: Server kafka2003.codfw.wmnet (enabled/partially up/not pooled) is up

There might be an issue in eventbus code while handling changes in DNS recursors.

Details

	Subject	Repo	Branch	Lines +/-
	Resolve STATSD_HOST for sprockets.mixins.statsd to avoid too many DNS lookups	eventlogging	master	+10 -1

Customize query in gerrit

Related Objects

Mentioned In: T171498: Implement machine-local forwarding DNS caches
T104442: Investigate better DNS cache/lookup solutions
Mentioned Here: T88997: Improve graphite failover

Event Timeline

elukey created this task.Jul 19 2017, 12:24 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 19 2017, 12:24 PM

MoritzMuehlenhoff subscribed.Jul 19 2017, 12:39 PM

elukey added a project: User-Elukey.Jul 19 2017, 1:24 PM

Me and @ema set up an experiment, namely adding 10.64.32.155 statsd.eqiad.wmnet in /etc/hosts on kafka2002 (and not on the others) and then depooling acamar. Pybal registered ProxyFetch failures only for kafka200[13]. We repeated the experiment applying the hack to all of the hosts and pybal didn't log anything bad.

The initial suspect was eventlogging's statsd writer, but then @Ottomata told us that the writer is not used and sprockets.mixins.statsd was the likely suspect. This is plugin that Tornado uses to push metrics to statsd for each request handled. Diving deep a bit into its core dependency sprockets.clients.statsd we found these bits of code:

    global STATSD_ADDR
    connection_string = os.getenv('STATSD')
    if connection_string:
        url = urlparse.urlparse(connection_string)
        STATSD_ADDR = (url.hostname, url.port)
    else:
        STATSD_ADDR = (os.getenv('STATSD_HOST', 'localhost'),
                       int(os.getenv('STATSD_PORT', 8125)))

[...]

def _send(key, value, metric_type):
    """Send the specified value to the statsd daemon via UDP without a
    direct socket connection.
    :param str value: The properly formatted statsd counter value
    """
    if STATSD_PREFIX:
        key = '.'.join([STATSD_PREFIX, key])
    try:
        STATSD_SOCKET.sendto('{0}:{1}|{2}'.format(key,
                                                  value,
                                                  metric_type).encode(),
                             STATSD_ADDR)

We set Environment="STATSD_HOST=statsd.eqiad.wmnet" in the eventbus' systemd unit, that is not a raw IP but a domain. The plugin is apparently not able to handle this difference and forces the sendto to issue a DNS call for each _send() call.

Change 366561 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[eventlogging@master] Resolve STATSD_HOST for sprockets.mixins.statsd to avoid too many DNS lookups

https://gerrit.wikimedia.org/r/366561

gerritbot added a project: Patch-For-Review.Jul 20 2017, 1:59 PM

• ema mentioned this in T104442: Investigate better DNS cache/lookup solutions.Jul 20 2017, 2:05 PM

Change 366561 merged by Ottomata:
[eventlogging@master] Resolve STATSD_HOST for sprockets.mixins.statsd to avoid too many DNS lookups

https://gerrit.wikimedia.org/r/366561

FYI caching statsd name forever also has its problems when failing over statsd as many services need restart (i.e. T88997), caching for a limited amount of time or even better respecting TTL would be best

In T171048#3456690, @fgiunchedi wrote:

FYI caching statsd name forever also has its problems when failing over statsd as many services need restart (i.e. T88997), caching for a limited amount of time or even better respecting TTL would be best

Yep this will not be the last code fix, I'll follow up with upstream to fix the plugin and then hopefully we'll remove the hack.

elukey moved this task from Backlog to In Progress on the User-Elukey board.Jul 20 2017, 4:36 PM

Opened an issue to upstream: https://github.com/sprockets/sprockets.clients.statsd/issues/7

• fdans edited projects, added Analytics-Kanban; removed Analytics.Jul 24 2017, 4:02 PM

• fdans assigned this task to elukey.Jul 24 2017, 4:05 PM

• fdans moved this task from Next Up to In Progress on the Analytics-Kanban board.

BBlack mentioned this in T171498: Implement machine-local forwarding DNS caches.Jul 24 2017, 5:50 PM

Fix is deployed to eventbus.

elukey moved this task from In Progress to Paused on the Analytics-Kanban board.Jul 31 2017, 2:59 PM

The remaining step is to explore the possibility of having a logic to cache the statsd IP only for a limited amount of time.

elukey moved this task from Incoming to Backlog (Later) on the Analytics board.Aug 3 2017, 3:34 PM

elukey moved this task from Stalled to Ops Backlog on the User-Elukey board.Aug 7 2017, 7:42 AM

elukey removed elukey as the assignee of this task.Sep 4 2017, 2:25 PM

• fdans moved this task from Backlog (Later) to Wikistats on the Analytics board.Oct 9 2017, 4:10 PM

elukey moved this task from Ops Backlog to Backlog on the User-Elukey board.Feb 16 2018, 12:01 PM

Milimetric raised the priority of this task from Low to Needs Triage.Apr 2 2018, 4:03 PM

Milimetric triaged this task as Low priority.

Milimetric moved this task from Wikistats to Event Platform on the Analytics board.

• Phabricator_maintenance moved this task from Backlog to Acknowledged on the SRE board.Jan 26 2019, 9:23 PM

elukey moved this task from Backlog to Keep an eye on it on the User-Elukey board.Feb 27 2019, 9:15 AM

Eventbus is on its road to decommission in favor of event-gate, I'd close this task since probably not relevant anymore.

Eventbus does not handle gracefully changes in DNS recursorsClosed, DeclinedPublicActions

Description

Details

Related Objects

Event Timeline

Eventbus does not handle gracefully changes in DNS recursors
Closed, DeclinedPublic
Actions