Page MenuHomePhabricator

Eventbus does not handle gracefully changes in DNS recursors
Closed, DeclinedPublic

Description

Today when Moritz depooled and rebooted acamar in codfw (DNS recursor) all the kafka200[123] hosts have failed to answer Pybal's ProxyFetch health checks ending up in a brief outage.

Moritz rebooted acamar at ~11:38 UTC, and this is the first error in pybal logs on lvs2003:

[eventbus_8085 ProxyFetch] WARN: kafka2001.codfw.wmnet (enabled/up/pooled): Fetch failed, 30.000 s

The same happened to kafka2002 too, meanwhile kafka2003 was depooled due to the network maintenance happened earlier on for the row-d switches.

@ema tried to depool acamar again from dns-rec-ln.codfw.wmnet (listed in the /etc/resolv.conf of the hosts) at ~12:09 UTC, and these logs popped up in the pybal logs:

Jul 19 12:09:37 lvs2003 pybal[45967]: [eventbus_8085 ProxyFetch] WARN: kafka2001.codfw.wmnet (enabled/up/pooled): Fetch failed, 6.060 s
Jul 19 12:09:37 lvs2003 pybal[45967]: [eventbus_8085] ERROR: Monitoring instance ProxyFetch reports server kafka2001.codfw.wmnet (enabled/up/pooled) down: Getting http://localhost/v1/topics took longer than 5 seconds.
Jul 19 12:09:37 lvs2003 pybal[45967]: [eventbus_8085 ProxyFetch] WARN: kafka2002.codfw.wmnet (enabled/up/pooled): Fetch failed, 5.001 s
Jul 19 12:09:37 lvs2003 pybal[45967]: [eventbus_8085] ERROR: Monitoring instance ProxyFetch reports server kafka2002.codfw.wmnet (enabled/up/pooled) down: Getting http://localhost/v1/topics took longer than 5 seconds.
Jul 19 12:09:37 lvs2003 pybal[45967]: [eventbus_8085] ERROR: Could not depool server kafka2002.codfw.wmnet because of too many down!
Jul 19 12:09:40 lvs2003 pybal[45967]: [eventbus_8085 ProxyFetch] WARN: kafka2003.codfw.wmnet (enabled/up/pooled): Fetch failed, 8.024 s
Jul 19 12:09:40 lvs2003 pybal[45967]: [eventbus_8085] ERROR: Monitoring instance ProxyFetch reports server kafka2003.codfw.wmnet (enabled/up/pooled) down: Getting http://localhost/v1/topics took longer than 5 seconds.
Jul 19 12:09:40 lvs2003 pybal[45967]: [eventbus_8085] ERROR: Could not depool server kafka2003.codfw.wmnet because of too many down!
Jul 19 12:09:47 lvs2003 pybal[45967]: [eventbus_8085] INFO: Server kafka2001.codfw.wmnet (enabled/partially up/not pooled) is up
Jul 19 12:09:55 lvs2003 pybal[45967]: [eventbus_8085 ProxyFetch] WARN: kafka2003.codfw.wmnet (enabled/partially up/pooled): Fetch failed, 5.001 s
Jul 19 12:09:55 lvs2003 pybal[45967]: [eventbus_8085 ProxyFetch] WARN: kafka2002.codfw.wmnet (enabled/partially up/pooled): Fetch failed, 8.040 s
Jul 19 12:10:10 lvs2003 pybal[45967]: [eventbus_8085 ProxyFetch] WARN: kafka2002.codfw.wmnet (enabled/partially up/pooled): Fetch failed, 5.001 s
Jul 19 12:10:12 lvs2003 pybal[45967]: [eventbus_8085 ProxyFetch] WARN: kafka2001.codfw.wmnet (enabled/up/pooled): Fetch failed, 5.001 s
Jul 19 12:10:12 lvs2003 pybal[45967]: [eventbus_8085] ERROR: Monitoring instance ProxyFetch reports server kafka2001.codfw.wmnet (enabled/up/pooled) down: Getting http://localhost/v1/topics took longer than 5 seconds.
Jul 19 12:10:12 lvs2003 pybal[45967]: [eventbus_8085] ERROR: Could not depool server kafka2001.codfw.wmnet because of too many down!
Jul 19 12:10:13 lvs2003 pybal[45967]: [eventbus_8085 ProxyFetch] WARN: kafka2003.codfw.wmnet (enabled/partially up/pooled): Fetch failed, 8.023 s
Jul 19 12:10:32 lvs2003 pybal[45967]: [eventbus_8085] INFO: Server kafka2001.codfw.wmnet (enabled/partially up/pooled) is up
Jul 19 12:10:32 lvs2003 pybal[45967]: [eventbus_8085] INFO: Leaving previously pooled but down server kafka2001.codfw.wmnet pooled
Jul 19 12:10:32 lvs2003 pybal[45967]: [eventbus_8085] INFO: Server kafka2002.codfw.wmnet (enabled/partially up/pooled) is up
Jul 19 12:10:32 lvs2003 pybal[45967]: [eventbus_8085] INFO: Leaving previously pooled but down server kafka2002.codfw.wmnet pooled
Jul 19 12:10:34 lvs2003 pybal[45967]: [eventbus_8085] INFO: Server kafka2003.codfw.wmnet (enabled/partially up/not pooled) is up

There might be an issue in eventbus code while handling changes in DNS recursors.

Event Timeline

Me and @ema set up an experiment, namely adding 10.64.32.155 statsd.eqiad.wmnet in /etc/hosts on kafka2002 (and not on the others) and then depooling acamar. Pybal registered ProxyFetch failures only for kafka200[13]. We repeated the experiment applying the hack to all of the hosts and pybal didn't log anything bad.

The initial suspect was eventlogging's statsd writer, but then @Ottomata told us that the writer is not used and sprockets.mixins.statsd was the likely suspect. This is plugin that Tornado uses to push metrics to statsd for each request handled. Diving deep a bit into its core dependency sprockets.clients.statsd we found these bits of code:

    global STATSD_ADDR
    connection_string = os.getenv('STATSD')
    if connection_string:
        url = urlparse.urlparse(connection_string)
        STATSD_ADDR = (url.hostname, url.port)
    else:
        STATSD_ADDR = (os.getenv('STATSD_HOST', 'localhost'),
                       int(os.getenv('STATSD_PORT', 8125)))

[...]

def _send(key, value, metric_type):
    """Send the specified value to the statsd daemon via UDP without a
    direct socket connection.
    :param str value: The properly formatted statsd counter value
    """
    if STATSD_PREFIX:
        key = '.'.join([STATSD_PREFIX, key])
    try:
        STATSD_SOCKET.sendto('{0}:{1}|{2}'.format(key,
                                                  value,
                                                  metric_type).encode(),
                             STATSD_ADDR)

We set Environment="STATSD_HOST=statsd.eqiad.wmnet" in the eventbus' systemd unit, that is not a raw IP but a domain. The plugin is apparently not able to handle this difference and forces the sendto to issue a DNS call for each _send() call.

Change 366561 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[eventlogging@master] Resolve STATSD_HOST for sprockets.mixins.statsd to avoid too many DNS lookups

https://gerrit.wikimedia.org/r/366561

Change 366561 merged by Ottomata:
[eventlogging@master] Resolve STATSD_HOST for sprockets.mixins.statsd to avoid too many DNS lookups

https://gerrit.wikimedia.org/r/366561

FYI caching statsd name forever also has its problems when failing over statsd as many services need restart (i.e. T88997), caching for a limited amount of time or even better respecting TTL would be best

FYI caching statsd name forever also has its problems when failing over statsd as many services need restart (i.e. T88997), caching for a limited amount of time or even better respecting TTL would be best

Yep this will not be the last code fix, I'll follow up with upstream to fix the plugin and then hopefully we'll remove the hack.

elukey lowered the priority of this task from High to Low.Aug 1 2017, 11:48 AM
elukey moved this task from In Progress to Stalled on the User-Elukey board.
elukey edited projects, added Analytics; removed Analytics-Kanban, Patch-For-Review.

The remaining step is to explore the possibility of having a logic to cache the statsd IP only for a limited amount of time.

elukey removed elukey as the assignee of this task.Sep 4 2017, 2:25 PM
Milimetric raised the priority of this task from Low to Needs Triage.Apr 2 2018, 4:03 PM
Milimetric triaged this task as Low priority.
Milimetric moved this task from Wikistats to Event Platform on the Analytics board.

Eventbus is on its road to decommission in favor of event-gate, I'd close this task since probably not relevant anymore.