Page MenuHomePhabricator

TCP traffic increase for DNS over TLS breached a low limit for max open files on authdns1001/2001
Open, MediumPublic

Description

TCP traffic ramped up during the past couple of days on authdns1001 and authdns2001 leading to the following errors (logged in /var/log/daemon.log):

authdns2001 gdnsd[3268]: TCP DNS: accept() failed: Too many open files

We realized about the problem only due to the root partition disk space alerts :(

The max open files settings for gdnsd were too low:

Max open files            1024                 524288               files

@Vgutierrez applied a hot fix adding LimitNOFILE=500000 to the gdnsd unit and restarting the daemons, with puppet disabled.

This task has been created to track two things:

  1. Permanent fix for LimitNOFILE=500000 (puppet override since the systemd unit is shipped with the package?)
  2. Review dns boxes alerting/monitoring and figure out if we need more alarms.

Event Timeline

elukey triaged this task as High priority.Thu, Oct 29, 8:24 AM
elukey created this task.
Restricted Application added a project: Operations. · View Herald TranscriptThu, Oct 29, 8:24 AM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Change 637359 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] gdnsd: add systemd::service to apply a custom LimitNOFILE

https://gerrit.wikimedia.org/r/637359

Change 637359 merged by Elukey:
[operations/puppet@production] gdnsd: add systemd::service to apply a custom LimitNOFILE

https://gerrit.wikimedia.org/r/637359

Mentioned in SAL (#wikimedia-operations) [2020-10-29T09:52:18Z] <elukey> add gdnsd.service to all gdnsd hosts (with LimitNOFILE=infinity as override) - no daemon restart done - T266746

Current status:

  • dns* nodes have LimitNOFILE=infinity but since daemons have not been restarted, they are running with DefaultLimitNOFILE=1024:524288
  • authdns1001/authdns2001 nodes have LimitNOFILE=infinity but they are running with Valentin's hot fix LimitNOFILE=500000

Mentioned in SAL (#wikimedia-operations) [2020-10-29T13:29:32Z] <bblack> staggered restart of gdnsd on dns[12345]002 (1/2 recursors in each DC) - T266746

Mentioned in SAL (#wikimedia-operations) [2020-10-29T13:38:03Z] <bblack> staggered restart of gdnsd on dns[12345]001 (1/2 recursors in each DC) - T266746

Mentioned in SAL (#wikimedia-operations) [2020-10-29T13:46:05Z] <bblack> authdns2001 - restart gdnsd - T266746

Mentioned in SAL (#wikimedia-operations) [2020-10-29T13:52:04Z] <bblack> authdns1001 - restart gdnsd - T266746

All the authdns are restarted with the infinite limit applied. There's been some IRC discussion about a few possible spinoff tickets here:

  • For all hosts:
    • alerting on bad trends towards disk-full before they happen: volans> yeah, I was also thinking of a more general disk alert that will alert if the disk will be full in N days/hours and the trend doesn't change after N days/hours based on the prediction of being full
    • there's probably some kernel level socket error metric that was incrementing dramatically from the accept failures, maybe we should be graphing and/or alerting on that as well?
  • for gdnsd-the-software:
    • self-ratelimiting log outputs to some sanity level would be a win against insane spam (I saw ~5k/sec from this case in our syslog output)
    • on seeing specifically EMFILE from accept, the daemon could at least attempt to close the most-idle to make room, when possible. It already has code to do this, and just looping back to accept() will cause it to ceaselessly spam the same error for a single connection.
    • Add stats counters for acceptfail of some kind, because right now that's a black hole in gdnsd's stats (we have stats for new conns, and for all conditions that lead to conn-close, but not for acceptfail).
Vgutierrez lowered the priority of this task from High to Medium.Thu, Oct 29, 3:12 PM
ema moved this task from Triage to Bug Reports on the Traffic board.Fri, Oct 30, 9:59 AM
BBlack added a comment.Sun, Nov 8, 2:53 PM
  • for gdnsd-the-software:

The gdnsd bits are addressed in a handful of commits here (not yet merged to master, still double-checking them, feedback welcome!)

https://github.com/gdnsd/gdnsd/pull/198

Change 640219 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] authdns: raise tcp_clients_per_thread to 4K

https://gerrit.wikimedia.org/r/640219

Change 640219 merged by BBlack:
[operations/puppet@production] authdns: raise tcp_clients_per_thread to 4K

https://gerrit.wikimedia.org/r/640219

Various related gdnsd fixes were deployed to production with version 3.4.1 of upstream.

What remains here is possibly spinning up some side-tickets about this stuff mentioned earlier:

alerting on bad trends towards disk-full before they happen: volans> yeah, I was also thinking of a more general disk alert that will alert if the disk will be full in N days/hours and the trend doesn't change after N days/hours based on the prediction of being full

there's probably some kernel level socket error metric that was incrementing dramatically from the accept failures, maybe we should be graphing and/or alerting on that as well?