Page MenuHomePhabricator

TCP traffic increase for DNS over TLS breached a low limit for max open files on authdns1001/2001
Closed, ResolvedPublic

Description

TCP traffic ramped up during the past couple of days on authdns1001 and authdns2001 leading to the following errors (logged in /var/log/daemon.log):

authdns2001 gdnsd[3268]: TCP DNS: accept() failed: Too many open files

We realized about the problem only due to the root partition disk space alerts :(

The max open files settings for gdnsd were too low:

Max open files            1024                 524288               files

@Vgutierrez applied a hot fix adding LimitNOFILE=500000 to the gdnsd unit and restarting the daemons, with puppet disabled.

This task has been created to track two things:

  1. Permanent fix for LimitNOFILE=500000 (puppet override since the systemd unit is shipped with the package?)
  2. Review dns boxes alerting/monitoring and figure out if we need more alarms.

Event Timeline

elukey triaged this task as High priority.Oct 29 2020, 8:24 AM
elukey created this task.
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Change 637359 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] gdnsd: add systemd::service to apply a custom LimitNOFILE

https://gerrit.wikimedia.org/r/637359

Change 637359 merged by Elukey:
[operations/puppet@production] gdnsd: add systemd::service to apply a custom LimitNOFILE

https://gerrit.wikimedia.org/r/637359

Mentioned in SAL (#wikimedia-operations) [2020-10-29T09:52:18Z] <elukey> add gdnsd.service to all gdnsd hosts (with LimitNOFILE=infinity as override) - no daemon restart done - T266746

Current status:

  • dns* nodes have LimitNOFILE=infinity but since daemons have not been restarted, they are running with DefaultLimitNOFILE=1024:524288
  • authdns1001/authdns2001 nodes have LimitNOFILE=infinity but they are running with Valentin's hot fix LimitNOFILE=500000

Mentioned in SAL (#wikimedia-operations) [2020-10-29T13:29:32Z] <bblack> staggered restart of gdnsd on dns[12345]002 (1/2 recursors in each DC) - T266746

Mentioned in SAL (#wikimedia-operations) [2020-10-29T13:38:03Z] <bblack> staggered restart of gdnsd on dns[12345]001 (1/2 recursors in each DC) - T266746

Mentioned in SAL (#wikimedia-operations) [2020-10-29T13:46:05Z] <bblack> authdns2001 - restart gdnsd - T266746

Mentioned in SAL (#wikimedia-operations) [2020-10-29T13:52:04Z] <bblack> authdns1001 - restart gdnsd - T266746

All the authdns are restarted with the infinite limit applied. There's been some IRC discussion about a few possible spinoff tickets here:

  • For all hosts:
    • alerting on bad trends towards disk-full before they happen: volans> yeah, I was also thinking of a more general disk alert that will alert if the disk will be full in N days/hours and the trend doesn't change after N days/hours based on the prediction of being full
    • there's probably some kernel level socket error metric that was incrementing dramatically from the accept failures, maybe we should be graphing and/or alerting on that as well?
  • for gdnsd-the-software:
    • self-ratelimiting log outputs to some sanity level would be a win against insane spam (I saw ~5k/sec from this case in our syslog output)
    • on seeing specifically EMFILE from accept, the daemon could at least attempt to close the most-idle to make room, when possible. It already has code to do this, and just looping back to accept() will cause it to ceaselessly spam the same error for a single connection.
    • Add stats counters for acceptfail of some kind, because right now that's a black hole in gdnsd's stats (we have stats for new conns, and for all conditions that lead to conn-close, but not for acceptfail).
Vgutierrez lowered the priority of this task from High to Medium.Oct 29 2020, 3:12 PM
  • for gdnsd-the-software:

The gdnsd bits are addressed in a handful of commits here (not yet merged to master, still double-checking them, feedback welcome!)

https://github.com/gdnsd/gdnsd/pull/198

Change 640219 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] authdns: raise tcp_clients_per_thread to 4K

https://gerrit.wikimedia.org/r/640219

Change 640219 merged by BBlack:
[operations/puppet@production] authdns: raise tcp_clients_per_thread to 4K

https://gerrit.wikimedia.org/r/640219

Various related gdnsd fixes were deployed to production with version 3.4.1 of upstream.

What remains here is possibly spinning up some side-tickets about this stuff mentioned earlier:

alerting on bad trends towards disk-full before they happen: volans> yeah, I was also thinking of a more general disk alert that will alert if the disk will be full in N days/hours and the trend doesn't change after N days/hours based on the prediction of being full

there's probably some kernel level socket error metric that was incrementing dramatically from the accept failures, maybe we should be graphing and/or alerting on that as well?

There's some anomalies in network graphs on authdns1001 that I hadn't noticed until today, which go all the way back to Oct 26, which is probably around when this started. I'm not sure if they're artificial or not (nothing seems to be wrong), but I'm going to do a precautionary reboot anyways. More likely than not it's something to do with stats reporting itself, that may have become a bit confused with the root disk out of space and never truly recovered since we never rebooted.

The swap of Traffic for Traffic-Icebox in this ticket's set of tags was based on a bulk action for all such tickets that haven't been updated in 6 months or more. This does not imply any human judgement about the validity or importance of the task, and is simply the first step in a larger task cleanup effort. Further manual triage and/or requests for updates will happen this month for all such tickets. For more detail, have a look at the extended explanation on the main page of Traffic-Icebox . Thank you!

Various related gdnsd fixes were deployed to production with version 3.4.1 of upstream.

What remains here is possibly spinning up some side-tickets about this stuff mentioned earlier:

alerting on bad trends towards disk-full before they happen: volans> yeah, I was also thinking of a more general disk alert that will alert if the disk will be full in N days/hours and the trend doesn't change after N days/hours based on the prediction of being full

there's probably some kernel level socket error metric that was incrementing dramatically from the accept failures, maybe we should be graphing and/or alerting on that as well?

@BBlack Is there a desire to create such tickets? It sounds like that's the only reason this ticket was kept open as fixes were applied, so it'd be good to capture these before closing this ticket if they're wanted.

BCornwall claimed this task.

Setting as resolved as the fixes were applied. If there's a want for general disk alerting/kernel level socket error metrics, please file new tickets. Thanks!