Page MenuHomePhabricator

nscd does not cache localhost causing high CPU usage when localhost is often resolved
Closed, DeclinedPublic

Description

While optimizing CPU usage of the Swift backends on the beta cluster (T160990), I found out nscd was taking a share part of the CPU usage.

Swift processes emit metrics to statsd host localhost, however it seems nscd always try to resolve it which keep it busy.

Workaround

Change the statsd host localhost -> 127.0.0.1 https://gerrit.wikimedia.org/r/#/c/358799/

swift::proxy::statsd_host: '127.0.0.1'
swift::storage::statsd_host: '127.0.0.1'

Traces

From nscd debug log:

Tue 13 Jun 2017 07:14:14 PM UTC - 30055: handle_request: request received (Version = 2) from PID 32224
Tue 13 Jun 2017 07:14:14 PM UTC - 30055: 	GETHOSTBYNAME (localhost)
Tue 13 Jun 2017 07:14:14 PM UTC - 30055: Haven't found "localhost" in hosts cache!

It does not cache localhost :-(

PID 32224 being swift-container-replicator. From strace:

connect(10<socket:[1600099473]>, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("127.0.0.1")}, 16) = 0
getsockname(10<socket:[1600099473]>, {sa_family=AF_INET, sin_port=htons(19341), sin_addr=inet_addr("127.0.0.1")}, [16]) = 0
sendto(8<socket:[1600099470]>, "swift.deployment-prep.deployment-ms-be04.container-replicator.timing:76.3509273529|ms", 85, 0, {sa_family=AF_INET, sin_port=htons(8125), sin_addr=inet_addr("127.0.0.1")}, 16) = 85

I.e socket connections to localhost port 8125 which is statsd.

The Swift configuration had:

# egrep -R '^[^#].*8125' /etc/swift
/etc/swift/account-server.conf:log_statsd_port = 8125
/etc/swift/container-server.conf:log_statsd_port = 8125
/etc/swift/object-server.conf:log_statsd_port = 8125

Additionally nscd checks for freshness of:

  • /etc/resolv.conf by using stat()
  • /etc/hosts via open() + fstat()
[pid 31946] read(15, "\2\0\0\0\4\0\0\0\n\0\0\0", 12) = 12
[pid 31946] read(15, "localhost\0", 10) = 10
[pid 31946] stat("/etc/resolv.conf", {st_mode=S_IFREG|0444, st_size=276, ...}) = 0
[pid 31946] open("/etc/hosts", O_RDONLY|O_CLOEXEC) = 16
[pid 31946] fstat(16, {st_mode=S_IFREG|0644, st_size=295, ...}) = 0
[pid 31946] mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fe4cb049000
[pid 31946] read(16, "127.0.0.1\tlocalhost\n::1\t\tlocalho"..., 4096) = 295
[pid 31946] read(16, "", 4096)          = 0
[pid 31946] close(16)                   = 0

Due to /etc/nscd.conf having:

check-files           hosts    yes

Event Timeline

Change 358799 had a related patch set uploaded (by Hashar; owner: Hashar):
[operations/puppet@production] swift: save nscd CPU by using IP address

https://gerrit.wikimedia.org/r/358799

For Beta-Cluster-Infrastructure the workaround https://gerrit.wikimedia.org/r/#/c/358799/ is applied on the local puppet master.

hashar closed this task as Declined.Mar 30 2018, 8:28 PM

No time to look into it, so lets archive this task.

Change 358799 abandoned by Hashar:
swift: save nscd CPU by using IP address

Reason:
Replacing 'localhost' by '127.0.0.1' definitely lowered the load on labs.

As to figure out exactly why nscd/labs dns config is at fault, I have no idea. There is the archived task T171745 for history purposes.

https://gerrit.wikimedia.org/r/358799