deployment-ms-be03.deployment-prep and deployment-ms-be04.deployment-prep have high load / system CPU
Closed, ResolvedPublic

Description

The beta cluster swift backend instances show high load and system CPU usage. There must be something wrong with them.

Symptoms

Jessie / new
deployment-ms-be03.deployment-prepPrometheus 1 days
deployment-ms-be04.deployment-prepPrometheus 1 days

Load:

deployment-ms-be03.deployment-prep.eqiad.wmflabs:
 08:54:01 up 12 days, 19:40,  0 users,  load average: 5.27, 3.87, 3.67
deployment-ms-be04.deployment-prep.eqiad.wmflabs:
 08:54:20 up 12 days, 21:11,  0 users,  load average: 7.59, 6.35, 5.61

Maybe the Swift services have to many workers for the labs instances?

deployment-ms-be01.deployment-prep.eqiad.wmflabs:
    /etc/swift/account-server.conf:workers = 8
    /etc/swift/container-server.conf:workers = 8
    /etc/swift/object-server.conf:workers = 100
deployment-ms-be02.deployment-prep.eqiad.wmflabs:
    /etc/swift/account-server.conf:workers = 8
    /etc/swift/container-server.conf:workers = 8
    /etc/swift/object-server.conf:workers = 100

Fix candidates

The summary:

  • number of containers

There tis ~ 20k containers which causes the replicator to issue a lot of stat() calls and similar. A low hanging fruit are the -deleted ones 1296 containers per wiki.

That is controlled from MediaWiki config which uses a shard level of 2 pretty much everywhere with the exception of $wgLocalFileRepo which uses 3 levels for deleted files:

wmf-config/filebackend.php
$wgLocalFileRepo = [
    'class'             => 'LocalRepo',
    'name'              => 'local',
    'backend'           => 'local-multiwrite',
    'url'               => $wgUploadBaseUrl ? $wgUploadBaseUrl . $wgUploadPath : $wgUploadPath,
    'scriptDirUrl'      => $wgScriptPath,
    'hashLevels'        => 2,
...
    'deletedHashLevels' => 3,
...
];

On beta it would be nice to lower it to two. Question is do we have a way to migrate containers? Then given it is beta and they are just deleted files, we can probably just delete them all.

  • lower replication passes

Running the replication less often would help. There is a puppet patch to let us tweak Swift configs via hiera and on beta change the interval between pass to 300 seconds with only 1 concurrent process. That is applied on beta and nicely reduced the load. https://gerrit.wikimedia.org/r/344387

  • migrate Swift to Jessie

On beta the three Swift instances are on Ubuntu Trusty when production has switched to Jessie. That has to be done eventually and might bring in optimization in the replication pass. Filled T162247

  • instances on different labvirt

Both ms-be instances are on the same labvirt. T161083 Though if we create Jessie instances they will most probably end up on different labvirt.

  • nscd eating CPU

gethostbyname('localhost') is not cached by nscd and causes lot of disk access. The lookups come from statsd metrics. Worked around on beta by disabling statsd entirely.
Potential fix is to use 127.0.0.1 https://gerrit.wikimedia.org/r/#/c/358799/

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 21 2017, 12:41 PM

@hashar indeed the workers 100 is static in puppet, I'm not going to have time for it but please feel free to play with the setting and see if that makes a difference!

Mentioned in SAL (#wikimedia-releng) [2017-03-22T08:43:12Z] <hashar> deployment-ms-be01: swift-init reload object - T160990

Mentioned in SAL (#wikimedia-releng) [2017-03-22T08:45:24Z] <hashar> deployment-ms-be01: swift-init reload container - T160990

Mentioned in SAL (#wikimedia-releng) [2017-03-22T08:48:00Z] <hashar> deployment-ms-be01: swift-init reload all - T160990

On deployment-ms-be01 I reload the object server with 30 workers that might have helped. There is apparently some replication going on between the two back end. Will let them settle.

Twist: for the per process CPU%:

  • htop shows it relative to a single CPU. Eg 80%
  • top default to percentage for the whole host so for 8 CPU => 10%. In Irix mode ( I) that is for a single CPU

The best I can tell is:

  • lowering number of workers might help
  • most of the time is spent in the process:
    • swift-container-replicator
    • swift-object-replicator

So my guess is that both process keep looking for all files and attempt to replicate them over and over. From Swift Deployment Guide there are settings to tweak ionice and replication interval. That might help.

Change 344387 had a related patch set uploaded (by Hashar):
[operations/puppet@production] swift: lower replication interval for beta

https://gerrit.wikimedia.org/r/344387

Mentioned in SAL (#wikimedia-releng) [2017-03-23T14:02:57Z] <hashar> deployment-ms-be01 and deployment-ms-be02 : Lower Swift replicator on, upgrade package, reboot hosts. T160990

Each instance uses 300% user CPU and 100% system CPU. So potentially 8 core out of labvirt1004 24 core. All that apparently just for replicating via rsync 23000 sqlite files that are barely changing.

Will revisit the Prometheus graph later on and see how it has evolved after the patches above.

From what I understand swift replications ends continuously stats() all the containers and objects sqlite files.

We have 18990 containers base on . /etc/swift/account_AUTH_mw.env && swift list

deployment-ms-fe01$ . /etc/swift/account_AUTH_mw.env && swift list|cut -d\. -f1|uniq -c|egrep -v ' 1 '
    256 global-data-math-render
      2 wikimedia-commons-local-public
   1296 wikipedia-commons-local-deleted
    256 wikipedia-commons-local-public
    256 wikipedia-commons-local-temp
    256 wikipedia-commons-local-thumb
    256 wikipedia-commons-local-transcoded
   1296 wikipedia-de-local-deleted
    256 wikipedia-de-local-public
    256 wikipedia-de-local-temp
    256 wikipedia-de-local-thumb
    256 wikipedia-de-local-transcoded
   1296 wikipedia-en-local-deleted
    256 wikipedia-en-local-public
    256 wikipedia-en-local-temp
    256 wikipedia-en-local-thumb
    256 wikipedia-en-local-transcoded
   1296 wikipedia-he-local-deleted
    256 wikipedia-he-local-public
    256 wikipedia-he-local-temp
    256 wikipedia-he-local-thumb
    256 wikipedia-he-local-transcoded
   1296 wikipedia-ja-local-deleted
    256 wikipedia-ja-local-public
    256 wikipedia-ja-local-temp
    256 wikipedia-ja-local-thumb
    256 wikipedia-ja-local-transcoded
   1296 wikipedia-ru-local-deleted
    256 wikipedia-ru-local-public
    256 wikipedia-ru-local-temp
    256 wikipedia-ru-local-thumb
    256 wikipedia-ru-local-transcoded
   1296 wikipedia-uk-local-deleted
    256 wikipedia-uk-local-public
    256 wikipedia-uk-local-temp
    256 wikipedia-uk-local-thumb
    256 wikipedia-uk-local-transcoded
   1296 wikipedia-zh-local-deleted
    256 wikipedia-zh-local-public
    256 wikipedia-zh-local-temp
    256 wikipedia-zh-local-thumb
    256 wikipedia-zh-local-transcoded

More than half (10394) are *local-deleted*.

@godog is there a way to purge those deleted containers somehow ? On beta maybe we could just use a single sqlite instead of the namespaces entries aa - zz ? That would save a lot of files and hence stat().

hashar updated the task description. (Show Details)Mar 24 2017, 1:03 PM

Also found out via swift list --lh that most containers are actually empty. Most probably due to container-server.conf having db_preallocation = on.

fgiunchedi added a comment.EditedMar 27 2017, 1:45 PM

[...]
@godog is there a way to purge those deleted containers somehow ? On beta maybe we could just use a single sqlite instead of the namespaces entries aa - zz ? That would save a lot of files and hence stat().

Not on the swift side, the containers are managed (created) by mediawiki entirely

edit: Also I think with your workers change loadaverage is now to an acceptable level, further optimizations are likely not worth it, unless it is somehow a problem?

Not on the swift side, the containers are managed (created) by mediawiki entirely

Maybe the hashlevel. I guess that is determined in the wgFileBackend of mediawiki-config, then I don't know how how we could merge the containers on swift side.

Also I think with your workers change loadaverage is now to an acceptable level, further optimizations are likely not worth it, unless it is somehow a problem?

Yup the load is way nicer. and under the number of CPU so that is definitely an improvement. I guess I can polish up the puppet patch so we can change the # of Swift worker via hiera.

What puzzles me is the prometheus probe reports 150% user CPU and 50+% system CPU. And on the instance it is at 25% usage of 8 virtual CPU. So that is two labs CPU being used per instance or a waste of 4 CPU on the labs hosts.

From top:

         user      system
%Cpu0  : 13.2 us,  5.3 sy,  0.0 ni, 81.1 id,  0.4 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu1  : 29.7 us, 15.0 sy,  0.0 ni, 53.1 id,  1.0 wa,  0.0 hi,  1.0 si,  0.0 st
%Cpu2  : 20.7 us,  4.6 sy,  0.0 ni, 74.3 id,  0.0 wa,  0.0 hi,  0.4 si,  0.0 st
%Cpu3  : 18.2 us,  6.7 sy,  0.0 ni, 75.1 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu4  : 15.0 us,  6.3 sy,  0.0 ni, 77.7 id,  1.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu5  : 21.1 us,  4.9 sy,  0.0 ni, 63.0 id,  0.0 wa,  0.4 hi, 10.6 si,  0.0 st
%Cpu6  : 15.3 us,  5.9 sy,  0.0 ni, 78.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu7  : 22.5 us, 13.0 sy,  0.0 ni, 64.4 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st

All of that due to the object-replicator and container-replicator services :(

Another thing that might improve utilization is move to jessie's version of swift (though I don't have any concrete data on that) which we are doing in production anyway

The summary

number of containers

There tis ~ 20k containers which causes the replicator to issue a lot of stat() calls and similar. A low hanging fruit are the -deleted ones 1296 containers per wiki.

That is controlled from MediaWiki config which uses a shard level of 2 pretty much everywhere with the exception of $wgLocalFileRepo which uses 3 levels for deleted files:

wmf-config/filebackend.php
$wgLocalFileRepo = [
    'class'             => 'LocalRepo',
    'name'              => 'local',
    'backend'           => 'local-multiwrite',
    'url'               => $wgUploadBaseUrl ? $wgUploadBaseUrl . $wgUploadPath : $wgUploadPath,
    'scriptDirUrl'      => $wgScriptPath,
    'hashLevels'        => 2,
...
    'deletedHashLevels' => 3,
...
];

On beta it would be nice to lower it to two. Question is do we have a way to migrate containers? Then given it is beta and they are just deleted files, we can probably just delete them all.

lower replication passes

Running the replication less often would help. There is a puppet patch to let us tweak Swift configs via hiera and on beta change the interval between pass to 300 seconds with only 1 concurrent process. That is applied on beta and nicely reduced the load. https://gerrit.wikimedia.org/r/344387

migrate Swift to Jessie

On beta the three Swift instances are on Ubuntu Trusty when production has switched to Jessie. That has to be done eventually and might bring in optimization in the replication pass. Filled T162247

instances on different labvirt

Both ms-be instances are on the same labvirt. T161083 Though if we create Jessie instances they will most probably end up on different labvirt.

hashar updated the task description. (Show Details)May 2 2017, 7:20 PM

T162247 created two new Swift backend using Jessie. I have updated the task detail to add Prometheus links. Something to note is both new instances have high "steal" % CPU, most probably an indication that the underlying labvirt is saturated ( T161118 ).

hashar renamed this task from deployment-ms-be01.deployment-prep and deployment-ms-be02.deployment-prep have high load / system CPU to deployment-ms-be03.deployment-prep and deployment-ms-be04.deployment-prep have high load / system CPU.May 9 2017, 8:55 AM
hashar updated the task description. (Show Details)

The Swift servers have been rebuild to use Jessie.

object-server workers went from 100 to 12.

As a result of both actions, load went from ~20 to 3 or 6, which is lower than the 8 CPU the instances have.

They both have high steal CPU usage (that is KVM being busy with other instances) and overall high CPU usage: ms-be03 at ~ 40% and ms-be04 at ~ 50%.

:(

hashar updated the task description. (Show Details)May 9 2017, 9:04 AM

Seems the load on deployment-ms-be04 is enough to overload labvirt1006 ( T165753 ) :(

Looking at top on each of the instances. ms-be-04 has more load (6 of load for 8 CPU) and CPU % at 60% or so.

/usr/sbin/nscd is quite high

Swift processes with high CPU:

  • swift-container-replicator
  • swift-object-replicator

ms-be-04 additionally has a swift-object-auditor

hashar added a comment.EditedJun 13 2017, 7:28 PM

nscd:

Tue 13 Jun 2017 07:14:14 PM UTC - 30055: handle_request: request received (Version = 2) from PID 32224
Tue 13 Jun 2017 07:14:14 PM UTC - 30055: 	GETHOSTBYNAME (localhost)
Tue 13 Jun 2017 07:14:14 PM UTC - 30055: Haven't found "localhost" in hosts cache!

PID 32224 being swift-container-replicator. From strace:

connect(10<socket:[1600099473]>, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("127.0.0.1")}, 16) = 0
getsockname(10<socket:[1600099473]>, {sa_family=AF_INET, sin_port=htons(19341), sin_addr=inet_addr("127.0.0.1")}, [16]) = 0
sendto(8<socket:[1600099470]>, "swift.deployment-prep.deployment-ms-be04.container-replicator.timing:76.3509273529|ms", 85, 0, {sa_family=AF_INET, sin_port=htons(8125), sin_addr=inet_addr("127.0.0.1")}, 16) = 85

Ie socket connections to localhost port 8125 which is statsd. And we have:

# egrep -R '^[^#].*8125' /etc/swift
/etc/swift/account-server.conf:log_statsd_port = 8125
/etc/swift/container-server.conf:log_statsd_port = 8125
/etc/swift/object-server.conf:log_statsd_port = 8125

nscd additionally check for freshness of /etc/resolv.conf by using stat() and /etc/hosts via open() + fstat():

[pid 31946] read(15, "\2\0\0\0\4\0\0\0\n\0\0\0", 12) = 12
[pid 31946] read(15, "localhost\0", 10) = 10
[pid 31946] stat("/etc/resolv.conf", {st_mode=S_IFREG|0444, st_size=276, ...}) = 0
[pid 31946] open("/etc/hosts", O_RDONLY|O_CLOEXEC) = 16
[pid 31946] fstat(16, {st_mode=S_IFREG|0644, st_size=295, ...}) = 0
[pid 31946] mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fe4cb049000
[pid 31946] read(16, "127.0.0.1\tlocalhost\n::1\t\tlocalho"..., 4096) = 295
[pid 31946] read(16, "", 4096)          = 0
[pid 31946] close(16)                   = 0

Due to /etc/nscd.conf having:

check-files           hosts    yes

I have dropped the related settings from hiera https://wikitech.wikimedia.org/w/index.php?title=Hiera:Deployment-prep&diff=1761898&oldid=1761816

That causes puppet to drop the statsd setting:

/etc/swift/account-server.conf
-log_statsd_host = localhost
-log_statsd_port = 8125
-log_statsd_metric_prefix = swift.deployment-prep.deployment-ms-be03
-log_statsd_sample_rate_factor = 1

I have restarted swift services and nscd is happy now. Will send a patch to switch production to use 127.0.0.1.

Change 358799 had a related patch set uploaded (by Hashar; owner: Hashar):
[operations/puppet@production] swift: save nscd CPU by using IP address

https://gerrit.wikimedia.org/r/358799

faidon added a subscriber: faidon.Jun 13 2017, 11:08 PM

That sounds like an issue with nscd and probably needs to be solved on that layer, not on the Swift side. Queries for localhost shouldn't be that expensive.

Gave it a quick try. I have added in /etc/nscd.conf:

debug-level 9
logfile  /var/log/nscd.log

Stopped nscd and deleted the cache file /var/cache/nscd/hosts.

Then I started nscd, did a ping localhost to trigger a gethostbyname() and the second ping nscd still complains it can't find it in the cache:

handle_request: request received (Version = 2) from PID 11841
	GETHOSTBYNAME (localhost)
Haven't found "localhost" in hosts cache!

It probably cache localhost. or other entries in /etc/hosts:

/etc/hosts
127.0.0.1	localhost
::1		localhost ip6-localhost ip6-loopback
ff02::1		ip6-allnodes
ff02::2		ip6-allrouters

10.68.22.228	jenkinstest.integration.eqiad.wmflabs

So whatever happen in nscd, localhost is special cased somehow. I did a quick search the GNU libc source code but nothing is standing out. So I guess using 127.0.0.1 is the straightforward easy solution.

There's potentially other VMs across labs/cloud doing the same expensive resolving localhost, it seems to me it'd be more useful to understand the root cause instead

I tried a few things with nscd, have dig in the code a bit but I could not think anything standing out. I am not going to try to find the issue in the GNU libc / nscd source code. The hack to switch to 127.0.0.1 in https://gerrit.wikimedia.org/r/#/c/358799/ is good enough for beta cluster. Feel free to open a new task and reuse my previous comments.


In swift: lower replication interval for beta I have bumped the replication delay to 20 / 30 minutes. They take a while to run:

Jun 15 18:45:12 deployment-ms-be03 container-replicator:
Beginning replication run
Jun 15 18:55:46 deployment-ms-be03 container-replicator:
Replication run OVER
                                    vvv
Attempted to replicate 24236 dbs in 634.24600 seconds (38.21230/s)
                                    ^^^
Removed 0 dbs
24236 successes, 0 failures
no_change:24236 ts_repl:0 diff:0 rsync:0 diff_capped:0 hashmatch:0 empty:0
Jun 15 18:37:19 deployment-ms-be03 object-replicator:
38510/38510 (100.00%) partitions replicated in 702.83s (54.79/sec, 0s remaining)
38510 successes, 0 failures
49179 suffixes checked - 0.00% hashed, 0.00% synced
Partition times: max 1.0279s, min 0.0029s, med 0.0136s
                              vvvvv
Object replication complete. (11.72 minutes)
                              ^^^^^

A way to make it lighter is to reduce the number of containers ( T160990#3128087 ). We have 256 per wiki and 1024 per deleted per wiki. I have no idea how we could migrate though for the deleted one we can most probably just trash them?

hashar triaged this task as Normal priority.Jun 16 2017, 10:55 AM
hashar closed this task as Resolved.Jun 27 2017, 9:04 PM
hashar claimed this task.

Load is mostly gone since the reboot of the lab machines on June 21th. That also solved labvirt1006 being over busy ( T165753 ).

System load over 10 days:

CPU usage over 10 days:

So as far as I am concerned. That is solved.