Page MenuHomePhabricator

deployment-ms-be03.deployment-prep and deployment-ms-be04.deployment-prep have high load / system CPU
Closed, ResolvedPublic

Description

The beta cluster swift backend instances show high load and system CPU usage. There must be something wrong with them.

Symptoms

Jessie / new
deployment-ms-be03.deployment-prepPrometheus 1 days
deployment-ms-be04.deployment-prepPrometheus 1 days

Load:

deployment-ms-be03.deployment-prep.eqiad.wmflabs:
 08:54:01 up 12 days, 19:40,  0 users,  load average: 5.27, 3.87, 3.67
deployment-ms-be04.deployment-prep.eqiad.wmflabs:
 08:54:20 up 12 days, 21:11,  0 users,  load average: 7.59, 6.35, 5.61

Maybe the Swift services have to many workers for the labs instances?

deployment-ms-be01.deployment-prep.eqiad.wmflabs:
    /etc/swift/account-server.conf:workers = 8
    /etc/swift/container-server.conf:workers = 8
    /etc/swift/object-server.conf:workers = 100
deployment-ms-be02.deployment-prep.eqiad.wmflabs:
    /etc/swift/account-server.conf:workers = 8
    /etc/swift/container-server.conf:workers = 8
    /etc/swift/object-server.conf:workers = 100

Fix candidates

The summary:

  • number of containers

There tis ~ 20k containers which causes the replicator to issue a lot of stat() calls and similar. A low hanging fruit are the -deleted ones 1296 containers per wiki.

That is controlled from MediaWiki config which uses a shard level of 2 pretty much everywhere with the exception of $wgLocalFileRepo which uses 3 levels for deleted files:

wmf-config/filebackend.php
$wgLocalFileRepo = [
    'class'             => 'LocalRepo',
    'name'              => 'local',
    'backend'           => 'local-multiwrite',
    'url'               => $wgUploadBaseUrl ? $wgUploadBaseUrl . $wgUploadPath : $wgUploadPath,
    'scriptDirUrl'      => $wgScriptPath,
    'hashLevels'        => 2,
...
    'deletedHashLevels' => 3,
...
];

On beta it would be nice to lower it to two. Question is do we have a way to migrate containers? Then given it is beta and they are just deleted files, we can probably just delete them all.

  • lower replication passes

Running the replication less often would help. There is a puppet patch to let us tweak Swift configs via hiera and on beta change the interval between pass to 300 seconds with only 1 concurrent process. That is applied on beta and nicely reduced the load. https://gerrit.wikimedia.org/r/344387

  • migrate Swift to Jessie

On beta the three Swift instances are on Ubuntu Trusty when production has switched to Jessie. That has to be done eventually and might bring in optimization in the replication pass. Filled T162247

  • instances on different labvirt

Both ms-be instances are on the same labvirt. T161083 Though if we create Jessie instances they will most probably end up on different labvirt.

  • nscd eating CPU

gethostbyname('localhost') is not cached by nscd and causes lot of disk access. The lookups come from statsd metrics. Worked around on beta by disabling statsd entirely.
Potential fix is to use 127.0.0.1 https://gerrit.wikimedia.org/r/#/c/358799/

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

@hashar indeed the workers 100 is static in puppet, I'm not going to have time for it but please feel free to play with the setting and see if that makes a difference!

Mentioned in SAL (#wikimedia-releng) [2017-03-22T08:43:12Z] <hashar> deployment-ms-be01: swift-init reload object - T160990

Mentioned in SAL (#wikimedia-releng) [2017-03-22T08:45:24Z] <hashar> deployment-ms-be01: swift-init reload container - T160990

Mentioned in SAL (#wikimedia-releng) [2017-03-22T08:48:00Z] <hashar> deployment-ms-be01: swift-init reload all - T160990

On deployment-ms-be01 I reload the object server with 30 workers that might have helped. There is apparently some replication going on between the two back end. Will let them settle.

Twist: for the per process CPU%:

  • htop shows it relative to a single CPU. Eg 80%
  • top default to percentage for the whole host so for 8 CPU => 10%. In Irix mode ( I) that is for a single CPU

The best I can tell is:

  • lowering number of workers might help
  • most of the time is spent in the process:
    • swift-container-replicator
    • swift-object-replicator

So my guess is that both process keep looking for all files and attempt to replicate them over and over. From Swift Deployment Guide there are settings to tweak ionice and replication interval. That might help.

Change 344387 had a related patch set uploaded (by Hashar):
[operations/puppet@production] swift: lower replication interval for beta

https://gerrit.wikimedia.org/r/344387

Mentioned in SAL (#wikimedia-releng) [2017-03-23T14:02:57Z] <hashar> deployment-ms-be01 and deployment-ms-be02 : Lower Swift replicator on, upgrade package, reboot hosts. T160990

Each instance uses 300% user CPU and 100% system CPU. So potentially 8 core out of labvirt1004 24 core. All that apparently just for replicating via rsync 23000 sqlite files that are barely changing.

Will revisit the Prometheus graph later on and see how it has evolved after the patches above.

From what I understand swift replications ends continuously stats() all the containers and objects sqlite files.

We have 18990 containers base on . /etc/swift/account_AUTH_mw.env && swift list

deployment-ms-fe01$ . /etc/swift/account_AUTH_mw.env && swift list|cut -d\. -f1|uniq -c|egrep -v ' 1 '
    256 global-data-math-render
      2 wikimedia-commons-local-public
   1296 wikipedia-commons-local-deleted
    256 wikipedia-commons-local-public
    256 wikipedia-commons-local-temp
    256 wikipedia-commons-local-thumb
    256 wikipedia-commons-local-transcoded
   1296 wikipedia-de-local-deleted
    256 wikipedia-de-local-public
    256 wikipedia-de-local-temp
    256 wikipedia-de-local-thumb
    256 wikipedia-de-local-transcoded
   1296 wikipedia-en-local-deleted
    256 wikipedia-en-local-public
    256 wikipedia-en-local-temp
    256 wikipedia-en-local-thumb
    256 wikipedia-en-local-transcoded
   1296 wikipedia-he-local-deleted
    256 wikipedia-he-local-public
    256 wikipedia-he-local-temp
    256 wikipedia-he-local-thumb
    256 wikipedia-he-local-transcoded
   1296 wikipedia-ja-local-deleted
    256 wikipedia-ja-local-public
    256 wikipedia-ja-local-temp
    256 wikipedia-ja-local-thumb
    256 wikipedia-ja-local-transcoded
   1296 wikipedia-ru-local-deleted
    256 wikipedia-ru-local-public
    256 wikipedia-ru-local-temp
    256 wikipedia-ru-local-thumb
    256 wikipedia-ru-local-transcoded
   1296 wikipedia-uk-local-deleted
    256 wikipedia-uk-local-public
    256 wikipedia-uk-local-temp
    256 wikipedia-uk-local-thumb
    256 wikipedia-uk-local-transcoded
   1296 wikipedia-zh-local-deleted
    256 wikipedia-zh-local-public
    256 wikipedia-zh-local-temp
    256 wikipedia-zh-local-thumb
    256 wikipedia-zh-local-transcoded

More than half (10394) are *local-deleted*.

@godog is there a way to purge those deleted containers somehow ? On beta maybe we could just use a single sqlite instead of the namespaces entries aa - zz ? That would save a lot of files and hence stat().

Also found out via swift list --lh that most containers are actually empty. Most probably due to container-server.conf having db_preallocation = on.

[...]
@godog is there a way to purge those deleted containers somehow ? On beta maybe we could just use a single sqlite instead of the namespaces entries aa - zz ? That would save a lot of files and hence stat().

Not on the swift side, the containers are managed (created) by mediawiki entirely

edit: Also I think with your workers change loadaverage is now to an acceptable level, further optimizations are likely not worth it, unless it is somehow a problem?

Not on the swift side, the containers are managed (created) by mediawiki entirely

Maybe the hashlevel. I guess that is determined in the wgFileBackend of mediawiki-config, then I don't know how how we could merge the containers on swift side.

Also I think with your workers change loadaverage is now to an acceptable level, further optimizations are likely not worth it, unless it is somehow a problem?

Yup the load is way nicer. and under the number of CPU so that is definitely an improvement. I guess I can polish up the puppet patch so we can change the # of Swift worker via hiera.

What puzzles me is the prometheus probe reports 150% user CPU and 50+% system CPU. And on the instance it is at 25% usage of 8 virtual CPU. So that is two labs CPU being used per instance or a waste of 4 CPU on the labs hosts.

From top:

         user      system
%Cpu0  : 13.2 us,  5.3 sy,  0.0 ni, 81.1 id,  0.4 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu1  : 29.7 us, 15.0 sy,  0.0 ni, 53.1 id,  1.0 wa,  0.0 hi,  1.0 si,  0.0 st
%Cpu2  : 20.7 us,  4.6 sy,  0.0 ni, 74.3 id,  0.0 wa,  0.0 hi,  0.4 si,  0.0 st
%Cpu3  : 18.2 us,  6.7 sy,  0.0 ni, 75.1 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu4  : 15.0 us,  6.3 sy,  0.0 ni, 77.7 id,  1.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu5  : 21.1 us,  4.9 sy,  0.0 ni, 63.0 id,  0.0 wa,  0.4 hi, 10.6 si,  0.0 st
%Cpu6  : 15.3 us,  5.9 sy,  0.0 ni, 78.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu7  : 22.5 us, 13.0 sy,  0.0 ni, 64.4 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st

All of that due to the object-replicator and container-replicator services :(

Another thing that might improve utilization is move to jessie's version of swift (though I don't have any concrete data on that) which we are doing in production anyway

The summary

number of containers

There tis ~ 20k containers which causes the replicator to issue a lot of stat() calls and similar. A low hanging fruit are the -deleted ones 1296 containers per wiki.

That is controlled from MediaWiki config which uses a shard level of 2 pretty much everywhere with the exception of $wgLocalFileRepo which uses 3 levels for deleted files:

wmf-config/filebackend.php
$wgLocalFileRepo = [
    'class'             => 'LocalRepo',
    'name'              => 'local',
    'backend'           => 'local-multiwrite',
    'url'               => $wgUploadBaseUrl ? $wgUploadBaseUrl . $wgUploadPath : $wgUploadPath,
    'scriptDirUrl'      => $wgScriptPath,
    'hashLevels'        => 2,
...
    'deletedHashLevels' => 3,
...
];

On beta it would be nice to lower it to two. Question is do we have a way to migrate containers? Then given it is beta and they are just deleted files, we can probably just delete them all.

lower replication passes

Running the replication less often would help. There is a puppet patch to let us tweak Swift configs via hiera and on beta change the interval between pass to 300 seconds with only 1 concurrent process. That is applied on beta and nicely reduced the load. https://gerrit.wikimedia.org/r/344387

migrate Swift to Jessie

On beta the three Swift instances are on Ubuntu Trusty when production has switched to Jessie. That has to be done eventually and might bring in optimization in the replication pass. Filled T162247

instances on different labvirt

Both ms-be instances are on the same labvirt. T161083 Though if we create Jessie instances they will most probably end up on different labvirt.

T162247 created two new Swift backend using Jessie. I have updated the task detail to add Prometheus links. Something to note is both new instances have high "steal" % CPU, most probably an indication that the underlying labvirt is saturated ( T161118 ).

hashar renamed this task from deployment-ms-be01.deployment-prep and deployment-ms-be02.deployment-prep have high load / system CPU to deployment-ms-be03.deployment-prep and deployment-ms-be04.deployment-prep have high load / system CPU.May 9 2017, 8:55 AM
hashar updated the task description. (Show Details)

The Swift servers have been rebuild to use Jessie.

object-server workers went from 100 to 12.

As a result of both actions, load went from ~20 to 3 or 6, which is lower than the 8 CPU the instances have.

They both have high steal CPU usage (that is KVM being busy with other instances) and overall high CPU usage: ms-be03 at ~ 40% and ms-be04 at ~ 50%.

:(

Seems the load on deployment-ms-be04 is enough to overload labvirt1006 ( T165753 ) :(

Looking at top on each of the instances. ms-be-04 has more load (6 of load for 8 CPU) and CPU % at 60% or so.

/usr/sbin/nscd is quite high

Swift processes with high CPU:

  • swift-container-replicator
  • swift-object-replicator

ms-be-04 additionally has a swift-object-auditor

nscd:

Tue 13 Jun 2017 07:14:14 PM UTC - 30055: handle_request: request received (Version = 2) from PID 32224
Tue 13 Jun 2017 07:14:14 PM UTC - 30055: 	GETHOSTBYNAME (localhost)
Tue 13 Jun 2017 07:14:14 PM UTC - 30055: Haven't found "localhost" in hosts cache!

PID 32224 being swift-container-replicator. From strace:

connect(10<socket:[1600099473]>, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("127.0.0.1")}, 16) = 0
getsockname(10<socket:[1600099473]>, {sa_family=AF_INET, sin_port=htons(19341), sin_addr=inet_addr("127.0.0.1")}, [16]) = 0
sendto(8<socket:[1600099470]>, "swift.deployment-prep.deployment-ms-be04.container-replicator.timing:76.3509273529|ms", 85, 0, {sa_family=AF_INET, sin_port=htons(8125), sin_addr=inet_addr("127.0.0.1")}, 16) = 85

Ie socket connections to localhost port 8125 which is statsd. And we have:

# egrep -R '^[^#].*8125' /etc/swift
/etc/swift/account-server.conf:log_statsd_port = 8125
/etc/swift/container-server.conf:log_statsd_port = 8125
/etc/swift/object-server.conf:log_statsd_port = 8125

nscd additionally check for freshness of /etc/resolv.conf by using stat() and /etc/hosts via open() + fstat():

[pid 31946] read(15, "\2\0\0\0\4\0\0\0\n\0\0\0", 12) = 12
[pid 31946] read(15, "localhost\0", 10) = 10
[pid 31946] stat("/etc/resolv.conf", {st_mode=S_IFREG|0444, st_size=276, ...}) = 0
[pid 31946] open("/etc/hosts", O_RDONLY|O_CLOEXEC) = 16
[pid 31946] fstat(16, {st_mode=S_IFREG|0644, st_size=295, ...}) = 0
[pid 31946] mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fe4cb049000
[pid 31946] read(16, "127.0.0.1\tlocalhost\n::1\t\tlocalho"..., 4096) = 295
[pid 31946] read(16, "", 4096)          = 0
[pid 31946] close(16)                   = 0

Due to /etc/nscd.conf having:

check-files           hosts    yes

I have dropped the related settings from hiera https://wikitech.wikimedia.org/w/index.php?title=Hiera:Deployment-prep&diff=1761898&oldid=1761816

That causes puppet to drop the statsd setting:

/etc/swift/account-server.conf
-log_statsd_host = localhost
-log_statsd_port = 8125
-log_statsd_metric_prefix = swift.deployment-prep.deployment-ms-be03
-log_statsd_sample_rate_factor = 1

I have restarted swift services and nscd is happy now. Will send a patch to switch production to use 127.0.0.1.

Change 358799 had a related patch set uploaded (by Hashar; owner: Hashar):
[operations/puppet@production] swift: save nscd CPU by using IP address

https://gerrit.wikimedia.org/r/358799

That sounds like an issue with nscd and probably needs to be solved on that layer, not on the Swift side. Queries for localhost shouldn't be that expensive.

Gave it a quick try. I have added in /etc/nscd.conf:

debug-level 9
logfile  /var/log/nscd.log

Stopped nscd and deleted the cache file /var/cache/nscd/hosts.

Then I started nscd, did a ping localhost to trigger a gethostbyname() and the second ping nscd still complains it can't find it in the cache:

handle_request: request received (Version = 2) from PID 11841
	GETHOSTBYNAME (localhost)
Haven't found "localhost" in hosts cache!

It probably cache localhost. or other entries in /etc/hosts:

/etc/hosts
127.0.0.1	localhost
::1		localhost ip6-localhost ip6-loopback
ff02::1		ip6-allnodes
ff02::2		ip6-allrouters

10.68.22.228	jenkinstest.integration.eqiad.wmflabs

So whatever happen in nscd, localhost is special cased somehow. I did a quick search the GNU libc source code but nothing is standing out. So I guess using 127.0.0.1 is the straightforward easy solution.

There's potentially other VMs across labs/cloud doing the same expensive resolving localhost, it seems to me it'd be more useful to understand the root cause instead

I tried a few things with nscd, have dig in the code a bit but I could not think anything standing out. I am not going to try to find the issue in the GNU libc / nscd source code. The hack to switch to 127.0.0.1 in https://gerrit.wikimedia.org/r/#/c/358799/ is good enough for beta cluster. Feel free to open a new task and reuse my previous comments.


In swift: lower replication interval for beta I have bumped the replication delay to 20 / 30 minutes. They take a while to run:

Jun 15 18:45:12 deployment-ms-be03 container-replicator:
Beginning replication run
Jun 15 18:55:46 deployment-ms-be03 container-replicator:
Replication run OVER
                                    vvv
Attempted to replicate 24236 dbs in 634.24600 seconds (38.21230/s)
                                    ^^^
Removed 0 dbs
24236 successes, 0 failures
no_change:24236 ts_repl:0 diff:0 rsync:0 diff_capped:0 hashmatch:0 empty:0
Jun 15 18:37:19 deployment-ms-be03 object-replicator:
38510/38510 (100.00%) partitions replicated in 702.83s (54.79/sec, 0s remaining)
38510 successes, 0 failures
49179 suffixes checked - 0.00% hashed, 0.00% synced
Partition times: max 1.0279s, min 0.0029s, med 0.0136s
                              vvvvv
Object replication complete. (11.72 minutes)
                              ^^^^^

A way to make it lighter is to reduce the number of containers ( T160990#3128087 ). We have 256 per wiki and 1024 per deleted per wiki. I have no idea how we could migrate though for the deleted one we can most probably just trash them?

hashar triaged this task as Medium priority.Jun 16 2017, 10:55 AM
hashar claimed this task.

Load is mostly gone since the reboot of the lab machines on June 21th. That also solved labvirt1006 being over busy ( T165753 ).

System load over 10 days:

Screenshot_20170627_230136.png (246×616 px, 36 KB)

Screenshot_20170627_230202.png (244×623 px, 44 KB)

CPU usage over 10 days:

Screenshot_20170627_230126.png (284×618 px, 64 KB)

So as far as I am concerned. That is solved.

Screenshot_20170627_230151.png (278×607 px, 76 KB)

Change 358799 abandoned by Hashar:
swift: save nscd CPU by using IP address

Reason:
Replacing 'localhost' by '127.0.0.1' definitely lowered the load on labs.

As to figure out exactly why nscd/labs dns config is at fault, I have no idea. There is the archived task T171745 for history purposes.

https://gerrit.wikimedia.org/r/358799

It has come up while reviewing cherry-picks used in deployment-prep that https://gerrit.wikimedia.org/r/c/344387 was not merged. Is that still needed? Either we should reopen this task or remove that cherry-pick...

Swift triggers a lot of name resolution to localhost and nscd always attempts to resolve them causing high Disk/CPU usage. That is either a bug in nscd or a bad DNS configuration in labs. T171745 tracked that but I had and still have no time to track it.

https://gerrit.wikimedia.org/r/c/344387 changes the configuration from localhost to 127.0.0.1 to workaround the name resolution cpu usage. The patch got rejected in favor of tracking the actual root cause but I gave up. It is too complicated / too much time for me.

One can potentially try to drop the cherry pick, run puppet on the Swift server and observe the CPU / load usage. Another way is to disable the statsd metric in Swift which is triggering the localhost resolution.

Reopening that two years old task in the interest of getting the related puppet patches to be merged eventually.

The main slow down of the Swift instances on beta is that the replication occurs too often given the disk/cpu available. https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/344387/ makes the puppet classes for Swift more configurable and uses hiera to tweak parameters.

Change 344387 had a related patch set uploaded (by Hashar; owner: Hashar):
[operations/puppet@production] swift: lower replication interval for beta

https://gerrit.wikimedia.org/r/344387

Change 513053 had a related patch set uploaded (by Hashar; owner: Hashar):
[operations/puppet@production] swift: hiera-ize object-replicator interval

https://gerrit.wikimedia.org/r/513053

Change 513054 had a related patch set uploaded (by Hashar; owner: Hashar):
[operations/puppet@production] beta: tweak swift replicator

https://gerrit.wikimedia.org/r/513054

Change 513058 had a related patch set uploaded (by Hashar; owner: Hashar):
[operations/puppet@production] swift: hiera-ize object server number of workers

https://gerrit.wikimedia.org/r/513058

Change 513059 had a related patch set uploaded (by Hashar; owner: Hashar):
[operations/puppet@production] beta: lower swift server workers

https://gerrit.wikimedia.org/r/513059

Change 513062 had a related patch set uploaded (by Hashar; owner: Hashar):
[operations/puppet@production] swift: hierarize container_replicator settings

https://gerrit.wikimedia.org/r/513062

Change 513063 had a related patch set uploaded (by Hashar; owner: Hashar):
[operations/puppet@production] beta: slow down swift container replication

https://gerrit.wikimedia.org/r/513063

Change 344387 abandoned by Hashar:
swift: lower replication interval for beta

Reason:
I have split this huge commit into smaller chunks that would be easier to review and get merged:

https://gerrit.wikimedia.org/r/#/q/owner:self project:operations/puppet bug:T160990

https://gerrit.wikimedia.org/r/344387

I have abandoned the years old patch which was doing too much at the same time. Instead I have split it up in small atomic chunks which should be easier to review/deploy for production.

There are three settings affected, for each I have first send a patch to make the Swift setting configurable via hiera based on @CDanis prior work (Gerrit #507386 which should be noop for production. Then a second patch tweak the setting for Beta-Cluster-Infrastructure

The list of changes is at: https://gerrit.wikimedia.org/r/#/q/bug:T160990+NOT+is:abandoned

Each went through the puppet compiler before the SRE summit against production hosts ms-fe1005.eqiad.wmnet, ms-be1040.eqiad.wmnet.

So I guess it is just pending for a final review/compiler/deploy by @fgiunchedi / @CDanis ?

All the patches to operations/puppet.git are attached to this bug. Can be found in Gerrit via the search query [[ https://gerrit.wikimedia.org/r/#/q/bug:T160990+NOT+is:abandoned | bug:T160990 ]]

Change 513053 merged by CDanis:
[operations/puppet@production] swift: hiera-ize object-replicator interval

https://gerrit.wikimedia.org/r/513053

Change 513054 merged by CDanis:
[operations/puppet@production] beta: tweak swift replicator

https://gerrit.wikimedia.org/r/513054

Change 513058 merged by CDanis:
[operations/puppet@production] swift: hiera-ize object server number of workers

https://gerrit.wikimedia.org/r/513058

Change 513059 merged by CDanis:
[operations/puppet@production] beta: lower swift server workers

https://gerrit.wikimedia.org/r/513059

Change 513062 merged by CDanis:
[operations/puppet@production] swift: hierarize container_replicator settings

https://gerrit.wikimedia.org/r/513062

Change 513063 merged by CDanis:
[operations/puppet@production] beta: slow down swift container replication

https://gerrit.wikimedia.org/r/513063

Hurrah. Thank you @CDanis & @godog :]