deployment-ms-be01.deployment-prep and deployment-ms-be02.deployment-prep have high load / system CPU
Open, Needs TriagePublic

Description

The beta cluster swift backend instances show high load and system CPU usage. There must be something wrong with them.

deployment-ms-be01.deployment-prepPrometheus 1 days
deployment-ms-be02.deployment-prepPrometheus 1 days

Load:

deployment-ms-be01.deployment-prep.eqiad.wmflabs:
     12:34:20 up 139 days, 15:22,  0 users,  load average: 24.88, 22.79, 21.10
deployment-ms-be02.deployment-prep.eqiad.wmflabs:
     12:34:20 up 139 days, 15:22,  0 users,  load average: 13.54, 22.26, 22.68

Maybe the Swift services have to many workers for the labs instances?

deployment-ms-be01.deployment-prep.eqiad.wmflabs:
    /etc/swift/account-server.conf:workers = 8
    /etc/swift/container-server.conf:workers = 8
    /etc/swift/object-server.conf:workers = 100
deployment-ms-be02.deployment-prep.eqiad.wmflabs:
    /etc/swift/account-server.conf:workers = 8
    /etc/swift/container-server.conf:workers = 8
    /etc/swift/object-server.conf:workers = 100
hashar created this task.Mar 21 2017, 12:41 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 21 2017, 12:41 PM

@hashar indeed the workers 100 is static in puppet, I'm not going to have time for it but please feel free to play with the setting and see if that makes a difference!

Mentioned in SAL (#wikimedia-releng) [2017-03-22T08:43:12Z] <hashar> deployment-ms-be01: swift-init reload object - T160990

Mentioned in SAL (#wikimedia-releng) [2017-03-22T08:45:24Z] <hashar> deployment-ms-be01: swift-init reload container - T160990

Mentioned in SAL (#wikimedia-releng) [2017-03-22T08:48:00Z] <hashar> deployment-ms-be01: swift-init reload all - T160990

On deployment-ms-be01 I reload the object server with 30 workers that might have helped. There is apparently some replication going on between the two back end. Will let them settle.

Twist: for the per process CPU%:

  • htop shows it relative to a single CPU. Eg 80%
  • top default to percentage for the whole host so for 8 CPU => 10%. In Irix mode ( I) that is for a single CPU

The best I can tell is:

  • lowering number of workers might help
  • most of the time is spent in the process:
    • swift-container-replicator
    • swift-object-replicator

So my guess is that both process keep looking for all files and attempt to replicate them over and over. From Swift Deployment Guide there are settings to tweak ionice and replication interval. That might help.

Change 344387 had a related patch set uploaded (by Hashar):
[operations/puppet@production] swift: lower replication interval for beta

https://gerrit.wikimedia.org/r/344387

Mentioned in SAL (#wikimedia-releng) [2017-03-23T14:02:57Z] <hashar> deployment-ms-be01 and deployment-ms-be02 : Lower Swift replicator on, upgrade package, reboot hosts. T160990

Each instance uses 300% user CPU and 100% system CPU. So potentially 8 core out of labvirt1004 24 core. All that apparently just for replicating via rsync 23000 sqlite files that are barely changing.

Will revisit the Prometheus graph later on and see how it has evolved after the patches above.

From what I understand swift replications ends continuously stats() all the containers and objects sqlite files.

We have 18990 containers base on . /etc/swift/account_AUTH_mw.env && swift list

deployment-ms-fe01$ . /etc/swift/account_AUTH_mw.env && swift list|cut -d\. -f1|uniq -c|egrep -v ' 1 '
    256 global-data-math-render
      2 wikimedia-commons-local-public
   1296 wikipedia-commons-local-deleted
    256 wikipedia-commons-local-public
    256 wikipedia-commons-local-temp
    256 wikipedia-commons-local-thumb
    256 wikipedia-commons-local-transcoded
   1296 wikipedia-de-local-deleted
    256 wikipedia-de-local-public
    256 wikipedia-de-local-temp
    256 wikipedia-de-local-thumb
    256 wikipedia-de-local-transcoded
   1296 wikipedia-en-local-deleted
    256 wikipedia-en-local-public
    256 wikipedia-en-local-temp
    256 wikipedia-en-local-thumb
    256 wikipedia-en-local-transcoded
   1296 wikipedia-he-local-deleted
    256 wikipedia-he-local-public
    256 wikipedia-he-local-temp
    256 wikipedia-he-local-thumb
    256 wikipedia-he-local-transcoded
   1296 wikipedia-ja-local-deleted
    256 wikipedia-ja-local-public
    256 wikipedia-ja-local-temp
    256 wikipedia-ja-local-thumb
    256 wikipedia-ja-local-transcoded
   1296 wikipedia-ru-local-deleted
    256 wikipedia-ru-local-public
    256 wikipedia-ru-local-temp
    256 wikipedia-ru-local-thumb
    256 wikipedia-ru-local-transcoded
   1296 wikipedia-uk-local-deleted
    256 wikipedia-uk-local-public
    256 wikipedia-uk-local-temp
    256 wikipedia-uk-local-thumb
    256 wikipedia-uk-local-transcoded
   1296 wikipedia-zh-local-deleted
    256 wikipedia-zh-local-public
    256 wikipedia-zh-local-temp
    256 wikipedia-zh-local-thumb
    256 wikipedia-zh-local-transcoded

More than half (10394) are *local-deleted*.

@godog is there a way to purge those deleted containers somehow ? On beta maybe we could just use a single sqlite instead of the namespaces entries aa - zz ? That would save a lot of files and hence stat().

hashar edited the task description. (Show Details)Mar 24 2017, 1:03 PM

Also found out via swift list --lh that most containers are actually empty. Most probably due to container-server.conf having db_preallocation = on.

fgiunchedi added a comment.EditedMar 27 2017, 1:45 PM

[...]
@godog is there a way to purge those deleted containers somehow ? On beta maybe we could just use a single sqlite instead of the namespaces entries aa - zz ? That would save a lot of files and hence stat().

Not on the swift side, the containers are managed (created) by mediawiki entirely

edit: Also I think with your workers change loadaverage is now to an acceptable level, further optimizations are likely not worth it, unless it is somehow a problem?

Not on the swift side, the containers are managed (created) by mediawiki entirely

Maybe the hashlevel. I guess that is determined in the wgFileBackend of mediawiki-config, then I don't know how how we could merge the containers on swift side.

Also I think with your workers change loadaverage is now to an acceptable level, further optimizations are likely not worth it, unless it is somehow a problem?

Yup the load is way nicer. and under the number of CPU so that is definitely an improvement. I guess I can polish up the puppet patch so we can change the # of Swift worker via hiera.

What puzzles me is the prometheus probe reports 150% user CPU and 50+% system CPU. And on the instance it is at 25% usage of 8 virtual CPU. So that is two labs CPU being used per instance or a waste of 4 CPU on the labs hosts.

From top:

         user      system
%Cpu0  : 13.2 us,  5.3 sy,  0.0 ni, 81.1 id,  0.4 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu1  : 29.7 us, 15.0 sy,  0.0 ni, 53.1 id,  1.0 wa,  0.0 hi,  1.0 si,  0.0 st
%Cpu2  : 20.7 us,  4.6 sy,  0.0 ni, 74.3 id,  0.0 wa,  0.0 hi,  0.4 si,  0.0 st
%Cpu3  : 18.2 us,  6.7 sy,  0.0 ni, 75.1 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu4  : 15.0 us,  6.3 sy,  0.0 ni, 77.7 id,  1.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu5  : 21.1 us,  4.9 sy,  0.0 ni, 63.0 id,  0.0 wa,  0.4 hi, 10.6 si,  0.0 st
%Cpu6  : 15.3 us,  5.9 sy,  0.0 ni, 78.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu7  : 22.5 us, 13.0 sy,  0.0 ni, 64.4 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st

All of that due to the object-replicator and container-replicator services :(

Another thing that might improve utilization is move to jessie's version of swift (though I don't have any concrete data on that) which we are doing in production anyway

hashar added a comment.Wed, Apr 5, 9:15 AM

The summary

number of containers

There tis ~ 20k containers which causes the replicator to issue a lot of stat() calls and similar. A low hanging fruit are the -deleted ones 1296 containers per wiki.

That is controlled from MediaWiki config which uses a shard level of 2 pretty much everywhere with the exception of $wgLocalFileRepo which uses 3 levels for deleted files:

wmf-config/filebackend.php
$wgLocalFileRepo = [
    'class'             => 'LocalRepo',
    'name'              => 'local',
    'backend'           => 'local-multiwrite',
    'url'               => $wgUploadBaseUrl ? $wgUploadBaseUrl . $wgUploadPath : $wgUploadPath,
    'scriptDirUrl'      => $wgScriptPath,
    'hashLevels'        => 2,
...
    'deletedHashLevels' => 3,
...
];

On beta it would be nice to lower it to two. Question is do we have a way to migrate containers? Then given it is beta and they are just deleted files, we can probably just delete them all.

lower replication passes

Running the replication less often would help. There is a puppet patch to let us tweak Swift configs via hiera and on beta change the interval between pass to 300 seconds with only 1 concurrent process. That is applied on beta and nicely reduced the load. https://gerrit.wikimedia.org/r/344387

migrate Swift to Jessie

On beta the three Swift instances are on Ubuntu Trusty when production has switched to Jessie. That has to be done eventually and might bring in optimization in the replication pass. Filled T162247

instances on different labvirt

Both ms-be instances are on the same labvirt. T161083 Though if we create Jessie instances they will most probably end up on different labvirt.