Page MenuHomePhabricator

High load on some webgrid nodes
Closed, ResolvedPublic

Description

valhallasw@tools-bastion-03:~$ qhost | grep -e 'pd-14' -e 'MEM'

  • tools-webgrid-lighttpd-1401.eqiad.wmflabs lx26-amd64 4 10.88 7.8G 1.1G 23.9G 551.1M
  • tools-webgrid-lighttpd-1403.eqiad.wmflabs lx26-amd64 4 5.53 7.8G 1.3G 23.9G 0.0
  • tools-webgrid-lighttpd-1404.eqiad.wmflabs lx26-amd64 4 1.39 7.8G 2.2G 23.9G 461.7M
  • tools-webgrid-lighttpd-1405.eqiad.wmflabs lx26-amd64 4 8.44 7.8G 2.4G 23.9G 0.0
  • tools-webgrid-lighttpd-1406.eqiad.wmflabs lx26-amd64 4 10.88 7.8G 1.4G 23.9G 0.0
  • tools-webgrid-lighttpd-1407.eqiad.wmflabs lx26-amd64 4 1.21 7.8G 1.8G 23.9G 0.0
  • tools-webgrid-lighttpd-1408.eqiad.wmflabs lx26-amd64 4 2.97 7.8G 1.7G 23.9G 0.0
  • tools-webgrid-lighttpd-1409.eqiad.wmflabs lx26-amd64 4 4.72 7.8G 1.4G 23.9G 0.0
  • tools-webgrid-lighttpd-1410.eqiad.wmflabs lx26-amd64 4 8.36 7.8G 1.2G 23.9G 0.0
  • tools-webgrid-lighttpd-1411.tools.eqiad.wmflabs lx26-amd64 4 13.67 7.8G 1.9G 23.9G 1.1G
  • tools-webgrid-lighttpd-1412.tools.eqiad.wmflabs lx26-amd64 4 11.27 7.8G 2.0G 23.9G 312.3M
  • tools-webgrid-lighttpd-1413.tools.eqiad.wmflabs lx26-amd64 4 13.04 7.8G 1.9G 24.5G 578.5M
  • tools-webgrid-lighttpd-1414.tools.eqiad.wmflabs lx26-amd64 4 11.43 7.8G 1.1G 23.9G 805.9M
  • tools-webgrid-lighttpd-1415.tools.eqiad.wmflabs lx26-amd64 4 0.01 7.8G 558.5M 23.9G 0.0
So: three out of 14 hosts are //not// overloaded according to my 'load > ncpu' metric...

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
for i in {01..14};
do
   echo tools-webgrid-lighttpd-14$i;
   echo "---------------------------------------";
   ssh tools-webgrid-lighttpd-14$i 'top -b -n 2 -d 3 -S -o "-%CPU" | tail | tac'; 
done | tee heavyuserlog

shows tools.jembot has php-cgi processes running everywhere (well, except for those three hosts that are not overloaded), and using massive amounts of cpu while doing that.

From strace, the jobs seem to be in a busy-wait loop:

gettimeofday({1460909928, 640933}, NULL) = 0
gettimeofday({1460909928, 640985}, NULL) = 0
gettimeofday({1460909928, 641043}, NULL) = 0
gettimeofday({1460909928, 641099}, NULL) = 0
gettimeofday({1460909928, 641154}, NULL) = 0
gettimeofday({1460909928, 641205}, NULL) = 0
gettimeofday({1460909928, 641261}, NULL) = 0
gettimeofday({1460909928, 641313}, NULL) = 0
gettimeofday({1460909928, 641369}, NULL) = 0
gettimeofday({1460909928, 641415}, NULL) = 0
gettimeofday({1460909928, 641472}, NULL) = 0
gettimeofday({1460909928, 641530}, NULL) = 0
gettimeofday({1460909928, 641587}, NULL) = 0
gettimeofday({1460909928, 641642}, NULL) = 0
gettimeofday({1460909928, 641700}, NULL) = 0
gettimeofday({1460909928, 641755}, NULL) = 0
gettimeofday({1460909928, 641814}, NULL) = 0
gettimeofday({1460909928, 641869}, NULL) = 0
gettimeofday({1460909928, 641926}, NULL) = 0
gettimeofday({1460909928, 641973}, NULL) = 0
gettimeofday({1460909928, 642028}, NULL) = 0

I don't have time to investigate this further at the moment; I will kill the jobs, and hope that they won't respawn too quickly.

valhallasw claimed this task.

Cleaned up with

valhallasw@tools-bastion-03:~$ sudo become jembot
tools.jembot@tools-bastion-03:~$ for i in {01..14}; do ssh tools-webgrid-lighttpd-14$i killall -v -9 -u tools.jembot php-cgi; done
tools.jembot@tools-bastion-03:~$ webservice restart
Restarting..