The free space on /var on tools-webgrid-04 is on the brink of collapsing causing frequent Shinken alerts:
scfc@tools-webgrid-04:~$ df -h Dateisystem Größe Benutzt Verf. Verw% Eingehängt auf /dev/vda1 7,6G 3,8G 3,5G 53% / udev 7,9G 12K 7,9G 1% /dev tmpfs 1,6G 2,2M 1,6G 1% /run none 5,0M 0 5,0M 0% /run/lock none 7,9G 0 7,9G 0% /run/shm /dev/vda2 1,9G 1,8G 37M 98% /var labstore.svc.eqiad.wmnet:/project/tools/project 40T 11T 30T 26% /data/project labstore.svc.eqiad.wmnet:/project/tools/home 40T 11T 30T 26% /home labstore.svc.eqiad.wmnet:/keys 960M 49M 911M 6% /public/keys labstore1003.eqiad.wmnet:/dumps 44T 12T 33T 26% /public/dumps labstore.svc.eqiad.wmnet:/scratch 7,3T 1,4T 6,0T 19% /data/scratch scfc@tools-webgrid-04:~$
I have disabled the queue on this host with:
scfc@tools-bastion-01:~$ qmod -d webgrid-lighttpd@tools-webgrid-04.eqiad.wmflabs scfc@tools-bastion-01.eqiad.wmflabs changed state of "webgrid-lighttpd@tools-webgrid-04.eqiad.wmflabs" (disabled) scfc@tools-bastion-01:~$
In a few hours I will:
- reschedule the jobs on that host (qhost -h tools-webgrid-04 -j, qmod -rj),
- delete the instance,
- recreate a new Precise instance,
- copy the configuration settings from tools-webgrid-04,
- copy /etc/hosts,
- copy /usr/local/bin/gridengine-mailer (cf. T63160),
- check that SGE recognizes the new host, and
- reenable the queue (qmod -e webgrid-lighttpd@tools-webgrid-04.eqiad.wmflabs).