Page MenuHomePhabricator

Resetup tools-webgrid-04 due to /var being too small
Closed, ResolvedPublic

Description

The free space on /var on tools-webgrid-04 is on the brink of collapsing causing frequent Shinken alerts:

scfc@tools-webgrid-04:~$ df -h
Dateisystem                                     Größe Benutzt Verf. Verw% Eingehängt auf
/dev/vda1                                        7,6G    3,8G  3,5G   53% /
udev                                             7,9G     12K  7,9G    1% /dev
tmpfs                                            1,6G    2,2M  1,6G    1% /run
none                                             5,0M       0  5,0M    0% /run/lock
none                                             7,9G       0  7,9G    0% /run/shm
/dev/vda2                                        1,9G    1,8G   37M   98% /var
labstore.svc.eqiad.wmnet:/project/tools/project   40T     11T   30T   26% /data/project
labstore.svc.eqiad.wmnet:/project/tools/home      40T     11T   30T   26% /home
labstore.svc.eqiad.wmnet:/keys                   960M     49M  911M    6% /public/keys
labstore1003.eqiad.wmnet:/dumps                   44T     12T   33T   26% /public/dumps
labstore.svc.eqiad.wmnet:/scratch                7,3T    1,4T  6,0T   19% /data/scratch
scfc@tools-webgrid-04:~$

I have disabled the queue on this host with:

scfc@tools-bastion-01:~$ qmod -d webgrid-lighttpd@tools-webgrid-04.eqiad.wmflabs
scfc@tools-bastion-01.eqiad.wmflabs changed state of "webgrid-lighttpd@tools-webgrid-04.eqiad.wmflabs" (disabled)
scfc@tools-bastion-01:~$

In a few hours I will:

  • reschedule the jobs on that host (qhost -h tools-webgrid-04 -j, qmod -rj),
  • delete the instance,
  • recreate a new Precise instance,
  • copy the configuration settings from tools-webgrid-04,
  • copy /etc/hosts,
  • copy /usr/local/bin/gridengine-mailer (cf. T63160),
  • check that SGE recognizes the new host, and
  • reenable the queue (qmod -e webgrid-lighttpd@tools-webgrid-04.eqiad.wmflabs).

Event Timeline

scfc claimed this task.
scfc raised the priority of this task from to High.
scfc updated the task description. (Show Details)
scfc added a project: Toolforge.
scfc moved this task to In Progress on the Toolforge board.
scfc added a subscriber: scfc.
scfc set Security to None.

Added the host with qconf -mhgrp \@webgrid, restarted execd with service gridengine-exec restart, checked with diff -u <(qconf -sq webgrid-lighttpd@tools-webgrid-07.eqiad.wmflabs) <(qconf -sq webgrid-lighttpd@tools-webgrid-08.eqiad.wmflabs) that only the hostname is different, added with qconf -as tools-webgrid-08.eqiad.wmflabs as submit host, but webservice put a job on tools-webgrid-02 instead of tools-webgrid-08. Perhaps a resource for Precise missing?

diff -u <(qconf -se tools-webgrid-01.eqiad.wmflabs) <(qconf -se tools-webgrid-08.eqiad.wmflabs) shows the difference between the resources.

And that was the Gordian knot:

tools.typoscan@tools-bastion-01:~$ qstat -xml
<?xml version='1.0'?>
<job_info  xmlns:xsd="http://gridengine.sunsource.net/source/browse/*checkout*/gridengine/source/dist/util/resources/schemas/qstat/qstat.xsd?revision=1.11">
  <queue_info>
    <job_list state="running">
      <JB_job_number>9799151</JB_job_number>
      <JAT_prio>0.30001</JAT_prio>
      <JB_name>lighttpd-typoscan</JB_name>
      <JB_owner>tools.typoscan</JB_owner>
      <state>r</state>
      <JAT_start_time>2015-04-10T02:02:09</JAT_start_time>
      <queue_name>webgrid-lighttpd@tools-webgrid-08.eqiad.wmflabs</queue_name>
      <slots>1</slots>
    </job_list>
  </queue_info>
  <job_info>
  </job_info>
</job_info>
tools.typoscan@tools-bastion-01:~$

It was not necessary to enable the queue for this host as that happened apparently as part of qconf -mhgrp.

Why is it called 8 and not 4? This mixes up trusty and precise progression
and causes confusion...

Removed tools-webgrid-04 as submit host by qconf -ds tools-webgrid-04.eqiad.wmflabs, from queue by qconf -mhgrp \@webgrid and as exec host by qconf -de tools-webgrid-04.eqiad.wmflabs. Finally rm -f /data/project/.system/store/*-tools-webgrid-04.eqiad.wmflabs to remove the instance from ssh.

scfc updated the task description. (Show Details)

Why is it called 8 and not 4? This mixes up trusty and precise progression
and causes confusion...

Because I didn't want to have to deal with host keys lingering somewhere as well. Also, I simply wasn't aware that there was an underlying scheme to the numbering. I always looked at the individual instance when that was necessary to know. With multiple prefixes (exec, webgrid, webgrid-generic) I couldn't (and can't) remember OS releases relating to numbers anyhow.

Fair enough. We should start phasing precise out soon anyway.