Resetup tools-webgrid-04 due to /var being too small
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	scfc
	Apr 9 2015, 10:46 AM

Description

The free space on /var on tools-webgrid-04 is on the brink of collapsing causing frequent Shinken alerts:

scfc@tools-webgrid-04:~$ df -h
Dateisystem                                     Größe Benutzt Verf. Verw% Eingehängt auf
/dev/vda1                                        7,6G    3,8G  3,5G   53% /
udev                                             7,9G     12K  7,9G    1% /dev
tmpfs                                            1,6G    2,2M  1,6G    1% /run
none                                             5,0M       0  5,0M    0% /run/lock
none                                             7,9G       0  7,9G    0% /run/shm
/dev/vda2                                        1,9G    1,8G   37M   98% /var
labstore.svc.eqiad.wmnet:/project/tools/project   40T     11T   30T   26% /data/project
labstore.svc.eqiad.wmnet:/project/tools/home      40T     11T   30T   26% /home
labstore.svc.eqiad.wmnet:/keys                   960M     49M  911M    6% /public/keys
labstore1003.eqiad.wmnet:/dumps                   44T     12T   33T   26% /public/dumps
labstore.svc.eqiad.wmnet:/scratch                7,3T    1,4T  6,0T   19% /data/scratch
scfc@tools-webgrid-04:~$

I have disabled the queue on this host with:

scfc@tools-bastion-01:~$ qmod -d webgrid-lighttpd@tools-webgrid-04.eqiad.wmflabs
scfc@tools-bastion-01.eqiad.wmflabs changed state of "webgrid-lighttpd@tools-webgrid-04.eqiad.wmflabs" (disabled)
scfc@tools-bastion-01:~$

In a few hours I will:

reschedule the jobs on that host (qhost -h tools-webgrid-04 -j, qmod -rj),
delete the instance,
recreate a new Precise instance,
copy the configuration settings from tools-webgrid-04,
copy /etc/hosts,
copy /usr/local/bin/gridengine-mailer (cf. T63160),
check that SGE recognizes the new host, and
reenable the queue (qmod -e webgrid-lighttpd@tools-webgrid-04.eqiad.wmflabs).

Related Objects

Mentioned In: T109417: 'new exec node' checklist
T97904: Make a decommissioning checklist for toollabs VMs
Mentioned Here: T63160: Error mails from SGE are encoded as application/octet-stream

Event Timeline

scfc created this task.Apr 9 2015, 10:46 AM

scfc claimed this task.

scfc raised the priority of this task from to High.

scfc updated the task description. (Show Details)

scfc added a project: Toolforge.

scfc moved this task to In Progress on the Toolforge board.

scfc subscribed.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 9 2015, 10:46 AM

scfc updated the task description. (Show Details)Apr 9 2015, 11:19 PM

scfc set Security to None.

scfc updated the task description. (Show Details)Apr 10 2015, 1:31 AM

Added the host with qconf -mhgrp \@webgrid, restarted execd with service gridengine-exec restart, checked with diff -u <(qconf -sq webgrid-lighttpd@tools-webgrid-07.eqiad.wmflabs) <(qconf -sq webgrid-lighttpd@tools-webgrid-08.eqiad.wmflabs) that only the hostname is different, added with qconf -as tools-webgrid-08.eqiad.wmflabs as submit host, but webservice put a job on tools-webgrid-02 instead of tools-webgrid-08. Perhaps a resource for Precise missing?

diff -u <(qconf -se tools-webgrid-01.eqiad.wmflabs) <(qconf -se tools-webgrid-08.eqiad.wmflabs) shows the difference between the resources.

And that was the Gordian knot:

tools.typoscan@tools-bastion-01:~$ qstat -xml
<?xml version='1.0'?>
<job_info  xmlns:xsd="http://gridengine.sunsource.net/source/browse/*checkout*/gridengine/source/dist/util/resources/schemas/qstat/qstat.xsd?revision=1.11">
  <queue_info>
    <job_list state="running">
      <JB_job_number>9799151</JB_job_number>
      <JAT_prio>0.30001</JAT_prio>
      <JB_name>lighttpd-typoscan</JB_name>
      <JB_owner>tools.typoscan</JB_owner>
      <state>r</state>
      <JAT_start_time>2015-04-10T02:02:09</JAT_start_time>
      <queue_name>webgrid-lighttpd@tools-webgrid-08.eqiad.wmflabs</queue_name>
      <slots>1</slots>
    </job_list>
  </queue_info>
  <job_info>
  </job_info>
</job_info>
tools.typoscan@tools-bastion-01:~$

It was not necessary to enable the queue for this host as that happened apparently as part of qconf -mhgrp.

Why is it called 8 and not 4? This mixes up trusty and precise progression
and causes confusion...

Removed tools-webgrid-04 as submit host by qconf -ds tools-webgrid-04.eqiad.wmflabs, from queue by qconf -mhgrp \@webgrid and as exec host by qconf -de tools-webgrid-04.eqiad.wmflabs. Finally rm -f /data/project/.system/store/*-tools-webgrid-04.eqiad.wmflabs to remove the instance from ssh.

scfc closed this task as Resolved.Apr 10 2015, 2:20 AM

scfc updated the task description. (Show Details)

In T95537#1196614, @yuvipanda wrote:

Why is it called 8 and not 4? This mixes up trusty and precise progression
and causes confusion...

Because I didn't want to have to deal with host keys lingering somewhere as well. Also, I simply wasn't aware that there was an underlying scheme to the numbering. I always looked at the individual instance when that was necessary to know. With multiple prefixes (exec, webgrid, webgrid-generic) I couldn't (and can't) remember OS releases relating to numbers anyhow.

Fair enough. We should start phasing precise out soon anyway.

scfc mentioned this in T97904: Make a decommissioning checklist for toollabs VMs.May 2 2015, 5:55 PM

scfc mentioned this in T109417: 'new exec node' checklist.Aug 18 2015, 5:49 PM

• Phabricator_maintenance removed a subscriber: yuvipanda.Jun 7 2017, 6:52 PM

Restricted Application added a project: Cloud-Services. · View Herald TranscriptJun 7 2017, 6:52 PM

Resetup tools-webgrid-04 due to /var being too smallClosed, ResolvedPublicActions

Description

Related Objects

Event Timeline

Resetup tools-webgrid-04 due to /var being too small
Closed, ResolvedPublic
Actions