Page MenuHomePhabricator

contint1001 store docker images on separate partition or disk
Open, HighPublic

Description

Currently contint1001 stores all docker images on the root partition (which is only 50GB). It seems like this could cause trouble if we build one too many docker images and fill the whole disk (which seems like an easy thing to do in the current setup).

Not sure how difficult it would be to add storage to this machine, or reallocate the existing storage, filing this task to find out.

On T178663#3699074 @hashar wrote:

Looks like profile::docker::storage has the logic to setup a partition with parameters:

# list of physical volumes to use.
$physical_volumes = hiera('profile::docker::storage::physical_volumes'),
# Volume group to substitute.
$vg_to_remove = hiera('profile::docker::storage::vg_to_remove'),

It seems to create a new volume group docker with logical volumes data and metadata.

profile::docker::storage::physical_volumes would be the physical volume.

contint1001 has a 1TB disk and all the physical volume / volume group is allocated:

# pvdisplay
  --- Physical volume ---
  PV Name               /dev/md2
  VG Name               contint1001-vg
  PV Size               883.89 GiB / not usable 3.00 MiB
  Allocatable           yes (but full)
# vgdisplay 
  --- Volume group ---
  VG Name               contint1001-vg
  Format                lvm2
  VG Status             resizable
  Cur LV                1
  VG Size               883.89 GiB
root@contint1001:~# lvdisplay 
  --- Logical volume ---
  LV Path                /dev/contint1001-vg/data
  LV Name                data
  VG Name                contint1001-vg
  LV Size                883.89 GiB

Seems to me we would have to shrink the logical volume /dev/contint1001-vg/data and the volume group contint1001-vg.


Usage as of January 25th 2019

$ df -h / /srv
Filesystem                        Size  Used Avail Use% Mounted on
/dev/md0                           46G   39G  5.0G  89% /
/dev/mapper/contint1001--vg-data  870G  544G  283G  66% /srv

/ has Docker images (via /var/lib/docker) which is the concern: docker can fill the root partition.

/srv has Jenkins build results (large), zuul-merge repositories (~ 25GB), integration.wikimedia.org docroot (small)

We would want to shrink the volume group and create a new one for Docker images which would be at /srv/docker.

Event Timeline

jijiki added a subscriber: jijiki.Oct 23 2018, 3:45 PM
jijiki triaged this task as Normal priority.Oct 26 2018, 7:55 AM

@RobH would you know if it's possible to add physical storage to this machine? If not we'll have to work out a different solution.

RobH added a comment.Nov 16 2018, 9:02 PM

contint1001 was purchased on T130738, and has dual 1TB SATA disks.

We generally don't store anything in the / root partition, but toss all our things in /srv, which is a larger partition.

robh@contint1001:~$ df -h
Filesystem                        Size  Used Avail Use% Mounted on
udev                               10M     0   10M   0% /dev
tmpfs                              13G  1.4G   12G  11% /run
/dev/md0                           46G   33G   11G  76% /
tmpfs                              32G     0   32G   0% /dev/shm
tmpfs                             5.0M     0  5.0M   0% /run/lock
tmpfs                              32G     0   32G   0% /sys/fs/cgroup

@thcipriani: I'd recommend changes to your docker storage configuration to store in the /srv directory (and larger partition.)

Dzahn added a subscriber: Dzahn.Jan 18 2019, 11:05 PM

This happened today. Was about to make a ticket for it and found this.

17:38 < icinga-wm> PROBLEM - Disk space on contint1001 is CRITICAL: DISK CRITICAL - free space: / 2510 MB (5% inode=62%)
17:39 < mutante> !log contint1001 - apt-get clean - disk space low

17:42 < Krinkle> mutante: thanks, I guess my docker-pgk rebuilding is contributing somehow
17:42 < Krinkle> it's rebuilding a lot of them, due to a change in the ci-stretch base image.
17:42 < Krinkle> actually, no it isn't. Nevermind.

17:52 < mutante> Krinkle: /var/lib/docker/overlay2 is like 31G of 46 total

17:53 < Krinkle> mutante: The change was pretty small, so I guess it's just slow build up
17:53 * Krinkle tries to find graphs
17:54 < Krinkle> we should probably have a strategy for cleaning up older version of Ci images that aren't used.
17:54 < Krinkle> if we don't already have soemthing for that - assuming that's where they are stored, I don't know.
17:55 < Krinkle> https://grafana.wikimedia.org/d/000000377/host-overview?panelId=12&fullscreen&orgId=1&var-server=contint1001&var-datasource=eqiad%20prometheus%2Fops&var-cluster=ci&from=1546904126553&to=1547852116992

17:57 < mutante> !log contint1001 - moved zuul logs from 2018 and gzipped zuul logs from /var/log/zuul to /srv/logs/zuul to free disk space on /

18:00 < icinga-wm> RECOVERY - Disk space on contint1001 is OK: DISK OK

18:00 < mutante> !log contint1001 - gzipping more files in /var/log/zuul/

18:00 < mutante> Krinkle: yep, slow build up. happened before i think

hashar updated the task description. (Show Details)Jan 25 2019, 2:23 PM

updated to integrate my comments from T178663#3699074

Could use /srv to be shrinked a bit and a new partition for Docker images at /srv/docker?

Dzahn added a comment.Jan 25 2019, 3:23 PM

Could use /srv to be shrinked a bit and a new partition for Docker images at /srv/docker?

Quoting Tyler from the other ticket though "We don't want to resize /srv/ as that's already in use for zuul-merger (and is currently using > half the available disk space). Ideally we'd be able to add a disk to this machine just to store images."

Dzahn added a comment.Jan 25 2019, 3:24 PM

Let's ask dcops instead and request a new disk to be added. ?

Let's ask dcops instead and request a new disk to be added. ?

@RobH (since you chimed in earlier) is it possible to add an additional disk to contint1001? Ideally, I'd like to avoid using /srv since zuul-merger is already there and using 65% of the storage, and we've only just started the pipeline project (building docker images on contint1001) -- probably more disk space usage in future for docker. Looks like the Dell PowerEdge R430 has 4 drive bays(?).

Since that is recurring. Can we check whether we can add a couple disks to the machine? I guess 256G would be sufficient.

An alternative is to shrink the existing volume group for /srv. It is reasonably busy, but we can look at optimizing the current disk usage (keep less artifacts, compress logs etc).

Dzahn assigned this task to RobH.Apr 17 2019, 10:11 PM

just assigning for the question in the 2 comments above

Dzahn added a comment.Wed, Apr 24, 5:33 PM

Icinga alerting again:

contint1001 - Disk space
CRITICAL 2019-04-24 17:29:39 0d 1h 10m 17s 3/3 DISK CRITICAL - free space: / 2644 MB (5% inode=65%):

[#wikimedia-oper] !log contint1001 - apt-get clean for 1% more disk space

Dzahn raised the priority of this task from Normal to High.Wed, Apr 24, 5:38 PM

Mentioned in SAL (#wikimedia-operations) [2019-04-24T17:52:38Z] <mutante> contint1001 - for logfile in $(find /var/log/zuul/ ! -name "*.gz"); do gzip $logfile; done to get more disk space (T207707)

Dzahn lowered the priority of this task from High to Normal.Wed, Apr 24, 5:57 PM

gzipping all files in /var/log/zuul that were not already gzipped saved almost 10G. usage of / back to 79% from 95%

hashar added a comment.Mon, May 6, 9:42 AM

Eventually I have unzipped them, the reason is the log rotation is handled by python logging not by logrotate. So that when we gzip the file, logging does not delete the old .gz files :-/

Dzahn removed a subscriber: Dzahn.Tue, May 7, 3:25 PM
Dzahn reassigned this task from RobH to hashar.Tue, May 14, 5:26 PM

Mentioned in SAL (#wikimedia-operations) [2019-05-14T17:32:46Z] <mutante> contint1001 - mkdir /srv/zuul-logs ; mv /var/log/zuul/debug.log* /srv/zuul-logs/ to prevent CI running out of disk again (T207707)

Dzahn added a comment.EditedTue, May 14, 5:36 PM

As before /var/log/zuul is many Gigabytes and a large percentage of / and debug logging is enabled.

Linked previous duplicate tickets. Raising priority.

Pinging @RobH for T207707#4937008

@hashar Because of T207707#5159292 this time i made a /srv/zuul-logs and moved the debug logs there. See above. Also do we really need (that many) debug logs on a constant basis?

Dzahn raised the priority of this task from Normal to High.Tue, May 14, 5:36 PM
RobH added a subscriber: Dzahn.Tue, May 14, 5:54 PM

Let's ask dcops instead and request a new disk to be added. ?

@RobH (since you chimed in earlier) is it possible to add an additional disk to contint1001? Ideally, I'd like to avoid using /srv since zuul-merger is already there and using 65% of the storage, and we've only just started the pipeline project (building docker images on contint1001) -- probably more disk space usage in future for docker. Looks like the Dell PowerEdge R430 has 4 drive bays(?).

Cost of this has to be discussed, and it cannot be done on this public task.

I'll create a private sub task for pricing disucssion for adding disks to this system.

RobH mentioned this in Unknown Object (Task).Tue, May 14, 6:24 PM