Page MenuHomePhabricator

Fix partitions on CI slaves , some are missing /var/lib/docker
Open, MediumPublic

Description

Some integration-slave-docker instances have been provisioned with a /var/lib/docker partition but others do not. It is a bit confusing.

$ sudo cumin --trace --force 'name:docker' 'mount -t ext4|sort'
13 hosts will be targeted:
integration-slave-docker-[1021,1034,1037,1040-1041,1043,1048-1054].integration.eqiad.wmflabs
===== NODE GROUP =====                                                                                                                                                                                    
(3) integration-slave-docker-[1021,1049,1053].integration.eqiad.wmflabs                                                                                                                                   
----- OUTPUT of 'mount -t ext4|sort' -----                                                                                                                                                                
/dev/mapper/vd-second--local--disk on /srv type ext4 (rw,relatime,data=ordered)                                                                                                                           
/dev/vda3 on / type ext4 (rw,relatime,data=ordered)
===== NODE GROUP =====                                                                                                                                                                                    
(10) integration-slave-docker-[1034,1037,1040-1041,1043,1048,1050-1052,1054].integration.eqiad.wmflabs                                                                                                    
----- OUTPUT of 'mount -t ext4|sort' -----                                                                                                                                                                
/dev/mapper/vd-docker on /var/lib/docker type ext4 (rw,relatime,data=ordered)                                                                                                                             
/dev/mapper/vd-second--local--disk on /srv type ext4 (rw,relatime,data=ordered)                                                                                                                           
/dev/vda3 on / type ext4 (rw,relatime,data=ordered)
================

Or in short, the following instances lack a /var/lib/docker dedicated partition:

integration-slave-docker-1021.integration.eqiad.wmflabs
integration-slave-docker-1049.integration.eqiad.wmflabs
integration-slave-docker-1053.integration.eqiad.wmflabs

Maybe because they predate the introduction of /var/lib/docker and puppet is unable to magically shuffle the partitions for us. In which case we would have to provision new instances and delete those.

Event Timeline

I have added two Jenkins slaves with role::ci::slave::labs::docker with $docker_lvm_volume = true. One had a sub partition /var/lib/docker the other did not :]

Jdforrester-WMF assigned this task to hashar.
Jdforrester-WMF subscribed.

This got fixed in the re-build for stretch, it appears:

jforrester@integration-cumin-01:~$ sudo cumin --trace --force 'name:docker' 'mount -t ext4|sort'
16 hosts will be targeted:
integration-agent-docker-[1001-1014,1016].integration.eqiad.wmflabs,integration-agent-puppet-docker-1002.integration.eqiad.wmflabs
FORCE mode enabled, continuing without confirmation
===== NODE GROUP =====
(5) integration-agent-docker-[1001-1005].integration.eqiad.wmflabs
----- OUTPUT of 'mount -t ext4|sort' -----
/dev/mapper/vd-docker on /var/lib/docker type ext4 (rw,relatime,data=ordered)
/dev/mapper/vd-second--local--disk on /srv type ext4 (rw,relatime,data=ordered)
/dev/vda3 on / type ext4 (rw,relatime,data=ordered)
===== NODE GROUP =====
(11) integration-agent-docker-[1006-1014,1016].integration.eqiad.wmflabs,integration-agent-puppet-docker-1002.integration.eqiad.wmflabs
----- OUTPUT of 'mount -t ext4|sort' -----
/dev/mapper/vd-docker on /var/lib/docker type ext4 (rw,relatime,data=ordered)
/dev/mapper/vd-second--local--disk on /srv type ext4 (rw,relatime,data=ordered)
/dev/vda2 on / type ext4 (rw,relatime,data=ordered)
================
PASS:  |████████████████████████████████████████████████████████████████████████| 100% (16/16) [00:00<00:00, 19.79hosts/s]
FAIL:  |                                                                                 |   0% (0/16) [00:00<?, ?hosts/s]
100.0% (16/16) success ratio (>= 100.0% threshold) for command: 'mount -t ext4|sort'.
100.0% (16/16) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.

It is a race condition / improper order somewhere in our Puppet manifests. The reason our instances are all fine now is that I have manually fixed them upon creation :]