Page MenuHomePhabricator

mediawiki-quibble docker jobs fails due to disk full
Closed, ResolvedPublicPRODUCTION ERROR

Description

On integration-slave-docker-1026

There was 1 error:

1) MediaWiki\Tests\Storage\NoContentModelRevisionStoreDbTest::testNewRevisionFromArchiveRow_legacyEncoding
Wikimedia\Rdbms\DBQueryError: A database query error has occurred. Did you forget to run your application's database schema updater after upgrading? 
Query: INSERT  INTO `unittest_logging` (log_type,log_action,log_timestamp,log_namespace,log_title,log_page,log_params,log_comment,log_user,log_user_text) VALUES ('create','create','20180821204148','0','MediaWiki\\Tests\\Storage\\RevisionStoreDbTestBase::testNewRevisionFromArchiveRow_legacyEncoding','1','a:1:{s:17:\"associated_rev_id\";i:1;}','MediaWiki\\Tests\\Storage\\RevisionStoreDbTestBase::testNewRevisionFromArchiveRow_legacyEncoding','0','127.0.0.1')
Function: ManualLogEntry::insert
Error: 1114 The table 'unittest_logging' is full (/tmp/quibble-mysql-v3yd_ue9/socket)

https://integration.wikimedia.org/ci/job/mediawiki-quibble-vendor-mysql-hhvm-docker/4710/console

May be related:

Event Timeline

integration-slave-docker-1026 is a new Docker host that has several Jenkins executors. It has been pooled in yesterday as part of T201972

Quibble configure the MySQL data using a temporary directory which happens to be /tmp. That is the root partition which is 20GB that holds the system and all the docker images.

dduvall triaged this task as High priority.

I've reduced the number of executors to 5 for integration-slave-docker-1026. However, for a long term fix, it would be better to remove the bottleneck by having quibble set up mysql with its datadir under the workspace, since the workspace is mounted on the host's secondary LVM volume.

It is not alone mysql:

20:41:00 error: unable to write file oyejorge/less.php/lib/Less/Parser.php
20:41:00 fatal: cannot create directory at 'oyejorge/less.php/lib/Less/SourceMap': No space left on device
20:41:00 warning: Clone succeeded, but checkout failed.
20:41:00 You can inspect what was checked out with 'git status'
20:41:00 and retry the checkout with 'git checkout -f HEAD'

https://integration.wikimedia.org/ci/job/quibble-vendor-mysql-php70-docker/8208/console

Change 454706 had a related patch set uploaded (by Dduvall; owner: Dduvall):
[integration/config@master] Use a volume under workspace for /tmp in docker containers

https://gerrit.wikimedia.org/r/454706

Quibble configure the MySQL data using a temporary directory which happens to be /tmp. That is the root partition which is 20GB that holds the system and all the docker images.

My comment is misleading suggesting that Quibble inside the container writes on the host /tmp ( / partition). There is an intermediary step that causes / to fill:

  • /tmp is inside the container which is handled by Docker
  • the CI instances have Docker with the overlay2 storage driver which is in /var/lib/docker and thus the host / partition
  • MySQL writes in the container /tmp which fills the volume allocated by Docker. That volume is on /

So yeah, writting to /tmp in the container indirectly fill the host /.


On the host, the jobs run in /srv/jenkins-workspace/workspace/ which is the extended disk partition /srv.

In https://gerrit.wikimedia.org/r/454706 @dduvall offers to create a tmp directory in the workspace and mount that inside the container as /tmp. That solves it.

Eventually we want Docker to be on the extended disk space instead of the root partition. That is T178663: Switch CI Docker Storage Driver to its own partition and to use devicemapper.

Quibble paths and the volume mounts we do in CI jobs are a bit of a mess. It would be way easier to just mount the whole workspace instead of individual directories (cache, log, src).

Quibble paths and the volume mounts we do in CI jobs are a bit of a mess. It would be way easier to just mount the whole workspace instead of individual directories (cache, log, src).

Yesterday was the first time I'd seen all the various docker builders in integration/config/jjb/macros-docker.yaml so this is only a first impression, but it seemed like they could be greatly simplified if the directory creation and cleanup was refactored in the way you're describing.

Solved / worked around by https://gerrit.wikimedia.org/r/454706

I guess we can mark this task resolved and fill another one to rethink the partitioning of labs instances (based on above comment T202457#4525849 and discussion we had on IRC)?

Change 454706 merged by jenkins-bot:
[integration/config@master] Use a volume under workspace for /tmp in docker containers

https://gerrit.wikimedia.org/r/454706

Krinkle renamed this task from mediawiki-quibble jobs fails due to disk full (sql insert failed) to mediawiki-quibble docker jobs fails due to disk full.Sep 1 2018, 12:48 AM
Krinkle removed a project: Patch-For-Review.

From https://integration.wikimedia.org/ci/job/wmf-quibble-core-vendor-mysql-php70-docker/432/console

npm ERR! tar.unpack untar error /cache/npm/lodash/4.17.10/package.tgz
...
npm ERR! Linux 4.9.0-0.bpo.7-amd64
npm ERR! argv "/usr/bin/nodejs" "/usr/local/bin/npm" "install"
npm ERR! node v6.11.0
npm ERR! npm  v3.8.3
npm ERR! code ENOSPC
npm ERR! errno -28
npm ERR! syscall write

npm ERR! nospc ENOSPC: no space left on device, write

From https://integration.wikimedia.org/ci/job/wmf-quibble-core-vendor-mysql-php70-docker/433/console

Building remotely on integration-slave-docker-1026 (stats-T201972.bigmem blubber DebianJessieDocker m4executor) in workspace /srv/jenkins-workspace/workspace/wmf-quibble-core-vendor-mysql-php70-docker@2
...
npm ERR! tar.unpack untar error /cache/npm/globby/5.0.0/package.tgz
npm WARN install:globby@5.0.0 ENOSPC: no space left on device, mkdir '/workspace/src/node_modules/.staging/globby-76a46398'

Both builds from integration-slave-docker-1026 (stats-T201972.bigmem blubber DebianJessieDocker m4executor) in workspace /srv/jenkins-workspace/workspace/wmf-quibble-core-vendor-mysql-php70-docker@2

<shinken-wm> PROBLEM - Free space - all mounts on integration-slave-docker-1026 is CRITICAL: CRITICAL: integration.integration-slave-docker-1026.diskspace.root.byte_percentfree (<22.22%)

I've depooled this slave for now. But it seems like something that should be avoided by making sure that disk size matches the space needed for the number of concurrent jobs configured. And for cases with multiple times of jobs that can run (bot aren't run) their old workspace should typically be removed. This is sometimes not done to be able to use the workspace's git directory as a cache to speed up the next build of the same type, but afaik zuul-cloner's cache is used for that instead.

https://integration.wikimedia.org/ci/job/wmf-quibble-vendor-mysql-hhvm-docker/5033/console

Building remotely on integration-slave-docker-1026 
...
00:00:32.803 fatal: cannot create directory at 'wikimedia/timestamp': No space left on device
00:00:32.803 warning: Clone succeeded, but checkout failed.

EDIT, and again.
https://integration.wikimedia.org/ci/job/wmf-quibble-vendor-mysql-hhvm-docker/5034/consoleFull

00:02:35.918   Failed to clone                                                                            
00:02:35.920   - https://github.com/JetBrains/phpstorm-stubs.git                                                    
00:02:35.922     Cloning into '/workspace/src/vendor/jetbrains/phpstorm-stubs'...                                   
00:02:35.923     fatal: write error: No space left on device

Each running quibble container takes up space (sometimes a lot of space). This might explain some of the problem.

thcipriani@integration-slave-docker-1026:~$ sudo docker ps -s                                                                                         
CONTAINER ID        IMAGE      COMMAND                  CREATED             STATUS    PORTS               NAMES                    SIZE                                                                                                    
372446a0b264        docker-registry.wikimedia.org/releng/quibble-stretch:0.0.21-7     "/usr/local/bin/quib…"   2 minutes ago       Up 2 minutes compassionate_blackwell   559MB (virtual 1.58GB)                                                                     
96aa93483f5e        docker-registry.wikimedia.org/releng/quibble-stretch:0.0.23       "/usr/local/bin/quib…"   5 minutes ago       Up 5 minutes
                    goofy_meitner             491MB (virtual 1.51GB)                                                                           
bca763831d72        docker-registry.wikimedia.org/releng/quibble-stretch:0.0.23       "/usr/local/bin/quib…"   5 minutes ago       Up 5 minutes
                    elastic_ardinghelli       491MB (virtual 1.51GB)                                                                           
1adfed95df69        docker-registry.wikimedia.org/releng/quibble-jessie-hhvm:0.0.23   "/usr/local/bin/quib…"   7 minutes ago       Up 7 minutes
                    peaceful_jackson          1.2GB (virtual 2.21GB)                                                                           
eae9ba3a1459        docker-registry.wikimedia.org/releng/quibble-jessie-hhvm:0.0.23   "/usr/local/bin/quib…"   5 hours ago         Up 5 hours
                    adoring_lumiere           945MB (virtual 1.96GB)

There is also a problem where sometimes a container that is taking up space is not stopped or destroyed T198517: Quibble docker instance running on CI instance for 6 hours

Just to reiterate what was talked about in IRC (#wikimedia-releng), one long term solution to the Docker disk space issue might be to:

  1. Free up LVM extents on /dev/vda4 currently used by the "second-local-disk" logical volume mounted at /srv. (Modify profile::labs::lvm::srv to specify a size much less than "100%FREE", the default.)
  2. Configure dockerd to use the device mapper storage driver and /dev/vda4.

An alternative to switching storage drivers would be to divvy up the volume group into two logical volumes, and mount them at separate directories for both the Jenkins workspace and Docker (e.g. /srv/jenkins, and /srv/docker).

Change 457918 had a related patch set uploaded (by Hashar; owner: Hashar):
[integration/config@master] Wipe src upon Quibble jobs completion

https://gerrit.wikimedia.org/r/457918

Change 457918 merged by jenkins-bot:
[integration/config@master] Wipe src upon Quibble jobs completion

https://gerrit.wikimedia.org/r/457918

The Quibble jobs now delete the src directory when the build is completed. So at least /srv on the hosts would no more be filled by that.

hashar lowered the priority of this task from High to Medium.Sep 11 2018, 7:53 PM

The 90 Quibble jobs left behind at least a full copy of mediawiki/core, with 5 concurrent builds happening on the new slaves, that would mean up to 450 copies of mediawiki/core floating around idlessly. The jobs now delete src upon build completion.

The sub tasks still have to be fullfilled.

Grafana trends for the past day show that disk usage on both logical volumes for Docker and Jenkins workspaces is stable.

mmodell changed the subtype of this task from "Task" to "Production Error".Aug 28 2019, 11:09 PM