Maniphest T202457

mediawiki-quibble docker jobs fails due to disk full
Closed, ResolvedPublicPRODUCTION ERROR
Actions

Assigned To

Authored By

	Krinkle
	Aug 21 2018, 8:58 PM

Description

On integration-slave-docker-1026

There was 1 error:

1) MediaWiki\Tests\Storage\NoContentModelRevisionStoreDbTest::testNewRevisionFromArchiveRow_legacyEncoding
Wikimedia\Rdbms\DBQueryError: A database query error has occurred. Did you forget to run your application's database schema updater after upgrading? 
Query: INSERT  INTO `unittest_logging` (log_type,log_action,log_timestamp,log_namespace,log_title,log_page,log_params,log_comment,log_user,log_user_text) VALUES ('create','create','20180821204148','0','MediaWiki\\Tests\\Storage\\RevisionStoreDbTestBase::testNewRevisionFromArchiveRow_legacyEncoding','1','a:1:{s:17:\"associated_rev_id\";i:1;}','MediaWiki\\Tests\\Storage\\RevisionStoreDbTestBase::testNewRevisionFromArchiveRow_legacyEncoding','0','127.0.0.1')
Function: ManualLogEntry::insert
Error: 1114 The table 'unittest_logging' is full (/tmp/quibble-mysql-v3yd_ue9/socket)

https://integration.wikimedia.org/ci/job/mediawiki-quibble-vendor-mysql-hhvm-docker/4710/console

May be related:

Details

	Subject	Repo	Branch	Lines +/-
	Wipe src upon Quibble jobs completion	integration/config	master	+19 -0
	Use a volume under workspace for /tmp in docker containers	integration/config	master	+31 -0

Customize query in gerrit

Related Objects
Search...

Status	Subtype	Assigned	Task
Resolved	PRODUCTION ERROR	• dduvall	T202457 mediawiki-quibble docker jobs fails due to disk full
Resolved		• dduvall	T203841 Provide dedicated storage space to Docker for images/containers
Declined		• dduvall	T203842 Free up LVM extents for Docker devicemapper on new Jenkins Agents

Event Timeline

Krinkle created this task.Aug 21 2018, 8:58 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 21 2018, 8:58 PM

hashar updated the task description. (Show Details)Aug 21 2018, 9:01 PM

integration-slave-docker-1026 is a new Docker host that has several Jenkins executors. It has been pooled in yesterday as part of T201972

Quibble configure the MySQL data using a temporary directory which happens to be /tmp. That is the root partition which is 20GB that holds the system and all the docker images.

I've reduced the number of executors to 5 for integration-slave-docker-1026. However, for a long term fix, it would be better to remove the bottleneck by having quibble set up mysql with its datadir under the workspace, since the workspace is mounted on the host's secondary LVM volume.

Restricted Application edited projects, added Release-Engineering-Team (Kanban); removed Release-Engineering-Team. · View Herald TranscriptAug 21 2018, 9:14 PM

It is not alone mysql:

20:41:00 error: unable to write file oyejorge/less.php/lib/Less/Parser.php
20:41:00 fatal: cannot create directory at 'oyejorge/less.php/lib/Less/SourceMap': No space left on device
20:41:00 warning: Clone succeeded, but checkout failed.
20:41:00 You can inspect what was checked out with 'git status'
20:41:00 and retry the checkout with 'git checkout -f HEAD'

https://integration.wikimedia.org/ci/job/quibble-vendor-mysql-php70-docker/8208/console

thcipriani mentioned this in T201224: Jenkins should auto-depool nodes if they run out of disk space on specific partitions.Aug 22 2018, 3:16 PM

Krinkle moved this task from Untriaged to In progress on the Wikimedia-production-error (ARCHIVED -- Shared Build Failure) board.Aug 22 2018, 7:52 PM

Change 454706 had a related patch set uploaded (by Dduvall; owner: Dduvall):
[integration/config@master] Use a volume under workspace for /tmp in docker containers

https://gerrit.wikimedia.org/r/454706

gerritbot added a project: Patch-For-Review.Aug 22 2018, 10:03 PM

Quibble configure the MySQL data using a temporary directory which happens to be /tmp. That is the root partition which is 20GB that holds the system and all the docker images.

My comment is misleading suggesting that Quibble inside the container writes on the host /tmp ( / partition). There is an intermediary step that causes / to fill:

/tmp is inside the container which is handled by Docker
the CI instances have Docker with the overlay2 storage driver which is in /var/lib/docker and thus the host / partition
MySQL writes in the container /tmp which fills the volume allocated by Docker. That volume is on /

So yeah, writting to /tmp in the container indirectly fill the host /.

On the host, the jobs run in /srv/jenkins-workspace/workspace/ which is the extended disk partition /srv.

In https://gerrit.wikimedia.org/r/454706 @dduvall offers to create a tmp directory in the workspace and mount that inside the container as /tmp. That solves it.

Eventually we want Docker to be on the extended disk space instead of the root partition. That is T178663: Switch CI Docker Storage Driver to its own partition and to use devicemapper.

Quibble paths and the volume mounts we do in CI jobs are a bit of a mess. It would be way easier to just mount the whole workspace instead of individual directories (cache, log, src).

In T202457#4525849, @hashar wrote:

Quibble paths and the volume mounts we do in CI jobs are a bit of a mess. It would be way easier to just mount the whole workspace instead of individual directories (cache, log, src).

Yesterday was the first time I'd seen all the various docker builders in integration/config/jjb/macros-docker.yaml so this is only a first impression, but it seemed like they could be greatly simplified if the directory creation and cleanup was refactored in the way you're describing.

Solved / worked around by https://gerrit.wikimedia.org/r/454706

I guess we can mark this task resolved and fill another one to rethink the partitioning of labs instances (based on above comment T202457#4525849 and discussion we had on IRC)?

Change 454706 merged by jenkins-bot:
[integration/config@master] Use a volume under workspace for /tmp in docker containers

https://gerrit.wikimedia.org/r/454706

Krinkle moved this task from In progress to Resolved on the Wikimedia-production-error (ARCHIVED -- Shared Build Failure) board.Aug 27 2018, 3:42 AM

Krinkle renamed this task from mediawiki-quibble jobs fails due to disk full (sql insert failed) to mediawiki-quibble docker jobs fails due to disk full.Sep 1 2018, 12:48 AM

Krinkle moved this task from Resolved to In progress on the Wikimedia-production-error (ARCHIVED -- Shared Build Failure) board.

Krinkle removed a project: Patch-For-Review.

From https://integration.wikimedia.org/ci/job/wmf-quibble-core-vendor-mysql-php70-docker/432/console

npm ERR! tar.unpack untar error /cache/npm/lodash/4.17.10/package.tgz
...
npm ERR! Linux 4.9.0-0.bpo.7-amd64
npm ERR! argv "/usr/bin/nodejs" "/usr/local/bin/npm" "install"
npm ERR! node v6.11.0
npm ERR! npm  v3.8.3
npm ERR! code ENOSPC
npm ERR! errno -28
npm ERR! syscall write

npm ERR! nospc ENOSPC: no space left on device, write

From https://integration.wikimedia.org/ci/job/wmf-quibble-core-vendor-mysql-php70-docker/433/console

Building remotely on integration-slave-docker-1026 (stats-T201972.bigmem blubber DebianJessieDocker m4executor) in workspace /srv/jenkins-workspace/workspace/wmf-quibble-core-vendor-mysql-php70-docker@2
...
npm ERR! tar.unpack untar error /cache/npm/globby/5.0.0/package.tgz
npm WARN install:globby@5.0.0 ENOSPC: no space left on device, mkdir '/workspace/src/node_modules/.staging/globby-76a46398'

Both builds from integration-slave-docker-1026 (stats-T201972.bigmem blubber DebianJessieDocker m4executor) in workspace /srv/jenkins-workspace/workspace/wmf-quibble-core-vendor-mysql-php70-docker@2

<shinken-wm> PROBLEM - Free space - all mounts on integration-slave-docker-1026 is CRITICAL: CRITICAL: integration.integration-slave-docker-1026.diskspace.root.byte_percentfree (<22.22%)

I've depooled this slave for now. But it seems like something that should be avoided by making sure that disk size matches the space needed for the number of concurrent jobs configured. And for cases with multiple times of jobs that can run (bot aren't run) their old workspace should typically be removed. This is sometimes not done to be able to use the workspace's git directory as a cache to speed up the next build of the same type, but afaik zuul-cloner's cache is used for that instead.

Krinkle merged tasks: T189616: Docker-based Jenkins jobs failing due to "No space left on device", T189361: Error: file write error (No space left on device).Sep 1 2018, 1:51 AM

Krinkle added subscribers: SamanthaNguyen, Jayprakash12345, Paladox, Legoktm.

https://integration.wikimedia.org/ci/job/wmf-quibble-vendor-mysql-hhvm-docker/5033/console

Building remotely on integration-slave-docker-1026 
...
00:00:32.803 fatal: cannot create directory at 'wikimedia/timestamp': No space left on device
00:00:32.803 warning: Clone succeeded, but checkout failed.

EDIT, and again.
https://integration.wikimedia.org/ci/job/wmf-quibble-vendor-mysql-hhvm-docker/5034/consoleFull

00:02:35.918   Failed to clone                                                                            
00:02:35.920   - https://github.com/JetBrains/phpstorm-stubs.git                                                    
00:02:35.922     Cloning into '/workspace/src/vendor/jetbrains/phpstorm-stubs'...                                   
00:02:35.923     fatal: write error: No space left on device

Umherirrender mentioned this in T203649: quibble-vendor-mysql-hhvm-docker no space left on device, write.Sep 6 2018, 4:54 PM

Krinkle reopened this task as Open.Sep 6 2018, 4:54 PM

Krinkle closed this task as a duplicate of T203649: quibble-vendor-mysql-hhvm-docker no space left on device, write.

Krinkle merged a task: T203649: quibble-vendor-mysql-hhvm-docker no space left on device, write.

Krinkle added a subscriber: Physikerwelt.

Each running quibble container takes up space (sometimes a lot of space). This might explain some of the problem.

thcipriani@integration-slave-docker-1026:~$ sudo docker ps -s                                                                                         
CONTAINER ID        IMAGE      COMMAND                  CREATED             STATUS    PORTS               NAMES                    SIZE                                                                                                    
372446a0b264        docker-registry.wikimedia.org/releng/quibble-stretch:0.0.21-7     "/usr/local/bin/quib…"   2 minutes ago       Up 2 minutes compassionate_blackwell   559MB (virtual 1.58GB)                                                                     
96aa93483f5e        docker-registry.wikimedia.org/releng/quibble-stretch:0.0.23       "/usr/local/bin/quib…"   5 minutes ago       Up 5 minutes
                    goofy_meitner             491MB (virtual 1.51GB)                                                                           
bca763831d72        docker-registry.wikimedia.org/releng/quibble-stretch:0.0.23       "/usr/local/bin/quib…"   5 minutes ago       Up 5 minutes
                    elastic_ardinghelli       491MB (virtual 1.51GB)                                                                           
1adfed95df69        docker-registry.wikimedia.org/releng/quibble-jessie-hhvm:0.0.23   "/usr/local/bin/quib…"   7 minutes ago       Up 7 minutes
                    peaceful_jackson          1.2GB (virtual 2.21GB)                                                                           
eae9ba3a1459        docker-registry.wikimedia.org/releng/quibble-jessie-hhvm:0.0.23   "/usr/local/bin/quib…"   5 hours ago         Up 5 hours
                    adoring_lumiere           945MB (virtual 1.96GB)

There is also a problem where sometimes a container that is taking up space is not stopped or destroyed T198517: Quibble docker instance running on CI instance for 6 hours

Just to reiterate what was talked about in IRC (#wikimedia-releng), one long term solution to the Docker disk space issue might be to:

Free up LVM extents on /dev/vda4 currently used by the "second-local-disk" logical volume mounted at /srv. (Modify profile::labs::lvm::srv to specify a size much less than "100%FREE", the default.)
Configure dockerd to use the device mapper storage driver and /dev/vda4.

An alternative to switching storage drivers would be to divvy up the volume group into two logical volumes, and mount them at separate directories for both the Jenkins workspace and Docker (e.g. /srv/jenkins, and /srv/docker).

• dduvall mentioned this in T201972: Add some more m4executor docker slaves for Jenkins.Sep 10 2018, 4:50 PM

Change 457918 had a related patch set uploaded (by Hashar; owner: Hashar):
[integration/config@master] Wipe src upon Quibble jobs completion

https://gerrit.wikimedia.org/r/457918

gerritbot added a project: Patch-For-Review.Sep 11 2018, 8:43 AM

Change 457918 merged by jenkins-bot:
[integration/config@master] Wipe src upon Quibble jobs completion

https://gerrit.wikimedia.org/r/457918

The Quibble jobs now delete the src directory when the build is completed. So at least /srv on the hosts would no more be filled by that.

The 90 Quibble jobs left behind at least a full copy of mediawiki/core, with 5 concurrent builds happening on the new slaves, that would mean up to 450 copies of mediawiki/core floating around idlessly. The jobs now delete src upon build completion.

The sub tasks still have to be fullfilled.

• dduvall closed subtask T203841: Provide dedicated storage space to Docker for images/containers as Resolved.Sep 13 2018, 5:52 PM

• dduvall closed subtask T203842: Free up LVM extents for Docker devicemapper on new Jenkins Agents as Declined.

Grafana trends for the past day show that disk usage on both logical volumes for Docker and Jenkins workspaces is stable.

Krinkle awarded a token.Sep 19 2018, 4:14 PM

zeljkofilipin mentioned this in Blog Post: Production Excellence #3: September 2018.Sep 26 2018, 12:17 PM

• dduvall mentioned this in T205902: Disk-space related issues still occurring for Docker based CI jobs.Oct 1 2018, 5:12 PM

• mmodell changed the subtype of this task from "Task" to "Production Error".Aug 28 2019, 11:09 PM

mediawiki-quibble docker jobs fails due to disk fullClosed, ResolvedPublicPRODUCTION ERRORActions

Description

Details

Related ObjectsSearch...

Event Timeline

mediawiki-quibble docker jobs fails due to disk full
Closed, ResolvedPublicPRODUCTION ERROR
Actions

Related Objects
Search...