integration-slave-trusty-1014 and integration-slave-trusty-1017 instances can't boot anymore, ended up corrupted. Need rebuild
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	hashar
	Aug 24 2015, 3:40 PM

Description

We have two instances that refuses to boot entirely, even after a few hard reboot via the Horizon dashboard. The console shows nothing and apparently the instances ends up in 'paused' states.

The instances are Jenkins slaves:

integration-slave-trusty-1014.integration.eqiad.wmflabs	log	jenkins node
integration-slave-trusty-1017.integration.eqiad.wmflabs	log	jenkins node
integration-slave-precise-1014

Ended up corrupted due to labvirt1007 filling disk space.
Delete instances
Build new instances

Related Objects
Search...

Status	Assigned	Task
Resolved	hashar	T110052 integration-slave-trusty-1014 and integration-slave-trusty-1017 instances can't boot anymore, ended up corrupted. Need rebuild
Resolved	hashar	T109752 disk space on labvirt1007
Resolved	hashar	T110303 Beta cluster puppetmaster lost cherry picked patches due to disk corruption
Resolved	hashar	T110506 Fix tox installation on new Precise slaves
Resolved	hashar	T110512 mediawiki-core-phplint clone the whole repo from Zuul and times out

Event Timeline

hashar created this task.Aug 24 2015, 3:40 PM

hashar raised the priority of this task from to Needs Triage.

hashar updated the task description. (Show Details)

hashar added projects: Cloud-Services, Cloud-VPS, Continuous-Integration-Infrastructure.

hashar subscribed.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 24 2015, 3:40 PM

hashar moved this task from Untriaged to Externally Blocked on the Continuous-Integration-Infrastructure board.Aug 24 2015, 3:40 PM

hashar updated the task description. (Show Details)Aug 24 2015, 3:52 PM

hashar set Security to None.

@Andrew do you have any spare time to look at them please ? :-}

This is probably because they are hosted on labvirt1007 which no longer has space to expand instance drives. I'm working on a fix for this and may be able to resolve it later today... if you need things working in the next few hours it's best to destroy and rebuild them.

At worse I will recreate tomorrow morning Europe time :-) Thanks Andrew!

Self note: manual steps are listed at https://wikitech.wikimedia.org/wiki/Nova_Resource:Integration/Setup?redirect=no#Patches

hashar added a subtask: T109752: disk space on labvirt1007.Aug 24 2015, 7:51 PM

integration-slave-precise-1014 was affected as well apparently.

Disk space have been freed on labvirt1007 and the instances managed to boot. I pooled them back in Jenkins.

hashar mentioned this in T109752: disk space on labvirt1007.Aug 25 2015, 8:51 AM

So apparently some git repos in Jenkins workspace ended up being corrupted:

jenkins-deploy@integration-slave-trusty-1017:/mnt/jenkins-workspace/workspace/mwext-qunit/src/extensions/VisualEditor$ git fsck
error: object file .git/objects/00/3124078a04dce9370de7f1eca062437ef88fe2 is empty
error: object file .git/objects/00/3124078a04dce9370de7f1eca062437ef88fe2 is empty
fatal: loose object 003124078a04dce9370de7f1eca062437ef88fe2 (stored in .git/objects/00/3124078a04dce9370de7f1eca062437ef88fe2) is corrupt

Potentially the file was created but content could not be written because the host disk was full :/ Going to purge workspaces on all three instances.

integration-slave-trusty-1014 on boot reports:

/bin/sh: 1: exec: cloud-init: not found

integration-slave-trusty-1017 when running an unknown command I got:

/usr/bin/python: can't find '__main__' module in '/usr/share/command-not-found'

integration-slave-precise-1014 probably suffers similar issue.

I have unopposed all three instances and deleting them. We need to recreate them from scratch.

I have poked the Release-Engineering-Team team mailing list to pair the rebuild with someone.

hashar renamed this task from integration-slave-trusty-1014 and integration-slave-trusty-1017 instances can't boot anymore to integration-slave-trusty-1014 and integration-slave-trusty-1017 instances can't boot anymore, ended up corrupted. Need rebuild.Aug 25 2015, 6:58 PM

hashar updated the task description. (Show Details)

hashar merged a task: T110184: Jenkins job can't clone repos: git/objects/* is corrupt.

hashar mentioned this in T110184: Jenkins job can't clone repos: git/objects/* is corrupt.

hashar added subscribers: SBisson, Krinkle, • Mattflaschen-WMF.

hashar updated the task description. (Show Details)Aug 26 2015, 8:09 AM

Zeljko rebuild integration-slave-precise-1014 :-}

In T110052#1575146, @hashar wrote:

Zeljko rebuild integration-slave-precise-1014 :-}

I just marked https://integration.wikimedia.org/ci/computer/integration-slave-precise-1014/ offline. It seems to be having problems cloning git://zuul.eqiad.wmnet/mediawiki/core that are very repeatable (even with stock git from the cli). See https://integration.wikimedia.org/ci/job/mediawiki-core-phplint/9469/console and other failed mediawiki-core-phplint jobs.

In T110052#1575146, @hashar wrote:

Zeljko rebuild integration-slave-precise-1014 :-}

It's not working properly, @bd808 marked it as offline in jenkins for now. See https://integration.wikimedia.org/ci/job/mediawiki-core-phplint/9468/console and https://integration.wikimedia.org/ci/job/tox-flake8/8196/console for example.

Seems it takes more than 10 minutes to clone mediawiki/core from zuul.eqiad.wmnet :-/ Maybe the git-daemon serving them on gallium.wikimedia.org is overloaded or gallium itself as disk I/O troubles.

Would need to reproduce with GIT_TRACE=1 GIT_TRACE_PACKETS=1.

Gave it a try on integration-slave-precise1014:

$ git init .
$ GIT_TRACE=1 time git -c core.askpass=true fetch --tags --progress git://zuul.eqiad.wmnet/mediawiki/core +refs/heads/*:refs/remotes/origin/*
trace: built-in: git 'fetch' '--tags' '--progress' 'git://zuul.eqiad.wmnet/mediawiki/core' '+refs/heads/*:refs/remotes/origin/*'
trace: run_command: 'rev-list' '--verify-objects' '--stdin' '--not' '--all' '--quiet'
remote: Counting objects: 578493, done.
remote: Compressing objects: 100% (105780/105780), done.
trace: run_command: 'index-pack' '--stdin' '-v' '--fix-thin' '--keep=fetch-pack 23014 on integration-slave-precise-1014' '--pack_header=2,578493'
trace: exec: 'git' 'index-pack' '--stdin' '-v' '--fix-thin' '--keep=fetch-pack 23014 on integration-slave-precise-1014' '--pack_header=2,578493'
trace: built-in: git 'index-pack' '--stdin' '-v' '--fix-thin' '--keep=fetch-pack 23014 on integration-slave-precise-1014' '--pack_header=2,578493'
remote: Total 578493 (delta 478370), reused 570493 (delta 470938)
Receiving objects: 100% (578493/578493), 310.03 MiB | 23.57 MiB/s, done.
Resolving deltas: 100% (478370/478370), done.
trace: run_command: 'rev-list' '--verify-objects' '--stdin' '--not' '--all'
trace: exec: 'git' 'rev-list' '--verify-objects' '--stdin' '--not' '--all'
trace: built-in: git 'rev-list' '--verify-objects' '--stdin' '--not' '--all'

At this point it takes roughly 10 minutes to write informations in the pack file :-/

667.51user 9.14system 11:29.68elapsed 98%CPU (0avgtext+0avgdata 1603184maxresident)k
0inputs+672856outputs (0major+2249250minor)pagefaults 0swaps

So 667 seconds or 11 minutes :-/

hashar mentioned this in T110512: mediawiki-core-phplint clone the whole repo from Zuul and times out.Aug 27 2015, 12:08 PM

hashar closed subtask T109752: disk space on labvirt1007 as Resolved.Aug 27 2015, 6:24 PM

hashar closed subtask T110512: mediawiki-core-phplint clone the whole repo from Zuul and times out as Resolved.Sep 14 2015, 2:08 PM

Status? T110506 seems to be resolved. We're on low capacity without these two slaves.

hashar closed subtask T110506: Fix tox installation on new Precise slaves as Resolved.Sep 21 2015, 7:56 PM

Recreating integration-slave-trusty-1014 and integration-slave-trusty-1017

SBisson unsubscribed.Sep 23 2015, 10:48 AM

We have rebuild:

integration-slave-precise-1014
integration-slave-trusty-1014
integration-slave-trusty-1017

(npm / grunt-cli are up-to-date)

greg added a project: Essential-Work.Jan 11 2016, 10:50 PM

integration-slave-trusty-1014 and integration-slave-trusty-1017 instances can't boot anymore, ended up corrupted. Need rebuildClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

integration-slave-trusty-1014 and integration-slave-trusty-1017 instances can't boot anymore, ended up corrupted. Need rebuild
Closed, ResolvedPublic
Actions

Related Objects
Search...