Page MenuHomePhabricator

integration-slave-trusty-1014 and integration-slave-trusty-1017 instances can't boot anymore, ended up corrupted. Need rebuild
Closed, ResolvedPublic

Description

We have two instances that refuses to boot entirely, even after a few hard reboot via the Horizon dashboard. The console shows nothing and apparently the instances ends up in 'paused' states.

The instances are Jenkins slaves:

integration-slave-trusty-1014.integration.eqiad.wmflabslogjenkins node
integration-slave-trusty-1017.integration.eqiad.wmflabslogjenkins node
integration-slave-precise-1014
  • Ended up corrupted due to labvirt1007 filling disk space.
  • Delete instances
  • Build new instances

Event Timeline

hashar created this task.Aug 24 2015, 3:40 PM
hashar raised the priority of this task from to Needs Triage.
hashar updated the task description. (Show Details)
hashar added a subscriber: hashar.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 24 2015, 3:40 PM
hashar updated the task description. (Show Details)Aug 24 2015, 3:52 PM
hashar set Security to None.
hashar added a subscriber: Andrew.Aug 24 2015, 3:57 PM

@Andrew do you have any spare time to look at them please ? :-}

This is probably because they are hosted on labvirt1007 which no longer has space to expand instance drives. I'm working on a fix for this and may be able to resolve it later today... if you need things working in the next few hours it's best to destroy and rebuild them.

hashar added a comment.EditedAug 24 2015, 4:34 PM

At worse I will recreate tomorrow morning Europe time :-) Thanks Andrew!

Self note: manual steps are listed at https://wikitech.wikimedia.org/wiki/Nova_Resource:Integration/Setup?redirect=no#Patches

hashar closed this task as Resolved.Aug 25 2015, 8:51 AM
hashar claimed this task.

integration-slave-precise-1014 was affected as well apparently.

Disk space have been freed on labvirt1007 and the instances managed to boot. I pooled them back in Jenkins.

hashar reopened this task as Open.Aug 25 2015, 3:01 PM

So apparently some git repos in Jenkins workspace ended up being corrupted:

jenkins-deploy@integration-slave-trusty-1017:/mnt/jenkins-workspace/workspace/mwext-qunit/src/extensions/VisualEditor$ git fsck
error: object file .git/objects/00/3124078a04dce9370de7f1eca062437ef88fe2 is empty
error: object file .git/objects/00/3124078a04dce9370de7f1eca062437ef88fe2 is empty
fatal: loose object 003124078a04dce9370de7f1eca062437ef88fe2 (stored in .git/objects/00/3124078a04dce9370de7f1eca062437ef88fe2) is corrupt

Potentially the file was created but content could not be written because the host disk was full :/ Going to purge workspaces on all three instances.

integration-slave-trusty-1014 on boot reports:

/bin/sh: 1: exec: cloud-init: not found

integration-slave-trusty-1017 when running an unknown command I got:

/usr/bin/python: can't find '__main__' module in '/usr/share/command-not-found'

integration-slave-precise-1014 probably suffers similar issue.

I have unopposed all three instances and deleting them. We need to recreate them from scratch.

I have poked the Release-Engineering-Team team mailing list to pair the rebuild with someone.

hashar renamed this task from integration-slave-trusty-1014 and integration-slave-trusty-1017 instances can't boot anymore to integration-slave-trusty-1014 and integration-slave-trusty-1017 instances can't boot anymore, ended up corrupted. Need rebuild.Aug 25 2015, 6:58 PM
hashar updated the task description. (Show Details)
hashar updated the task description. (Show Details)Aug 26 2015, 8:09 AM

Zeljko rebuild integration-slave-precise-1014 :-}

bd808 added a subscriber: bd808.Aug 26 2015, 4:40 PM

Zeljko rebuild integration-slave-precise-1014 :-}

I just marked https://integration.wikimedia.org/ci/computer/integration-slave-precise-1014/ offline. It seems to be having problems cloning git://zuul.eqiad.wmnet/mediawiki/core that are very repeatable (even with stock git from the cli). See https://integration.wikimedia.org/ci/job/mediawiki-core-phplint/9469/console and other failed mediawiki-core-phplint jobs.

Zeljko rebuild integration-slave-precise-1014 :-}

It's not working properly, @bd808 marked it as offline in jenkins for now. See https://integration.wikimedia.org/ci/job/mediawiki-core-phplint/9468/console and https://integration.wikimedia.org/ci/job/tox-flake8/8196/console for example.

Seems it takes more than 10 minutes to clone mediawiki/core from zuul.eqiad.wmnet :-/ Maybe the git-daemon serving them on gallium.wikimedia.org is overloaded or gallium itself as disk I/O troubles.

Would need to reproduce with GIT_TRACE=1 GIT_TRACE_PACKETS=1.

Gave it a try on integration-slave-precise1014:

$ git init .
$ GIT_TRACE=1 time git -c core.askpass=true fetch --tags --progress git://zuul.eqiad.wmnet/mediawiki/core +refs/heads/*:refs/remotes/origin/*
trace: built-in: git 'fetch' '--tags' '--progress' 'git://zuul.eqiad.wmnet/mediawiki/core' '+refs/heads/*:refs/remotes/origin/*'
trace: run_command: 'rev-list' '--verify-objects' '--stdin' '--not' '--all' '--quiet'
remote: Counting objects: 578493, done.
remote: Compressing objects: 100% (105780/105780), done.
trace: run_command: 'index-pack' '--stdin' '-v' '--fix-thin' '--keep=fetch-pack 23014 on integration-slave-precise-1014' '--pack_header=2,578493'
trace: exec: 'git' 'index-pack' '--stdin' '-v' '--fix-thin' '--keep=fetch-pack 23014 on integration-slave-precise-1014' '--pack_header=2,578493'
trace: built-in: git 'index-pack' '--stdin' '-v' '--fix-thin' '--keep=fetch-pack 23014 on integration-slave-precise-1014' '--pack_header=2,578493'
remote: Total 578493 (delta 478370), reused 570493 (delta 470938)
Receiving objects: 100% (578493/578493), 310.03 MiB | 23.57 MiB/s, done.
Resolving deltas: 100% (478370/478370), done.
trace: run_command: 'rev-list' '--verify-objects' '--stdin' '--not' '--all'
trace: exec: 'git' 'rev-list' '--verify-objects' '--stdin' '--not' '--all'
trace: built-in: git 'rev-list' '--verify-objects' '--stdin' '--not' '--all'

At this point it takes roughly 10 minutes to write informations in the pack file :-/

667.51user 9.14system 11:29.68elapsed 98%CPU (0avgtext+0avgdata 1603184maxresident)k
0inputs+672856outputs (0major+2249250minor)pagefaults 0swaps

So 667 seconds or 11 minutes :-/

Krinkle triaged this task as High priority.Sep 17 2015, 5:27 AM
Krinkle updated the task description. (Show Details)

Status? T110506 seems to be resolved. We're on low capacity without these two slaves.

Recreating integration-slave-trusty-1014 and integration-slave-trusty-1017

SBisson removed a subscriber: SBisson.Sep 23 2015, 10:48 AM
hashar closed this task as Resolved.Sep 23 2015, 12:35 PM

We have rebuild:

  • integration-slave-precise-1014
  • integration-slave-trusty-1014
  • integration-slave-trusty-1017

(npm / grunt-cli are up-to-date)