/ on gallium is read only, breaking jenkins
Closed, ResolvedPublic
Actions

Assigned To

None

Authored By

	Legoktm
	Jun 8 2016, 2:56 AM

Description

legoktm@gallium:~$ touch test
touch: cannot touch `test': Read-only file system

https://integration.wikimedia.org/ci/log/WARNING/ also has complaints about not being able to do anything due to a read-only FS.

[19:53:22] <yuvipanda> legoktm: does this completely kill jenkins?
[19:53:39] <legoktm> it's still running, but it can't do anything
[19:53:49] <legoktm> because triggering jobs requires writing to the file system
[19:53:53] <yuvipanda> legoktm: hmm, I *think* it might be hardware failure
[19:53:59] <yuvipanda> I see an madm alert for it
[19:54:00] <legoktm> well...shit.
[19:54:23] <legoktm> lemme file a bug then
[19:54:26] <yuvipanda> legoktm: yeah
[19:54:36] <yuvipanda> legoktm: I don't want to reboot in that state since it might not come up at all

Details

Subject	Repo	Branch	Lines +/-
zuul.eqiad.wmnet is no more of any use	operations/dns	master	+0 -1
contint1001: add zuul	operations/puppet	production	+10 -1
zuul status: notice about ongoing outage	integration/docroot	master	+4 -0
contint1001: add role::ci::master	operations/puppet	production	+1 -0
contint: hiera conf for contint1001.eqiad.wmnet	operations/puppet	production	+8 -0
contint: add contint1001 as gallium replacement	operations/puppet	production	+118 -0
Add contint1001.eqiad.wmnet	operations/dns	master	+3 -0

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Declined	None	T133150 Move gallium to an internal host?
Declined	None	T137358 Migrate CI services from gallium to contint1001
Declined	None	T137293 Update all references to gallium and change it to contint1001 in integration/*
Resolved	Legoktm	T127809 OOUI PHP demos on doc.wikimedia.org are broken
Resolved	hashar	T127504 doc.wikimedia.org should be running PHP 5.5+, not 5.3 -> demos etc. don't work
Resolved	Dzahn	T123525 reduce amount of remaining Ubuntu 12.04 (precise) systems in production
Invalid	None	T124121 [keyresult] Migrate Jenkins to Jessie (gallium -> cobalt)
Resolved	Dzahn	T95757 Phase out gallium.wikimedia.org
Resolved	None	T137265 / on gallium is read only, breaking jenkins

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Restricted Application added subscribers: Luke081515, TerraCodes, Urbanecm. · View Herald TranscriptJun 8 2016, 2:56 AM

I see an mdm alert for it:

This is an automatically generated mail message from mdadm
running on gallium

A Fail event had been detected on md device /dev/md0.

It could be related to component device /dev/sda2.

Faithfully yours, etc.

P.S. The /proc/mdstat file currently contains the following:

Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md0 : active raid1 sdc2[1] sda2[2](F)
      480573376 blocks [2/1] [_U]

unused devices: <none>

• MZMcBride subscribed.Jun 8 2016, 3:50 AM

fsck completed with:

root@gallium:/home/yuvipanda# fsck.ext3 -n /dev/md0 | tee fsck
tee: fsck: Read-only file system
e2fsck 1.42 (29-Nov-2011)
Warning: skipping journal recovery because doing a read-only filesystem check.
/dev/md0 has gone 972 days without being checked, check forced.
Pass 1: Checking inodes, blocks, and sizes


Inodes that were part of a corrupted orphan linked list found.  Fix? no

Inode 17425807 was part of the orphaned inode list.  IGNORED.
Inode 17425841 was part of the orphaned inode list.  IGNORED.
Inode 17425842 was part of the orphaned inode list.  IGNORED.
Inode 17425847 was part of the orphaned inode list.  IGNORED.
Inode 17425851 was part of the orphaned inode list.  IGNORED.
Inode 17425871 was part of the orphaned inode list.  IGNORED.
Inode 17425876 was part of the orphaned inode list.  IGNORED.
Inode 22495243 was part of the orphaned inode list.  IGNORED.
Inode 22495368 was part of the orphaned inode list.  IGNORED.
Inode 23093406 was part of the orphaned inode list.  IGNORED.
Inode 23094308 was part of the orphaned inode list.  IGNORED.
Inode 23118262 was part of the orphaned inode list.  IGNORED.
Deleted inode 25651949 has zero dtime.  Fix? no

Inode 27476014 was part of the orphaned inode list.  IGNORED.
Inode 27476089 was part of the orphaned inode list.  IGNORED.
Inode 27476097 was part of the orphaned inode list.  IGNORED.
Inode 27476115 was part of the orphaned inode list.  IGNORED.
Inode 27476124 was part of the orphaned inode list.  IGNORED.
Inode 27476140 was part of the orphaned inode list.  IGNORED.
Inode 27476141 was part of the orphaned inode list.  IGNORED.
Inode 27476144 was part of the orphaned inode list.  IGNORED.
Inode 27476403 was part of the orphaned inode list.  IGNORED.
Inode 27476808 was part of the orphaned inode list.  IGNORED.
Inode 27476809 was part of the orphaned inode list.  IGNORED.
Inode 27476810 was part of the orphaned inode list.  IGNORED.
Inode 27476811 was part of the orphaned inode list.  IGNORED.
Inode 27476817 was part of the orphaned inode list.  IGNORED.
Inode 27476818 was part of the orphaned inode list.  IGNORED.
Inode 27476819 was part of the orphaned inode list.  IGNORED.
Inode 27476821 was part of the orphaned inode list.  IGNORED.
Inode 27477353 was part of the orphaned inode list.  IGNORED.
Inode 27477354 was part of the orphaned inode list.  IGNORED.
Inode 27477372 was part of the orphaned inode list.  IGNORED.
Inode 27477377 was part of the orphaned inode list.  IGNORED.
Inode 27477379 was part of the orphaned inode list.  IGNORED.
Inode 28352687 was part of the orphaned inode list.  IGNORED.
Inode 28352793 was part of the orphaned inode list.  IGNORED.
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
Block bitmap differences:  -(69736114--69736133) -(69739355--69739370) -(69739379--69739398) -89982989 -89983112 -(92401767--92401784) -(92748806--92748918) -92755994 -92758057 -(92758060--92758061) -(92758063--92758064) -(92758066--92758071) -(92758078--92758092) -(92758098--92758100) -(92875384--92875390) -102628679 -(102628977--102629014) -(109914308--109914311) -(109914317--109914330) -(109914819--109914823) -(109915084--109915094) -(109915246--109915284) -(109916026--109916042) -(109916051--109916057) -109916059 -(109916128--109916140) -(109916144--109916159) -109919492 -109919593 -(109919595--109919609) -(109919611--109919645) -(109919942--109919943) -(109919945--109919958) -(109919964--109919967) -(109919982--109920094) -(109920160--109920255) -(109920572--109920573) -(109920576--109920577) -109920751 -(109920753--109920765) -(109920888--109920916) -(109920931--109920948) -(109921022--109921069) -(109921081--109921095) -(109921123--109921127) -(109921755--109921759) -(109922036--109922040) -(109922044--109922047) -(109922056--109922146) -(109922192--109922207) -(109924352--109924355) -(109924816--109924817) -(109924856--109924857) -(109924862--109924863) -109924881 -109924890 -(109926016--109926030) -(109927993--109928087) -(109928188--109928447) -(109928539--109928585) -(109929015--109929184) -(109929193--109929196) -(110239775--110239782) -(110239787--110239789) -(110241793--110241796) -(110241798--110241799) -(110243111--110243114) -(110243118--110243119) -(110243123--110243126) -(110243144--110243145) -(110243153--110243156) -(110243159--110243166) -(110243168--110243169) -113637986 -(113637988--113637990) -113637995 -113638001 -(113638014--113638015) -113653913 -113654043 -(113654054--113654055) -(113655477--113655478) -(113656933--113656936) -(113656938--113656941) -113656944 -113656946 -113656948 -(113657866--113657868) -113679315 -113679317 -(113679319--113679320) -113679322 -(113688587--113688590) -(113688596--113688598) -113688601 -(113688608--113688615) -113696313 -(113721244--113721245) -(113721247--113721264)
Fix? no

Free blocks count wrong for group #22 (1972, counted=1971).
Fix? no

Free blocks count wrong for group #889 (0, counted=1).
Fix? no

Free blocks count wrong for group #928 (2, counted=4).
Fix? no

Free blocks count wrong for group #929 (2246, counted=2256).
Fix? no

Free blocks count wrong for group #2745 (376, counted=382).
Fix? no

Free blocks count wrong for group #2755 (2311, counted=2312).
Fix? no

Free blocks count wrong for group #2760 (1034, counted=1043).
Fix? no

Free blocks count wrong for group #3526 (348, counted=349).
Fix? no

Free blocks count wrong (46687680, counted=38869639).
Fix? no

Inode bitmap differences:  -17425807 -(17425841--17425842) -17425847 -17425851 -17425871 -17425876 -22495243 -22495368 -23093406 -23094308 -23118262 -25651949 -27476014 -27476089 -27476097 -27476115 -27476124 -(27476140--27476141) -27476144 -27476403 -(27476808--27476811) -(27476817--27476819) -27476821 -(27477353--27477354) -27477372 -27477377 -27477379 -28352687 -28352793
Fix? no

Free inodes count wrong for group #22 (7648, counted=7649).
Fix? no

Free inodes count wrong for group #889 (6053, counted=6052).
Fix? no

Free inodes count wrong for group #927 (7804, counted=7809).
Fix? no

Directories count wrong for group #927 (120, counted=118).
Fix? no

Free inodes count wrong (22960164, counted=27714007).
Fix? no


/dev/md0: ********** WARNING: Filesystem still has errors **********

/dev/md0: 7079900/30040064 files (3.0% non-contiguous), 73455664/120143344 blocks

I suspect rebooting + fsck on reboot will fix this, but I'm also aware that I haven't done this before, and that gallium isn't fully puppetized - so I'm going to hold off and wait for someone else to show up. Me and @Legoktm also decided to not page people for this, since nobody noticed this for ~3h, due to the timing of the break (SF evening)

mdadm shows /dev/sda2 as failed, so it needs to be removed from /dev/md0 and replaced. Let's wait for Antoine to appear before rebooting, the next mediawiki deployment is still ten hours away and for puppet we can use PCC to doublecheck.

Ricordisamoa subscribed.Jun 8 2016, 5:16 AM

Florian subscribed.Jun 8 2016, 5:18 AM

Smalyshev subscribed.Jun 8 2016, 6:03 AM

Peachey88 subscribed.Jun 8 2016, 6:06 AM

Nikerabbit subscribed.Jun 8 2016, 6:21 AM

Please don't reboot the machine: while /dev/sda2 seems to be failing, we also have /dev/sdc reporting I/O errors

[Mon Jun  6 17:46:50 2016] sd 1:0:0:0: [sdc] Unhandled sense code
[Mon Jun  6 17:46:50 2016] sd 1:0:0:0: [sdc]  Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[Mon Jun  6 17:46:50 2016] sd 1:0:0:0: [sdc]  Sense Key : Medium Error [current] [descriptor]
[Mon Jun  6 17:46:50 2016] sd 1:0:0:0: [sdc]  Add. Sense: Unrecovered read error - auto reallocate failed
[Mon Jun  6 17:46:50 2016] sd 1:0:0:0: [sdc] CDB: Read(10): 28 00 2c 62 ee 38 00 00 08 00
[Mon Jun  6 17:46:50 2016] end_request: I/O error, dev sdc, sector 744681016

I am honestly fearing the linux md marked the wrong disk as failed, too. The alternative is, we lost one disk and weren't alerted, and now we lost both.

mdadm --detail /dev/md0 
/dev/md0:
        Version : 0.90
  Creation Time : Thu Aug 25 21:30:22 2011
     Raid Level : raid1
     Array Size : 480573376 (458.31 GiB 492.11 GB)
  Used Dev Size : 480573376 (458.31 GiB 492.11 GB)
   Raid Devices : 2
  Total Devices : 2
Preferred Minor : 0
    Persistence : Superblock is persistent

    Update Time : Wed Jun  8 07:27:19 2016
          State : clean, degraded 
 Active Devices : 1
Working Devices : 1
 Failed Devices : 1
  Spare Devices : 0

           UUID : 02787912:1cc83bcf:e7d252f1:6ccb01d9
         Events : 0.5093

    Number   Major   Minor   RaidDevice State
       0       0        0        0      removed
       1       8       34        1      active sync   /dev/sdc2

       2       8        2        -      faulty spare   /dev/sda2

So it seems we are out of luck here. If valuable things are present on gallium, we should just salvage those.

jayvdb subscribed.Jun 8 2016, 7:52 AM

JanZerebecki subscribed.Jun 8 2016, 7:57 AM

Glaisher subscribed.Jun 8 2016, 8:03 AM

Paladox subscribed.Jun 8 2016, 8:08 AM

Entirely my fault for not having prepared a proper backup of gallium T80385 and not having moved gallium to another host sooner :(

For backups:

/var/lib/zuul	Zuul private data
/srv/	/srv/ssd can be skipped
/var/lib/jenkins	Jenkins data, a lot of that can be purged before saving

I think our best bet at the moment is installing a new system to replace gallium.

@hashar suggested moving to jessie directly, I will take a look at possibilities

Paladox mentioned this in T137276: Testing for gerrit.Jun 8 2016, 8:39 AM

I don't have rights to edit the spares allocation spreadsheet, so I can't comment there, but I am thinking of allocating WMF4723 as a gallium replacement.

This will happen a bit out of process but I'd like for it to be available ASAP.

jcrespo subscribed.Jun 8 2016, 9:00 AM

Poyekhali added a subscriber: Sirishjoshi.Jun 8 2016, 9:04 AM

Paladox mentioned this in T137278: Create a notice panel on phabricator homepage.Jun 8 2016, 9:07 AM

hashar created subtask T137279: Port Zuul package 2.1.0-95-g66c8e52 from Precise to Jessie.Jun 8 2016, 9:08 AM

I am rebuilding/testing the Zuul deb package for Jessie (T137279).

I have created a placeholder incident report on https://wikitech.wikimedia.org/wiki/Incident_documentation/20160608-gallium-disk-failure . Would be completed later on.

The host I chose was already allocated to maps100*, so we are now targeting wmf4746 instead.

smartclt status for both disks:

sdc P3220
sda P3221

Peachey88 mentioned this in T95959: install/setup/deploy cobalt as replacement for gallium.Jun 8 2016, 10:11 AM

Change 293278 had a related patch set uploaded (by Giuseppe Lavagetto):
Add darmstadtium.eqiad.wmnet (Eqiad Row a private)

https://gerrit.wikimedia.org/r/293278

Change 293279 had a related patch set uploaded (by Giuseppe Lavagetto):
contint: add darmstadtium as gallium replacement

https://gerrit.wikimedia.org/r/293279

@MoritzMuehlenhoff is taking care of adding the Jenkins 1.652.2 Debian packages for jessie-wikimedia.

The Zuul package for Jessie seems to work fine on Jessie based on rough testing on labs ( T137279 )

Change 293278 merged by Giuseppe Lavagetto:
Add contint1001.eqiad.wmnet

https://gerrit.wikimedia.org/r/293278

Change 293279 merged by Giuseppe Lavagetto:
contint: add contint1001 as gallium replacement

https://gerrit.wikimedia.org/r/293279

Restricted Application added a subscriber: Poyekhali. · View Herald TranscriptJun 8 2016, 10:54 AM

Joe mentioned this in rOPUPe368bd2456d6: contint: add contint1001 as gallium replacement.Jun 8 2016, 10:56 AM

Paladox mentioned this in T127504: doc.wikimedia.org should be running PHP 5.5+, not 5.3 -> demos etc. don't work.Jun 8 2016, 11:10 AM

Paladox added a parent task: T127504: doc.wikimedia.org should be running PHP 5.5+, not 5.3 -> demos etc. don't work.

Paladox mentioned this in E206: Code Review Office Hours.Jun 8 2016, 11:19 AM

Change 293288 had a related patch set uploaded (by Hashar):
zuul.eqiad.wmnet is no more of any use

https://gerrit.wikimedia.org/r/293288

hashar moved this task from Untriaged to In-progress on the Continuous-Integration-Infrastructure board.Jun 8 2016, 11:27 AM

Paladox mentioned this in T137293: Update all references to gallium and change it to contint1001 in integration/*.Jun 8 2016, 12:08 PM

Paladox added a parent task: T137293: Update all references to gallium and change it to contint1001 in integration/*.

phuedx mentioned this in T135628: Hovercards overrides NavPopups; should be the other way around.Jun 8 2016, 12:11 PM

Change 293300 had a related patch set uploaded (by Hashar):
gallium is replaced by contint1001.eqiad.wmnet

https://gerrit.wikimedia.org/r/293300

Change 293301 had a related patch set uploaded (by Hashar):
contint: hiera conf for contint1001.eqiad.wmnet

https://gerrit.wikimedia.org/r/293301

Change 293301 merged by Giuseppe Lavagetto:
contint: hiera conf for contint1001.eqiad.wmnet

https://gerrit.wikimedia.org/r/293301

Change 293302 had a related patch set uploaded (by Giuseppe Lavagetto):
contint1001: add role::ci::master

https://gerrit.wikimedia.org/r/293302

Change 293302 merged by Giuseppe Lavagetto:
contint1001: add role::ci::master

https://gerrit.wikimedia.org/r/293302

hashar mentioned this in rOPUPec9b4473d5df: contint: hiera conf for contint1001.eqiad.wmnet.Jun 8 2016, 1:09 PM

Joe mentioned this in rOPUPf09fae50ea9e: contint1001: add role::ci::master.Jun 8 2016, 1:12 PM

hashar edited projects, added Continuous-Integration-Infrastructure (phase-out-gallium); removed Patch-For-Review, Continuous-Integration-Infrastructure.Jun 8 2016, 1:30 PM

Paladox added a parent task: T95757: Phase out gallium.wikimedia.org.Jun 8 2016, 1:30 PM

Paladox mentioned this in T95757: Phase out gallium.wikimedia.org.

Change 293313 had a related patch set uploaded (by Paladox):
zuul status: notice about ongoing outage

https://gerrit.wikimedia.org/r/293313

gerritbot added a project: Patch-For-Review.Jun 8 2016, 2:15 PM

Paladox removed a project: Patch-For-Review.Jun 8 2016, 2:18 PM

Paladox removed a subscriber: gerritbot.

Change 293313 abandoned by Paladox:
zuul status: notice about ongoing outage

Reason:
No point since gallium is read only.

https://gerrit.wikimedia.org/r/293313

Change 293324 had a related patch set uploaded (by Giuseppe Lavagetto):
contint1001: add zuul

https://gerrit.wikimedia.org/r/293324

gerritbot added a project: Patch-For-Review.Jun 8 2016, 3:03 PM

Change 293324 merged by Giuseppe Lavagetto:
contint1001: add zuul

https://gerrit.wikimedia.org/r/293324

Joe mentioned this in rOPUP39cd9bfbd450: contint1001: add zuul.Jun 8 2016, 3:08 PM

@Joe got a new server, did a nice partition schema based on lvm. Had to polish up puppet manifests and eventually it is completed.

zuul is at 2.1.0-95-g66c8e52-wmf1jessie1 but the service is masked in systemd to prevent it from starting.

Next steps:

restore jenkins config
open network flows

Status:

@jcrespo has taken backups and is dealing with the disk failure + RAID with guidance from Faidon/Mark and Chris on site

@Joe allocated a server and installed Jessie pairing with @hashar to polish up the puppet scripts.

It seems as if the RAID operations were successful, but it got stuck on boot:

P3223 gallium after replacing one disk

1	The system may have suffered a hardware fault, such as a disk drive
2	failure. The root device may depend on the RAID devices being online. One
3	or more of the following RAID devices are degraded:
4	Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
5	md0 : active (auto-read-on[ 95.706273] kjournald starting. Commit interval 5 seconds
6	[ 95.706438] EXT3-fs (md0): mounted filesystem with ordered data mode
7	ly) raid1 sda2[0]
8	480573376 blocks [2/1] [U_]
9
10	unused devices: <none>
11	Attempting to start the RAID in degraded mode...
12	mdadm: CREATE user root not found
13	mdadm: CREATE group disk not found
14	Started the RAID in degraded mode.
15	done.
16	Begin: Running /scripts/local-bottom ... done.
17	done.
18	Begin: Running /scripts/init-bottom ... done.
19	[ 98.955733] Adding 7811068k swap on /dev/sda1. Priority:-1 extents:1 across:7811068k
20	[ 100.160075] EXT3-fs (md0): using internal journal
21	[ 101.332710] SGI XFS with ACLs, security attributes, realtime, large block/inode numbers, no debug enabled
22	[ 101.342921] SGI XFS Quota Management subsystem
23	[ 101.351317] XFS (sdb1): Mounting Filesystem
24	[ 101.582347] XFS (sdb1): Starting recovery (logdev: internal)
25	[ 101.689028] XFS (sdb1): Ending recovery (logdev: internal)

hashar created subtask T137323: Firewall rules for labs support host to communicate with contint1001.wikimedia.org (new gallium).Jun 8 2016, 4:22 PM

The RAID array is rebuilding on gallium, would take ~1 hour and half.
Puppet is disabled, Jenkins stopped and Zuul masked in systemd.

@jcrespo made a copy of jenkins data to db1085, they are being copied to the new host contint1001.

contint1001 has puppet disabled, Jenkins stopped and Zuul masked in systemd.

Once the RAID is rebuild and confirmed to be in a sane state, we can bring back Jenkins/Zuul service on gallium.

Tomorrow we will proceed with the migration of the service toward the new host contint1001. Though there are a few blockers such as figuring out firewall rules (T137323), making sure Zuul works on Jessie and updating the IP address in Puppet and Jenkins jobs.

Jdforrester-WMF subscribed.Jun 8 2016, 4:49 PM

More details before I go:

there are several backups on db1085:/srv/backup/gallium.wikimedia.org

sda.img
sdb.img

which are raw dd copies of the sda and sdb disks (currently loopback-mounted on /mnt and subdirs).

There are some unfinished backups on: einsteinium:/srv/backup/gallium.wikimedia.org

db1085 has a root screen session copying /var/lib/jenkins to the same location on contint1001 (which is also receiving that data on a root screen session). You can attache to either of those sessions to check what is being done, or just ls/du the appropiate dir.

gallium is still cloning its disk:

gallium:~$ cat /proc/mdstat 
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] 
md0 : active raid1 sdc2[2] sda2[0]
      480573376 blocks [2/1] [U_]
      [======>..............]  recovery = 31.6% (152319744/480573376) finish=69.0min speed=79240K/sec
      
unused devices: <none>

18:50 mark poked me stating that the raid rebuild is complete and gallium rebooted. He confirmed that the raid/disks status is all clear as of now. So we can resume Jenkins/Zuul now.

Note gallium can well die again.

Mentioned in SAL [2016-06-08T19:18:57Z] <hashar> Bringing back Jenkins and Zuul on gallium T137265