Page MenuHomePhabricator

/ on gallium is read only, breaking jenkins
Closed, ResolvedPublic

Description

legoktm@gallium:~$ touch test
touch: cannot touch `test': Read-only file system

https://integration.wikimedia.org/ci/log/WARNING/ also has complaints about not being able to do anything due to a read-only FS.

[19:53:22] <yuvipanda> legoktm: does this completely kill jenkins?
[19:53:39] <legoktm> it's still running, but it can't do anything
[19:53:49] <legoktm> because triggering jobs requires writing to the file system
[19:53:53] <yuvipanda> legoktm: hmm, I *think* it might be hardware failure
[19:53:59] <yuvipanda> I see an madm alert for it
[19:54:00] <legoktm> well...shit.
[19:54:23] <legoktm> lemme file a bug then
[19:54:26] <yuvipanda> legoktm: yeah
[19:54:36] <yuvipanda> legoktm: I don't want to reboot in that state since it might not come up at all

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

I see an mdm alert for it:

This is an automatically generated mail message from mdadm
running on gallium

A Fail event had been detected on md device /dev/md0.

It could be related to component device /dev/sda2.

Faithfully yours, etc.

P.S. The /proc/mdstat file currently contains the following:

Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md0 : active raid1 sdc2[1] sda2[2](F)
      480573376 blocks [2/1] [_U]

unused devices: <none>

fsck completed with:

root@gallium:/home/yuvipanda# fsck.ext3 -n /dev/md0 | tee fsck
tee: fsck: Read-only file system
e2fsck 1.42 (29-Nov-2011)
Warning: skipping journal recovery because doing a read-only filesystem check.
/dev/md0 has gone 972 days without being checked, check forced.
Pass 1: Checking inodes, blocks, and sizes


Inodes that were part of a corrupted orphan linked list found.  Fix? no

Inode 17425807 was part of the orphaned inode list.  IGNORED.
Inode 17425841 was part of the orphaned inode list.  IGNORED.
Inode 17425842 was part of the orphaned inode list.  IGNORED.
Inode 17425847 was part of the orphaned inode list.  IGNORED.
Inode 17425851 was part of the orphaned inode list.  IGNORED.
Inode 17425871 was part of the orphaned inode list.  IGNORED.
Inode 17425876 was part of the orphaned inode list.  IGNORED.
Inode 22495243 was part of the orphaned inode list.  IGNORED.
Inode 22495368 was part of the orphaned inode list.  IGNORED.
Inode 23093406 was part of the orphaned inode list.  IGNORED.
Inode 23094308 was part of the orphaned inode list.  IGNORED.
Inode 23118262 was part of the orphaned inode list.  IGNORED.
Deleted inode 25651949 has zero dtime.  Fix? no

Inode 27476014 was part of the orphaned inode list.  IGNORED.
Inode 27476089 was part of the orphaned inode list.  IGNORED.
Inode 27476097 was part of the orphaned inode list.  IGNORED.
Inode 27476115 was part of the orphaned inode list.  IGNORED.
Inode 27476124 was part of the orphaned inode list.  IGNORED.
Inode 27476140 was part of the orphaned inode list.  IGNORED.
Inode 27476141 was part of the orphaned inode list.  IGNORED.
Inode 27476144 was part of the orphaned inode list.  IGNORED.
Inode 27476403 was part of the orphaned inode list.  IGNORED.
Inode 27476808 was part of the orphaned inode list.  IGNORED.
Inode 27476809 was part of the orphaned inode list.  IGNORED.
Inode 27476810 was part of the orphaned inode list.  IGNORED.
Inode 27476811 was part of the orphaned inode list.  IGNORED.
Inode 27476817 was part of the orphaned inode list.  IGNORED.
Inode 27476818 was part of the orphaned inode list.  IGNORED.
Inode 27476819 was part of the orphaned inode list.  IGNORED.
Inode 27476821 was part of the orphaned inode list.  IGNORED.
Inode 27477353 was part of the orphaned inode list.  IGNORED.
Inode 27477354 was part of the orphaned inode list.  IGNORED.
Inode 27477372 was part of the orphaned inode list.  IGNORED.
Inode 27477377 was part of the orphaned inode list.  IGNORED.
Inode 27477379 was part of the orphaned inode list.  IGNORED.
Inode 28352687 was part of the orphaned inode list.  IGNORED.
Inode 28352793 was part of the orphaned inode list.  IGNORED.
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
Block bitmap differences:  -(69736114--69736133) -(69739355--69739370) -(69739379--69739398) -89982989 -89983112 -(92401767--92401784) -(92748806--92748918) -92755994 -92758057 -(92758060--92758061) -(92758063--92758064) -(92758066--92758071) -(92758078--92758092) -(92758098--92758100) -(92875384--92875390) -102628679 -(102628977--102629014) -(109914308--109914311) -(109914317--109914330) -(109914819--109914823) -(109915084--109915094) -(109915246--109915284) -(109916026--109916042) -(109916051--109916057) -109916059 -(109916128--109916140) -(109916144--109916159) -109919492 -109919593 -(109919595--109919609) -(109919611--109919645) -(109919942--109919943) -(109919945--109919958) -(109919964--109919967) -(109919982--109920094) -(109920160--109920255) -(109920572--109920573) -(109920576--109920577) -109920751 -(109920753--109920765) -(109920888--109920916) -(109920931--109920948) -(109921022--109921069) -(109921081--109921095) -(109921123--109921127) -(109921755--109921759) -(109922036--109922040) -(109922044--109922047) -(109922056--109922146) -(109922192--109922207) -(109924352--109924355) -(109924816--109924817) -(109924856--109924857) -(109924862--109924863) -109924881 -109924890 -(109926016--109926030) -(109927993--109928087) -(109928188--109928447) -(109928539--109928585) -(109929015--109929184) -(109929193--109929196) -(110239775--110239782) -(110239787--110239789) -(110241793--110241796) -(110241798--110241799) -(110243111--110243114) -(110243118--110243119) -(110243123--110243126) -(110243144--110243145) -(110243153--110243156) -(110243159--110243166) -(110243168--110243169) -113637986 -(113637988--113637990) -113637995 -113638001 -(113638014--113638015) -113653913 -113654043 -(113654054--113654055) -(113655477--113655478) -(113656933--113656936) -(113656938--113656941) -113656944 -113656946 -113656948 -(113657866--113657868) -113679315 -113679317 -(113679319--113679320) -113679322 -(113688587--113688590) -(113688596--113688598) -113688601 -(113688608--113688615) -113696313 -(113721244--113721245) -(113721247--113721264)
Fix? no

Free blocks count wrong for group #22 (1972, counted=1971).
Fix? no

Free blocks count wrong for group #889 (0, counted=1).
Fix? no

Free blocks count wrong for group #928 (2, counted=4).
Fix? no

Free blocks count wrong for group #929 (2246, counted=2256).
Fix? no

Free blocks count wrong for group #2745 (376, counted=382).
Fix? no

Free blocks count wrong for group #2755 (2311, counted=2312).
Fix? no

Free blocks count wrong for group #2760 (1034, counted=1043).
Fix? no

Free blocks count wrong for group #3526 (348, counted=349).
Fix? no

Free blocks count wrong (46687680, counted=38869639).
Fix? no

Inode bitmap differences:  -17425807 -(17425841--17425842) -17425847 -17425851 -17425871 -17425876 -22495243 -22495368 -23093406 -23094308 -23118262 -25651949 -27476014 -27476089 -27476097 -27476115 -27476124 -(27476140--27476141) -27476144 -27476403 -(27476808--27476811) -(27476817--27476819) -27476821 -(27477353--27477354) -27477372 -27477377 -27477379 -28352687 -28352793
Fix? no

Free inodes count wrong for group #22 (7648, counted=7649).
Fix? no

Free inodes count wrong for group #889 (6053, counted=6052).
Fix? no

Free inodes count wrong for group #927 (7804, counted=7809).
Fix? no

Directories count wrong for group #927 (120, counted=118).
Fix? no

Free inodes count wrong (22960164, counted=27714007).
Fix? no


/dev/md0: ********** WARNING: Filesystem still has errors **********

/dev/md0: 7079900/30040064 files (3.0% non-contiguous), 73455664/120143344 blocks

I suspect rebooting + fsck on reboot will fix this, but I'm also aware that I haven't done this before, and that gallium isn't fully puppetized - so I'm going to hold off and wait for someone else to show up. Me and @Legoktm also decided to not page people for this, since nobody noticed this for ~3h, due to the timing of the break (SF evening)

mdadm shows /dev/sda2 as failed, so it needs to be removed from /dev/md0 and replaced. Let's wait for Antoine to appear before rebooting, the next mediawiki deployment is still ten hours away and for puppet we can use PCC to doublecheck.

Please don't reboot the machine: while /dev/sda2 seems to be failing, we also have /dev/sdc reporting I/O errors

[Mon Jun  6 17:46:50 2016] sd 1:0:0:0: [sdc] Unhandled sense code
[Mon Jun  6 17:46:50 2016] sd 1:0:0:0: [sdc]  Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[Mon Jun  6 17:46:50 2016] sd 1:0:0:0: [sdc]  Sense Key : Medium Error [current] [descriptor]
[Mon Jun  6 17:46:50 2016] sd 1:0:0:0: [sdc]  Add. Sense: Unrecovered read error - auto reallocate failed
[Mon Jun  6 17:46:50 2016] sd 1:0:0:0: [sdc] CDB: Read(10): 28 00 2c 62 ee 38 00 00 08 00
[Mon Jun  6 17:46:50 2016] end_request: I/O error, dev sdc, sector 744681016

I am honestly fearing the linux md marked the wrong disk as failed, too. The alternative is, we lost one disk and weren't alerted, and now we lost both.

mdadm --detail /dev/md0 
/dev/md0:
        Version : 0.90
  Creation Time : Thu Aug 25 21:30:22 2011
     Raid Level : raid1
     Array Size : 480573376 (458.31 GiB 492.11 GB)
  Used Dev Size : 480573376 (458.31 GiB 492.11 GB)
   Raid Devices : 2
  Total Devices : 2
Preferred Minor : 0
    Persistence : Superblock is persistent

    Update Time : Wed Jun  8 07:27:19 2016
          State : clean, degraded 
 Active Devices : 1
Working Devices : 1
 Failed Devices : 1
  Spare Devices : 0

           UUID : 02787912:1cc83bcf:e7d252f1:6ccb01d9
         Events : 0.5093

    Number   Major   Minor   RaidDevice State
       0       0        0        0      removed
       1       8       34        1      active sync   /dev/sdc2

       2       8        2        -      faulty spare   /dev/sda2

So it seems we are out of luck here. If valuable things are present on gallium, we should just salvage those.

Entirely my fault for not having prepared a proper backup of gallium T80385 and not having moved gallium to another host sooner :(

For backups:

/var/lib/zuulZuul private data
/srv//srv/ssd can be skipped
/var/lib/jenkinsJenkins data, a lot of that can be purged before saving

I think our best bet at the moment is installing a new system to replace gallium.

@hashar suggested moving to jessie directly, I will take a look at possibilities

I don't have rights to edit the spares allocation spreadsheet, so I can't comment there, but I am thinking of allocating WMF4723 as a gallium replacement.

This will happen a bit out of process but I'd like for it to be available ASAP.

I am rebuilding/testing the Zuul deb package for Jessie (T137279).

I have created a placeholder incident report on https://wikitech.wikimedia.org/wiki/Incident_documentation/20160608-gallium-disk-failure . Would be completed later on.

The host I chose was already allocated to maps100*, so we are now targeting wmf4746 instead.

smartclt status for both disks:

Change 293278 had a related patch set uploaded (by Giuseppe Lavagetto):
Add darmstadtium.eqiad.wmnet (Eqiad Row a private)

https://gerrit.wikimedia.org/r/293278

Change 293279 had a related patch set uploaded (by Giuseppe Lavagetto):
contint: add darmstadtium as gallium replacement

https://gerrit.wikimedia.org/r/293279

@MoritzMuehlenhoff is taking care of adding the Jenkins 1.652.2 Debian packages for jessie-wikimedia.

The Zuul package for Jessie seems to work fine on Jessie based on rough testing on labs ( T137279 )

Change 293278 merged by Giuseppe Lavagetto:
Add contint1001.eqiad.wmnet

https://gerrit.wikimedia.org/r/293278

Change 293279 merged by Giuseppe Lavagetto:
contint: add contint1001 as gallium replacement

https://gerrit.wikimedia.org/r/293279

Change 293288 had a related patch set uploaded (by Hashar):
zuul.eqiad.wmnet is no more of any use

https://gerrit.wikimedia.org/r/293288

Change 293300 had a related patch set uploaded (by Hashar):
gallium is replaced by contint1001.eqiad.wmnet

https://gerrit.wikimedia.org/r/293300

Change 293301 had a related patch set uploaded (by Hashar):
contint: hiera conf for contint1001.eqiad.wmnet

https://gerrit.wikimedia.org/r/293301

Change 293301 merged by Giuseppe Lavagetto:
contint: hiera conf for contint1001.eqiad.wmnet

https://gerrit.wikimedia.org/r/293301

Change 293302 had a related patch set uploaded (by Giuseppe Lavagetto):
contint1001: add role::ci::master

https://gerrit.wikimedia.org/r/293302

Change 293302 merged by Giuseppe Lavagetto:
contint1001: add role::ci::master

https://gerrit.wikimedia.org/r/293302

Change 293313 had a related patch set uploaded (by Paladox):
zuul status: notice about ongoing outage

https://gerrit.wikimedia.org/r/293313

Change 293313 abandoned by Paladox:
zuul status: notice about ongoing outage

Reason:
No point since gallium is read only.

https://gerrit.wikimedia.org/r/293313

Change 293324 had a related patch set uploaded (by Giuseppe Lavagetto):
contint1001: add zuul

https://gerrit.wikimedia.org/r/293324

Change 293324 merged by Giuseppe Lavagetto:
contint1001: add zuul

https://gerrit.wikimedia.org/r/293324

@Joe got a new server, did a nice partition schema based on lvm. Had to polish up puppet manifests and eventually it is completed.

zuul is at 2.1.0-95-g66c8e52-wmf1jessie1 but the service is masked in systemd to prevent it from starting.

Next steps:

  • restore jenkins config
  • open network flows

Status:

@jcrespo has taken backups and is dealing with the disk failure + RAID with guidance from Faidon/Mark and Chris on site

@Joe allocated a server and installed Jessie pairing with @hashar to polish up the puppet scripts.

It seems as if the RAID operations were successful, but it got stuck on boot:

1The system may have suffered a hardware fault, such as a disk drive
2failure. The root device may depend on the RAID devices being online. One
3or more of the following RAID devices are degraded:
4Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
5md0 : active (auto-read-on[ 95.706273] kjournald starting. Commit interval 5 seconds
6[ 95.706438] EXT3-fs (md0): mounted filesystem with ordered data mode
7ly) raid1 sda2[0]
8 480573376 blocks [2/1] [U_]
9
10unused devices: <none>
11Attempting to start the RAID in degraded mode...
12mdadm: CREATE user root not found
13mdadm: CREATE group disk not found
14Started the RAID in degraded mode.
15done.
16Begin: Running /scripts/local-bottom ... done.
17done.
18Begin: Running /scripts/init-bottom ... done.
19[ 98.955733] Adding 7811068k swap on /dev/sda1. Priority:-1 extents:1 across:7811068k
20[ 100.160075] EXT3-fs (md0): using internal journal
21[ 101.332710] SGI XFS with ACLs, security attributes, realtime, large block/inode numbers, no debug enabled
22[ 101.342921] SGI XFS Quota Management subsystem
23[ 101.351317] XFS (sdb1): Mounting Filesystem
24[ 101.582347] XFS (sdb1): Starting recovery (logdev: internal)
25[ 101.689028] XFS (sdb1): Ending recovery (logdev: internal)

The RAID array is rebuilding on gallium, would take ~1 hour and half.
Puppet is disabled, Jenkins stopped and Zuul masked in systemd.

@jcrespo made a copy of jenkins data to db1085, they are being copied to the new host contint1001.

contint1001 has puppet disabled, Jenkins stopped and Zuul masked in systemd.

Once the RAID is rebuild and confirmed to be in a sane state, we can bring back Jenkins/Zuul service on gallium.

Tomorrow we will proceed with the migration of the service toward the new host contint1001. Though there are a few blockers such as figuring out firewall rules (T137323), making sure Zuul works on Jessie and updating the IP address in Puppet and Jenkins jobs.

More details before I go:

there are several backups on db1085:/srv/backup/gallium.wikimedia.org

sda.img
sdb.img

which are raw dd copies of the sda and sdb disks (currently loopback-mounted on /mnt and subdirs).

There are some unfinished backups on: einsteinium:/srv/backup/gallium.wikimedia.org

db1085 has a root screen session copying /var/lib/jenkins to the same location on contint1001 (which is also receiving that data on a root screen session). You can attache to either of those sessions to check what is being done, or just ls/du the appropiate dir.

gallium is still cloning its disk:

gallium:~$ cat /proc/mdstat 
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] 
md0 : active raid1 sdc2[2] sda2[0]
      480573376 blocks [2/1] [U_]
      [======>..............]  recovery = 31.6% (152319744/480573376) finish=69.0min speed=79240K/sec
      
unused devices: <none>

18:50 mark poked me stating that the raid rebuild is complete and gallium rebooted. He confirmed that the raid/disks status is all clear as of now. So we can resume Jenkins/Zuul now.

Note gallium can well die again.

Mentioned in SAL [2016-06-08T19:18:57Z] <hashar> Bringing back Jenkins and Zuul on gallium T137265

Change 293288 abandoned by Hashar:
zuul.eqiad.wmnet is no more of any use

Reason:
Will keep the entry for now since it is used by Nodepool. For the contint1001 migration the change is https://gerrit.wikimedia.org/r/318249

https://gerrit.wikimedia.org/r/293288