Page MenuHomePhabricator

reinstall bast2001 with jessie
Closed, ResolvedPublic

Description

bast2001 is trusty. we want jessie.

should be easy, no ganglia here or any other misc stuff, just bastionhost::general role

Event Timeline

Dzahn created this task.Mar 4 2016, 7:51 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 4 2016, 7:51 PM
Dzahn claimed this task.Mar 4 2016, 7:57 PM

Change 275045 had a related patch set uploaded (by Dzahn):
install-server: switch bast2001 to jessie

https://gerrit.wikimedia.org/r/275045

Change 275045 merged by Dzahn:
install-server: switch bast2001 to jessie

https://gerrit.wikimedia.org/r/275045

Mentioned in SAL [2016-03-04T21:01:47Z] <mutante> bast2001 - rebooting into PXE for T128899

Dzahn closed this task as Resolved.Mar 4 2016, 10:05 PM
Dzahn removed a project: Patch-For-Review.

21:58 mutante: bast2001 if your ssh client shows the fingerprint as base64 SHA256, the new default, you can ssh -o FingerprintHash=md5 bast2001.wikimedia.org to compare
21:29 mutante: bast2001 - reinstalled with jessie, fingerprints on https://wikitech.wikimedia.org/wiki/Help:SSH_Fingerprints/bast2001.wikimedia.org
21:17 mutante: bast2001 - revoke and sign new puppet cert / salt keys

restoring files from home dirs ..

faidon reopened this task as Open.Mar 7 2016, 11:16 AM
faidon triaged this task as Normal priority.
faidon added a subscriber: faidon.

We've been getting RAID failures. It looks like this:

[ 5040.656250] INFO: task apt-get:3558 blocked for more than 120 seconds.
[ 5040.663553]       Not tainted 3.19.0-2-amd64 #1
[ 5040.668633] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 5040.677398] apt-get         D ffff88041ce456b0     0  3558   3557 0x00000000
[ 5040.677405]  ffff88041ce456b0 ffffea001047ac40 ffff88041d28e110 0000000000014180
[ 5040.677409]  ffff88041aefffd8 0000000000014180 ffff88041ce456b0 0000000000000246
[ 5040.677413]  ffff880419035000 0000000000004a9a ffff880419035088 ffff880419035024
[ 5040.677426] Call Trace:
[ 5040.677458]  [<ffffffffa016e6ad>] ? jbd2_log_wait_commit+0x9d/0x110 [jbd2]
[ 5040.677472]  [<ffffffff810a8fd0>] ? wait_woken+0x90/0x90
[ 5040.677485]  [<ffffffffa01ae9c3>] ? ext4_sync_file+0x283/0x310 [ext4]
[ 5040.677491]  [<ffffffff81187dbe>] ? SyS_msync+0x1fe/0x260
[ 5040.677499]  [<ffffffff815534cd>] ? system_call_fast_compare_end+0xc/0x11
[ 5158.260005] sd 0:0:0:0: [sda] FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[ 5158.260012] sd 0:0:0:0: [sda] CDB: 
[ 5158.260015] Write(10): 2a 00 00 c5 de 00 00 01 00 00
[ 5158.260026] blk_update_request: I/O error, dev sda, sector 12967424
[ 5158.267070] sd 0:0:0:0: [sda] FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[ 5158.267073] sd 0:0:0:0: [sda] CDB: 
[ 5158.267074] Write(10): 2a 00 00 c5 dd 00 00 01 00 00
[ 5158.267083] blk_update_request: I/O error, dev sda, sector 12967168
[ 5158.274094] sd 0:0:0:0: [sda] FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[ 5158.274097] sd 0:0:0:0: [sda] CDB: 
[ 5158.274099] Write(10): 2a 00 00 c5 dc 00 00 01 00 00
[ 5158.274107] blk_update_request: I/O error, dev sda, sector 12966912
[ 5158.281119] sd 0:0:0:0: [sda] FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[ 5158.281122] sd 0:0:0:0: [sda] CDB: 
[ 5158.281124] Write(10): 2a 00 00 c5 db 00 00 01 00 00
[ 5158.281132] blk_update_request: I/O error, dev sda, sector 12966656
[ 5158.288146] sd 0:0:0:0: [sda] FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[ 5158.288149] sd 0:0:0:0: [sda] CDB: 
[ 5158.288150] Write(10): 2a 00 00 c5 da 00 00 01 00 00
[ 5158.288161] blk_update_request: I/O error, dev sda, sector 12966400
[ 5158.295170] sd 0:0:0:0: [sda] FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[ 5158.295173] sd 0:0:0:0: [sda] CDB: 
[ 5158.295175] Write(10): 2a 00 00 c5 d9 00 00 01 00 00
[ 5158.295183] blk_update_request: I/O error, dev sda, sector 12966144
[ 5158.302198] sd 0:0:0:0: [sda] FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[ 5158.302201] sd 0:0:0:0: [sda] CDB: 
[ 5158.302204] Write(10): 2a 00 00 c5 d8 00 00 01 00 00
[ 5158.302213] blk_update_request: I/O error, dev sda, sector 12965888
[ 5158.309233] sd 0:0:0:0: [sda] FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[ 5158.309236] sd 0:0:0:0: [sda] CDB: 
[ 5158.309238] Write(10): 2a 00 00 c5 d7 00 00 01 00 00
[ 5158.309246] blk_update_request: I/O error, dev sda, sector 12965632
[ 5158.316263] sd 0:0:0:0: [sda] FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[ 5158.316266] sd 0:0:0:0: [sda] CDB: 
[ 5158.316267] Write(10): 2a 00 00 c5 d6 00 00 01 00 00
[ 5158.316278] blk_update_request: I/O error, dev sda, sector 12965376
[ 5158.323292] sd 0:0:0:0: [sda] FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[ 5158.323295] sd 0:0:0:0: [sda] CDB: 
[ 5158.323296] Write(10): 2a 00 00 c5 d5 00 00 01 00 00
[ 5158.323305] blk_update_request: I/O error, dev sda, sector 12965120
[ 5158.330738] md: super_written gets error=-5, uptodate=0
[ 5158.330744] md/raid1:md2: Disk failure on sda3, disabling device.
md/raid1:md2: Operation continuing on 1 devices.
[ 5158.346753] sd 0:0:0:0: [sda] Synchronizing SCSI cache
[ 5158.347243] md: super_written gets error=-19, uptodate=0
root@bast2001:~# cat /proc/mdstat 
Personalities : [raid1] 
md2 : active raid1 sda3[0](F) sdb3[1]
      477511680 blocks super 1.2 [2/1] [_U]
      bitmap: 1/4 pages [4KB], 65536KB chunk

md1 : active (auto-read-only) raid1 sda2[0] sdb2[1]
      976320 blocks super 1.2 [2/2] [UU]
      	resync=PENDING
      
md0 : active raid1 sda1[0](F) sdb1[1]
      9756672 blocks super 1.2 [2/1] [_U]
      
unused devices: <none>
root@bast2001:~# fdisk -l /dev/sda
fdisk: cannot open /dev/sda: No such file or directory
root@bast2001:~# fdisk -l /dev/sdc

Disk /dev/sdc: 465.8 GiB, 500107862016 bytes, 976773168 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: dos
Disk identifier: 0x059b0dde

Device     Boot    Start       End   Sectors   Size Id Type
/dev/sdc1  *        2048  19531775  19529728   9.3G fd Linux raid autodetect
/dev/sdc2       19531776  21485567   1953792   954M fd Linux raid autodetect
/dev/sdc3       21485568 976771071 955285504 455.5G fd Linux raid autodetect

It looks like it's sdc now for some reason? @Dzahn, please investigate :)

Dzahn added a comment.Mar 7 2016, 4:13 PM

uhmm, ok. i saw no issue during the install, didn't partition or setup anything manual, no changes to partman, just jessie installer and reboot

Change 276022 had a related patch set uploaded (by Dzahn):
netboot: use same partman for all bastions

https://gerrit.wikimedia.org/r/276022

Change 276022 merged by Dzahn:
netboot: use same partman for all bastions

https://gerrit.wikimedia.org/r/276022

Dzahn added a comment.Mar 9 2016, 1:27 AM

after the above,now:

│                    Failed to create a file system                     │
│ The ext3 file system creation in partition #1 of LVM VG bast2001-vg,  │
│ LV root failed.                                                       │

:/

Dzahn added a comment.Mar 9 2016, 1:51 AM
parted_server: OUT: 1	0-7998537727	7998537728	primary	linux-swap	/dev/mapper/bast2001--vg-swap_1	


parted_server: Partitions printed
parted_server: OUT: 


parted_server: Closing infifo and outfifo
/lib/partman/choose_partition/35crypto/choices: paragraph: 1	0-7998537727	7998537728	primary	linux-swap	/dev/mapper/bast2001--vg-swap_1
parted_server: main_loop: iteration 3107
parted_server: Opening infifo
/lib/partman/choose_partition/35crypto/choices: IN: PARTITIONS =dev=md0
parted_server: Read command: PARTITIONS
parted_server: command_partitions()
parted_server: Opening outfifo
parted_server: OUT: OK


parted_server: OUT: 1	0-9990832127	9990832128	primary	ext3	/dev/md0	


parted_server: Partitions printed
parted_server: OUT: 


parted_server: Closing infifo and outfifo
/lib/partman/choose_partition/35crypto/choices: paragraph: 1	0-9990832127	9990832128	primary	ext3	/dev/md0
parted_server: main_loop: iteration 3108
parted_server: Opening infifo
/lib/partman/choose_partition/35crypto/choices: IN: PARTITIONS =dev=md2
parted_server: Read command: PARTITIONS
parted_server: command_partitions()
parted_server: Opening outfifo
parted_server: OUT: OK


parted_server: OUT: 1	0-488971960319	488971960320	primary	xfs	/dev/md2	


parted_server: Partitions printed
parted_server: OUT: 


parted_server: Closing infifo and outfifo
/lib/partman/choose_partition/35crypto/choices: paragraph: 1	0-488971960319	488971960320	primary	xfs	/dev/md2
parted_server: main_loop: iteration 3109
parted_server: Opening infifo
/lib/partman/choose_partition/35crypto/choices: IN: PARTITIONS =dev=sda
parted_server: Read command: PARTITIONS
parted_server: command_partitions()
parted_server: Opening outfifo
parted_server: OUT: OK


parted_server: OUT: 1	1048576-299892735	298844160	primary	ext3	/dev/sda1	


parted_server: OUT: 5	300941312-500106788863	499805847552	logical	unknown	/dev/sda5	


parted_server: Partitions printed
parted_server: OUT: 


parted_server: Closing infifo and outfifo
/lib/partman/choose_partition/35crypto/choices: paragraph: 1	1048576-299892735	298844160	primary	ext3	/dev/sda1
/lib/partman/choose_partition/35crypto/choices: paragraph: 5	300941312-500106788863	499805847552	logical	unknown	/dev/sda5
parted_server: main_loop: iteration 3110
parted_server: Opening infifo
/lib/partman/choose_partition/35crypto/choices: IN: PARTITIONS =dev=sdb
parted_server: Read command: PARTITIONS
parted_server: command_partitions()
parted_server: Opening outfifo
parted_server: OUT: OK


parted_server: OUT: 1	1048576-10000269311	9999220736	primary	unknown	/dev/sdb1	


parted_server: OUT: 2	10000269312-11000610815	1000341504	primary	unknown	/dev/sdb2	


parted_server: OUT: 3	11000610816-500106788863	489106178048	primary	unknown	/dev/sdb3	


parted_server: Partitions printed
parted_server: OUT: 


parted_server: Closing infifo and outfifo
/lib/partman/choose_partition/35crypto/choices: paragraph: 1	1048576-10000269311	9999220736	primary	unknown	/dev/sdb1
/lib/partman/choose_partition/35crypto/choices: paragraph: 2	10000269312-11000610815	1000341504	primary	unknown	/dev/sdb2
/lib/partman/choose_partition/35crypto/choices: paragraph: 3	11000610816-500106788863	489106178048	primary	unknown	/dev/sdb3
parted_server: main_loop: iteration 3111
parted_server: Opening infifo
/lib/partman/choose_partition/60partition_tree/choices: *******************************************************
/bin/partman: IN: QUIT
parted_server: Read command: QUIT
parted_server: Quitting

Change 276085 had a related patch set uploaded (by Dzahn):
netboot: bast2001->raid1-lvm, no more wildcard

https://gerrit.wikimedia.org/r/276085

Change 276085 merged by Dzahn:
netboot: bast2001->raid1-lvm, no more wildcard

https://gerrit.wikimedia.org/r/276085

Dzahn added a comment.Mar 9 2016, 4:36 AM

after this ^ change and reinstalling again the installer went past the partioning and looked all good, then console output became messed up, became extremely slow but was apparently still doing stuff.. later just seemed to stop, no output on mgmt console. tried to connect with install-console key, and ended in busybox.

there i could see with ps:

  221 root     50432 S    debconf -o d-i /usr/bin/main-menu
  227 root     10248 S    /usr/bin/main-menu
  467 root         0 SW   [kworker/2:0]
 4532 root      6320 S    log-output -t base-installer apt-install usbutils
 4533 root      4544 S    {apt-install} /bin/sh /bin/apt-install usbutils
 4534 root      4540 S    {in-target} /bin/sh /bin/in-target sh -c debconf-apt-progress --no-progress --log
 4583 root      6320 S    log-output -t in-target chroot /target sh -c debconf-apt-progress --no-progress -
 4584 root      4336 S    sh -c debconf-apt-progress --no-progress --logstderr --  apt-get -q -y --no-remov
 4585 root     60796 S    {frontend} /usr/bin/perl -w /usr/share/debconf/frontend /usr/bin/debconf-apt-prog
 4587 root     29184 S    {debconf-apt-pro} /usr/bin/perl -w /usr/bin/debconf-apt-progress --no-progress --
 4588 root     67712 S    apt-get -o APT::Status-Fd=4 -o APT::Keep-Fds::=5 -o APT::Keep-Fds::=6 -q -y --no-
 5040 root     17644 D    /usr/bin/dpkg --status-fd 16 --configure libusb-1.0-0:amd64 usbutils:amd64
..
15469 root      6460 S    udpkg --configure --force-configure bootstrap-base
15470 root      4544 S    {bootstrap-base.} /bin/sh /var/lib/dpkg/info/bootstrap-base.postinst configure
23184 root         0 SW   [kauditd]
..
Dzahn added a comment.Mar 9 2016, 4:43 AM

"reboot" didn't work from here. "kill 1" finally got me out.

additonally, "console com2" randomly stops working on this DRAC and i had to reset it twice, then it worked again.. server from hell :p

Dzahn added a comment.EditedMar 9 2016, 6:21 AM
21:15 < mutante> tried again.. installer starts, works normal
                 and quick. at some random time half way thru
                 installing base system becomes super slow..
                 then moves on .. appears to freeze


21:16 < mutante> lol, just wanted to give up and it switched
                 from 58 to 59%
21:17 < mutante> super slow but not dead

i could watch it to 79% this time.. but then it got too late and once i get disconnected i dont get console output anymore

22:30 < mutante> !log bast2001 - still installing in snail mode

  • please feel free to check if it's done, and if so re-add to puppet so users get created.thx

Just checked dmesg. Lots of medium errors — sda is clearly failed which is what makes the installer be so slow.

Dzahn raised the priority of this task from Normal to High.Mar 31 2016, 1:03 AM
Dzahn added a comment.Mar 31 2016, 9:26 PM

reinstalled with jessie, re-signed puppet/salt. can be used again.

fingerprints

RSA

    MD5:3f:18:b6:2d:12:1c:81:93:74:a2:eb:86:2c:7c:80:41
    SHA256:saX7tsDLjsHCU67XroGcw+tAwVPuxVTLTaDmLij6Khc

ECDSA

    MD5:27:3f:74:4b:3c:10:7f:cf:64:bf:2e:34:8f:46:35:f0
    SHA256:4PtEoWeDScZkWM9EqM7Dbj1qy4P26h78EI0XOEBLsYU

ED25519

    MD5:c3:4a:14:94:03:ea:33:c6:6f:ff:a7:6a:fa:af:c7:12
    SHA256:s4hrueeAqnQjWNHIEMQ5PCHN5RFUSLJI+saN6v70M3c
Dzahn closed this task as Resolved.Mar 31 2016, 9:26 PM
Dzahn removed a project: Patch-For-Review.