deploy aqs100[456]
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	RobH
	Apr 27 2016, 4:05 PM

Description

This task will track the racking and setup/deployment of the 3 new AQS nodes ordered on T132067, requested initially on T132067.

aqs100[123] are all racked in different racks. These three new hosts should also be placed into different racks from one another, and if possible from the existing systems: aqs1001 in A2, aqs1002 in C7, & aqs 1003 in D2.

aqs1004-1006

- receive in normally via T132067
- rack
- add mgmt dns entries for both asset tag and hostname
- add production dns entries
- setup network ports (description, enable, vlan)
- update install_server module
- install OS
- service implementation (hand off to @Ottomata for this as initial requestor on T124947)

Details

Subject	Repo	Branch	Lines +/-
Reduce root rad10 in aqs-cassandra-8ssd-2srv.cfg after chat with Rob.	operations/puppet	production	+3 -3
Fixed aqs-cassandra-8ssd-2srv.cfg partman config.	operations/puppet	production	+2 -2
Revised aqs-cassandra-8ssd-2srv.cfg partman recipe.	operations/puppet	production	+9 -12
Remove aqs1006 partman configuration to test why boot is failing after os install.	operations/puppet	production	+1 -1
Switch the PXE installer to Trusty to check a boot bug after Jessie install.	operations/puppet	production	+2 -2
Ported all the suggestions to the AQS recipe.	operations/puppet	production	+15 -15
Add some suggestions to the aqs partman recipe.	operations/puppet	production	+7 -8
Add partman receipe for new AQS hosts with SSDs.	operations/puppet	production	+57 -0
Add aqs100[345] DCHP configuration.	operations/puppet	production	+21 -0
Reserve extra IP addresses for the new AQS hosts.	operations/dns	master	+12 -0

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	Ottomata	T134275 rack/setup/deploy 3 eqiad druid nodes
Resolved	RobH	T128807 eqiad: (3) nodes for Druid / analytics
		Unknown Object (Task)
Duplicate	• mobrovac	T125345 Many error 500 from pageviews API "Error in Cassandra table storage backend"
Resolved	JAllemandou	T124314 Better response times on AQS (Pageview API mostly) {melc}
Resolved	elukey	T133785 rack/setup/deploy aqs100[456]
		Unknown Object (Task)
		Unknown Object (Task)
		Unknown Object (Task)

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Ok! We discussed partitioning today. We'd like the following:

/ a small (30G?) RAID 1 partition on the first 2 drives.
2 RAID 10 (probably ext4, asking to be sure) partitions across each of the 4 disks filling up the rest of space.

Something like:

/	sda1,sdb1	RAID 1
/var/lib/cassandra/a	sda2,sdb2,sdc1,sdd1	RAID 10
/var/lib/cassandra/b	sde1,sdf1,sdg1,sdh1	RAID 10

@JAllemandou or @Eevans can you confirm ext4 for cassandra partitions?

If it is easier to put the / partition RAID10 (or RAID1?) across the first 4 drives, that is fine too.

RobH added a comment.Apr 28 2016, 5:56 PM

This comment was removed by RobH.

Ok, old comment was wrong, had bad disk info.

New suggestion:

mount	disks	raid level	size
/	sda1,sdb1, sdc1, sdd1	raid10	30GB
/var/lib/cassandra/a	sda2.sdb2, sdc2, sdd2	raid10	remainder of space
/var/lib/cassandra/b	sde1,sdf1,sdg1,sdh1	raid10	all space on disks

No use of LVM is needed, since analytics wants to use the maximum capacity of these disks for the cassandra mounts.

+1, makes sense. Thank you!

• Cmjohnson updated the task description. (Show Details)Apr 28 2016, 8:17 PM

Racked one each in A4, C5, RackD4

In T133785#2249043, @Ottomata wrote:

Ok! We discussed partitioning today. We'd like the following:

/ a small (30G?) RAID 1 partition on the first 2 drives.

2 RAID 10 (probably ext4, asking to be sure) partitions across each of the 4 disks filling up the rest of space.

Something like:

/ sda1,sdb1 RAID 1

/var/lib/cassandra/a sda2,sdb2,sdc1,sdd1 RAID 10

/var/lib/cassandra/b sde1,sdf1,sdg1,sdh1 RAID 10

@JAllemandou or @Eevans can you confirm ext4 for cassandra partitions?

FYI for restbase in production we are mounting /srv and all cassandra instances under that, not per-instance mountpoints. Main reason being to allow for free disk space to be shared (and therefore contented, heh!) while doing cleanups/compactions

Interesting @fgiunchedi. But what in case of failure, two instances down?

@JAllemandou failure of which component? the other different thing for cassandra/restbase in production is that it maximizes available disk space, so ssds there for /srv are raid0 and / is raid1, not sure if it makes sense for your use case though

The reason RESTBase (and many other Cassandra users) are using RAID-0 or JBOD is that it tends to provide more resilience and throughput at a given data duplication ratio than RAID-10. The replication / fail-over already happens at the machine / rack level, which also covers other sources of failures.

Overall, my recommendation for AQS would be to go with RAID-0 and three-way replication (up from two-way right now).

The reason RESTBase (and many other Cassandra users) are using RAID-0 or JBOD is that it tends to provide more resilience and throughput at a given data duplication ratio than RAID-10. The replication / fail-over already happens at the machine / rack level, which also covers other sources of failures.

It's worth pointing out that they are provisioning 8T of usable space per node, to each of (only) 3 nodes. If (when, really) an array fails, that's quite a significant blast radius (both in terms of amount of data, and percentage of cluster). It's a much larger blast radius than we currently operate with in the RESTBase cluster.

On the other hand, losing one of only three machines is a larger blast radius than losing one of five or so, which when using RAID-0 cost about the same.

There is no question that for a given number of machines RAID-10 offers more reliability than RAID-0. The optimization question is more about reliability and performance per dollar, though, which means that the number of machines is not fixed.

In T133785#2250716, @fgiunchedi wrote:

In T133785#2249043, @Ottomata wrote:

Ok! We discussed partitioning today. We'd like the following:

/ a small (30G?) RAID 1 partition on the first 2 drives.

2 RAID 10 (probably ext4, asking to be sure) partitions across each of the 4 disks filling up the rest of space.

Something like:

/ sda1,sdb1 RAID 1

/var/lib/cassandra/a sda2,sdb2,sdc1,sdd1 RAID 10

/var/lib/cassandra/b sde1,sdf1,sdg1,sdh1 RAID 10

@JAllemandou or @Eevans can you confirm ext4 for cassandra partitions?

FYI for restbase in production we are mounting /srv and all cassandra instances under that, not per-instance mountpoints. Main reason being to allow for free disk space to be shared (and therefore contented, heh!) while doing cleanups/compactions

I'll add a bit more context to what I meant here, which is essentially the question of how many filesystems to have for cassandra: one per instance or shared among instances.

In restbase production at the moment it is one filesystem shared among instances, part of the motivation comes from having observed instances compete for disk space during (de)commission of other instances. We've been doing a lot of such instances movement while converting to multi-instance or replacing hardware and been suffering from large sstables consuming all disk space (for a discussion of sstable size and related compaction strategy see T126221). With one file system the available space is shared for the cleanup/compactions to happen, of course it also means that a disk failure compromises the machine but for restbase this was already the case since it is ~~using raid0~~ has more hardware and is meant to tolerate a single row going offline.

The raid level discussion though I think is separate and depends on different requirements for availability/cost/performance/etc.

hope that helps!

MusikAnimal subscribed.May 2 2016, 1:59 AM

Hm, we were planning on running 2 cassandra instances per node for a total of 6 instances.

Just stating the obvious here for my own benefit:

If we go with RAID 0 on a single partition, any disk failure will take down 1/3 of the cluster capacity.

If we go with RAID 0 and partitions per instance (2 per node), then any disk failure will take down 1/6 of the capacity.

If we go with RAID 10 on a single partition, then we'll have half the disk space, but no single disk failure will take down any instances.

If we go with RAID 10 and partitions per instance (2 per node), then we'll have half the disk space, but no single disk failure will take down any instance.

Perhaps RAID 0 with 2 partitions is a good compromise here? We'll have more disk space to work with, but will lose fewer instances when a disk fails than with just one partition. 1/6 drop in capacity feels more tolerable than 1/3, eh?

Q: Is there a benefit to RAID 0 vs just JBOD here? If not, then s/RAID 0/JBOD/g in the above. :)

In T133785#2256684, @Ottomata wrote:

Hm, we were planning on running 2 cassandra instances per node for a total of 6 instances.

Just stating the obvious here for my own benefit:

If we go with RAID 0 on a single partition, any disk failure will take down 1/3 of the cluster capacity.

If we go with RAID 0 and partitions per instance (2 per node), then any disk failure will take down 1/6 of the capacity.

If we go with RAID 10 on a single partition, then we'll have half the disk space, but no single disk failure will take down any instances.

I assume you mean s/partition/array/ above.

If we go with RAID 10 and partitions per instance (2 per node), then we'll have half the disk space, but no single disk failure will take down any instance.

It was my understanding that the hardware order was informed by a prior decision to cluster 3 machines with 8T of storage each, and that enough drives were ordered to configure storage as a RAID10. Which I guess is why I haven't thought of these drives as space lost, but instead as a redundancy that was consciously bought and paid for.

In a perfect world, we'd put every Cassandra instance/node on it's own host, we'd keep them relatively small and use more of them to provide the right amount of storage, and we wouldn't care about host-level fault tolerance. We don't live in a perfect world though: Ops standardizes on certain hardware (which is not cheap/commodity), data-center rack space comes at a cost, and "small" machines by today's standards aren't really that small.

TL;DR There isn't really a Right or Wrong answer here, it comes down to balancing the right set of trade-offs. So I guess I would ask, what informed the decisions to order this hardware, and has something changed?

Perhaps RAID 0 with 2 partitions is a good compromise here? We'll have more disk space to work with, but will lose fewer instances when a disk fails than with just one partition. 1/6 drop in capacity feels more tolerable than 1/3, eh?

Q: Is there a benefit to RAID 0 vs just JBOD here? If not, then s/RAID 0/JBOD/g in the above. :)

JBOD is an option, but not a perfect one. For example: Compaction isn't smart enough to know when it would make sense to cross the device barrier, so each disk will need enough space for the largest compaction (i.e. you lose the economy of scale combining disks provides here). However, data at rest for a keyspace can span devices, which can cause obsolete or deleted data to be resurrected in the event of a disk failure and repair. The JBOD story is looking much better in 3.x, but we're not there yet.

I assume you mean s/partition/array/ above.

Indeed danke.

It was my understanding that the hardware order was informed by a prior decision to cluster 3 machines with 8T of storage each, and that enough drives were ordered to configure storage as a RAID10.

I guess is why I haven't thought of these drives as space lost, but instead as a redundancy that was consciously bought and paid for.

The original request did say this, but somehow along the way we ended up with 8 1.6T drives. In a mirrored RAID setup, this gives 6.5T of useable space. But ja, I don't mean we 'lose' space since we did originally consider mirroring. Maybe I should rephrase my summary to talk about 'gaining' space in a non mirrored setup :)

So I guess I would ask, what informed the decisions to order this hardware, and has something changed?

The original intention (iirc) was to get to about 8T of useable space after mirroring with RAID. 6 instances with more space per each with a higher replication factor (3) seems ok to me. Temporarily losing 1/6 of them due to a single disk failure sounds acceptable. I'd like @JAllemandou and maybe @Milimetric to weigh in though. If they'd really much prefer the sanity gained with mirroring, we should stick with it. @elukey, any thoughts?

• Cmjohnson updated the task description. (Show Details)May 3 2016, 8:02 PM

• Cmjohnson updated the task description. (Show Details)May 5 2016, 2:55 PM

@Ottomata: 6 instances with 4 disks each in RAID 0 works for me. As you said, 1 lost over 6 is acceptable, and having 6.5Tb per instance seems fine about empty space for compaction (currently we have maxed up at 3Tb on exisitng nodes).

Ok then, unless there are objections, let's go with that. Since they are mounting cassandra stuff under /srv elsewhere, let's do that here too.

mount	disks	raid level	size
/	sda1,sdb1, sdc1, sdd1	raid10	30G
/srv/cassandra-a	sda2.sdb2, sdc2, sdd2	raid0	remainder of space (should be about 6.43T)
/srv/cassandra-b	sde1,sdf1,sdg1,sdh1	raid0	all space on disks (should be about 6.55T)

@JAllemandou I think this RAID 0 setup assumes we will be using replication factor = 3. If we stick with 2, then I think RAID 0 like this is a little dangerous. If 2, it will be possible (right? maybe not...) for data to be replicated to 2 cassandra instances on the same node. Even though they use different disks, it is possible to lose the whole node. In this case we'd lose data. (unless some rack based replication policy keeps this from happening?)

Anyway, it sounds like we were going with replication factor 3 anyway, ja?

• Cmjohnson updated the task description. (Show Details)May 5 2016, 3:57 PM

@Ottomata:
TL;DR: We already have replication factor of 3 :)
Details: Double checked on cassandra-aqs: every keyspace we use has replication = {'class': 'NetworkTopologyStrategy', 'eqiad': '3'} which means replication factor of 3. We manually changed it to 2 month ago, but at every restbase restart, it changes to 3. We talked with the services team a bit on that and then let it go since drive space was less of an issue after having dropped the per-article hourly resolution.

Ah ok cool.

@Cmjohnson we are good to go on these then!

mark closed subtask Unknown Object (Task) as Resolved.May 6 2016, 3:42 PM

@Ottomata: yes...just need to add the dhcpd and partman but feel free if you have time

Change 287605 had a related patch set uploaded (by Elukey):
Reserve extra IP addresses for the new AQS hosts.

https://gerrit.wikimedia.org/r/287605

gerritbot added a project: Patch-For-Review.May 9 2016, 12:08 PM

Change 287607 had a related patch set uploaded (by Elukey):
Add aqs100[345] DCHP configuration.

https://gerrit.wikimedia.org/r/287607

@Cmjohnson: tried to file a code review for the DHCP config, not sure if correct though!

Change 287605 merged by Elukey:
Reserve extra IP addresses for the new AQS hosts.

https://gerrit.wikimedia.org/r/287605

Change 287607 merged by Elukey:
Add aqs100[345] DCHP configuration.

https://gerrit.wikimedia.org/r/287607

@Cmjohnson: I tried to make a very simple partman recipe (https://phabricator.wikimedia.org/P3025) but I am sure that it is wrong on some many levels, would you mind to give me some hints? I am still not that familiar with partman sadly.

Change 288184 had a related patch set uploaded (by Elukey):
Add partman receipe for new AQS hosts with SSDs.

https://gerrit.wikimedia.org/r/288184

Change 288184 merged by Elukey:
Add partman receipe for new AQS hosts with SSDs.

https://gerrit.wikimedia.org/r/288184

elukey mentioned this in rOPUP0a9f28009dd9: Add partman receipe for new AQS hosts with SSDs..May 11 2016, 3:12 PM

elukey claimed this task.May 12 2016, 3:12 PM

Change 288403 had a related patch set uploaded (by Elukey):
Add some suggestions to the aqs partman recipe.

https://gerrit.wikimedia.org/r/288403

Change 288403 merged by Elukey:
Add some suggestions to the aqs partman recipe.

https://gerrit.wikimedia.org/r/288403

elukey mentioned this in rOPUP39346b97b969: Add some suggestions to the aqs partman recipe..May 12 2016, 3:17 PM

Change 288410 had a related patch set uploaded (by Elukey):
Ported all the suggestions to the AQS recipe.

https://gerrit.wikimedia.org/r/288410

Change 288410 merged by Elukey:
Ported all the suggestions to the AQS recipe.

https://gerrit.wikimedia.org/r/288410

elukey mentioned this in rOPUPd5e8fe7fc9af: Ported all the suggestions to the AQS recipe..May 12 2016, 3:39 PM

elukey moved this task from Next Up to In Progress on the Analytics-Kanban board.May 12 2016, 4:02 PM

Change 288424 had a related patch set uploaded (by Elukey):
Switch the PXE installer to Trusty to check a boot bug after Jessie install.

https://gerrit.wikimedia.org/r/288424

Change 288424 merged by Elukey:
Switch the PXE installer to Trusty to check a boot bug after Jessie install.

https://gerrit.wikimedia.org/r/288424

elukey mentioned this in rOPUP5e570d94a8e3: Switch the PXE installer to Trusty to check a boot bug after Jessie install..May 12 2016, 4:28 PM

Issue: Post jessie install, system states booting off C, and then fails to boot anything.

Troubleshooting done so far:

compared all bios settings to known good bios settings on a nearly identical HP machine that is working (elastic1047)
attempted ubuntu install, has same issue of booting and then no updates on screen.
- had @Cmjohnson attach physical console, he states he got nothing different on output.
failed back to jessie, which now has a software error:

May 12 19:58:25 in-target: The following packages have unmet dependencies:
May 12 19:58:25 in-target:  bind9-host : Depends: libbind9-90 (= 1:9.9.5.dfsg-9+deb8u6) but it is not going to be installed
May 12 19:58:25 in-target:               Depends: libdns100 (= 1:9.9.5.dfsg-9+deb8u6) but it is not going to be installed
May 12 19:58:25 in-target:               Depends: libisccfg90 (= 1:9.9.5.dfsg-9+deb8u6) but it is not going to be installed
May 12 19:58:25 in-target:  dnsutils : Depends: libbind9-90 (= 1:9.9.5.dfsg-9+deb8u6) but it is not going to be installed
May 12 19:58:25 in-target:             Depends: libdns100 (= 1:9.9.5.dfsg-9+deb8u6) but it is not going to be installed
May 12 19:58:25 in-target:             Depends: libisccfg90 (= 1:9.9.5.dfsg-9+deb8u6) but it is not going to be installed
May 12 19:58:25 in-target:  rpcbind : Depends: libtirpc1 (>= 0.2.4-2~) but it is not installable
May 12 19:58:25 in-target: E: Unable to correct problems, you have held broken packages.
May 12 19:58:25 debconf: --> PROGRESS SET 950
May 12 19:58:25 debconf: <-- 0 OK

However, Jessie worked earlier today for @elukey so I'm not sure why it is suddenly now failing for me.

full log for the aqs1006 install: P3061 line 26707
full log for the aqs1005 install: P3062 line 27493

@Papaul states he installed aqs1004 without that error, but it has the cannot boot issue:

Attempting Boot From Hard Drive (C:)

after post and then nothing.

I'm not sure how he got aqs1004 installed, when I cannot get 1005 or 1006 with identical hardware to do so.

So @Dzahn was able to work around the dependency issue, I've asked him to put an update, but I'll attempt to paraphrase from irc:

when the installer fails, go back one step on process manually and configure apt, then it will take a long time, and eventually go through.

It shouldn't be failing at all though, so there is something odd where aqs1004 didnt have the issue, but aqs1005 and aqd1006 did.

Additionally, they still have the error of:

Attempting Boot From Hard Drive (C:)

When they should boot up the OS.

Yep, so the "install software" / tasksel step of the installer failed and there were the "packages have unmet dependencies:" errors Rob pasted above. I went to the installer step right before that "configure apt" where apt itself is setup and sources are selected. that took a while .. then it asked me to select if i also want security updates and backport repo, i did not change that and accepted the default but usually you don't get asked. After that the installer continued through the install software part without the errors and continued with installing grub as normal. One guess is that the "configure apt" part was interrrupted somehow during the first installer run (it may look stuck and just hitting enter would mean "cancel" and go to next step), or it timed out trying to fetch package lists from the remote repos.

Seperately from all that now it will not boot , but the install finished.

Change 288626 had a related patch set uploaded (by Elukey):
Remove aqs1006 partman configuration to test why boot is failing after os install.

https://gerrit.wikimedia.org/r/288626

Change 288626 merged by Elukey:
Remove aqs1006 partman configuration to test why boot is failing after os install.

https://gerrit.wikimedia.org/r/288626

elukey mentioned this in rOPUP490fdfdd4f80: Remove aqs1006 partman configuration to test why boot is failing after os….May 13 2016, 3:52 PM

So I disabled the second controller port and it boots into the OS. It seems the OS installs onto one of the ports, but the other port is conflicting during post. Perhaps the OS installs one as primary, when the bios posts the other?

I'm going to enable the other one again, and set it as the primary boot controller, and attempt reinstall.

Ok, So the boot from C is due to Jessie/Trusty detecting the second controller/port over the primary controller/port.

The boot order has to be changed to boot from the secondary controller/port in bios, under boot options, legacy bios boot options, boot disk order.

Once that is done, the installer will work and install. Just keep in mind that for the 8 SSDs, its detecting the disks in slots 5-8 before 1-4. So when a disk goes bad, that will matter.

Now one can setup a manual partition install of aqs1006 without incident. However, upon reboot with the new partman config, it fails to load. I've not yet investigated if this is bad partman recipe or not, but I assume so, since it boots and installs with all 8 ssds in manual config (or using the auto config of another recipe.)

So now the recipe needs work.

I fixed the bios boot order on aqs100[456], setting port #2 to primary allows the bios to boot in the order that the jessie/trusty installer detects the controllers.

Please note this doesn't fix the broken recipe.

Tried to re-install Debian on aqs1006 and I was able to boot correctly, but indeed the receipe is not doing what I need:

root@aqs1006:~# cat /proc/mdstat
Personalities : [raid10] [raid0]
md2 : active raid0 sde2[0] sdh2[3] sdg2[2] sdf2[1]
      117121024 blocks super 1.2 512k chunks

md1 : active raid0 sda3[0] sdd3[3] sdc3[2] sdb3[1]
      6133534720 blocks super 1.2 512k chunks

md0 : active raid10 sda2[0] sdd2[3] sdc2[2] sdb2[1]
      58560512 blocks super 1.2 512K chunks 2 near-copies [4/4] [UUUU]

root@aqs1006:~# df -h
Filesystem      Size  Used Avail Use% Mounted on
udev             10M     0   10M   0% /dev
tmpfs            13G  9.2M   13G   1% /run
/dev/md0         55G  1.3G   51G   3% /
tmpfs            32G     0   32G   0% /dev/shm
tmpfs           5.0M     0  5.0M   0% /run/lock
tmpfs            32G     0   32G   0% /sys/fs/cgroup
/dev/md2        110G   60M  105G   1% /srv/cassandra-b
/dev/md1        5.7T   57M  5.4T   1% /srv/cassandra-a

Had a chat with @Volans and after seeing what fdisk shows it the partman recipe looks wrong. Each disk has the following layout:

Device     Boot    Start        End    Sectors  Size Id Type
/dev/sdf1           2048       4095       2048    1M 83 Linux
/dev/sdf2           4096   58597375   58593280   28G fd Linux raid autodetect
/dev/sdf3       58597376 3125626879 3067029504  1.4T fd Linux raid autodetect

This should be due to the current d-i partman-auto/expert_recipe multiraid config that creates three partitions per disk. This means that the recipe:

creates 8 * 30GB partitions (/dev/sdX2)
creates 8 * 1MB partitions (/dev/sdX1)
creates 8 * rest of the disk (/dev/sdX3)

Then the raid config uses sdX2 and sdX3 partitions, creating:

md0 with 4 * 30GB partitions and raid 10 (correct for raid)
md1 with 4 * 30GB partitions and raid 0 (incorrect, 110GB instead of 6TB) ==> /srv/cassandra-a
md2 with 8 * rest of the disk space and raid 0 (incorrect) ==> /srv/cassandra-b

Is partman smart enough to avoid the above symmetry or it is worth just to create a simple recipe for / and create mdX manually after install?

Cc: @Papaul, @RobH

Change 288921 had a related patch set uploaded (by Elukey):
Revised aqs-cassandra-8ssd-2srv.cfg partman recipe.

https://gerrit.wikimedia.org/r/288921

Change 288921 merged by Elukey:
Revised aqs-cassandra-8ssd-2srv.cfg partman recipe.

https://gerrit.wikimedia.org/r/288921

elukey mentioned this in rOPUP3fa99360fa32: Revised aqs-cassandra-8ssd-2srv.cfg partman recipe..May 16 2016, 12:23 PM

Change 288930 had a related patch set uploaded (by Elukey):
Fixed aqs-cassandra-8ssd-2srv.cfg partman config.

https://gerrit.wikimedia.org/r/288930

Change 288930 merged by Elukey:
Fixed aqs-cassandra-8ssd-2srv.cfg partman config.

https://gerrit.wikimedia.org/r/288930

elukey mentioned this in rOPUP0bcfff9ee51b: Fixed aqs-cassandra-8ssd-2srv.cfg partman config..May 16 2016, 1:30 PM

New recipe:

RAID10 between 8 disks, 10GB partitions (~40GB in total)
RAID0 between 4 disks, 5.7TB total
RAID0 between 4 disks, 5.7 TB total

root@aqs1006:~# cat /proc/mdstat
Personalities : [raid0] [raid10]
md2 : active raid0 sde2[0] sdg2[3] sdh2[2] sdf2[1]
      6211665920 blocks super 1.2 512k chunks

md0 : active raid10 sda1[0] sdg1[7] sdh1[6] sdf1[5] sde1[4] sdd1[3] sdc1[2] sdb1[1]
      39026688 blocks super 1.2 512K chunks 2 near-copies [8/8] [UUUUUUUU]

md1 : active raid0 sda2[0] sdd2[3] sdc2[2] sdb2[1]
      6211665920 blocks super 1.2 512k chunks

root@aqs1006:~# df -h
Filesystem      Size  Used Avail Use% Mounted on
udev             10M     0   10M   0% /dev
tmpfs            13G  9.2M   13G   1% /run
/dev/md0         37G  1.2G   34G   4% /
tmpfs            32G     0   32G   0% /dev/shm
tmpfs           5.0M     0  5.0M   0% /run/lock
tmpfs            32G     0   32G   0% /sys/fs/cgroup
/dev/md1        5.8T   57M  5.5T   1% /srv/cassandra-a
/dev/md2        5.8T   57M  5.5T   1% /srv/cassandra-b

Installed successfully aqs 1004/5, but 1005 fails with:

Loading Linux 4.4.0-1-amd64 ...
Loading initial ramdisk ...
[    0.113680] [Firmware Bug]: the BIOS has corrupted hw-PMU resources (MSR 38d is 330)
Loading, please wait...
mdadm: No devices listed in conf file were found.
Gave up waiting for root device.  Common problems:
 - Boot args (cat /proc/cmdline)
   - Check rootdelay= (did the system wait long enough?)
   - Check root= (did the system wait for the right device?)
 - Missing modules (cat /proc/modules; ls /dev)
ALERT!  /dev/disk/by-uuid/d0e5e321-5437-4409-a9f9-02d71636487c does not exist.  Dropping to a shell!
modprobe: module ehci-orion not found in modules.dep


BusyBox v1.22.1 (Debian 1:1.22.0-9+deb8u1) built-in shell (ash)
Enter 'help' for a list of built-in commands.

/bin/sh: can't access tty; job control turned off
(initramfs)

Really weird, after rebooting a couple of times:

Loading Linux 4.4.0-1-amd64 ...
Loading initial ramdisk ...
[    0.113896] [Firmware Bug]: the BIOS has corrupted hw-PMU resources (MSR 38d is 330)
Loading, please wait...
mdadm: /dev/md/1 has been started with 4 drives.
mdadm: /dev/md/0 has been started with 8 drives.
mdadm: /dev/md/2 has been started with 4 drives.
/dev/md0: clean, 41902/2441216 files, 493180/9756672 blocks
[    2.017374] power_meter ACPI000D:00: Ignoring unsafe software power cap!
[    2.259864] systemd-fsck[617]: /dev/md2: clean, 11/194117632 files, 12282460/1552916480 blocks
[    2.280750] systemd-fsck[618]: /dev/md1: clean, 11/194117632 files, 12282460/1552916480 blocks
[    2.451591] kvm: disabled by bios
[    2.514562] kvm: disabled by bios

Debian GNU/Linux 8 aqs1005 ttyS1

aqs1005 login:

All right 1005 booted after restarts, it might be a problem of md arrays taking too much time to bootstrap?

Anyhow, after a chat with @RobH we decided to revise a bit the recipe to use only 4 disks for the raid10 rather than 8 to reduce the risk of high load:

root@aqs1004:~# df -h
Filesystem      Size  Used Avail Use% Mounted on
udev             10M     0   10M   0% /dev
tmpfs            13G  9.2M   13G   1% /run
/dev/md0         28G  1.2G   25G   5% /
tmpfs            32G     0   32G   0% /dev/shm
tmpfs           5.0M     0  5.0M   0% /run/lock
tmpfs            32G     0   32G   0% /sys/fs/cgroup
/dev/md2        5.8T   57M  5.5T   1% /srv/cassandra-b
/dev/md1        5.8T   57M  5.5T   1% /srv/cassandra-a

root@aqs1004:~# cat /proc/mdstat
Personalities : [raid10] [raid0]
md2 : active raid0 sdf2[0] sdh2[3] sdc2[2] sdg2[1]
      6192136192 blocks super 1.2 512k chunks

md1 : active raid0 sda2[0] sde2[3] sdd2[2] sdb2[1]
      6192136192 blocks super 1.2 512k chunks

md0 : active raid10 sda1[0] sde1[3] sdd1[2] sdb1[1]
      29278208 blocks super 1.2 512K chunks 2 near-copies [4/4] [UUUU]

This means that on 4 disks we'll get a 15GB partition not used, but we can live with that :)

Also, generic question: is 30GB for root enough?

Change 289157 had a related patch set uploaded (by Elukey):
Reduce root rad10 in aqs-cassandra-8ssd-2srv.cfg after chat with Rob.

https://gerrit.wikimedia.org/r/289157

Change 289157 merged by Elukey:
Reduce root rad10 in aqs-cassandra-8ssd-2srv.cfg after chat with Rob.

https://gerrit.wikimedia.org/r/289157

elukey mentioned this in rOPUP883d1150f3ea: Reduce root rad10 in aqs-cassandra-8ssd-2srv.cfg after chat with Rob..May 17 2016, 9:15 AM

All the hosts re-installed and working fine, the only issue seems to be occasionally md arrays not available during boot.

30GB for root should be fine, we do that on many other servers.

We could put in a boot delay option (we used to have a similar issue as this on some models of Dells in the past.) I recall it simply applying to the entire fleet though, not just one or two machines.

We did have this similar disk detection timeout (goes away on reboot) issue on a number of ulsfo's cp systems when I rebooted them a couple of weeks ago, so its not unheard of.

the raid arrays issue might be related to T131961: Boot time race condition when assembling root raid device on cp1052 though that should be fixed already in puppet for jessie, modulo rebuild of initramfs

I followed @fgiunchedi's advice and had a chat with @ema about this. His code updates initramfs only after the first time that puppet runs, meanwhile I had the delay issue only during the first boot, so when puppet was still waiting to run the first time. I rebooted aqs1004 several times (checking also @ema's script output Waiting for disks to show up (T131961)") and everything went fine.

The issue seems resolved, I would be inclined to close this task. @RobH what do you think?

elukey closed this task as Resolved.May 20 2016, 1:15 PM

elukey updated the task description. (Show Details)

elukey mentioned this in rOPUP17d18073d5ae: Remove aqs1006 partman configuration to test why boot is failing after os….Jun 17 2016, 6:09 PM

elukey mentioned this in rOPUPefdc4e00006e: Add partman receipe for new AQS hosts with SSDs..

elukey mentioned this in rOPUP34a4f33ba188: Add partman receipe for new AQS hosts with SSDs..

elukey mentioned this in rOPUP5eb0ad097360: Add partman receipe for new AQS hosts with SSDs..

elukey mentioned this in rOPUPd600ba4ba4fc: Add partman receipe for new AQS hosts with SSDs..

elukey mentioned this in rOPUP22ae2f2932d3: Add aqs100[345] DCHP configuration..