Page MenuHomePhabricator

Upgrade restbase100[7-9] to match restbase100[1-6] hardware
Closed, ResolvedPublic

Description

For historical reasons, restbase100[7-9] have different hardware than restbase100[1-6]. Without special efforts to adjust for the differing capacity, they are thus running out of disk space before other nodes do. They also perform significantly worse, with iowait for example consistently at around 4-5% compared to about 0.5% on restbase100[1-6].

I think it is not worth spending the effort to work around these limitations. We should upgrade the hardware to match that of restbase100[1-6].

We did plan for this possibility when we purchased those boxes. They are dual CPU Dell machines with only one socket populated.

Upgrading them would entail:

  • Adding a third SSD to match the disk space. This is the most urgent issue, and already tracked in T119659.
  • Adding a second CPU + RAM. We did consider this when we purchased the hardware, but I'm not sure how expensive & labor intensive this would be. Of the two, upgrading RAM would probably be the more urgent one.

Event Timeline

GWicke raised the priority of this task from to Needs Triage.
GWicke updated the task description. (Show Details)
GWicke added projects: SRE, procurement.
GWicke added a subscriber: GWicke.
GWicke renamed this task from Upgrade 100[7-9] to match codfw (and restbase100[1-6]) hardware to Upgrade restbase100[7-9] to match restbase100[1-6] hardware.Dec 1 2015, 5:45 AM
GWicke set Security to None.
GWicke added a subtask: Unknown Object (Task).
GWicke edited subscribers, added: faidon, fgiunchedi, RobH, mark; removed: Aklapper, StudiesWorld.
fgiunchedi triaged this task as Medium priority.Dec 1 2015, 4:22 PM
RobH mentioned this in Unknown Object (Task).Dec 1 2015, 6:15 PM

@Cmjohnson already checked the disk bays of restbase1007-1009 on T119896 and they do have room to accomodate more disks.

Confirmed that restbase1007-1009 will take up to 6 more disks. I have disk carriers on-site.

This seems to have a cross-over/duplication of disk expansion with T119659. As such, I think we should drop the disk requests for restbase1007-1009 off T119659, or off of this, but not keep it on both.

I've chatted with @GWicke in IRC. Having the disk space is useful even without the expansion of the cpu/ram. As such, the disk upgrade will be handled on the older task T119659.

Last week we processed the disk upgrades, and I need to followup and get RAM/CPU options. This wasn't a high priority compared to the other orders, so it was shifted back (I had multiple open quotes/tasks with Dell already.)

Quotes for the RAM/CPU upgrade are on private task T121255. Since it contains vendor pricing, I elected to make a private task rather than move this into a private space.

RobH closed subtask Unknown Object (Task) as Resolved.Dec 11 2015, 6:35 PM
RobH mentioned this in Unknown Object (Task).
RobH added a subtask: Unknown Object (Task).
RobH mentioned this in Unknown Object (Task).Dec 11 2015, 6:48 PM
RobH added a subtask: Unknown Object (Task).
RobH edited projects, added hardware-requests; removed procurement.
RobH moved this task from Backlog to In Discussion / Review on the hardware-requests board.

started today to expand restbase1007:

sfdisk -d /dev/sda | sfdisk /dev/sdc
mdadm --add /dev/md0 /dev/sdc1
mdadm --add /dev/md1 /dev/sdc2
mdadm --grow /dev/md2 --raid-devices=3 --add /dev/sdc3

note the last step is an intensive one, to online resize a raid0 mdadm will need to convert to raid4 first and then raid0

on restbase1007 the last step didn't go according to plan, while resizing was ongoing the machine's load went through the roof and I've force rebooted it. the raid array shows up as raid4 now though, it is otherwise mountable and doesn't show obvious signs of not working, cassandra and restbase are both stopped

restbase1007:~$ cat /proc/mdstat 
Personalities : [raid6] [raid5] [raid4] [raid1] 
md1 : active (auto-read-only) raid1 sda2[0] sdc2[2](S) sdb2[1]
      976320 blocks super 1.2 [2/2] [UU]
      
md0 : active raid1 sdc1[2](S) sdb1[1] sda1[0]
      29279232 blocks super 1.2 [2/2] [UU]
      
md2 : active raid4 sdb3[1] sdc3[3] sda3[0]
      1939599360 blocks super 1.2 level 4, 512k chunk, algorithm 5 [4/3] [UU__]
      
unused devices: <none>

looking again /dev/sdc is marked as "spare, rebuilding" and I've convinced myself that data is readable by running

root@restbase1007:/srv/cassandra-a# find . -type f -iname '*.db' | parallel --xargs sstablemetadata >/dev/null
Exception in thread "main" java.util.NoSuchElementException
        at java.util.StringTokenizer.nextToken(StringTokenizer.java:349)
        at org.apache.cassandra.io.sstable.Descriptor.fromFilename(Descriptor.java:258)
        at org.apache.cassandra.io.sstable.Descriptor.fromFilename(Descriptor.java:224)
        at org.apache.cassandra.tools.SSTableMetadataViewer.main(SSTableMetadataViewer.java:50)
root@restbase1007:/srv/cassandra-a# echo $?
1

so only one sstablemetadata job failed, restarting cassandra so we can bootstrap restbase1004

update, I've tested locally growing a raid0 and was successful

# cat /proc/mdstat
Personalities : [raid0]
md0 : active raid0 sdc[1] sdb[0]
      16760832 blocks super 1.2 512k chunks

unused devices: <none>
# mdadm --grow /dev/md0 --raid-devices=3 --add /dev/sdd
mdadm: level of /dev/md0 changed to raid4
mdadm: added /dev/sdd
# cat /proc/mdstat
Personalities : [raid0] [raid6] [raid5] [raid4]
md0 : active raid4 sdd[3] sdc[1] sdb[0]
      16760832 blocks super 1.2 level 4, 512k chunk, algorithm 5 [4/3] [UU__]
      [=============>.......]  reshape = 65.3% (5475328/8380416) finish=0.6min speed=78862K/sec

unused devices: <none>
# cat /proc/mdstat
Personalities : [raid0] [raid6] [raid5] [raid4]
md0 : active raid0 sdd[3] sdc[1] sdb[0]
      25141248 blocks super 1.2 512k chunks

unused devices: <none>
# resize2fs /dev/md0
resize2fs 1.42.12 (29-Aug-2014)
Filesystem at /dev/md0 is mounted on /srv; on-line resizing required
old_desc_blocks = 1, new_desc_blocks = 2
The filesystem on /dev/md0 is now 6285312 (4k) blocks long.

I'll try a resize with activity on the filesystem to see if I can reproduce the kernel errors below we're seen on restbase1007

Dec 21 15:54:05 restbase1007 kernel: [625207.406320] md/raid:md2: device sda3 operational as raid disk 0
Dec 21 15:54:05 restbase1007 kernel: [625207.406322] md/raid:md2: device sdb3 operational as raid disk 1
Dec 21 15:54:05 restbase1007 kernel: [625207.406644] md/raid:md2: allocated 0kB
Dec 21 15:54:05 restbase1007 kernel: [625207.415666] md/raid:md2: raid level 4 active with 2 out of 3 devices, algorithm 5
Dec 21 15:54:05 restbase1007 kernel: [625207.422739] general protection fault: 0000 [#1] SMP 
Dec 21 15:54:05 restbase1007 kernel: [625207.422759] Modules linked in: raid456 async_raid6_recov async_memcpy async_pq async_xor xor async_tx raid6_pq nfnetlink_queue nfnetlink_log nfnetlink bluetooth rfkill binfmt_misc xt_pkttype nf_conn
track_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables xt_tcpudp nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack iptable_filter ip_tables x_tables 8021q garp mrp stp llc nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscach
e sunrpc x86_pkg_temp_thermal intel_powerclamp coretemp ipmi_devintf kvm crct10dif_pclmul crc32_pclmul ghash_clmulni_intel ttm aesni_intel drm_kms_helper aes_x86_64 lrw sb_edac gf128mul drm glue_helper ablk_helper iTCO_wdt i2c_algo_bit eda
c_core evdev iTCO_vendor_support dcdbas i2c_core cryptd pcspkr mei_me lpc_ich mei mfd_core shpchp wmi ipmi_si 8250_fintek ipmi_msghandler processor thermal_sys acpi_power_meter button autofs4 ext4 crc16 mbcache jbd2 dm_mod raid1 raid0 md_m
od sg sd_mod crc32c_intel ahci ehci_pci libahci tg3 ehci_hcd ptp libata pps_core libphy usbcore scsi_mod usb_common
Dec 21 15:54:05 restbase1007 kernel: [625207.422768] CPU: 1 PID: 44943 Comm: java Not tainted 3.19.0-2-amd64 #1 Debian 3.19.3-9
Dec 21 15:54:05 restbase1007 kernel: [625207.422768] Hardware name: Dell Inc. PowerEdge R430/03XKDV, BIOS 1.1.10 03/10/2015
Dec 21 15:54:05 restbase1007 kernel: [625207.422769] task: ffff880623eb3230 ti: ffff880856ca0000 task.ti: ffff880856ca0000
Dec 21 15:54:05 restbase1007 kernel: [625207.422774] RIP: 0010:[<ffffffffa008a244>]  [<ffffffffa008a244>] map_sector.isra.3+0x84/0xc0 [raid0]
Dec 21 15:54:05 restbase1007 kernel: [625207.422774] RSP: 0018:ffff880856ca3a20  EFLAGS: 00010202
Dec 21 15:54:05 restbase1007 kernel: [625207.422775] RAX: 0020002000200020 RBX: 000000000003d000 RCX: ffffffff812d1ad0
Dec 21 15:54:05 restbase1007 kernel: [625207.422775] RDX: 000000000008dfd5 RSI: 0000000000000218 RDI: ffff8808532eac00
Dec 21 15:54:05 restbase1007 kernel: [625207.422776] RBP: 00000000237f5618 R08: ffff880856ca3a30 R09: 0000000000000218
Dec 21 15:54:05 restbase1007 kernel: [625207.422776] R10: 0000000000000000 R11: 000000000008dfd5 R12: ffff880853327000
Dec 21 15:54:05 restbase1007 kernel: [625207.422777] R13: ffff880623df5b00 R14: ffff880856ca3aa8 R15: ffff88061ea2f008
Dec 21 15:54:05 restbase1007 kernel: [625207.422778] FS:  00007f9eaff96700(0000) GS:ffff88085e420000(0000) knlGS:0000000000000000
Dec 21 15:54:05 restbase1007 kernel: [625207.422778] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Dec 21 15:54:05 restbase1007 kernel: [625207.422779] CR2: 0000233d8d51e100 CR3: 00000008550f0000 CR4: 00000000001407e0
Dec 21 15:54:05 restbase1007 kernel: [625207.422779] Stack:
Dec 21 15:54:05 restbase1007 kernel: [625207.422781]  ffffffffa008a3e1 ffff88085a4c7740 0000000000000218 000000000000000c
Dec 21 15:54:05 restbase1007 kernel: [625207.422781]  0000000000100000 ffff880623df5b00 ffff880852c1e088 ffff880852c1e088
Dec 21 15:54:05 restbase1007 kernel: [625207.422782]  ffff880856ca3bf0 0000000000000001 ffffffffa01557ab ffff880856ca3aa8
Dec 21 15:54:05 restbase1007 kernel: [625207.422783] Call Trace:
Dec 21 15:54:05 restbase1007 kernel: [625207.422787]  [<ffffffffa008a3e1>] ? raid0_mergeable_bvec+0x101/0x160 [raid0]
Dec 21 15:54:05 restbase1007 kernel: [625207.422793]  [<ffffffffa01557ab>] ? linear_merge+0x4b/0x60 [dm_mod]
Dec 21 15:54:05 restbase1007 kernel: [625207.422796]  [<ffffffffa014f96c>] ? dm_merge_bvec+0xac/0x110 [dm_mod]
Dec 21 15:54:05 restbase1007 kernel: [625207.422800]  [<ffffffff81298923>] ? __bio_add_page+0x1f3/0x280
Dec 21 15:54:05 restbase1007 kernel: [625207.422803]  [<ffffffff811fbc7f>] ? do_mpage_readpage+0x28f/0x670
Dec 21 15:54:05 restbase1007 kernel: [625207.422809]  [<ffffffffa0211030>] ? _ext4_get_block+0x1a0/0x1a0 [ext4]
Dec 21 15:54:05 restbase1007 kernel: [625207.422810]  [<ffffffff811fc132>] ? mpage_readpages+0xd2/0x120
Dec 21 15:54:05 restbase1007 kernel: [625207.422815]  [<ffffffffa0211030>] ? _ext4_get_block+0x1a0/0x1a0 [ext4]
Dec 21 15:54:05 restbase1007 kernel: [625207.422818]  [<ffffffff8109f88a>] ? dequeue_task_fair+0x9a/0xa70
Dec 21 15:54:05 restbase1007 kernel: [625207.422822]  [<ffffffff8119c181>] ? alloc_pages_current+0x91/0x110
Dec 21 15:54:05 restbase1007 kernel: [625207.422825]  [<ffffffff8115c4f3>] ? __do_page_cache_readahead+0x173/0x200
Dec 21 15:54:05 restbase1007 kernel: [625207.422827]  [<ffffffff8115c7cc>] ? ondemand_readahead+0x24c/0x260
Dec 21 15:54:05 restbase1007 kernel: [625207.422828]  [<ffffffff81151076>] ? generic_file_read_iter+0x486/0x5d0
Dec 21 15:54:05 restbase1007 kernel: [625207.422831]  [<ffffffff811beb61>] ? new_sync_read+0x71/0xa0
Dec 21 15:54:05 restbase1007 kernel: [625207.422833]  [<ffffffff811bfd71>] ? vfs_read+0x81/0x130
Dec 21 15:54:05 restbase1007 kernel: [625207.422835]  [<ffffffff811bfe62>] ? SyS_read+0x42/0xb0
Dec 21 15:54:05 restbase1007 kernel: [625207.422838]  [<ffffffff81552c0d>] ? system_call_fast_compare_end+0xc/0x11
Dec 21 15:54:05 restbase1007 kernel: [625207.422847] Code: 89 08 48 2b 17 48 63 4b 10 5b 5d 48 c1 fa 03 48 0f af d0 4c 89 d8 41 5c 4c 0f af d2 31 d2 48 f7 f1 48 8b 47 08 48 63 d2 4c 01 d2 <48> 8b 04 d0 c3 0f 1f 80 00 00 00 00 89 f1 83 ee 01 31 d2 21 c6 
Dec 21 15:54:05 restbase1007 kernel: [625207.422850] RIP  [<ffffffffa008a244>] map_sector.isra.3+0x84/0xc0 [raid0]
Dec 21 15:54:05 restbase1007 kernel: [625207.422850]  RSP <ffff880856ca3a20>
Dec 21 15:54:05 restbase1007 kernel: [625207.423693] general protection fault: 0000 [#2] SMP 
Dec 21 15:54:05 restbase1007 kernel: [625207.423703] Modules linked in: raid456 async_raid6_recov async_memcpy async_pq async_xor xor async_tx raid6_pq nfnetlink_queue nfnetlink_log nfnetlink bluetooth rfkill binfmt_misc xt_pkttype nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter
Dec 21 15:54:05 restbase1007 kernel: [625207.423703] ---[ end trace c1b9bfb9ed7bc34a ]---

I can't reproduce locally by growing a raid0 to a third disk while also having disk activity, I'm proposing the following:

  • expand the raid1s on restbase1008 via mdadm --add /dev/md0 /dev/sdc1 and mdadm --add /dev/md1 /dev/sdc2 and wait for completion
  • stop cassandra on restbase1008 so /srv is quiescent
  • run mdadm --grow /dev/md2 --raid-devices=3 --add /dev/sdc3 and wait for completion
    • this will be a long running operation where cassandra will be stopped on the node, gc_grace_time currently is set at 24h IIRC, there's potential to exceed that. even though read/write speed should be fairly high with no activity on the raid array.
  • run resize2fs /dev/md2 and restart cassandra

ditto for restbase1009, for restbase1007 it should be possible to restart the reshape operation, even though we could decomission/bootstrap the node again

Lets coordinate with the CPU / memory upgrades in T121255 to avoid multiple downtimes:

  1. Shut down the cassandra node without decommissioning it, to avoid overloading other nodes.
  2. Upgrade cpu & memory, create a new linear LVM volume using the new disks (wiping the data). The LVM volume will make the next expansion (T121575) a lot quicker, at the price of no more striping. We think that over a longer time, the data / IO load will still be reasonably distributed across the disks by virtue of Cassandra's compactions being written to new locations before deleting the old data.
  3. Remove the dead node from the cluster (nodetool removenode), and bootstrap a new node with 256 tokens.

After 1007-1009 are updated: Run a repair.

Alternatively, it *might* be possible to use a decommission for 1007-1009 at this point (taking into account deletions & compaction after hand-off similar to what we saw with 1008), but it will still be very tight. Somebody would need to commit to monitoring & intervening while nodes are low on space.

Cmjohnson closed subtask Unknown Object (Task) as Resolved.Jan 22 2016, 4:56 PM

agreed the decommission seem tight but doable, also free disk space on nodes at any given time can be hard to predict (details in T126221)

re: LVM, we're trying to equalize the cluster and switching to lvm would also imply abandoning raid0, IOW reimaging all machines that use raid0 now. I'm not convinced it is a worthwhile idea before having experimented with extending the raid0 with cassandra stopped.

other than that, data wipe + reimage + removenode + rebootstrap with 256 tokens seem like the easiest path forward to me

Change 269394 had a related patch set uploaded (by Filippo Giunchedi):
cassandra: provision restbase1007 with new hw specs

https://gerrit.wikimedia.org/r/269394

Change 269394 merged by Filippo Giunchedi:
cassandra: provision restbase1007 with new hw specs

https://gerrit.wikimedia.org/r/269394

status update, restbase1007-a has been bootstrapped and the old node removed via nodetool removenode. note this has involved a range movement so cleanup are running on restbase1001 / restbase1002, starting next week the plan is also to bootstrap another instance on restbase1007 to further lessen the load on the others

Mentioned in SAL [2016-02-15T14:24:29Z] <godog> start restbase1007-b cassandra instance, bootstrapping T119935

Mentioned in SAL [2016-02-17T09:55:03Z] <godog> depool restbase1008 for raid expansion T119935

I've ran the mdadm expansion on restbase1008, this time with restbase and cassandra stopped and /srv unmounted, currently rebuilding and ETA is 1400-1600 minutes, i.e. 23-25h

root@restbase1008:~# cat /proc/mdstat 
Personalities : [raid1] [raid0] [raid6] [raid5] [raid4] 
md1 : active (auto-read-only) raid1 sda2[0] sdb2[1]
      976320 blocks super 1.2 [2/2] [UU]
      
md2 : active raid4 sdc3[3] sda3[0] sdb3[1]
      1939599360 blocks super 1.2 level 4, 512k chunk, algorithm 5 [4/3] [UU__]
      [>....................]  reshape =  0.5% (5310464/969799680) finish=1593.7min speed=10086K/sec
      
md0 : active raid1 sda1[0] sdb1[1]
      29279232 blocks super 1.2 [2/2] [UU]
      
unused devices: <none>

after bumping /sys/block/md2/md/stripe_cache_size to 32470 speed has increased to ~20MB/s

Personalities : [raid1] [raid0] [raid6] [raid5] [raid4]
md1 : active (auto-read-only) raid1 sda2[0] sdb2[1]
      976320 blocks super 1.2 [2/2] [UU]

md2 : active raid4 sdc3[3] sda3[0] sdb3[1]
      1939599360 blocks super 1.2 level 4, 512k chunk, algorithm 5 [4/3] [UU__]
      [===>.................]  reshape = 15.2% (148013920/969799680) finish=650.8min speed=21043K/sec

md0 : active raid1 sda1[0] sdb1[1]
      29279232 blocks super 1.2 [2/2] [UU]

unused devices: <none>

plan:

  • as soon as the rebuild has finished (~6h) remount /srv and restart cassandra-a
  • widen hint window by a safe margin e.g. 30h
  • tomorrow utc morning I'll shut cassandra-a and continue the rebuild with remaining disks

plan:

  • as soon as the rebuild has finished (~6h) remount /srv and restart cassandra-a
  • widen hint window by a safe margin e.g. 30h

this requires a cassandra rolling restart afaik, but see also T108611: Perform initial (manual) repair of Cassandra cluster for ways we could tolerate better nodes going down for a "long" time

plan:

  • as soon as the rebuild has finished (~6h) remount /srv and restart cassandra-a
  • widen hint window by a safe margin e.g. 30h
  • tomorrow utc morning I'll shut cassandra-a and continue the rebuild with remaining disks

given that the hinted handoff window has been exceeded yesterday anyway during rebuild for ~8h I've tried the reshape online with the remaining two ssd on restbase1008 and cassandra running, speed has been tuned down to 6MB/s to avoid big latency impact and hinted handoff

widen hint window by a safe margin e.g. 30h

With 2.1, this might push it a bit. There are optimizations around hints in 3.0 that make 24 hours or more possible, but the general consensus for 2.1 seems to be to use a window of less than 24 hours. Of course, hint volume heavily depends on the change volume, so there is a large component of "it depends". 12 hours worth of hints are *probably* fine.

@Cmjohnson we'd still need to upgrade ram/cpu on restbase1008 / restbase1009, let's coordinate that for today

update: restbase100[789] have ram/cpu upgraded and in line with codfw. restbase1009 is growing its raid0 and restbase1008-b will bootstrap shortly

@fgiunchedi During the last expansion (1008), we saw elevated 99p latencies. https://grafana-admin.wikimedia.org/dashboard/db/restbase?from=1455606380680&to=1456244323031&panelId=11&fullscreen

Do you have any suggestions for mitigating this this time around? Is it perhaps possible to further throttle the rebuild? Ideally we'd see 99p latencies closer to 1s than 3s (or spikes of 6s).

@Eevans rebuild is currently throttled at 6MB/s max and I'm seeing around 500ms p99, will keep an eye on that and if it increases we can throttle futher too

@Eevans rebuild is currently throttled at 6MB/s max and I'm seeing around 500ms p99, will keep an eye on that and if it increases we can throttle futher too

Great; Thanks!

update: restbase100[789] have ram/cpu upgraded and in line with codfw. restbase1009 is growing its raid0 and restbase1008-b will bootstrap shortly

This is great; Thanks!

Mentioned in SAL [2016-02-25T04:44:10Z] <urandom> decommissioning Cassandra on restbase1008-a.eqiad.wmnet T119935

Mentioned in SAL [2016-02-25T20:46:58Z] <urandom> starting bootstrap of restbase1008-a T119935

restbase1009 raid grow has finished and the FS has been expanded too, this is complete. all hosts have 128G ram, 5x ssd and 2x processors.

RobH closed subtask Unknown Object (Task) as Resolved.Jul 11 2016, 5:31 PM