Page MenuHomePhabricator

reinstall rdb100[56] with RAID
Closed, ResolvedPublic

Description

rdb100[56].eqiad.wmnet don't have any RAID per parent task

what is needed to reinstall them. would we have hardware or software RAID. which RAID level makes sense

Event Timeline

Dzahn removed Dzahn as the assignee of this task.Jul 20 2016, 11:47 PM
elukey added a subscriber: elukey.Oct 20 2016, 1:28 PM
Dzahn added a comment.Dec 22 2016, 8:33 PM

@elukey i see on rdb1005 for example it has a sda2 entirely used for /tmp, so 20G temp. That reminded me of the 2 videoscalers we reinstalled recently. is this the same partman recipe issue here maybe?

@elukey i see on rdb1005 for example it has a sda2 entirely used for /tmp, so 20G temp. That reminded me of the 2 videoscalers we reinstalled recently. is this the same partman recipe issue here maybe?

Yes exactly! From netboot I can see rdb100[1-6]) echo partman/mw.cfg ;; \, that is exactly the same recipe that I had to modify (creating mw-no-tmp.cfg).

Dzahn added a comment.Dec 23 2016, 8:04 PM

Ah! ok, but do we have RAID with mw-no-tmp.cfg ? or should we use "raid1.cfg" because rdb100[7-8] use that? As opposed to rdb100[1-6] using mw.cfg.

Dzahn added a comment.Dec 23 2016, 9:48 PM

I looked for rdb1001 in racktables and as Rob points out there you can find the linked RT ticket where we can see what they were ordered as.

so i checked them all and:

hostRT ticket
rdb10014281
rdb10024281
rdb10034712
rdb10044712
rdb1005?
rdb1006?
rdb1007527
rdb1008527

Ah! ok, but do we have RAID with mw-no-tmp.cfg ? or should we use "raid1.cfg" because rdb100[7-8] use that? As opposed to rdb100[1-6] using mw.cfg.

Sorry I didn't get the question the first time :) mw116[89] have only one disk IIRC!

Dzahn added a comment.Dec 23 2016, 9:52 PM

4281 - " 6 high performance misc servers, with SSDs in addition to the normal disks."

4712 - "2 x 500GB HDD"

527 - "large scale purchase for eqiad buildout"

hmm.. so missing the RT ticket for 1005/1006 which this ticket is about.

installed "lshw" on rdb1005/1006. It shows we have an unused "sdb" here that has an NTFS partition on both. so yea, 2 HDDs here with 500GB each. rdb1001 is the same but both drives have Linux partitions. This is hardware RAID, all of them use the same partman ... hmm

RobH added a subscriber: RobH.EditedApr 17 2017, 9:09 PM

I'd suggest the following:

Right now netboot has the following:

rdb100[1-6]) echo partman/mw.cfg ;; \  
rdb100[7-8]) echo partman/raid1.cfg ;; \

So it seems some of these hosts have the H310 hw raid controller, and someone used them as hw raid. This is not recommended, since its a lackluster hw raid controller, and should by bypassed for sw raid.

I'd suggest bypassing all the disks in the hosts witht he H310 into jbod mode and then use the sw raid recipe raid1-lvm-ext4-srv-noswap.cfg.

In the scope of this particular task (rdb100[56]), using mw.cfg would result in no raid, so don't use that. raid1.cfg sets up raid, but its outdated, we don't use xfs or swap, so I'd suggest raid1-lvm-ext4-srv-noswap.cfg, as it sets things up in the proper method of sw raid1 the dual disk setup, with a small / partition, and 80% of the remainder as ext4 /srv within an lvm than can grow to the remainder of the disk if needed.

Change 348666 had a related patch set uploaded (by Dzahn):
[operations/puppet@production] netboot: fix/adjust partman config for rdb servers

https://gerrit.wikimedia.org/r/348666

Dzahn added a comment.Apr 18 2017, 2:09 AM

I'd suggest raid1-lvm-ext4-srv-noswap.cfg,

Thanks @RobH! I found now that rdb200[1-6] uses raid1-lvm-ext4-srv.cfg but appeared in a different regex in netboot.cfg.

Moving that all into the same location it would get us: https://gerrit.wikimedia.org/r/#/c/348666/1/modules/install_server/files/autoinstall/netboot.cfg

So that would be the same you suggested except swap/noswap.

Change 348666 merged by Dzahn:
[operations/puppet@production] netboot: fix/adjust partman config for rdb servers

https://gerrit.wikimedia.org/r/348666

Dzahn added a comment.Apr 18 2017, 8:20 PM

@elukey any hints on what is needed to reinstall one of these? actions that are needed before it's ok taking one down? of course this would be AFTER the dcswitch :)

Theoretically when we'll have switched over these job queue hosts will only be replicas of codfw, so it should be super fine to just re-image them one at the time. Redis on these hosts will ask for a full sync when they'll be up again anyway, so probably a backup is not even needed.

I am now a bit confused about how our rdb hosts are configured though:

rdb100[1-4]) echo partman/mw.cfg ;; \
rdb100[5-6]) echo partman/raid1-lvm-ext4-srv-noswap.cfg ;; \
rdb100[7-8]) echo partman/raid1.cfg ;; \
rdb200[1-6]) echo partman/raid1-lvm-ext4-srv.cfg ;; \

There is a big mix of configurations in here :)

Are we planning to have a standard config for all of them too?

RobH added a comment.Aug 11 2017, 9:06 PM

I'd think we want to push more of them to raid1-lvm-ext4-srv-noswap.cfg.

The only difference between that and raid1-lvm-ext4-srv.cfg is the use of a swap file. I'd suggest we eliminate swap, since we typically don't use it across most of the fleet (imu).

RobH added a comment.Aug 11 2017, 9:06 PM

If we are fine with that, I'm happy to re-image these hosts!

rdb1005 is a JobQueue Redis master (1006 is its local dc slave) so it would be painful to put it out of production to reimage it (mw-config change to move its traffic away, Redis backup, etc.)

These hosts are going to be decommed as soon as the new JobQueue service will be implemented by the Services team (they have already started the work), so I'd be in favor not to proceed any further with this task, but I'd like to hear others opinions as well :)

Dzahn added a comment.May 7 2018, 2:15 PM

This is now the only subtask that keeps T136562 open. @elukey Did anything change about the status of it since the last comment?

elukey added a comment.May 7 2018, 5:11 PM

A ton of job have been moved to Kafka, but there's still work to be done. In theory when T190327 is finished we could easily do the work :)

I see that T190327 is closed meanwhile. Did it actually become easy now? :)

Yes definitely, now only ChangeProp uses Redis and it should be easy enough to reimage rdb100[56]. I added an overview of how our Redis cluster will look like in T196685#4267110 (from what I gathered).

IIRC only rdb100[12] are now actively used (by ChangeProp to store counters) so in theory rdb100[56] could be reimaged without too many problems. Let's have a chat with @Joe about this :)

Dzahn added a comment.Aug 24 2018, 5:01 PM

Thanks! Could we continue on the ticket? That means it doesn't have to be real time in the same timezone which isn't that easy to organize for me.

Dzahn added a comment.Sep 10 2018, 8:15 PM

@Joe can rdb1005, rdb1006 be reimaged without too many problems?

faidon added a comment.Oct 3 2018, 8:17 AM

Had a chat with @Joe, apparently rdb1005/6 are currently unused and can be reimaged at any point in time. There are some longer-term goals here (rebuilding these with stretch, a newer version of Redis, which requires a Puppet module etc.) that could be coupled together with the reimage. He'll follow up with more soon.

As far as this task goes, I'd recommend fixing the partman recipe on Puppet (so that the next install gets it right) and resolving it.

Dzahn claimed this task.Oct 3 2018, 2:26 PM

Cool! Thanks. I will do that.

Dzahn added a comment.Oct 5 2018, 10:00 PM

re: partman recipes

rdb100[1-4]) echo partman/mw.cfg ;; \
rdb100[5-6]) echo partman/raid1-lvm-ext4-srv-noswap.cfg ;; \
rdb100[7-9]|rdb1010) echo partman/raid1-lvm-ext4-srv.cfg ;; \
rdb200[1-6]) echo partman/raid1-lvm-ext4-srv.cfg ;; \

1001-1004 have a an LSI RAID controller. The others don't. (lspic | grep RAID via cumin)

1005 / 1006 which this ticket is for already have a raid1-lvm recipe, so that is like already fixed for next install.

That being said, i will unify them to all stop using the "noswap" one, right.

Change 464919 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] partman: let rdb1005/1006 also use recipe with swap

https://gerrit.wikimedia.org/r/464919

Change 464919 merged by Dzahn:
[operations/puppet@production] partman: let rdb1005/1006 also use recipe with swap

https://gerrit.wikimedia.org/r/464919

Dzahn closed this task as Resolved.Oct 5 2018, 10:13 PM

As far as this task goes, I'd recommend fixing the partman recipe on Puppet (so that the next install gets it right) and resolving it.

Done!