Page MenuHomePhabricator

Replace labstore100[67] with clouddumps100[12]
Closed, ResolvedPublic

Description

The dumps workloads on labstore1006 and 7 can be moved to the new clouddumps hosts. The new hosts have lots of built-in storage so this move will let us decom the old servers and disk shelves used by 1006/1007

Event Timeline

Andrew edited subscribers, added: ArielGlenn; removed: Jclark-ctr, cmooney, Volans and 3 others.

@ArielGlenn these two new servers should be ready; I'm hoping that you have the time to move the data and workload over.

Currently each has a gigantic sdb1 lvm partition that isn't mounted anywhere. If you want me to mess with partman more and get these set up some different way I'm happy to give it a go, but I'm pretty sure that moving ahead with puppetized or by-hand lvm commands is the right next step.

Not sure who should get this next but it's not Hannah or I :-) I was never involved in the configuration and setup of the old labstore boxes, so I wouldn't have any idea how they need to be set up. I guess that was probably all Brooke's work.

When the servers are ready for rsync, you'll want to grab the data off of the old boxes, and then let me know and we can change a couple names and sync to them instead of the current labstore1006,7. It's just flipping a few variable names in puppet. But if we want to minimize extra syncing, the main rsync from the labstores should be run a few times for the bulk of the data and one final time on say the 17th of the month when the sql/xml dumps are complete, then I can swap our end over to point to those with almost no additional copying needed.

For the fetches from stats hosts and such, I guess those are separate settings in puppet that can just be swapped as well on your end, once the main rsyncs have gotten done.

Note that there's the open task about putting dumps.wikimedia.org behind our cdn T306550 related to this task.

This task does not require DC-OPs tag, once you have moved the data, please decommission labstores and create a task for DC-OPs,

Change 802600 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] put clouddumps100[12] into service

https://gerrit.wikimedia.org/r/802600

Andrew added a subscriber: Cmjohnson.

I think this should be assigned to me, to put the new hosts into service. That's currently blocked by a lot of partman nonsense that Papaul is helping me with; that's documented on T302981

Change 802600 merged by Andrew Bogott:

[operations/puppet@production] put clouddumps100[12] into service

https://gerrit.wikimedia.org/r/802600

Change 824513 had a related patch set uploaded (by Btullis; author: Btullis):

[labs/private@master] Add dummy keytabs for new clouddumps100* servers

https://gerrit.wikimedia.org/r/824513

Change 824513 merged by Btullis:

[labs/private@master] Add dummy keytabs for new clouddumps100* servers

https://gerrit.wikimedia.org/r/824513

Change 828102 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] Dumps: switch to using clouddumps hosts rather than the old labstores.

https://gerrit.wikimedia.org/r/828102

Change 828103 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] Dumps: stop mounting the old labstore100x servers on VMs

https://gerrit.wikimedia.org/r/828103

Change 830200 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/dns@master] Move dumps from labstore1006 to clouddumps1001

https://gerrit.wikimedia.org/r/830200

Change 828102 merged by Andrew Bogott:

[operations/puppet@production] Dumps: switch to using clouddumps hosts rather than the old labstores.

https://gerrit.wikimedia.org/r/828102

Change 830200 merged by Andrew Bogott:

[operations/dns@master] Move dumps from labstore1006 to clouddumps1001

https://gerrit.wikimedia.org/r/830200

Change 830246 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] openstack network tests: switch to check clouddumps1001 mounts

https://gerrit.wikimedia.org/r/830246

Change 830246 merged by Andrew Bogott:

[operations/puppet@production] openstack network tests: switch to check clouddumps1001 mounts

https://gerrit.wikimedia.org/r/830246

Change 835192 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] Dumps: switch to using clouddumps hosts rather than the old labstores.

https://gerrit.wikimedia.org/r/835192

Change 835192 merged by Andrew Bogott:

[operations/puppet@production] Dumps: switch to using clouddumps hosts rather than the old labstores.

https://gerrit.wikimedia.org/r/835192

Change 828103 merged by Andrew Bogott:

[operations/puppet@production] Dumps: stop mounting the old labstore100x servers on VMs

https://gerrit.wikimedia.org/r/828103

This is now done. I'm going to gradually dismantle the old dumps servers but will probably leave their data intact for a few weeks before decom.

Thanks for all your work on this @Andrew.

I'm going to do a fleet-wide check to see if anything still references the labstore servers in /etc/fstab or if they're still mounted anywhere. I have a feeling that one or two servers might still have the mount active.

Also you might like to update this section when convenient: https://wikitech.wikimedia.org/wiki/Dumps/Dump_servers#Hardware

Note that labstore1006 has some html dumps that didn't make it around to the other boxes, so please don't reimage until I get that sorted. Thanks!

Given the new machines much larger capacity, I believe any pending requests for more space can now be reconsidered. @ArielGlenn do you know of any pending needs? Shall we discuss on a separate ticket any potential use cases for the extra space?

Given the new machines much larger capacity, I believe any pending requests for more space can now be reconsidered. @ArielGlenn do you know of any pending needs? Shall we discuss on a separate ticket any potential use cases for the extra space?

There's the Kiwix request to mirror more files; see T57503 which I believe you're aware of. I'm not sure of anything else at the moment.

The directory /srv/deployment/analytics had incorrect ownership on the new hosts, so our deployment failed.
https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Deploy/Refinery#Deploying_to_clouddumps*_hosts

I did the following manual fix on clouddumps100[1-2]

btullis@clouddumps1001:/srv/deployment/analytics$ ls -ld
drwxr-xr-x 3 root root 4096 Aug 23 01:47 .
btullis@clouddumps1001:/srv/deployment/analytics$ sudo chown analytics-deploy .
btullis@clouddumps1001:/srv/deployment/analytics$ ls -ld
drwxr-xr-x 3 analytics-deploy root 4096 Aug 23 01:47 .
btullis@clouddumps1001:/srv/deployment/analytics$

I will follow up with a patch to configure this in puppet.

Change 849192 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] Dumps: remove a bunch of references to labstore1006 and labstore1007

https://gerrit.wikimedia.org/r/849192

Change 849193 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] rsync-via-primary.sh: replace labstore with clouddumps

https://gerrit.wikimedia.org/r/849193

Change 849193 merged by Andrew Bogott:

[operations/puppet@production] rsync-via-primary.sh: replace labstore with clouddumps

https://gerrit.wikimedia.org/r/849193

Change 849192 merged by Andrew Bogott:

[operations/puppet@production] Dumps: remove a bunch of references to labstore1006 and labstore1007

https://gerrit.wikimedia.org/r/849192