Goal:
- Rack and provision new Dumps servers
- Migrate NFS for VPS & Toolforge
- Migrate rsync
- Migrate dumps.wikimedia.org
- Migrate NFS for Analytics
Initial reading
- Planning task and brainstorm - https://phabricator.wikimedia.org/T118154#3133894
- Hardware specs for labstore1006/7 - https://phabricator.wikimedia.org/T161311
- Rack/setup/install task for labstore1006/7 - https://phabricator.wikimedia.org/T167984
Current status summary: (some of this text is borrowed from the phab tasks listed above)
- Dataset1001 is the canonical dump storage server, and handles all the client usecases. Snapshot* hosts handle actual dumps production.
- Dataset1001 and the snapshot* hosts share data between each other through nfs mounts.
- There are 4 different client use cases
- NFS mounts for Cloud VPS projects
- Dataset1001 has mounted an NFS share from labstore1003 and copies the relevant files to that for Labs consumption
- NFS mounts for stat boxes for wikistats data generation
- Analytics mounts directly a share on dataset1001 for consumption
- Web service for public download access to dumps
- Uses copy on dataset1001 directly
- Rsync mirrors
- Use copy on dataset1001 directly
- NFS mounts for Cloud VPS projects
- Dataset1001 is currently in the public vlan, and uses about 50T of storage
- ms1001 is the backup server for dataset1001
- labstore1003 is a SPOF
Where we're going (focusing a bit more on the Cloud Services team's work for upcoming quarters)
- Separate the dumps generation layer from the dumps serving layer
- Have one cluster (internal vlan) - Dumpsdata* and Snapshot* - that handles all the generation, one cluster (public vlan) - Labstore1006 and 1007 - that handles all the client serving use cases (Analytics, VPS, Web, Rsync mirrors)
- Labstore1006 and 7 each have 72TB (36TB in internal drives, and 36TB in external shelves) storage available post RAID 10
- Labstore1006 and 7 will get their data through a periodic rsync from the canonical dumps server (dataset/dumpsdata)
- Labstore1006 and 7 will be set up completely independently, with copies of data from pristine source, and the client use cases will be sharded between them
- We are hoping to put Analytics NFS mounts and VPS NFS mounts usecases in one server, and web and rsync mirrors on the other
- Labstore1006 and 7 should have the capability to failover client services between each other (as in provide all client use cases from a single server during maint. if necessary)
Task breakdown
- Finish rack/setup/install for labstore1006/7 - T167984
- Puppetize and setup initial lvms and directory structures - T171539
- Setup periodic rsync jobs from dataset1001/dumpsdata1001/2(?) to labstore1006 and 1007 - T171541
- Setup NFS kernel server to serve dumps to VPS instances and stat boxes - T181431
- Investigate alternatives for showmount check at instance boot time, so we can only open up port 2049 TCP from labstore1006/7 - T171508
- Figure out how NFS failovers will work (cluster IP failover will not work in the absence of matching underlying drbd volumes) - T171540
- Test mounting the shares on instances
- Test mounting the shares on stat boxes
- Investigate and setup the web service component that serves dumps to users - T188641
- Investigate the rsync mirrors setup - T188642
- Migrate Dumps WMCS NFS users from labstore1003 to labstore1006/7 - T188643
- Migrate the stat* mount from dataset1001 to labstore1006/7 (coordinate with Analytics/Ezachte) - T188644
- Get all the rsync mirror sites to to switch over to labstore1006,7 - T188645
- Point dumps.wikimedia.org to new servers - T188646
Open questions
- How are the rsync mirrors setup?
- What are the QoS mechanisms for web and rsync mirrors?
- How are outages currently handled?