Page MenuHomePhabricator

Migrate customer-facing Dumps endpoints to Cloud Services
Closed, ResolvedPublic

Description

Goal:

  • Rack and provision new Dumps servers
  • Migrate NFS for VPS & Toolforge
  • Migrate rsync
  • Migrate dumps.wikimedia.org
  • Migrate NFS for Analytics

Initial reading

Current status summary: (some of this text is borrowed from the phab tasks listed above)

  • Dataset1001 is the canonical dump storage server, and handles all the client usecases. Snapshot* hosts handle actual dumps production.
  • Dataset1001 and the snapshot* hosts share data between each other through nfs mounts.
  • There are 4 different client use cases
    • NFS mounts for Cloud VPS projects
      • Dataset1001 has mounted an NFS share from labstore1003 and copies the relevant files to that for Labs consumption
    • NFS mounts for stat boxes for wikistats data generation
      • Analytics mounts directly a share on dataset1001 for consumption
    • Web service for public download access to dumps
      • Uses copy on dataset1001 directly
    • Rsync mirrors
      • Use copy on dataset1001 directly
  • Dataset1001 is currently in the public vlan, and uses about 50T of storage
  • ms1001 is the backup server for dataset1001
  • labstore1003 is a SPOF

Where we're going (focusing a bit more on the Cloud Services team's work for upcoming quarters)

  • Separate the dumps generation layer from the dumps serving layer
  • Have one cluster (internal vlan) - Dumpsdata* and Snapshot* - that handles all the generation, one cluster (public vlan) - Labstore1006 and 1007 - that handles all the client serving use cases (Analytics, VPS, Web, Rsync mirrors)
  • Labstore1006 and 7 each have 72TB (36TB in internal drives, and 36TB in external shelves) storage available post RAID 10
  • Labstore1006 and 7 will get their data through a periodic rsync from the canonical dumps server (dataset/dumpsdata)
  • Labstore1006 and 7 will be set up completely independently, with copies of data from pristine source, and the client use cases will be sharded between them
  • We are hoping to put Analytics NFS mounts and VPS NFS mounts usecases in one server, and web and rsync mirrors on the other
  • Labstore1006 and 7 should have the capability to failover client services between each other (as in provide all client use cases from a single server during maint. if necessary)

Task breakdown

  • Finish rack/setup/install for labstore1006/7 - T167984
  • Puppetize and setup initial lvms and directory structures - T171539
  • Setup periodic rsync jobs from dataset1001/dumpsdata1001/2(?) to labstore1006 and 1007 - T171541
  • Setup NFS kernel server to serve dumps to VPS instances and stat boxes - T181431
    • Investigate alternatives for showmount check at instance boot time, so we can only open up port 2049 TCP from labstore1006/7 - T171508
    • Figure out how NFS failovers will work (cluster IP failover will not work in the absence of matching underlying drbd volumes) - T171540
    • Test mounting the shares on instances
    • Test mounting the shares on stat boxes
  • Investigate and setup the web service component that serves dumps to users - T188641
  • Investigate the rsync mirrors setup - T188642
  • Migrate Dumps WMCS NFS users from labstore1003 to labstore1006/7 - T188643
  • Migrate the stat* mount from dataset1001 to labstore1006/7 (coordinate with Analytics/Ezachte) - T188644
  • Get all the rsync mirror sites to to switch over to labstore1006,7 - T188645
  • Point dumps.wikimedia.org to new servers - T188646

Open questions

  • How are the rsync mirrors setup?
  • What are the QoS mechanisms for web and rsync mirrors?
  • How are outages currently handled?

Related Objects

StatusSubtypeAssignedTask
Resolvedbd808
ResolvedArielGlenn
Resolved madhuvishy
Resolved madhuvishy
Resolved madhuvishy
Resolved madhuvishy
ResolvedArielGlenn
ResolvedArielGlenn
Resolved ezachte
ResolvedArielGlenn
Resolved madhuvishy
Resolved madhuvishy
Resolved madhuvishy
Resolved madhuvishy
Resolved madhuvishy
Resolved madhuvishy
Resolved madhuvishy
Resolved madhuvishy
Resolved madhuvishy
Resolved madhuvishy
Resolved madhuvishy

Event Timeline

bd808 added a subscriber: madhuvishy.

Assigning to @madhuvishy as the tech lead for this initiative. She will be responsible for creating a plan for this work and helping me report on it as the quarter progresses.

Notes from hangout chat with Ariel:

  • We should keep the relative file structure for dumps the same so we don't break links for web/rsync downloaders
  • Maintenance windows are currently handled with announcing a downtime and just taking the server down. The dumps jobs email the dumps alias that only Ariel is currently on - sometimes users alert us of missing data even before the errors show up. The website is monitored by icinga.
  • There is ongoing work to make the dumps rsyncs incremental so they can run safely at small time windows of every 10 minutes or so. When we get here, we can alternatively pull from dumps servers to labstore1006 and 1007.
  • The monitor script will run on the dumpsdata boxes and generate the dumps index pages, which will be rsync-ed to labstores along with all the other data.
  • We have a small number of rsync mirrors that pull data from us everyday. We can do some work to stagger them or schedule them so they don't all run at the same time if needed.
  • We keep about 6-8 historical copies of dumps. latest/ is a symlink to the latest available complete dump.
  • QoS considerations for web - currently extremely limited bandwidth - disk IO is often tight, and not enough memory. We might be able to lift the bandwidth caps with the new servers, but we should consider keeping the number of connections limit per IP/UA.
  • The dumps and dataset modules in puppet have all the relevant infrastructure code.
  • Dumps run twice a month, 8-9 days per run. That gives about 1-2 days in between for maintenance and migration.
  • Dumpsdata servers will push data via rsync periodically to labstores.

A small correction to the above: there are indeed two dump runs per month, but the first run, starting on the 1st of the month, produces all historical revision content and takes 17-18 days. The second run, starting on the 20th of the month, skips this step, producing only current revision content, and takes 8-9 days.

small note new ports appeared accessible via instances which are intended but were noticed :)

OPEN 208.80.154.7 111 tcp 0 6 labstore1006.wikimedia.org
OPEN 208.80.155.106 111 tcp 0 6 labstore1007.wikimedia.org
bd808 renamed this task from Begin migrating customer-facing Dumps endpoints to Cloud Services to Migrate customer-facing Dumps endpoints to Cloud Services.Dec 18 2017, 3:59 AM
bd808 updated the task description. (Show Details)
bd808 moved this task from Q2 to Q3 on the cloud-services-team (FY2017-18) board.

Change 416502 had a related patch set uploaded (by Madhuvishy; owner: Madhuvishy):
[operations/puppet@production] dumps: Refactor profiles and hierakeys in web/

https://gerrit.wikimedia.org/r/416502

Change 416502 merged by ArielGlenn:
[operations/puppet@production] dumps: Refactor profiles and hierakeys in web/

https://gerrit.wikimedia.org/r/416502

Draft timeline for migration:

March 28 - Ensure all data fetches and syncs for labstore1006|7 are in place - T188726
Apr 2 - Starting 14:30 UTC - NFS clients migration to new servers - T188643 and T188644
April 4 - Starting 14:30 UTC - Web service & rsync mirrors migration to new servers - T188645 and T188646

Each window on April 2 & 4 will last at least 2 hours if things go exceedingly well, and possibly up to several hours.

madhuvishy closed subtask Restricted Task as Resolved.Mar 16 2018, 4:34 PM

Change 423040 had a related patch set uploaded (by Madhuvishy; owner: Madhuvishy):
[operations/puppet@production] dumps: Add nfs, load, network monitoring for dist servers

https://gerrit.wikimedia.org/r/423040

Change 423040 merged by Madhuvishy:
[operations/puppet@production] dumps: Add nfs, load, network monitoring for dist servers

https://gerrit.wikimedia.org/r/423040

Change 423359 had a related patch set uploaded (by Madhuvishy; owner: Madhuvishy):
[operations/puppet@production] dumps: Adjust network saturation thresholds for 10G interface

https://gerrit.wikimedia.org/r/423359

Change 423359 merged by Madhuvishy:
[operations/puppet@production] dumps: Adjust network saturation thresholds for 10G interface

https://gerrit.wikimedia.org/r/423359

The nginx logs from the active web server (labstore1007) need to get shipped over to stat1005 somehow, as we do with the dataset1001 logs for now in profile::dumps::web::xmldumps_active. If that happens already someplace, I could not find it.

Change 428864 had a related patch set uploaded (by Madhuvishy; owner: Madhuvishy):
[operations/puppet@production] dumps: Copy web server logs to stat host

https://gerrit.wikimedia.org/r/428864

Change 428864 merged by Madhuvishy:
[operations/puppet@production] dumps: Copy web server logs to stat host

https://gerrit.wikimedia.org/r/428864

The nginx logs from the active web server (labstore1007) need to get shipped over to stat1005 somehow, as we do with the dataset1001 logs for now in profile::dumps::web::xmldumps_active. If that happens already someplace, I could not find it.

Done :)