Page MenuHomePhabricator

Replicate data between codfw and eqiad
Closed, ResolvedPublic

Description

User and project data (files, databases) need to be replicated between the two primary datacenters for redundancy. This data has two sources: the NFS server and the Mysql databases.

After discussion within the ops team, the solution for filesystems that was picked is to have simple hourly replication between sites, and have the cross-site data be avaliable readonly in each location (as opposed to a shared filesystem that can be written to in both locations, which was considered too brittle). The codfw file server is already being set up with the capacity for thin snapshots for that purpose.

Databases will be replicated via the normal Mysql replication mechanism.

Event Timeline

coren raised the priority of this task from to Needs Triage.
coren updated the task description. (Show Details)
coren added a project: Cloud-Services.
coren added subscribers: Aklapper, coren.

Rsync of hourly snapshots pending on both storage clusters being working

coren triaged this task as Medium priority.Dec 31 2014, 3:00 PM
coren updated the task description. (Show Details)
coren set Security to None.

A point of note that doing so will require rejiggering storage in eqiad to do thin volumes also (for snapshots) and will require extended downtime (24h or so)

24 hours of downtime is very painful obviously. Any way we could do it faster, or avoid it altogether?

When do you anticipate doing this? We'll need to plan and communicate about it well in advance...

I've already started the discussion on labs-l about scheduling it; and revised the tentative schedule to address concerns from the beta team.

I've done some dry run, and we may be able to do so with as little of 10h of downtime, but there are a *lot* of variables so the window is 24h. Note that it's not /complete/ downtime - while /home and /data/project will be read-only during the interval, other networked filesystems will not be so it's possible to circumvent things (as detailed in the email).

For reference:
https://lists.wikimedia.org/pipermail/labs-l/2014-December/003226.html

Updated:
https://lists.wikimedia.org/pipermail/labs-l/2015-January/003241.html

This is ready to start; the replicated copy will not be the live one until the filesystem switch needed for T85608 is done but it does not depend on it.

What is a dependency is to finish tracking down the users with very large amount of files (>20M) so that proper backup exclusions can be made for local caches and easily rebuildable data; otherwise the even a dry run of the replication takes over 40h.

Change 199267 had a related patch set uploaded (by coren):
WIP: Proper labs_storage class

https://gerrit.wikimedia.org/r/199267

Copy is complete, waiting for swap today.

Now that everything has been demonstrably stable over the Easter long weekend, we're ready to turn replication on with a bit of code review.

Change 199267 abandoned by coren:
WIP: Proper labs_storage class

Reason:
Superseeded by https://gerrit.wikimedia.org/r/220618

https://gerrit.wikimedia.org/r/199267

Updates on this? I feel like this is a bit obsolete now and there are other tasks that this should be merged into.

This should be done now, right?

AFAIK, there is no machines for labsdb2 or toolsdb2 hosts. However, we do not guarantee a reliable user database service for what mostly is scratch data (obviously, production databases are already on codfw). There is redundancy within eqiad, though.

chasemp added a subscriber: chasemp.

anything left here I believe could be considered part of T127567