Page MenuHomePhabricator

Set up backup strategy for es clusters
Closed, ResolvedPublic

Description

Author: ben

Description:

None of the ES clusters are currently backed up. We have data replication but
not backup.
* Create a process for backing up the ES clusters
* Document this process
* Implement this process
* Test this process
* Schedule the next test

Referred To By:
{T79674}

Details

Reference
rt1576
ProjectBranchLines +/-Subject
operations/puppetproduction+2 -2
operations/puppetproduction+9 -6
operations/puppetproduction+3 -3
operations/puppetproduction+6 -6
operations/puppetproduction+1 -1
operations/puppetproduction+2 -2
operations/puppetproduction+1 -1
operations/puppetproduction+4 -5
operations/puppetproduction+40 -7
operations/puppetproduction+235 -136
operations/puppetproduction+18 -3
operations/puppetproduction+9 -1
operations/puppetproduction+14 -5
operations/puppetproduction+1 -0
operations/puppetproduction+32 -4
operations/puppetproduction+105 -51
operations/puppetproduction+7 -6
operations/puppetproduction+28 -6
operations/puppetproduction+2 -2
operations/puppetproduction+27 -5
Show related patches Customize query in gerrit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Should this task still be assigned to @Springle? (Same question for other tasks.)

Marostegui moved this task from Triage to Meta/Epic on the DBA board.
Marostegui added subscribers: Springle, Marostegui.

I have moved it to the Meta/Epic column in the DBA dashboard, as it is a pretty big task and we should probably split it on smaller ones so we can see the progress of it.

This has been budgeted (as in budget requested, don't know if approved) and scheduled for FY2017-2018.

jcrespo raised the priority of this task from Medium to High.Apr 12 2017, 10:19 AM
fgiunchedi changed the visibility from "WMF-NDA (Project)" to "Public (No Login Required)".Jul 19 2017, 2:31 PM
fgiunchedi changed the edit policy from "WMF-NDA (Project)" to "All Users".

While there is full logical dumps of es hosts, those are not integrated with the general metadata and misc backups. This task, once T244884 is done, will integrate those under the same process, although with a slightly different logic.

Codfw hw now available: T248934 (75TB total), only needs puppetization.

Change 589263 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] database-backups: Set backup2002 as a generator of es* host dumps

https://gerrit.wikimedia.org/r/589263

Change 589266 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb-backups: Move all backup config templates to its own subdir

https://gerrit.wikimedia.org/r/589266

Change 589263 merged by Jcrespo:
[operations/puppet@production] database-backups: Set backup2002 as a generator of es* host dumps

https://gerrit.wikimedia.org/r/589263

Change 589266 merged by Jcrespo:
[operations/puppet@production] mariadb-backups: Move all backup config templates to its own subdir

https://gerrit.wikimedia.org/r/589266

Mentioned in SAL (#wikimedia-operations) [2020-04-16T09:16:34Z] <jynus> starting es backups on backup2002 T79922

Change 589303 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb-backups: Split mariadb::backups into 2 roles for backup2002

https://gerrit.wikimedia.org/r/589303

Change 589303 merged by Jcrespo:
[operations/puppet@production] mariadb-backups: Split mariadb::backups into 2 roles for backup2002

https://gerrit.wikimedia.org/r/589303

Change 589534 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb-backups: Skip generation of backups for read-only es sections

https://gerrit.wikimedia.org/r/589534

Change 589534 merged by Jcrespo:
[operations/puppet@production] mariadb-backups: Skip generation of backups for read-only es sections

https://gerrit.wikimedia.org/r/589534

Proposed backup workflow strategy for content and metadata database backups:

backup_workflow.png (1×3 px, 648 KB)

Does that mean we won't be storing metadata/content backup within the same DC and only on the remote DC?

Does that mean we won't be storing metadata/content backup within the same DC and only on the remote DC

There will be backups available within the same dc, the freshest ones, on dbprov* and backup[12]002 hosts, both snapshots AND dumps (dumps only for content). In any case, one could use the backups from the other DC locally (recovery could be from the same or the other DC), they should be 100% equal- the idea to cross-dc, is to have extra redundancy across WAN (e.g. if eqiad becomes 100% unavailable, we still have the "offline" backups. Also, if one has to go to bacula, you will have to wait anyway- normally you would go to the local dbprov host for the freshesest one, without needing bacula.

Pros:

  • Cross-DC redundancy (we have both local and remote files on each DC)
  • Redundancy in bacula, not only dbprov
  • In case of a full DC outage, the backups are already local
  • Recovery can be done both from local and remote backups
  • We are actually *already doing this*, but only in one direction (backup1001 stores files generated on dbprov2* hosts)- although there was not a formal decision-, we only add the other direction (backup2001 storing dbprov1* ones), and the content ones with the same philosophy.

Cons:

  • Slower and less efficient cold backup (bacula) process at it happens over the WAN
  • It assumes that if fast, local provisioning system didn't work, we are more likely to want the files from the other dc faster

The main question is not really if we should do this or not- we should have bacula redundancy among dcs, the question is if we should cross-dc send the bacula copies or be 100% independent and send only local ones by duplicate. One has geographical process-redundancy and the other geographical data-redundancy. I lean more on the data redundancy side (cross-dc sends) because on the 2 worst case scenarios:

  • Backups are corrupted on 1 dc
  • A dc becomes 100% offline

cross-dc is slightly more complex and less efficient on backup, but more efficient in those worst-case recoveries.

Thanks for the clarification, from the diagram it wasn't entirely clear to me if the short-term backups would be kept locally (hence my question :-) ) or not.

I also like more the idea of being data redundant even if it a bit more expensive in terms on complexity and efficiency - the likelihood of having corrupted data is quite small, but we should better be fully sure as having 2 indepent pieces of data corrupted would be extremely minimal (unless something crashes at the same time and/or the corrupted data happens thru replication and arrives to both DCs).

the diagram it wasn't entirely clear

The summary is: "generate and keep short term locally, store long term on a geographically separate site."

Yep, that is good - thanks for the clarification :-)

Change 591305 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb-backups: Include es4/5 into backup checks and refactor

https://gerrit.wikimedia.org/r/591305

Change 591305 merged by Jcrespo:
[operations/puppet@production] mariadb-backups: Include es4/5 into backup checks and refactor

https://gerrit.wikimedia.org/r/591305

last host needed, backup1002 is finally fully setup, HW and OS-wise and ready to implement the last part of external storage backups (cross-dc redundancy).

Change 596255 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] backups: Add backup1002 as the eqiad host for ES db backups

https://gerrit.wikimedia.org/r/596255

Change 596255 merged by Jcrespo:
[operations/puppet@production] backups: Add backup1002 as the eqiad host for ES db backups

https://gerrit.wikimedia.org/r/596255

Change 598005 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] Add new pool DatabasesCodfw to backup data generated on eqiad

https://gerrit.wikimedia.org/r/598005

Change 602080 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb: Add snapshoting capabilities to content backup hosts

https://gerrit.wikimedia.org/r/602080

Change 602080 merged by Jcrespo:
[operations/puppet@production] mariadb: Add snapshotting capabilities to content backup hosts

https://gerrit.wikimedia.org/r/602080

Change 598005 merged by Jcrespo:
[operations/puppet@production] Add new pool DatabasesCodfw to backup data generated on eqiad

https://gerrit.wikimedia.org/r/598005

Change 620655 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] Backups: Fix storage definition for Databases on codfw

https://gerrit.wikimedia.org/r/620655

Change 620655 merged by Jcrespo:
[operations/puppet@production] Backups: Fix storage definition for Databases on codfw

https://gerrit.wikimedia.org/r/620655

Change 620654 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb-backups: Enable eqiad backups, which are sent to codfw (backup2001)

https://gerrit.wikimedia.org/r/620654

Change 620654 merged by Jcrespo:
[operations/puppet@production] mariadb-backups: Enable eqiad backups, which are sent to codfw (backup2001)

https://gerrit.wikimedia.org/r/620654

ES are backed up, but currently only locally. We need to finish the cross-dc backup, hopfully on Q3.

jcrespo lowered the priority of this task from High to Medium.Oct 20 2020, 5:21 PM

Change 659952 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] Bacula: Create a new set of storage daemons dedicated to db ES backups

https://gerrit.wikimedia.org/r/659952

Change 661396 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] Bacula: Create a new set of storage daemons dedicated to db ES backups

https://gerrit.wikimedia.org/r/661396

Change 661691 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] backups: Enable External Store backups

https://gerrit.wikimedia.org/r/661691

Change 661396 merged by Jcrespo:
[operations/puppet@production] Bacula: Create a new set of storage daemons dedicated to db ES backups

https://gerrit.wikimedia.org/r/661396

Change 659952 merged by Jcrespo:
[operations/puppet@production] Bacula: Start using new storage/pools for es database content backups

https://gerrit.wikimedia.org/r/659952

Change 662639 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] Bacula: Fix depencency cycle on es backup storage setup

https://gerrit.wikimedia.org/r/662639

Change 662639 merged by Jcrespo:
[operations/puppet@production] Bacula: Fix dependency cycle on es backup storage setup

https://gerrit.wikimedia.org/r/662639

Change 662644 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] backups: Apply partial rename of jobdefaults without changing devicename

https://gerrit.wikimedia.org/r/662644

Change 662644 merged by Jcrespo:
[operations/puppet@production] backups: Apply partial rename of jobdefaults without changing devicename

https://gerrit.wikimedia.org/r/662644

Change 661691 merged by Jcrespo:
[operations/puppet@production] backups: Enable External Store backups

https://gerrit.wikimedia.org/r/661691

Change 662653 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] Fix reference to still not yet renamed pool Databases

https://gerrit.wikimedia.org/r/662653

Change 662653 merged by Jcrespo:
[operations/puppet@production] Bacula: Fix reference to still not yet renamed pool Databases

https://gerrit.wikimedia.org/r/662653

After deployment, the latest operational steps are being done to ensure backups will be generated correctly, after that (and its documentation), we could consider this as resolved.

codfw -> eqiad completed rather quickly.

305897  Full       5,561    3.078 T  OK       08-Feb-21 17:19 backup2002.codfw.wmnet-Monthly-1st-Wed-EsRwEqiad-mysql-srv-backups-dumps-latest

For some reason, eqiad-> codfw backups take 4x-7x longer :-(: https://grafana.wikimedia.org/d/000000377/host-overview?viewPanel=8&orgId=1&from=1612780931730&to=1612809026255&var-server=backup1002&var-datasource=thanos&var-cluster=misc

Screenshot from 2021-02-08 19-31-35.png (940×2 px, 146 KB)

@ayounsi Could you think of a reason for this discrepancy at network layer? I cannot think of one at hw or software level. The only thing I see differently is the amount of tcp re-transmissions: https://grafana.wikimedia.org/d/000000377/host-overview?viewPanel=31&orgId=1&from=1612780931000&to=1612809644034&var-server=backup1002&var-datasource=thanos&var-cluster=misc

(I don't expect a deep analysis from you, just a quick assessment of things that could make one way transmission faster than the inverse, before asking dc-ops involvement).

Overall there is more traffic in the eqiad->codfw direction, so that could be part of the reason.

Is it possible to know a bit more about the source, destination and traffic pattern?
Maybe we can try iperf on a host pairs to test it too?

I will open a new task for this issue and add you there. While this is not a blocker for backup generation, it would be for an emergency, and we should understand why this is happening before we are in a rush-kind of situation.

Change 663005 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] bacula: Temporarily enable read-only backups and disable rw backups of ES

https://gerrit.wikimedia.org/r/663005

Change 663005 merged by Jcrespo:
[operations/puppet@production] bacula: Temporarily enable read-only backups and disable rw backups of ES

https://gerrit.wikimedia.org/r/663005

Change 663877 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] dbbackups: disable all ES db bacula runs until next week

https://gerrit.wikimedia.org/r/663877

Change 663877 merged by Jcrespo:
[operations/puppet@production] dbbackups: disable all ES db bacula runs until next week

https://gerrit.wikimedia.org/r/663877

Change 664508 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] dbbackups: Reenable read-write backups, disable ro, document job ids

https://gerrit.wikimedia.org/r/664508

I've documented the new architecture to support ES backups, with the additional pools: https://wikitech.wikimedia.org/wiki/Bacula#Retention

Once the last backup (es3 from eqiad) finishes (in its last GBs), we could considered this as resolved. CC @LSobanski

Change 664508 merged by Jcrespo:
[operations/puppet@production] dbbackups: Reenable read-write backups, disable ro, document job ids

https://gerrit.wikimedia.org/r/664508

jcrespo closed this task as Resolved.EditedFeb 16 2021, 3:08 PM

This is now done, we have full-covered, regularly-scheduled, geographically redundant, ES cluster backups on bacula.

Change 671114 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] bacula: Reduce read-write es db backup retention to 60 days

https://gerrit.wikimedia.org/r/671114

Change 671114 merged by Jcrespo:
[operations/puppet@production] bacula: Reduce read-write es db backup retention to 60 days

https://gerrit.wikimedia.org/r/671114