Page MenuHomePhabricator

Setup backups for es6, es7 and archive old read only backups
Closed, ResolvedPublic

Description

Setup regular backups of new active read-write hosts, and once the old ones are in read-only mode, send them to the backup archival, where they are not refreshed regularly.

Steps:

  • Setup new config (including grants)
  • Add new sections to monitoring (icinga, pampinus)
  • Remove old ones
  • Archival of old section data

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change #1025714 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] [WIP]dbbackups: Add backups for es6 and es7

https://gerrit.wikimedia.org/r/1025714

jcrespo changed the task status from Open to In Progress.Apr 30 2024, 10:42 AM
jcrespo updated the task description. (Show Details)

Mentioned in SAL (#wikimedia-operations) [2024-05-09T08:53:41Z] <jynus> deploy new grants for es6, es7 backups T363812

Change #1025714 merged by Jcrespo:

[operations/puppet@production] dbbackups: Add backups for es6 and es7

https://gerrit.wikimedia.org/r/1025714

@Marostegui es6 and es7 backups are enabled, and a first run was done here. They seem mostly empty, though:

image.png (558×2 px, 141 KB)

Thanks Jaime - yeah, those hosts only have the table schemas.

Change #1031387 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] dbbackups: Start monitoring es6, es7 for regular backups produced

https://gerrit.wikimedia.org/r/1031387

Change #1031387 merged by Jcrespo:

[operations/puppet@production] dbbackups: Start monitoring es6, es7 for regular backups produced

https://gerrit.wikimedia.org/r/1031387

Change #1031397 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] dbbackups: Update the list of valid sections to check for WMFbackups

https://gerrit.wikimedia.org/r/1031397

Change #1031397 merged by Jcrespo:

[operations/puppet@production] dbbackups: Update the list of valid sections to check for WMFbackups

https://gerrit.wikimedia.org/r/1031397

TODO:

  • Stop es4 and es5 backups
  • Generate a full clusterX and clusterY last backup
  • Archive it into long term backups
  • Remove dump user

Change #1042163 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] dbbackups: Stop backing up es4 and es5 hosts

https://gerrit.wikimedia.org/r/1042163

Change #1042163 merged by Jcrespo:

[operations/puppet@production] dbbackups: Stop backing up es4 and es5 hosts

https://gerrit.wikimedia.org/r/1042163

@Marostegui, in order to resolve this ticket, now that read activity I assume is lower, do you think I could get a host from es4 and es5 on both dcs depooled for a day and with exclusive usage in order to take a final, archivable, full backup of those sections? Doesn't have to happen at the same time on the 4 hosts:

eqiad:
  es4:
    host: 'es1022.eqiad.wmnet'
  es5:
    host: 'es1025.eqiad.wmnet'
codfw:
  es4:
    host: 'es2022.codfw.wmnet'
  es5:
    host: 'es2025.codfw.wmnet'

@Marostegui, in order to resolve this ticket, now that read activity I assume is lower, do you think I could get a host from es4 and es5 on both dcs depooled for a day and with exclusive usage in order to take a final, archivable, full backup of those sections? Doesn't have to happen at the same time on the 4 hosts:

Absolutely, I'd suggest to start with codfw this week if you want, as there're some switches maintenance in eqiad.

Change #1047920 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] dbbackups: Stop monitoring backups for es4 and es5

https://gerrit.wikimedia.org/r/1047920

Change #1047920 merged by Jcrespo:

[operations/puppet@production] dbbackups: Stop monitoring backups for es4 and es5

https://gerrit.wikimedia.org/r/1047920

I didn't get to this this week, but let's try to have this done next week CC @Marostegui @ABran-WMF @Ladsgroup as I will be gone in 2 weeks, and that way everything is where it should be.

Icinga downtime and Alertmanager silence (ID=62f74f53-c682-4a9c-81a0-1d4564004f32) set by jynus@cumin1002 for 1 day, 0:00:00 on 1 host(s) and their services with reason: full dump

es2022.codfw.wmnet

Icinga downtime and Alertmanager silence (ID=ef0acfcf-1704-406b-885e-a332b6a1c050) set by jynus@cumin1002 for 1 day, 0:00:00 on 1 host(s) and their services with reason: full dump

es2025.codfw.wmnet

Mentioned in SAL (#wikimedia-operations) [2024-06-26T07:59:47Z] <jynus@cumin1002> dbctl commit (dc=all): 'Depool es1022 for backups T363812', diff saved to https://phabricator.wikimedia.org/P65454 and previous config saved to /var/cache/conftool/dbconfig/20240626-075946-jynus.json

Mentioned in SAL (#wikimedia-operations) [2024-06-26T08:11:30Z] <jynus@cumin1002> dbctl commit (dc=all): 'Depool es1025 for backups T363812', diff saved to https://phabricator.wikimedia.org/P65458 and previous config saved to /var/cache/conftool/dbconfig/20240626-081130-jynus.json

This is the status now: es2025 completed and repooled, es2022 is about to finish (but still depooled), es1022 + es1025 have just started:

image.png (477×1 px, 127 KB)

Mentioned in SAL (#wikimedia-operations) [2024-06-26T10:25:23Z] <jynus@cumin1002> dbctl commit (dc=all): 'Repool es2022 after backup T363812', diff saved to https://phabricator.wikimedia.org/P65464 and previous config saved to /var/cache/conftool/dbconfig/20240626-102523-jynus.json

Mentioned in SAL (#wikimedia-operations) [2024-06-26T10:39:34Z] <jynus@cumin1002> dbctl commit (dc=all): 'Repool es2022 at 50% T363812', diff saved to https://phabricator.wikimedia.org/P65465 and previous config saved to /var/cache/conftool/dbconfig/20240626-103933-jynus.json

Mentioned in SAL (#wikimedia-operations) [2024-06-26T11:19:35Z] <jynus@cumin1002> dbctl commit (dc=all): 'Repool es2022 fully T363812', diff saved to https://phabricator.wikimedia.org/P65466 and previous config saved to /var/cache/conftool/dbconfig/20240626-111934-jynus.json

es2022 finished, all good. I am going to disable bacula for es hosts so it doesn't run while the ongoing db dumps finish, then reenable and do the one-time read only backup.

Change #1049921 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] dbbackups: Disable regular es backups (es6, es7) while es4/5 run

https://gerrit.wikimedia.org/r/1049921

Change #1049921 merged by Jcrespo:

[operations/puppet@production] dbbackups: Disable regular es backups (es6, es7) while es4/5 run

https://gerrit.wikimedia.org/r/1049921

Mentioned in SAL (#wikimedia-operations) [2024-06-27T07:45:43Z] <jynus@cumin1002> dbctl commit (dc=all): 'Repool es1022 at 10% weight T363812', diff saved to https://phabricator.wikimedia.org/P65512 and previous config saved to /var/cache/conftool/dbconfig/20240627-074542-jynus.json

Mentioned in SAL (#wikimedia-operations) [2024-06-27T07:54:47Z] <jynus@cumin1002> dbctl commit (dc=all): 'Repool es1025 at 10% weight T363812', diff saved to https://phabricator.wikimedia.org/P65513 and previous config saved to /var/cache/conftool/dbconfig/20240627-075447-jynus.json

Mentioned in SAL (#wikimedia-operations) [2024-06-27T07:56:20Z] <jynus@cumin1002> dbctl commit (dc=all): 'Repool es1022 at 50% weight T363812', diff saved to https://phabricator.wikimedia.org/P65514 and previous config saved to /var/cache/conftool/dbconfig/20240627-075620-jynus.json

Mentioned in SAL (#wikimedia-operations) [2024-06-27T07:59:45Z] <jynus@cumin1002> dbctl commit (dc=all): 'Repool es1025 at 50% weight T363812', diff saved to https://phabricator.wikimedia.org/P65515 and previous config saved to /var/cache/conftool/dbconfig/20240627-075944-jynus.json

Mentioned in SAL (#wikimedia-operations) [2024-06-27T08:10:16Z] <jynus@cumin1002> dbctl commit (dc=all): 'Repool es1022 at 100% weight T363812', diff saved to https://phabricator.wikimedia.org/P65516 and previous config saved to /var/cache/conftool/dbconfig/20240627-081016-jynus.json

Mentioned in SAL (#wikimedia-operations) [2024-06-27T08:10:45Z] <jynus@cumin1002> dbctl commit (dc=all): 'Repool es1025 at 100% weight T363812', diff saved to https://phabricator.wikimedia.org/P65517 and previous config saved to /var/cache/conftool/dbconfig/20240627-081044-jynus.json

Change #1050259 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] dbbackups: Reenable es backups, also enable ro ones for archival

https://gerrit.wikimedia.org/r/1050259

Change #1050259 merged by Jcrespo:

[operations/puppet@production] dbbackups: Reenable es backups, also enable ro ones for archival

https://gerrit.wikimedia.org/r/1050259

Change #1050265 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] dbbackups: Fix backup name conflict, disable regular backups

https://gerrit.wikimedia.org/r/1050265

Change #1050265 merged by Jcrespo:

[operations/puppet@production] dbbackups: Fix backup name conflict, disable regular backups

https://gerrit.wikimedia.org/r/1050265

Mentioned in SAL (#wikimedia-operations) [2024-07-01T08:18:12Z] <jynus@cumin1002> dbctl commit (dc=all): 'Depool es1025 for backups T363812', diff saved to https://phabricator.wikimedia.org/P65562 and previous config saved to /var/cache/conftool/dbconfig/20240701-081811-jynus.json

Mentioned in SAL (#wikimedia-operations) [2024-07-02T08:57:34Z] <jynus@cumin1002> dbctl commit (dc=all): 'Repool es1025 at 10% weight T363812', diff saved to https://phabricator.wikimedia.org/P65643 and previous config saved to /var/cache/conftool/dbconfig/20240702-085733-jynus.json

Mentioned in SAL (#wikimedia-operations) [2024-07-02T09:15:09Z] <jynus@cumin1002> dbctl commit (dc=all): 'Repool es1025 at 50% weight T363812', diff saved to https://phabricator.wikimedia.org/P65644 and previous config saved to /var/cache/conftool/dbconfig/20240702-091508-jynus.json

Mentioned in SAL (#wikimedia-operations) [2024-07-02T10:06:36Z] <jynus@cumin1002> dbctl commit (dc=all): 'Repool es1025 at 100% weight T363812', diff saved to https://phabricator.wikimedia.org/P65645 and previous config saved to /var/cache/conftool/dbconfig/20240702-100636-jynus.json

Change #1051341 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] dbbackups: Reenable es-readonly for one-time es5 section backup

https://gerrit.wikimedia.org/r/1051341

Change #1051341 merged by Jcrespo:

[operations/puppet@production] dbbackups: Reenable es-readonly for one-time es5 section backup

https://gerrit.wikimedia.org/r/1051341

es4 has already been archived on jobs 574899 and 574900, the two for es5 are running now. When finished, we will be able to close this ticket.

We will need to refresh the es1, es2 and es3 jobs (and probably redump them). But that is out of scope of this ticket.

Change #1051744 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] dbbackups: Disable es read-only backups and reenable rw ones

https://gerrit.wikimedia.org/r/1051744

Change #1051744 merged by Jcrespo:

[operations/puppet@production] dbbackups: Disable es read-only backups and reenable rw ones

https://gerrit.wikimedia.org/r/1051744

I will skip the "Remove dump user", as I think that may be useful and we will decide how to leave it long term when the es1, es2 & es3 backups are generated (with or without the user).

Other than that, all steps completed, with the pending task (elsewhere) to refresh those older read only dumps.