Page MenuHomePhabricator

Puppetize media backups infrastructure
Closed, ResolvedPublic

Description

Design and implement all necessarily logic to implement media backups logic into the orchestrator (ms-backup) and backup storage (backup) hosts. This will include the installation of packages and configuration needed for them to run, as well as any recurring execution of the backup generation.

In the end, we should have all code deployed on the operations puppet repository needed to fully reimage and setup automatically required hosts (for Debian buster os, on both datacenters) to manage (generate, recover) and store media backups.

Details

Related Changes in Gerrit:
SubjectRepoBranchLines +/-
operations/puppetproduction+49 -1
operations/puppetproduction+1 -1
operations/puppetproduction+13 -13
operations/puppetproduction+68 -17
labs/privatemaster+2 -0
operations/puppetproduction+2 -2
operations/puppetproduction+28 -28
operations/puppetproduction+14 -6
operations/puppetproduction+138 -11
labs/privatemaster+4 -6
operations/puppetproduction+4 -3
operations/puppetproduction+24 -2
operations/puppetproduction+0 -13
operations/puppetproduction+2 -2
operations/puppetproduction+1 -1
operations/puppetproduction+1 -1
operations/puppetproduction+7 -0
operations/puppetproduction+42 -11
labs/privatemaster+7 -0
operations/puppetproduction+1 -1
operations/puppetproduction+21 -2
operations/puppetproduction+2 -2
operations/puppetproduction+0 -2
operations/puppetproduction+64 -13
operations/puppetproduction+19 -6
operations/puppetproduction+61 -2
labs/privatemaster+2 -3
labs/privatemaster+3 -0
Show related patches Customize query in gerrit

Event Timeline

Change 668380 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mediabackup: Initial setup for the swift media backup orchestator hosts

https://gerrit.wikimedia.org/r/668380

jcrespo triaged this task as High priority.Mar 4 2021, 11:59 AM

Change 681103 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] mariadb: Setup 2 new host as temporary metadata database for media backups

https://gerrit.wikimedia.org/r/681103

Change 681117 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] mediabackup: Setup the storage hosts

https://gerrit.wikimedia.org/r/681117

Change 681602 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[labs/private@master] mediabackups: Set a dummy password on cloud repo to check validity

https://gerrit.wikimedia.org/r/681602

Change 681602 merged by Jcrespo:

[labs/private@master] mediabackups: Set a dummy password on cloud repo to check validity

https://gerrit.wikimedia.org/r/681602

Change 693355 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[labs/private@master] mediabackups: Move the database password to the common namespace

https://gerrit.wikimedia.org/r/693355

Change 693355 merged by Jcrespo:

[labs/private@master] mediabackups: Move the database password to the common namespace

https://gerrit.wikimedia.org/r/693355

Change 668380 merged by Jcrespo:

[operations/puppet@production] mediabackup: Initial setup for the media backup worker hosts

https://gerrit.wikimedia.org/r/668380

Change 681103 merged by Jcrespo:

[operations/puppet@production] mariadb: Setup 2 new host as temporary metadata database for media backups

https://gerrit.wikimedia.org/r/681103

Change 681117 merged by Jcrespo:

[operations/puppet@production] mediabackup: Setup the storage hosts

https://gerrit.wikimedia.org/r/681117

Change 693407 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] mariadb: Reenable notifications on db1176, db2151 after setup

https://gerrit.wikimedia.org/r/693407

Change 693407 merged by Jcrespo:

[operations/puppet@production] mariadb: Reenable notifications on db1176, db2151 after setup

https://gerrit.wikimedia.org/r/693407

First pass of puppetization complete, we now have at least one server setup per type with a very basic class, now it is a question of improving the profiles and classes until the entire functionality is replicable (to be done at the same time as T276445).

All codfw hardware is fully setup now.

Change 694332 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] mediabackup: Install minio on the storage hosts and open port 9000

https://gerrit.wikimedia.org/r/694332

Change 697586 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] install_server: Set backup2* hosts are non-reimagable

https://gerrit.wikimedia.org/r/697586

Change 697586 merged by Jcrespo:

[operations/puppet@production] install_server: Set backup2* hosts are non-reimagable

https://gerrit.wikimedia.org/r/697586

Change 694332 merged by Jcrespo:

[operations/puppet@production] mediabackup: Install minio on the storage hosts and open port 9000

https://gerrit.wikimedia.org/r/694332

Change 704510 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] mediabackup: Assign correct default group to minio-user

https://gerrit.wikimedia.org/r/704510

Change 704510 merged by Jcrespo:

[operations/puppet@production] mediabackup: Assign correct default group to minio-user

https://gerrit.wikimedia.org/r/704510

Change 704517 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[labs/private@master] mediabackup: Add dummy passwords for mediabackup storage keys

https://gerrit.wikimedia.org/r/704517

Change 704517 merged by Jcrespo:

[labs/private@master] mediabackup: Add dummy passwords for mediabackup storage keys

https://gerrit.wikimedia.org/r/704517

Change 704518 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] mediabackup: Update config dir behaviour, add root passwords

https://gerrit.wikimedia.org/r/704518

Change 704518 merged by Jcrespo:

[operations/puppet@production] mediabackup: Update config dir behaviour, add root passwords

https://gerrit.wikimedia.org/r/704518

Change 704580 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] mediabackup: Add monitoring to the minio storage server process

https://gerrit.wikimedia.org/r/704580

Change 704580 merged by Jcrespo:

[operations/puppet@production] mediabackup: Add monitoring to the minio storage server process

https://gerrit.wikimedia.org/r/704580

Change 704582 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] mediabackup: Update check name to not have spaces

https://gerrit.wikimedia.org/r/704582

Change 704582 merged by Jcrespo:

[operations/puppet@production] mediabackup: Update check name to not have spaces

https://gerrit.wikimedia.org/r/704582

Change 704585 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] mediabackup: Update the method of detecting minio server processes

https://gerrit.wikimedia.org/r/704585

Change 704585 merged by Jcrespo:

[operations/puppet@production] mediabackup: Update the method of detecting minio server processes

https://gerrit.wikimedia.org/r/704585

Change 704587 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] mariabackup: Fix config dir variable interpolation on module

https://gerrit.wikimedia.org/r/704587

Change 704587 merged by Jcrespo:

[operations/puppet@production] mariabackup: Fix config dir variable interpolation on module

https://gerrit.wikimedia.org/r/704587

All hosts have been setup including TLS. For now we are using Puppet's CA and certs, which for an internal service with internal IPs and that should not be public, may be enough, but we can change it very easily to new certs under the same CA or a separate one.

Basic functionality works, basic monitoring is setup (minio server is running).

Main pending thing would be prometheus exporter setup for minio-specific settings. Aside from liveness and readyness checks, there is prometheus built-ing support if MINIO_PROMETHEUS_AUTH_TYPE="public" is configured and we query /minio/v2/metrics/cluster or /minio/v2/metrics/node.

I will be focusing now on the worker setup.

Change 704599 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] mediabackup: Remove deleted directories and files

https://gerrit.wikimedia.org/r/704599

Change 704599 merged by Jcrespo:

[operations/puppet@production] mediabackup: Remove deleted directories and files

https://gerrit.wikimedia.org/r/704599

Change 705694 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] prometheus: Add hosts_only false on minio job

https://gerrit.wikimedia.org/r/705694

Change 705694 merged by Jcrespo:

[operations/puppet@production] prometheus: Add hosts_only=false on minio job

https://gerrit.wikimedia.org/r/705694

We need to fix the craziness of current partitioning on all servers:

# cumin 'P:mediabackup::storage' 'lsblk -b /dev/sdc'
8 hosts will be targeted:
backup[2004-2007].codfw.wmnet,backup[1004-1007].eqiad.wmnet
Ok to proceed on 8 hosts? Enter the number of affected hosts to confirm or "q" to quit 8
===== NODE GROUP =====                                                                                                                         
(1) backup1004.eqiad.wmnet                                                                                                                     
----- OUTPUT of 'lsblk -b /dev/sdc' -----                                                                                                      
NAME   MAJ:MIN RM            SIZE RO TYPE MOUNTPOINT                                                                                           
sdc      8:32   0 176021718433792  0 disk                                                                                                      
└─sdc1   8:33   0 176021717368320  0 part 
===== NODE GROUP =====                                                                                                                         
(3) backup[1005-1007].eqiad.wmnet                                                                                                              
----- OUTPUT of 'lsblk -b /dev/sdc' -----                                                                                                      
NAME   MAJ:MIN RM            SIZE RO TYPE MOUNTPOINT                                                                                           
sdc      8:32   0 176021718433792  0 disk                                                                                                      
├─sdc1   8:33   0     49999249408  0 part 
├─sdc2   8:34   0      1000341504  0 part 
└─sdc3   8:35   0 175970716745728  0 part 
===== NODE GROUP =====                                                                                                                         
(3) backup[2004,2006-2007].codfw.wmnet                                                                                                         
----- OUTPUT of 'lsblk -b /dev/sdc' -----                                                                                                      
NAME   MAJ:MIN RM            SIZE RO TYPE MOUNTPOINT                                                                                           
sdc      8:32   0 176021718433792  0 disk                                                                                                      
├─sdc1   8:33   0       157286400  0 part 
└─sdc2   8:34   0      2147483648  0 part 
===== NODE GROUP =====                                                                                                                         
(1) backup2005.codfw.wmnet                                                                                                                     
----- OUTPUT of 'lsblk -b /dev/sdc' -----                                                                                                      
NAME MAJ:MIN RM            SIZE RO TYPE MOUNTPOINT                                                                                             
sdc    8:32   0 176021718433792  0 disk                                                                                                        
================                                                                                                                               
PASS |███████████████████████████████████████████████████████████████████████████████████████| 100% (8/8) [00:00<00:00,  8.81hosts/s]          
FAIL |                                                                                               |   0% (0/8) [00:00<?, ?hosts/s]
100.0% (8/8) success ratio (>= 100.0% threshold) for command: 'lsblk -b /dev/sdc'.
100.0% (8/8) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.

There is an initial grafana dashboard, but will need a lot of work, it is almost unusuable for now (not sure if because of the lack of activity, the missing node data, or it is outadated): https://grafana.wikimedia.org/d/pJnnS4hZz/minio-overview

The next step on productionization of workers is to setup the account for access to mw content on swift. This is documented at: https://wikitech.wikimedia.org/wiki/Swift/How_To#Create_a_new_swift_account_(Thanos_Cluster) While the info there is enough, it is focused on new usages of swift, while I would like to go deeper on current usages of the mw repo, and its design decisions, as well as if there is any WIP work to have into account.

As far as I can see, my initial intention would be to have 2 new identities:

  • mw:backups or mw:mediabackups with read only access to at least swift mw originals, potentially thumbnails in the future.
  • mw:recovery or mw:mediarecovery with read-write access, to be used in case of recovery (not active). Alternatively, in case of an emergency recovery- it could be justified to have the same access as regular mediawiki, and manually provided (as long as the method is properly configured) to prevent accidental recoveries. To be discussed.

I read about swift authentication at https://docs.openstack.org/newton/user-guide/cli-swift-manage-access-swift.html, there seems to be support for read only+ listing access, but not lots of granularity over that.

There is a few open questions:

  • 1 or 2 new accounts?
  • Is the granularity of existing accounts good enough right now, are there usages that are sharing the same credentials (something to be aware for the future)?
  • What would change in the future regarding T279621?
  • Is the firewall granular enough, is there tracking of services accessing directly the proxies
  • Filippo mentioned to me long time ago that wiki creation script will likely may had to be amended to add the backup (and recovery?) accounts for new wikis. This will likely mean updating WikimediaMaintenance:MASTER:filebackend/setZoneAccess.php, which is what addWiki.php calls, but I need to understand better what the mw filebackend API does, plus the proper way to share credentials between backup generations hosts, the proxy (those 2 are clear) and maintenance hosts.

The next step on productionization of workers is to setup the account for access to mw content on swift. This is documented at: https://wikitech.wikimedia.org/wiki/Swift/How_To#Create_a_new_swift_account_(Thanos_Cluster) While the info there is enough, it is focused on new usages of swift, while I would like to go deeper on current usages of the mw repo, and its design decisions, as well as if there is any WIP work to have into account.

As far as I can see, my initial intention would be to have 2 new identities:

  • mw:backups or mw:mediabackups with read only access to at least swift mw originals, potentially thumbnails in the future.
  • mw:recovery or mw:mediarecovery with read-write access, to be used in case of recovery (not active). Alternatively, in case of an emergency recovery- it could be justified to have the same access as regular mediawiki, and manually provided (as long as the method is properly configured) to prevent accidental recoveries. To be discussed.

These accounts both good look to me, happy to start with mw:mediabackups for read-only to get started.

I read about swift authentication at https://docs.openstack.org/newton/user-guide/cli-swift-manage-access-swift.html, there seems to be support for read only+ listing access, but not lots of granularity over that.

There is a few open questions:

  • 1 or 2 new accounts?

Two accounts I think makes sense

  • Is the granularity of existing accounts good enough right now, are there usages that are sharing the same credentials (something to be aware for the future)?

No sharing for mw accounts AFAIK. In general mw only uses its account for access, and e.g. thumbor does the same with its accounts to read from originals and write to thumbnails

  • What would change in the future regarding T279621?

For the media backups nothing will change, as in mediawiki will stay on the current ms swift cluster.

  • Is the firewall granular enough, is there tracking of services accessing directly the proxies

No tracking at the firewall level, the ms swift cluster is open to the whole infra

  • Filippo mentioned to me long time ago that wiki creation script will likely may had to be amended to add the backup (and recovery?) accounts for new wikis. This will likely mean updating WikimediaMaintenance:MASTER:filebackend/setZoneAccess.php, which is what addWiki.php calls, but I need to understand better what the mw filebackend API does, plus the proper way to share credentials between backup generations hosts, the proxy (those 2 are clear) and maintenance hosts.

Yes that's correct, in general we let mediawiki handle ACLs to containers. In this case we'll have to do something similar to setZoneAccess.php we did for Thumbor: grant read only access to original containers (including private wikis) to the readonly account, and readwrite to the readwrite account.

HTH!

Change 708473 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] backup: Move backup-related hosts from misc to new backup cluster

https://gerrit.wikimedia.org/r/708473

Change 708473 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] backup: Move backup-related hosts from misc to new backup cluster

https://gerrit.wikimedia.org/r/708473

Change 708473 merged by Jcrespo:

[operations/puppet@production] backup: Move backup-related hosts from misc to new backup cluster

https://gerrit.wikimedia.org/r/708473

Change 711153 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] mediabackup: Puppetize the media backup workers

https://gerrit.wikimedia.org/r/711153

Change 711154 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[labs/private@master] mediabackup: Add dummy passwords and keys for worker hosts

https://gerrit.wikimedia.org/r/711154

Change 711154 merged by Jcrespo:

[labs/private@master] mediabackup: Add dummy passwords and keys for worker hosts

https://gerrit.wikimedia.org/r/711154

Change 711153 merged by Jcrespo:

[operations/puppet@production] mediabackup: Puppetize the media backup workers

https://gerrit.wikimedia.org/r/711153

Change 711564 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] mediabackups: Add python3 dependencies and misc changes for workers

https://gerrit.wikimedia.org/r/711564

Change 711564 merged by Jcrespo:

[operations/puppet@production] mediabackups: Add python3 dependencies and misc changes for workers

https://gerrit.wikimedia.org/r/711564

Change 711576 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] mediabackups: Hide diffs from logs on sensitive files and cleanup

https://gerrit.wikimedia.org/r/711576

Change 711576 merged by Jcrespo:

[operations/puppet@production] mediabackups: Hide diffs from logs on sensitive files and cleanup

https://gerrit.wikimedia.org/r/711576

Change 712993 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] mediabackups: Add mysql grants for mediabackups

https://gerrit.wikimedia.org/r/712993

Change 713473 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] backup: Update minio storage location to be /srv/objectstorage

https://gerrit.wikimedia.org/r/713473

Change 713473 merged by Jcrespo:

[operations/puppet@production] backup: Update minio storage location to be /srv/objectstorage

https://gerrit.wikimedia.org/r/713473

Change 714037 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[labs/private@master] Add dummy recovery access identities and keys for read only recoveries

https://gerrit.wikimedia.org/r/714037

Change 714037 merged by Jcrespo:

[labs/private@master] mediabackup:Add dummy recovery identity keys for read only recoveries

https://gerrit.wikimedia.org/r/714037

Change 714040 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] mediabackup: Add new conf and identity for read only recovery

https://gerrit.wikimedia.org/r/714040

Change 714040 merged by Jcrespo:

[operations/puppet@production] mediabackup: Add new conf and identity for read only recovery

https://gerrit.wikimedia.org/r/714040

Change 714351 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] mediabackup: Use file resource in lower case

https://gerrit.wikimedia.org/r/714351

Change 714351 merged by Jcrespo:

[operations/puppet@production] mediabackup: Use file resource in lower case

https://gerrit.wikimedia.org/r/714351

I am going to consider this as "done", but obviously, further changes will be required in the future- but the basic services are setup already, enough to have the minimum viable product.

Change 714588 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] mediabackup: Fix wrong version of s3 readandlist policy

https://gerrit.wikimedia.org/r/714588

Change 714588 merged by Jcrespo:

[operations/puppet@production] mediabackup: Fix wrong version of s3 readandlist policy

https://gerrit.wikimedia.org/r/714588

Change 712993 merged by Jcrespo:

[operations/puppet@production] mediabackups: Add mysql grants for mediabackups

https://gerrit.wikimedia.org/r/712993