Followup to backup1001 bacula switchover (misc pending tasks)
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	jcrespo
	Nov 12 2019, 9:34 AM

Description

After the helium -> bacula1001 switchover, new backups and recoveries are happening on backup1001, however there is a list of things that are pending to do to be able to decommission helium and heze, and consider this phase as done, providing the same or better level of service than old bacula infrastructure:

Setup and test recover from the archive pool on backup1001
Split database backups on its own separate pool/array
Increase retention period and/or available disk volumes
Populate backup2001 codfw mirror, probably through migrating jobs
Attach old sd hosts to new dirs, or wait for the expiration to be larger than the configured one
Cleanup old code for jessie
Rename 'production' to 'Production' pool (or Databases, Archive, etc. to lower case) for consistency (bacula is case sensitive and could cause issues)
Get rid of per-strategy, and per-day of the week jobdefaults, parametrize better the jobdefaults in bacula director host

Details

Subject	Repo	Branch	Lines +/-
bacula: Undo conditionals added while transitioning helium->backup1001	operations/puppet	production	+90 -149
Revert "bacula: Add temporary jobdefaults for 1 time Archive pool backup"	operations/puppet	production	+12 -6
bacula: Add temporary jobdefaults for 1 time Archive pool backup	operations/puppet	production	+5 -0
bacula: Perform a one-time backup of helium's archive pool	operations/puppet	production	+19 -1
bacula: Increase max total size of backups to 35 TB	operations/puppet	production	+3 -3
bacula: Setup weekly copy migrations until a first run happens	operations/puppet	production	+1 -1
bacula: Schedule hourly copies of production backups to the offsite pool	operations/puppet	production	+1 -0
bacula: Schedule hourly copies of production backups to the offsite pool	operations/puppet	production	+1 -0
bacula: Increase offsite backup retention to 90 days	operations/puppet	production	+1 -1
bacula: Increase production and Databases retention to 90 days	operations/puppet	production	+2 -2
database-backups: Fix database jobdefaults for database dumps	operations/puppet	production	+1 -1
database-backups: Use a different pool for database backups	operations/puppet	production	+3 -1
bacula: Setup new pool for databases as well as its configuration	operations/puppet	production	+30 -10
bacula: Rename schedule to monthlys., split schedule & jobdefaults	operations/puppet	production	+171 -91
bacula: Setup separate pool and defaults for database backups on backup1001	operations/puppet	production	+80 -34
backup: Move filesets to a separate file	operations/puppet	production	+185 -182

Related Objects
Search...

Status	Assigned	Task
Resolved	jcrespo	T229209 Strengthen backup infrastructure and support
Resolved	jcrespo	T238048 Followup to backup1001 bacula switchover (misc pending tasks)
Resolved	jcrespo	T260717 decom helium and heze
Resolved	jcrespo	T272686 print a list of backed up directories in the MOTD of production servers
Resolved	jcrespo	T273182 Revert OpenSSL min version configuration introduced for bacula compatibility
Resolved	jcrespo	T274809 Drop unused database "bacula" from m1

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 12 2019, 9:34 AM

@akosiaris Could you give a quick look to see if these seems like a complete archive contents?
{P9597}

I can execute the recovery of these files using a manual procedure (bootstrap file); however, I cannot do it automatically (restore) because on the database, these files are still referenced there, but existing on helium, not locally on backup1001. So I have to edit the bootstrap file to change the name of the volumes and sd host, and then run, e.g.:

bextract -p -b /var/lib/bacula/backup1001.eqiad.wmnet.restore.1.bsr FileStorageArchive /srv/local

Otherwhiese, the automatic process says:

The Job will require the following (*=>InChanger):
   Volume(s)                 Storage(s)                SD Device(s)
===========================================================================
   
    archive0003               helium-FileStorage2       FileStorage2

However, because encryption, I get an empty file (metadata is not encrypted).

I think the best option would be to run bscan or batch-editing the database to point to backup1001 and FileStorageArchive.

Change 550671 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] bacula: Setup separate pool and defaults for database backups on backup1001

https://gerrit.wikimedia.org/r/550671

gerritbot added a project: Patch-For-Review.Nov 13 2019, 12:45 PM

In T238048#5655633, @jcrespo wrote:

@akosiaris Could you give a quick look to see if these seems like a complete archive contents?
{P9597}

archive0055 seems to be missing sodium.wikimedia.org-Monthly-1st-Thu-production-var-lib-mailman which is however also duplicated on archive0003 (jobid 23,606), so maybe the db is wrong on that front. /var/lib/mailman is also present on fermium right now and hence fine if we lose the sodium thing. Otherwise, it seems fine to me.

I can execute the recovery of these files using a manual procedure (bootstrap file); however, I cannot do it automatically (restore) because on the database, these files are still referenced there, but existing on helium, not locally on backup1001. So I have to edit the bootstrap file to change the name of the volumes and sd host, and then run, e.g.:
bextract -p -b /var/lib/bacula/backup1001.eqiad.wmnet.restore.1.bsr FileStorageArchive /srv/local
Otherwhiese, the automatic process says:
The Job will require the following (*=>InChanger):
   Volume(s)                 Storage(s)                SD Device(s)
===========================================================================
   
    archive0003               helium-FileStorage2       FileStorage2
However, because encryption, I get an empty file (metadata is not encrypted).

I think the best option would be to run bscan or batch-editing the database to point to backup1001 and FileStorageArchive.

Batch editing the DB. This should be simple enough. Something like `update media set storageid = X where storageid = Y;

Change 553084 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] backup: Move filesets to a separate file

https://gerrit.wikimedia.org/r/553084

Change 553084 merged by Jcrespo:
[operations/puppet@production] backup: Move filesets to a separate file

https://gerrit.wikimedia.org/r/553084

Batch editing the DB

The update should be:

UPDATE Media SET StorageId = 11 WHERE StorageId = 4;

root@db1135[bacula9]> SELECT * FROM Media WHERE StorageId = 4;
+---------+-------------+------+--------+-----------+-------------+-----------+---------------------+-------------------
| MediaId | VolumeName  | Slot | PoolId | MediaType | MediaTypeId | LabelType | FirstWritten        | LastWritten       
+---------+-------------+------+--------+-----------+-------------+-----------+---------------------+-------------------
|       3 | archive0003 |    0 |      3 | File      |           0 |         0 | 2013-08-27 22:18:26 | 2015-10-08 14:42:0
|      55 | archive0055 |    0 |      3 | File      |           0 |         0 | 2015-10-08 10:08:51 | 2015-10-08 14:55:4
+---------+-------------+------+--------+-----------+-------------+-----------+---------------------+-------------------

But I need to take a backup and check there isn't other tables that have to be updated.

PS: No other reference to storage:

root@db1135[bacula9]> select * FROM information_schema.columns WHERE table_schema = 'bacula9' and column_name like '%torage%';
+---------------+--------------+------------+-------------+------------------+----------------+-------------+-----------
| TABLE_CATALOG | TABLE_SCHEMA | TABLE_NAME | COLUMN_NAME | ORDINAL_POSITION | COLUMN_DEFAULT | IS_NULLABLE | DATA_TYPE 
+---------------+--------------+------------+-------------+------------------+----------------+-------------+-----------
| def           | bacula9      | Device     | StorageId   |                4 | 0              | YES         | int       
| def           | bacula9      | Media      | StorageId   |               30 | 0              | YES         | int       
| def           | bacula9      | Storage    | StorageId   |                1 | NULL           | NO          | int       
+---------------+--------------+------------+-------------+------------------+----------------+-------------+-----------
3 rows in set (0.00 sec)

Device is empty, and Storage is the list of storage devices (old and new).

Mentioned in SAL (#wikimedia-operations) [2019-11-27T15:42:19Z] <jynus> migrate db entries of archive Media to backup1001 T238048

root@db1135.eqiad.wmnet[bacula9]> UPDATE Media SET StorageId = 11 WHERE StorageId = 4;
Query OK, 2 rows affected (0.00 sec)
Rows matched: 2  Changed: 2  Warnings: 0

Mentioned in SAL (#wikimedia-operations) [2019-11-27T15:52:26Z] <jynus> disabling puppet on dbprov1001 to test bacula restore T238048

Error while trying to restore sodium contents:

29-Nov 13:04 backup1001.eqiad.wmnet JobId 162656: Start Restore Job RestoreFiles.2019-11-29_13.04.00_16
29-Nov 13:04 backup1001.eqiad.wmnet JobId 162656: Using Device "FileStorageArchive" to read.
29-Nov 13:04 backup1001.eqiad.wmnet-fd JobId 162656: Ready to read from volume "archive0003" on File device "FileStorageArchive" (/srv/archive).
29-Nov 13:04 backup1001.eqiad.wmnet-fd JobId 162656: Forward spacing Volume "archive0003" to addr=452438779297
29-Nov 13:04 dbprov2001.codfw.wmnet-fd JobId 162656: Error: openssl.c:78 TLS read/write failure.: ERR=error:0407109F:rsa routines:RSA_padding_check_PKCS1_type_2:pkcs decoding error
29-Nov 13:04 dbprov2001.codfw.wmnet-fd JobId 162656: Error: openssl.c:78 TLS read/write failure.: ERR=error:04065072:rsa routines:RSA_EAY_PRIVATE_DECRYPT:padding check failed
29-Nov 13:04 dbprov2001.codfw.wmnet-fd JobId 162656: Fatal error: restore.c:473 Data record error. ERR=Resource temporarily unavailable
29-Nov 13:06 backup1001.eqiad.wmnet-fd JobId 162656: Error: bsock.c:388 Wrote 65540 bytes to client:10.192.0.114:43914, but only 0 accepted.
29-Nov 13:06 backup1001.eqiad.wmnet-fd JobId 162656: Fatal error: read.c:176 Error sending data to Client. ERR=Connection reset by peer
29-Nov 13:06 backup1001.eqiad.wmnet-fd JobId 162656: Elapsed time=00:02:00, Transfer rate=15.68 K Bytes/second
29-Nov 13:06 backup1001.eqiad.wmnet-fd JobId 162656: Error: bsock.c:271 Socket has errors=1 on call to client:10.192.0.114:43914
29-Nov 13:06 backup1001.eqiad.wmnet JobId 162656: Error: Bacula backup1001.eqiad.wmnet 9.4.2 (04Feb19):
  Build OS:               x86_64-pc-linux-gnu debian buster/sid
  JobId:                  162656
  Job:                    RestoreFiles.2019-11-29_13.04.00_16
  Restore Client:         dbprov2001.codfw.wmnet-fd
  Where:                  /srv/tmp
  Replace:                Never
  Start time:             29-Nov-2019 13:04:02
  End time:               29-Nov-2019 13:06:03
  Elapsed time:           2 mins 1 sec
  Files Expected:         5,408,668
  Files Restored:         1
  Bytes Restored:         0 (0 B)
  Rate:                   0.0 KB/s
  FD Errors:              2
  FD termination status:  Error
  SD termination status:  Error
  Termination:            *** Restore Error ***

restoring dbprov2002 content on dbprov2001 with the puppet master cert, however, works well.

Same for bast1001:

29-Nov 13:19 backup1001.eqiad.wmnet JobId 162657: Start Restore Job RestoreFiles.2019-11-29_13.18.59_50
29-Nov 13:19 backup1001.eqiad.wmnet JobId 162657: Using Device "FileStorageArchive" to read.
29-Nov 13:19 backup1001.eqiad.wmnet-fd JobId 162657: Ready to read from volume "archive0003" on File device "FileStorageArchive" (/srv/archive).
29-Nov 13:19 backup1001.eqiad.wmnet-fd JobId 162657: Forward spacing Volume "archive0003" to addr=323517936212
29-Nov 13:19 dbprov2001.codfw.wmnet-fd JobId 162657: Error: openssl.c:78 Encryption session provided an invalid symmetric key: ERR=error:0407109F:rsa routines:RSA_padding_check_PKCS1_type_2:pkcs decoding error
29-Nov 13:19 dbprov2001.codfw.wmnet-fd JobId 162657: Error: openssl.c:78 Encryption session provided an invalid symmetric key: ERR=error:04065072:rsa routines:RSA_EAY_PRIVATE_DECRYPT:padding check failed
29-Nov 13:19 dbprov2001.codfw.wmnet-fd JobId 162657: Error: openssl.c:78 Encryption session provided an invalid symmetric key: ERR=error:0607A082:digital envelope routines:EVP_CIPHER_CTX_set_key_length:invalid key length
29-Nov 13:19 dbprov2001.codfw.wmnet-fd JobId 162657: Error: restore.c:741 Failed to initialize decryption context for /srv/tmp/srv/home_pmtpa/aaron/.gitignore
29-Nov 13:19 dbprov2001.codfw.wmnet-fd JobId 162657: Error: openssl.c:78 OpenSSL digest Verify final failed: ERR=error:0407109F:rsa routines:RSA_padding_check_PKCS1_type_2:pkcs decoding error
29-Nov 13:19 dbprov2001.codfw.wmnet-fd JobId 162657: Error: openssl.c:78 OpenSSL digest Verify final failed: ERR=error:04065072:rsa routines:RSA_EAY_PRIVATE_DECRYPT:padding check failed
29-Nov 13:19 dbprov2001.codfw.wmnet-fd JobId 162657: Error: openssl.c:78 OpenSSL digest Verify final failed: ERR=error:04091077:rsa routines:INT_RSA_VERIFY:wrong signature length
29-Nov 13:19 dbprov2001.codfw.wmnet-fd JobId 162657: Error: restore.c:1680 Signature validation failed for file /srv/tmp/srv/home_pmtpa/aaron/.cache/motd.legal-displayed: ERR=Signature is invalid
29-Nov 13:19 dbprov2001.codfw.wmnet-fd JobId 162657: Error: openssl.c:78 Encryption session provided an invalid symmetric key: ERR=error:0407109F:rsa routines:RSA_padding_check_PKCS1_type_2:pkcs decoding error
29-Nov 13:19 dbprov2001.codfw.wmnet-fd JobId 162657: Error: openssl.c:78 Encryption session provided an invalid symmetric key: ERR=error:04065072:rsa routines:RSA_EAY_PRIVATE_DECRYPT:padding check failed

@akosiaris does this ring any bell? I find hard to belive that a new version of openssl couldn't decrypt a file encrypted with an older version. Maybe I am using the wrong options.

Change 554257 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] bacula: Rename schedule to monthlys., split schedule & jobdefaults

https://gerrit.wikimedia.org/r/554257

Change 554286 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] bacula: Setup new pool for databases as well as its configuration

https://gerrit.wikimedia.org/r/554286

Change 554288 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] database-backups: Use a different pool for dbatabase backups

https://gerrit.wikimedia.org/r/554288

Change 550671 abandoned by Jcrespo:
bacula: Setup separate pool and defaults for database backups on backup1001

Reason:
Split into Ie081e459b35b787a837424b31ad8 and following ones

https://gerrit.wikimedia.org/r/550671

Change 554257 merged by Jcrespo:
[operations/puppet@production] bacula: Rename schedule to monthlys., split schedule & jobdefaults

https://gerrit.wikimedia.org/r/554257

Change 554286 merged by Jcrespo:
[operations/puppet@production] bacula: Setup new pool for databases as well as its configuration

https://gerrit.wikimedia.org/r/554286

jcrespo updated the task description. (Show Details)Dec 3 2019, 5:41 PM

Change 554288 merged by Jcrespo:
[operations/puppet@production] database-backups: Use a different pool for database backups

https://gerrit.wikimedia.org/r/554288

Change 554344 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] database-backups: Fix database jobdefaults for database dumps

https://gerrit.wikimedia.org/r/554344

Change 554344 merged by Jcrespo:
[operations/puppet@production] database-backups: Fix database jobdefaults for database dumps

https://gerrit.wikimedia.org/r/554344

Yay!

Full           Backup    10  04-Dec-19 02:05    dbprov2002.codfw.wmnet-Monthly-1st-Wed-Databases-mysql-srv-backups-dumps-latest *unknown*
Full           Backup    10  04-Dec-19 02:05    dbprov2001.codfw.wmnet-Monthly-1st-Wed-Databases-mysql-srv-backups-dumps-latest *unknown*

jcrespo updated the task description. (Show Details)Dec 3 2019, 6:09 PM

Maintenance_bot removed a project: Patch-For-Review.Dec 3 2019, 6:10 PM

Change 554485 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] bacula: Increase production and Databases retention to 90 days

https://gerrit.wikimedia.org/r/554485

gerritbot added a project: Patch-For-Review.Dec 4 2019, 11:15 AM

"Offsite Job" seems to be correctly configured as "Copy", but it is not showing any activity. Needs checking.

In T238048#5701534, @jcrespo wrote:

Same for bast1001:

29-Nov 13:19 backup1001.eqiad.wmnet JobId 162657: Start Restore Job RestoreFiles.2019-11-29_13.18.59_50
29-Nov 13:19 backup1001.eqiad.wmnet JobId 162657: Using Device "FileStorageArchive" to read.
29-Nov 13:19 backup1001.eqiad.wmnet-fd JobId 162657: Ready to read from volume "archive0003" on File device "FileStorageArchive" (/srv/archive).
29-Nov 13:19 backup1001.eqiad.wmnet-fd JobId 162657: Forward spacing Volume "archive0003" to addr=323517936212
29-Nov 13:19 dbprov2001.codfw.wmnet-fd JobId 162657: Error: openssl.c:78 Encryption session provided an invalid symmetric key: ERR=error:0407109F:rsa routines:RSA_padding_check_PKCS1_type_2:pkcs decoding error
29-Nov 13:19 dbprov2001.codfw.wmnet-fd JobId 162657: Error: openssl.c:78 Encryption session provided an invalid symmetric key: ERR=error:04065072:rsa routines:RSA_EAY_PRIVATE_DECRYPT:padding check failed
29-Nov 13:19 dbprov2001.codfw.wmnet-fd JobId 162657: Error: openssl.c:78 Encryption session provided an invalid symmetric key: ERR=error:0607A082:digital envelope routines:EVP_CIPHER_CTX_set_key_length:invalid key length
29-Nov 13:19 dbprov2001.codfw.wmnet-fd JobId 162657: Error: restore.c:741 Failed to initialize decryption context for /srv/tmp/srv/home_pmtpa/aaron/.gitignore
29-Nov 13:19 dbprov2001.codfw.wmnet-fd JobId 162657: Error: openssl.c:78 OpenSSL digest Verify final failed: ERR=error:0407109F:rsa routines:RSA_padding_check_PKCS1_type_2:pkcs decoding error
29-Nov 13:19 dbprov2001.codfw.wmnet-fd JobId 162657: Error: openssl.c:78 OpenSSL digest Verify final failed: ERR=error:04065072:rsa routines:RSA_EAY_PRIVATE_DECRYPT:padding check failed
29-Nov 13:19 dbprov2001.codfw.wmnet-fd JobId 162657: Error: openssl.c:78 OpenSSL digest Verify final failed: ERR=error:04091077:rsa routines:INT_RSA_VERIFY:wrong signature length
29-Nov 13:19 dbprov2001.codfw.wmnet-fd JobId 162657: Error: restore.c:1680 Signature validation failed for file /srv/tmp/srv/home_pmtpa/aaron/.cache/motd.legal-displayed: ERR=Signature is invalid
29-Nov 13:19 dbprov2001.codfw.wmnet-fd JobId 162657: Error: openssl.c:78 Encryption session provided an invalid symmetric key: ERR=error:0407109F:rsa routines:RSA_padding_check_PKCS1_type_2:pkcs decoding error
29-Nov 13:19 dbprov2001.codfw.wmnet-fd JobId 162657: Error: openssl.c:78 Encryption session provided an invalid symmetric key: ERR=error:04065072:rsa routines:RSA_EAY_PRIVATE_DECRYPT:padding check failed

@akosiaris does this ring any bell? I find hard to belive that a new version of openssl couldn't decrypt a file encrypted with an older version. Maybe I am using the wrong options.

We 've had a couple of these in the past (~2013-2014?), but it was operator error IIRC. The message is openssl.c:78 Encryption session provided an invalid symmetric key: ERR=error:0607A082:digital envelope routines:EVP_CIPHER_CTX_set_key_length:invalid key length so it looks like the wrong key was used?

In T238048#5711820, @jcrespo wrote:

"Offsite Job" seems to be correctly configured as "Copy", but it is not showing any activity. Needs checking.

I don't think there is a schedule for that, that's why?

akosiaris awarded a token.Dec 4 2019, 1:05 PM

Change 554485 merged by Jcrespo:
[operations/puppet@production] bacula: Increase production and Databases retention to 90 days

https://gerrit.wikimedia.org/r/554485

Maintenance_bot removed a project: Patch-For-Review.Dec 4 2019, 3:10 PM

Hi, @akosiaris, thanks for the reviews and feedback. Could I have further your thoughts on T238048#5701519 and T238048#5701534. Normally I would just find a solution or workaround on my own, but archive file copy was one of the parts in which I compromised my suggested plan because you were quite confident on its forward compatibility :-/. On the other side, most of those files seem to be around 5 years old, which may mean some should be actually be purged. Let me know your thoughts.

After update, the pools seem ok, although we probably should also increase the offsite one (creating patch).

*list pool
+--------+------------+---------+---------+-----------------+--------------+---------+----------+-------------+
| PoolId | Name       | NumVols | MaxVols | MaxVolBytes     | VolRetention | Enabled | PoolType | LabelFormat |
+--------+------------+---------+---------+-----------------+--------------+---------+----------+-------------+
|      1 | Default    |       0 |       1 |               0 |  155,520,000 |       1 | Backup   | *           |
|      2 | production |      33 |      60 | 536,870,912,000 |    7,776,000 |       1 | Backup   | production  |
|      3 | Archive    |       2 |       5 | 536,870,912,000 |  157,680,000 |       1 | Backup   | archive     |
|      4 | offsite    |       0 |      60 | 536,870,912,000 |    2,592,000 |       1 | Backup   | offsite     |
|      5 | Databases  |       5 |      60 | 536,870,912,000 |    7,776,000 |       1 | Backup   | databases   |
+--------+------------+---------+---------+-----------------+--------------+---------+----------+-------------+

Change 554536 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] bacula: Increase offsite backup retention to 90 days

https://gerrit.wikimedia.org/r/554536

gerritbot added a project: Patch-For-Review.Dec 4 2019, 3:41 PM

Change 554539 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] bacula: Schedule hourly copies of production backups to the offsite pool

https://gerrit.wikimedia.org/r/554539

Rentention change documented at: https://wikitech.wikimedia.org/wiki/Bacula#Modify_a_pool's_retention_(or_other_similar_properties)

jcrespo updated the task description. (Show Details)Dec 4 2019, 4:06 PM

Change 554536 merged by Jcrespo:
[operations/puppet@production] bacula: Increase offsite backup retention to 90 days

https://gerrit.wikimedia.org/r/554536

Change 554539 merged by Jcrespo:
[operations/puppet@production] bacula: Schedule hourly copies of production backups to the offsite pool

https://gerrit.wikimedia.org/r/554539

In T238048#5712471, @jcrespo wrote:

Hi, @akosiaris, thanks for the reviews and feedback. Could I have further your thoughts on T238048#5701519 and T238048#5701534.

I have provided them already in T238048#5711997

Normally I would just find a solution or workaround on my own, but archive file copy was one of the parts in which I compromised my suggested plan because you were quite confident on its forward compatibility :-/.

I never was confident for archive specifically (I was for the rest). archive pool has not seen a restore in a very long time, as far as I remember.

On the other side, most of those files seem to be around 5 years old, which may mean some should be actually be purged. Let me know your thoughts.

It's historical data mostly, we can stall this until someone requires a restore and try to solve it then. Overall, 5 years into the last restore, I doubt there will be a really critical request. That being said, it does look like a key issue, maybe they are just encrypted with a different key? We did a CA rollover some 4-5 years ago. Maybe they are encrypted with that key (which should be on the puppetmasters?)

I have provided them already in

Indeed, sorry- mail client only showed you last comment.

Maybe they are encrypted with that key (which should be on the puppetmasters?)

I will try to find and use a previous key, thank you again for the suggestion. Very useful.

I scheduled by accident the migration, not the copy.

Incremental    Migrate   20  04-Dec-19 17:00    Migrate Job

I think it wouldn't have run anyway due to the selection config, but reverting and scheduling the right job instead.

Change 554563 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] bacula: Schedule hourly copies of production backups to the offsite pool

https://gerrit.wikimedia.org/r/554563

jcrespo updated the task description. (Show Details)Dec 4 2019, 4:47 PM

Change 554563 merged by Jcrespo:
[operations/puppet@production] bacula: Schedule hourly copies of production backups to the offsite pool

https://gerrit.wikimedia.org/r/554563

Now it is ok:

Scheduled Jobs:
Level          Type     Pri  Scheduled          Job Name           Volume
===================================================================================
Incremental    Backup    10  04-Dec-19 17:00    gerrit1001.wikimedia.org-Hourly-Sun-production-srv-gerrit-git production0070
Incremental    Copy      20  04-Dec-19 17:00    Offsite Job         <======= this

Waiting 5 minutes to check performance and status.

Maintenance_bot removed a project: Patch-For-Review.Dec 4 2019, 5:11 PM

Change 556195 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] bacula: Setup weekly copy migrations until a first run happens

https://gerrit.wikimedia.org/r/556195

gerritbot added a project: Patch-For-Review.Dec 10 2019, 2:32 PM

Change 556195 merged by Jcrespo:
[operations/puppet@production] bacula: Setup weekly copy migrations until a first run happens

https://gerrit.wikimedia.org/r/556195

Maintenance_bot removed a project: Patch-For-Review.Dec 10 2019, 3:10 PM

Copy jobs are running now- we will see how much it takes to do a full copy.

I setup for now copies to happen only every week because if I setup to do it every hour, bacula got overloaded with so much scheduled. Also, there are many errors that happen at the beginning, likely due to metadata of files that were not physically migrated, as expected. We may need to purge manually some of those old jobs, if bacula doesn't do it automatically.

jcrespo updated the task description. (Show Details)Dec 18 2019, 6:23 PM

Apparently, databases pool got enlarged, but production one is still on 1 month to purge. Needs checking to increase it too to 3 months, there is space available for that.

For some reason, the pool was updated, but not every volumne. I run update pool from resource, and then "all volumnes from pool", and it got applied.

Will need to monitor for space next, but this should be fixed.

Change 578489 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] bacula: Increase max total size of backups to 35 TB

https://gerrit.wikimedia.org/r/578489

gerritbot added a project: Patch-For-Review.Mar 10 2020, 11:00 AM

Change 578489 merged by Jcrespo:
[operations/puppet@production] bacula: Increase max total size of backups to 35 TB

https://gerrit.wikimedia.org/r/578489

Maintenance_bot removed a project: Patch-For-Review.Mar 10 2020, 12:10 PM

jcrespo mentioned this in T229209: Strengthen backup infrastructure and support.Apr 16 2020, 8:12 AM

jcrespo added a subtask: T260717: decom helium and heze.Aug 19 2020, 2:23 PM

jcrespo mentioned this in T245161: Track down and replace very old HW.Sep 4 2020, 11:20 AM

LSobanski added a project: Data-Persistence-Backup.Oct 6 2020, 9:26 AM

jcrespo moved this task from Triage to In Progress on the Data-Persistence-Backup board.Jan 27 2021, 5:52 PM

Change 659046 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] bacula: Undo conditionals added while transitioning helium->backup1001

https://gerrit.wikimedia.org/r/659046

gerritbot added a project: Patch-For-Review.Jan 27 2021, 5:53 PM

jcrespo added a subtask: T272686: print a list of backed up directories in the MOTD of production servers.Jan 28 2021, 10:25 AM

Change 659309 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] bacula: Perform a one-time backup of helium's archive pool

https://gerrit.wikimedia.org/r/659309

Change 659309 merged by Jcrespo:
[operations/puppet@production] bacula: Perform a one-time backup of helium's archive pool

https://gerrit.wikimedia.org/r/659309

Change 659321 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] bacula: Add temporary jobdefaults for 1 time Archive pool backup

https://gerrit.wikimedia.org/r/659321

jcrespo updated the task description. (Show Details)Jan 28 2021, 4:41 PM

Change 659321 merged by Jcrespo:
[operations/puppet@production] bacula: Add temporary jobdefaults for 1 time Archive pool backup

https://gerrit.wikimedia.org/r/659321

:-)

303254  Back Full          0         0  helium.eqiad.wmnet-Monthly-1st-Wed-Archive-archive-backup is running

Change 659276 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] Revert "bacula: Add temporary jobdefaults for 1 time Archive pool backup"

https://gerrit.wikimedia.org/r/659276

Mentioned in SAL (#wikimedia-operations) [2021-01-28T19:36:48Z] <jynus> extending backup1001 /dev/mapper/array1-archive partition to allocate enough space for helium backups T238048

yay

303254  Full           4    568.7 G  OK       28-Jan-21 20:16 helium.eqiad.wmnet-Monthly-1st-Wed-Archive-archive-backup

yay*2

303259  Restore        1    2.369 G  OK       28-Jan-21 20:21 RestoreFiles

$ diff /var/tmp/bacula-restores/srv/baculasd2/bacula.sql.gz /srv/baculasd2/bacula.sql.gz
$ echo $?
0

Change 659276 merged by Jcrespo:
[operations/puppet@production] Revert "bacula: Add temporary jobdefaults for 1 time Archive pool backup"

https://gerrit.wikimedia.org/r/659276

jcrespo updated the task description. (Show Details)Jan 29 2021, 9:31 AM

jcrespo closed subtask T260717: decom helium and heze as Resolved.Jan 29 2021, 11:12 AM

Change 659046 merged by Jcrespo:
[operations/puppet@production] bacula: Undo conditionals added while transitioning helium->backup1001

https://gerrit.wikimedia.org/r/659046

jcrespo updated the task description. (Show Details)Jan 29 2021, 11:46 AM

jcrespo closed subtask T272686: print a list of backed up directories in the MOTD of production servers as Resolved.

Regarding the last 2 points, we have, in a way, done the last point "parametrize better the jobdefaults in bacula director host" but not the in the original way "Get rid of per-strategy, and per-day of the week jobdefaults". Instead, we have removed duplicate jobdefaults and left only the ones for production. I think we shouldn't touch those until we have 2 "production" pools.

Regarding the rename, it is a bit confusing, but it has too much impact for a little gain. We again will be doing that if/when we create codfw-production. For now, resolving this, and leaving only as pending the TLS reversion T273182 and the databse cleanup T274809.

jcrespo added a subtask: T274809: Drop unused database "bacula" from m1.Feb 15 2021, 6:15 PM

jcrespo closed subtask T274809: Drop unused database "bacula" from m1 as Resolved.Mar 2 2021, 2:49 PM

jcrespo closed subtask T273182: Revert OpenSSL min version configuration introduced for bacula compatibility as Resolved.Apr 29 2021, 8:37 AM

Followup to backup1001 bacula switchover (misc pending tasks)Closed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Followup to backup1001 bacula switchover (misc pending tasks)
Closed, ResolvedPublic
Actions

Related Objects
Search...