Page MenuHomePhabricator

Followup to backup1001 bacula switchover (misc pending tasks)
Open, HighPublic

Description

After the helium -> bacula1001 switchover, new backups and recoveries are happening on backup1001, however there is a list of things that are pending to do to be able to decommission helium and heze, and consider this phase as done, providing the same or better level of service than old bacula infrastructure:

  • Setup and test recover from the archive pool on backup1001
  • Split database backups on its own separate pool/array
  • Increase retention period and/or available disk volumes
  • Populate backup2001 codfw mirror, probably through migrating jobs
  • Attach old sd hosts to new dirs, or wait for the expiration to be larger than the configured one
  • Cleanup old code for jessie
  • Rename 'production' to 'Production' pool (or Databases, Archive, etc. to lower case) for consistency (bacula is case sensitive and could cause issues)

Event Timeline

jcrespo created this task.Tue, Nov 12, 9:34 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptTue, Nov 12, 9:34 AM

@akosiaris Could you give a quick look to see if these seems like a complete archive contents?
{P9597}

I can execute the recovery of these files using a manual procedure (bootstrap file); however, I cannot do it automatically (restore) because on the database, these files are still referenced there, but existing on helium, not locally on backup1001. So I have to edit the bootstrap file to change the name of the volumes and sd host, and then run, e.g.:

bextract -p -b /var/lib/bacula/backup1001.eqiad.wmnet.restore.1.bsr FileStorageArchive /srv/local

Otherwhiese, the automatic process says:

The Job will require the following (*=>InChanger):
   Volume(s)                 Storage(s)                SD Device(s)
===========================================================================
   
    archive0003               helium-FileStorage2       FileStorage2

However, because encryption, I get an empty file (metadata is not encrypted).

I think the best option would be to run bscan or batch-editing the database to point to backup1001 and FileStorageArchive.

Change 550671 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] bacula: Setup separate pool and defaults for database backups on backup1001

https://gerrit.wikimedia.org/r/550671

@akosiaris Could you give a quick look to see if these seems like a complete archive contents?
{P9597}

archive0055 seems to be missing sodium.wikimedia.org-Monthly-1st-Thu-production-var-lib-mailman which is however also duplicated on archive0003 (jobid 23,606), so maybe the db is wrong on that front. /var/lib/mailman is also present on fermium right now and hence fine if we lose the sodium thing. Otherwise, it seems fine to me.

I can execute the recovery of these files using a manual procedure (bootstrap file); however, I cannot do it automatically (restore) because on the database, these files are still referenced there, but existing on helium, not locally on backup1001. So I have to edit the bootstrap file to change the name of the volumes and sd host, and then run, e.g.:

bextract -p -b /var/lib/bacula/backup1001.eqiad.wmnet.restore.1.bsr FileStorageArchive /srv/local

Otherwhiese, the automatic process says:

The Job will require the following (*=>InChanger):
   Volume(s)                 Storage(s)                SD Device(s)
===========================================================================
    archive0003               helium-FileStorage2       FileStorage2

However, because encryption, I get an empty file (metadata is not encrypted).
I think the best option would be to run bscan or batch-editing the database to point to backup1001 and FileStorageArchive.

Batch editing the DB. This should be simple enough. Something like `update media set storageid = X where storageid = Y;

Change 553084 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] backup: Move filesets to a separate file

https://gerrit.wikimedia.org/r/553084

Change 553084 merged by Jcrespo:
[operations/puppet@production] backup: Move filesets to a separate file

https://gerrit.wikimedia.org/r/553084

jcrespo added a comment.EditedTue, Nov 26, 5:22 PM

Batch editing the DB

The update should be:

UPDATE Media SET StorageId = 11 WHERE StorageId = 4;
root@db1135[bacula9]> SELECT * FROM Media WHERE StorageId = 4;
+---------+-------------+------+--------+-----------+-------------+-----------+---------------------+-------------------
| MediaId | VolumeName  | Slot | PoolId | MediaType | MediaTypeId | LabelType | FirstWritten        | LastWritten       
+---------+-------------+------+--------+-----------+-------------+-----------+---------------------+-------------------
|       3 | archive0003 |    0 |      3 | File      |           0 |         0 | 2013-08-27 22:18:26 | 2015-10-08 14:42:0
|      55 | archive0055 |    0 |      3 | File      |           0 |         0 | 2015-10-08 10:08:51 | 2015-10-08 14:55:4
+---------+-------------+------+--------+-----------+-------------+-----------+---------------------+-------------------

But I need to take a backup and check there isn't other tables that have to be updated.

PS: No other reference to storage:

root@db1135[bacula9]> select * FROM information_schema.columns WHERE table_schema = 'bacula9' and column_name like '%torage%';
+---------------+--------------+------------+-------------+------------------+----------------+-------------+-----------
| TABLE_CATALOG | TABLE_SCHEMA | TABLE_NAME | COLUMN_NAME | ORDINAL_POSITION | COLUMN_DEFAULT | IS_NULLABLE | DATA_TYPE 
+---------------+--------------+------------+-------------+------------------+----------------+-------------+-----------
| def           | bacula9      | Device     | StorageId   |                4 | 0              | YES         | int       
| def           | bacula9      | Media      | StorageId   |               30 | 0              | YES         | int       
| def           | bacula9      | Storage    | StorageId   |                1 | NULL           | NO          | int       
+---------------+--------------+------------+-------------+------------------+----------------+-------------+-----------
3 rows in set (0.00 sec)

Device is empty, and Storage is the list of storage devices (old and new).

Mentioned in SAL (#wikimedia-operations) [2019-11-27T15:42:19Z] <jynus> migrate db entries of archive Media to backup1001 T238048

jcrespo triaged this task as High priority.Wed, Nov 27, 3:43 PM
root@db1135.eqiad.wmnet[bacula9]> UPDATE Media SET StorageId = 11 WHERE StorageId = 4;
Query OK, 2 rows affected (0.00 sec)
Rows matched: 2  Changed: 2  Warnings: 0

Mentioned in SAL (#wikimedia-operations) [2019-11-27T15:52:26Z] <jynus> disabling puppet on dbprov1001 to test bacula restore T238048

Error while trying to restore sodium contents:

29-Nov 13:04 backup1001.eqiad.wmnet JobId 162656: Start Restore Job RestoreFiles.2019-11-29_13.04.00_16
29-Nov 13:04 backup1001.eqiad.wmnet JobId 162656: Using Device "FileStorageArchive" to read.
29-Nov 13:04 backup1001.eqiad.wmnet-fd JobId 162656: Ready to read from volume "archive0003" on File device "FileStorageArchive" (/srv/archive).
29-Nov 13:04 backup1001.eqiad.wmnet-fd JobId 162656: Forward spacing Volume "archive0003" to addr=452438779297
29-Nov 13:04 dbprov2001.codfw.wmnet-fd JobId 162656: Error: openssl.c:78 TLS read/write failure.: ERR=error:0407109F:rsa routines:RSA_padding_check_PKCS1_type_2:pkcs decoding error
29-Nov 13:04 dbprov2001.codfw.wmnet-fd JobId 162656: Error: openssl.c:78 TLS read/write failure.: ERR=error:04065072:rsa routines:RSA_EAY_PRIVATE_DECRYPT:padding check failed
29-Nov 13:04 dbprov2001.codfw.wmnet-fd JobId 162656: Fatal error: restore.c:473 Data record error. ERR=Resource temporarily unavailable
29-Nov 13:06 backup1001.eqiad.wmnet-fd JobId 162656: Error: bsock.c:388 Wrote 65540 bytes to client:10.192.0.114:43914, but only 0 accepted.
29-Nov 13:06 backup1001.eqiad.wmnet-fd JobId 162656: Fatal error: read.c:176 Error sending data to Client. ERR=Connection reset by peer
29-Nov 13:06 backup1001.eqiad.wmnet-fd JobId 162656: Elapsed time=00:02:00, Transfer rate=15.68 K Bytes/second
29-Nov 13:06 backup1001.eqiad.wmnet-fd JobId 162656: Error: bsock.c:271 Socket has errors=1 on call to client:10.192.0.114:43914
29-Nov 13:06 backup1001.eqiad.wmnet JobId 162656: Error: Bacula backup1001.eqiad.wmnet 9.4.2 (04Feb19):
  Build OS:               x86_64-pc-linux-gnu debian buster/sid
  JobId:                  162656
  Job:                    RestoreFiles.2019-11-29_13.04.00_16
  Restore Client:         dbprov2001.codfw.wmnet-fd
  Where:                  /srv/tmp
  Replace:                Never
  Start time:             29-Nov-2019 13:04:02
  End time:               29-Nov-2019 13:06:03
  Elapsed time:           2 mins 1 sec
  Files Expected:         5,408,668
  Files Restored:         1
  Bytes Restored:         0 (0 B)
  Rate:                   0.0 KB/s
  FD Errors:              2
  FD termination status:  Error
  SD termination status:  Error
  Termination:            *** Restore Error ***

restoring dbprov2002 content on dbprov2001 with the puppet master cert, however, works well.

Same for bast1001:

29-Nov 13:19 backup1001.eqiad.wmnet JobId 162657: Start Restore Job RestoreFiles.2019-11-29_13.18.59_50
29-Nov 13:19 backup1001.eqiad.wmnet JobId 162657: Using Device "FileStorageArchive" to read.
29-Nov 13:19 backup1001.eqiad.wmnet-fd JobId 162657: Ready to read from volume "archive0003" on File device "FileStorageArchive" (/srv/archive).
29-Nov 13:19 backup1001.eqiad.wmnet-fd JobId 162657: Forward spacing Volume "archive0003" to addr=323517936212
29-Nov 13:19 dbprov2001.codfw.wmnet-fd JobId 162657: Error: openssl.c:78 Encryption session provided an invalid symmetric key: ERR=error:0407109F:rsa routines:RSA_padding_check_PKCS1_type_2:pkcs decoding error
29-Nov 13:19 dbprov2001.codfw.wmnet-fd JobId 162657: Error: openssl.c:78 Encryption session provided an invalid symmetric key: ERR=error:04065072:rsa routines:RSA_EAY_PRIVATE_DECRYPT:padding check failed
29-Nov 13:19 dbprov2001.codfw.wmnet-fd JobId 162657: Error: openssl.c:78 Encryption session provided an invalid symmetric key: ERR=error:0607A082:digital envelope routines:EVP_CIPHER_CTX_set_key_length:invalid key length
29-Nov 13:19 dbprov2001.codfw.wmnet-fd JobId 162657: Error: restore.c:741 Failed to initialize decryption context for /srv/tmp/srv/home_pmtpa/aaron/.gitignore
29-Nov 13:19 dbprov2001.codfw.wmnet-fd JobId 162657: Error: openssl.c:78 OpenSSL digest Verify final failed: ERR=error:0407109F:rsa routines:RSA_padding_check_PKCS1_type_2:pkcs decoding error
29-Nov 13:19 dbprov2001.codfw.wmnet-fd JobId 162657: Error: openssl.c:78 OpenSSL digest Verify final failed: ERR=error:04065072:rsa routines:RSA_EAY_PRIVATE_DECRYPT:padding check failed
29-Nov 13:19 dbprov2001.codfw.wmnet-fd JobId 162657: Error: openssl.c:78 OpenSSL digest Verify final failed: ERR=error:04091077:rsa routines:INT_RSA_VERIFY:wrong signature length
29-Nov 13:19 dbprov2001.codfw.wmnet-fd JobId 162657: Error: restore.c:1680 Signature validation failed for file /srv/tmp/srv/home_pmtpa/aaron/.cache/motd.legal-displayed: ERR=Signature is invalid
29-Nov 13:19 dbprov2001.codfw.wmnet-fd JobId 162657: Error: openssl.c:78 Encryption session provided an invalid symmetric key: ERR=error:0407109F:rsa routines:RSA_padding_check_PKCS1_type_2:pkcs decoding error
29-Nov 13:19 dbprov2001.codfw.wmnet-fd JobId 162657: Error: openssl.c:78 Encryption session provided an invalid symmetric key: ERR=error:04065072:rsa routines:RSA_EAY_PRIVATE_DECRYPT:padding check failed

@akosiaris does this ring any bell? I find hard to belive that a new version of openssl couldn't decrypt a file encrypted with an older version. Maybe I am using the wrong options.

Change 554257 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] bacula: Rename schedule to monthlys., split schedule & jobdefaults

https://gerrit.wikimedia.org/r/554257

Change 554286 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] bacula: Setup new pool for databases as well as its configuration

https://gerrit.wikimedia.org/r/554286

Change 554288 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] database-backups: Use a different pool for dbatabase backups

https://gerrit.wikimedia.org/r/554288

Change 550671 abandoned by Jcrespo:
bacula: Setup separate pool and defaults for database backups on backup1001

Reason:
Split into Ie081e459b35b787a837424b31ad8 and following ones

https://gerrit.wikimedia.org/r/550671

Change 554257 merged by Jcrespo:
[operations/puppet@production] bacula: Rename schedule to monthlys., split schedule & jobdefaults

https://gerrit.wikimedia.org/r/554257

Change 554286 merged by Jcrespo:
[operations/puppet@production] bacula: Setup new pool for databases as well as its configuration

https://gerrit.wikimedia.org/r/554286

jcrespo updated the task description. (Show Details)Tue, Dec 3, 5:41 PM

Change 554288 merged by Jcrespo:
[operations/puppet@production] database-backups: Use a different pool for database backups

https://gerrit.wikimedia.org/r/554288

Change 554344 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] database-backups: Fix database jobdefaults for database dumps

https://gerrit.wikimedia.org/r/554344

Change 554344 merged by Jcrespo:
[operations/puppet@production] database-backups: Fix database jobdefaults for database dumps

https://gerrit.wikimedia.org/r/554344

Yay!

Full           Backup    10  04-Dec-19 02:05    dbprov2002.codfw.wmnet-Monthly-1st-Wed-Databases-mysql-srv-backups-dumps-latest *unknown*
Full           Backup    10  04-Dec-19 02:05    dbprov2001.codfw.wmnet-Monthly-1st-Wed-Databases-mysql-srv-backups-dumps-latest *unknown*
jcrespo updated the task description. (Show Details)Tue, Dec 3, 6:09 PM

Change 554485 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] bacula: Increase production and Databases retention to 90 days

https://gerrit.wikimedia.org/r/554485

"Offsite Job" seems to be correctly configured as "Copy", but it is not showing any activity. Needs checking.

Same for bast1001:

29-Nov 13:19 backup1001.eqiad.wmnet JobId 162657: Start Restore Job RestoreFiles.2019-11-29_13.18.59_50
29-Nov 13:19 backup1001.eqiad.wmnet JobId 162657: Using Device "FileStorageArchive" to read.
29-Nov 13:19 backup1001.eqiad.wmnet-fd JobId 162657: Ready to read from volume "archive0003" on File device "FileStorageArchive" (/srv/archive).
29-Nov 13:19 backup1001.eqiad.wmnet-fd JobId 162657: Forward spacing Volume "archive0003" to addr=323517936212
29-Nov 13:19 dbprov2001.codfw.wmnet-fd JobId 162657: Error: openssl.c:78 Encryption session provided an invalid symmetric key: ERR=error:0407109F:rsa routines:RSA_padding_check_PKCS1_type_2:pkcs decoding error
29-Nov 13:19 dbprov2001.codfw.wmnet-fd JobId 162657: Error: openssl.c:78 Encryption session provided an invalid symmetric key: ERR=error:04065072:rsa routines:RSA_EAY_PRIVATE_DECRYPT:padding check failed
29-Nov 13:19 dbprov2001.codfw.wmnet-fd JobId 162657: Error: openssl.c:78 Encryption session provided an invalid symmetric key: ERR=error:0607A082:digital envelope routines:EVP_CIPHER_CTX_set_key_length:invalid key length
29-Nov 13:19 dbprov2001.codfw.wmnet-fd JobId 162657: Error: restore.c:741 Failed to initialize decryption context for /srv/tmp/srv/home_pmtpa/aaron/.gitignore
29-Nov 13:19 dbprov2001.codfw.wmnet-fd JobId 162657: Error: openssl.c:78 OpenSSL digest Verify final failed: ERR=error:0407109F:rsa routines:RSA_padding_check_PKCS1_type_2:pkcs decoding error
29-Nov 13:19 dbprov2001.codfw.wmnet-fd JobId 162657: Error: openssl.c:78 OpenSSL digest Verify final failed: ERR=error:04065072:rsa routines:RSA_EAY_PRIVATE_DECRYPT:padding check failed
29-Nov 13:19 dbprov2001.codfw.wmnet-fd JobId 162657: Error: openssl.c:78 OpenSSL digest Verify final failed: ERR=error:04091077:rsa routines:INT_RSA_VERIFY:wrong signature length
29-Nov 13:19 dbprov2001.codfw.wmnet-fd JobId 162657: Error: restore.c:1680 Signature validation failed for file /srv/tmp/srv/home_pmtpa/aaron/.cache/motd.legal-displayed: ERR=Signature is invalid
29-Nov 13:19 dbprov2001.codfw.wmnet-fd JobId 162657: Error: openssl.c:78 Encryption session provided an invalid symmetric key: ERR=error:0407109F:rsa routines:RSA_padding_check_PKCS1_type_2:pkcs decoding error
29-Nov 13:19 dbprov2001.codfw.wmnet-fd JobId 162657: Error: openssl.c:78 Encryption session provided an invalid symmetric key: ERR=error:04065072:rsa routines:RSA_EAY_PRIVATE_DECRYPT:padding check failed

@akosiaris does this ring any bell? I find hard to belive that a new version of openssl couldn't decrypt a file encrypted with an older version. Maybe I am using the wrong options.

We 've had a couple of these in the past (~2013-2014?), but it was operator error IIRC. The message is openssl.c:78 Encryption session provided an invalid symmetric key: ERR=error:0607A082:digital envelope routines:EVP_CIPHER_CTX_set_key_length:invalid key length so it looks like the wrong key was used?

"Offsite Job" seems to be correctly configured as "Copy", but it is not showing any activity. Needs checking.

I don't think there is a schedule for that, that's why?

Change 554485 merged by Jcrespo:
[operations/puppet@production] bacula: Increase production and Databases retention to 90 days

https://gerrit.wikimedia.org/r/554485

Hi, @akosiaris, thanks for the reviews and feedback. Could I have further your thoughts on T238048#5701519 and T238048#5701534. Normally I would just find a solution or workaround on my own, but archive file copy was one of the parts in which I compromised my suggested plan because you were quite confident on its forward compatibility :-/. On the other side, most of those files seem to be around 5 years old, which may mean some should be actually be purged. Let me know your thoughts.

After update, the pools seem ok, although we probably should also increase the offsite one (creating patch).

*list pool
+--------+------------+---------+---------+-----------------+--------------+---------+----------+-------------+
| PoolId | Name       | NumVols | MaxVols | MaxVolBytes     | VolRetention | Enabled | PoolType | LabelFormat |
+--------+------------+---------+---------+-----------------+--------------+---------+----------+-------------+
|      1 | Default    |       0 |       1 |               0 |  155,520,000 |       1 | Backup   | *           |
|      2 | production |      33 |      60 | 536,870,912,000 |    7,776,000 |       1 | Backup   | production  |
|      3 | Archive    |       2 |       5 | 536,870,912,000 |  157,680,000 |       1 | Backup   | archive     |
|      4 | offsite    |       0 |      60 | 536,870,912,000 |    2,592,000 |       1 | Backup   | offsite     |
|      5 | Databases  |       5 |      60 | 536,870,912,000 |    7,776,000 |       1 | Backup   | databases   |
+--------+------------+---------+---------+-----------------+--------------+---------+----------+-------------+

Change 554536 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] bacula: Increase offsite backup retention to 90 days

https://gerrit.wikimedia.org/r/554536

Change 554539 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] bacula: Schedule hourly copies of production backups to the offsite pool

https://gerrit.wikimedia.org/r/554539

jcrespo updated the task description. (Show Details)Wed, Dec 4, 4:06 PM

Change 554536 merged by Jcrespo:
[operations/puppet@production] bacula: Increase offsite backup retention to 90 days

https://gerrit.wikimedia.org/r/554536

Change 554539 merged by Jcrespo:
[operations/puppet@production] bacula: Schedule hourly copies of production backups to the offsite pool

https://gerrit.wikimedia.org/r/554539

Hi, @akosiaris, thanks for the reviews and feedback. Could I have further your thoughts on T238048#5701519 and T238048#5701534.

I have provided them already in T238048#5711997

Normally I would just find a solution or workaround on my own, but archive file copy was one of the parts in which I compromised my suggested plan because you were quite confident on its forward compatibility :-/.

I never was confident for archive specifically (I was for the rest). archive pool has not seen a restore in a very long time, as far as I remember.

On the other side, most of those files seem to be around 5 years old, which may mean some should be actually be purged. Let me know your thoughts.

It's historical data mostly, we can stall this until someone requires a restore and try to solve it then. Overall, 5 years into the last restore, I doubt there will be a really critical request. That being said, it does look like a key issue, maybe they are just encrypted with a different key? We did a CA rollover some 4-5 years ago. Maybe they are encrypted with that key (which should be on the puppetmasters?)

I have provided them already in

Indeed, sorry- mail client only showed you last comment.

Maybe they are encrypted with that key (which should be on the puppetmasters?)

I will try to find and use a previous key, thank you again for the suggestion. Very useful.

I scheduled by accident the migration, not the copy.

Incremental    Migrate   20  04-Dec-19 17:00    Migrate Job

I think it wouldn't have run anyway due to the selection config, but reverting and scheduling the right job instead.

Change 554563 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] bacula: Schedule hourly copies of production backups to the offsite pool

https://gerrit.wikimedia.org/r/554563

jcrespo updated the task description. (Show Details)Wed, Dec 4, 4:47 PM

Change 554563 merged by Jcrespo:
[operations/puppet@production] bacula: Schedule hourly copies of production backups to the offsite pool

https://gerrit.wikimedia.org/r/554563

Now it is ok:

Scheduled Jobs:
Level          Type     Pri  Scheduled          Job Name           Volume
===================================================================================
Incremental    Backup    10  04-Dec-19 17:00    gerrit1001.wikimedia.org-Hourly-Sun-production-srv-gerrit-git production0070
Incremental    Copy      20  04-Dec-19 17:00    Offsite Job         <======= this

Waiting 5 minutes to check performance and status.

Change 556195 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] bacula: Setup weekly copy migrations until a first run happens

https://gerrit.wikimedia.org/r/556195

Change 556195 merged by Jcrespo:
[operations/puppet@production] bacula: Setup weekly copy migrations until a first run happens

https://gerrit.wikimedia.org/r/556195

Copy jobs are running now- we will see how much it takes to do a full copy.

I setup for now copies to happen only every week because if I setup to do it every hour, bacula got overloaded with so much scheduled. Also, there are many errors that happen at the beginning, likely due to metadata of files that were not physically migrated, as expected. We may need to purge manually some of those old jobs, if bacula doesn't do it automatically.