Page MenuHomePhabricator

Switchover backup director service from helium to backup1001
Closed, ResolvedPublic

Description

Scheduled for 2019-09-29 11:30:00 UTC

https://docs.google.com/document/d/1rHFMeuxQ6qOaLumsENYhkkf7sNo6PNav63az71Ii0Tg/edit

The steps are in this exact order.

  • fixing that TODO Jaime mentioned in order to have the iptables hole opened [Alex is doing this]
  • Move archive pool files into backup1001/2001 so they are not lost even if helium/heze go down. (Production rolls over, so it would be less problematic, as long there is some backup?) DONE.
  • Test a full backup/restore cycle on a new remote host towards backup1001 to validate new setup [Jaime did this]
  • Change helium to an sd role of the new director backup1001 (or if for some reason that is not possible, to an easily revertable noop role). Start with either an upgraded current bacula db “bacula” (and a backup generated) or on an upgraded copy of it “bacula9”.
  • update references to helium all over puppet [Jaime did a preliminary check, didn't found anything relevant]
  • check newly backup1001/2001 backups run as expected and they can be recovered [WIP]
  • Make sure we can recover from backup1001 helium/heze pools. Keep helium heze around for 3 months so new backups got to backup1001/2001, but old can be recovered from
  • Reattach locally to backup1001(/2001) the helium archive files moved on step 2
  • Remove old director code for jessie/pointing to helium

Details

Related Gerrit Patches:

Event Timeline

jcrespo created this task.Thu, Oct 24, 4:24 PM
Restricted Application removed a project: Patch-For-Review. · View Herald TranscriptThu, Oct 24, 4:24 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
jcrespo triaged this task as High priority.Thu, Oct 24, 4:24 PM
jcrespo updated the task description. (Show Details)
akosiaris updated the task description. (Show Details)Thu, Oct 24, 4:26 PM
jcrespo updated the task description. (Show Details)Thu, Oct 24, 4:27 PM

So because of buster clients and jessie storage daemons cannot talk to each other, we will have to alter slightly the upgrade strategy. Several options, not as urgent as being able to create new backups for buster hosts on backup1001.

jcrespo moved this task from Triage to Next on the DBA board.Fri, Oct 25, 12:17 PM
jcrespo updated the task description. (Show Details)Mon, Oct 28, 8:41 AM

I've created a copy of the bacula database on the bacula9 one, and then ran:

sudo -u bacula ./update_mysql_tables -h m1-master.eqiad.wmnet bacula9 -u bacula9 -pXXXXXXXXXXXXXXXXXXXXX

This script will update a Bacula MySQL database from version 12-16 to 16

Depending on the current version of your catalog,
you may have to run this script multiple times.

ERROR 1044 (42000) at line 1: Access denied for user 'bacula9'@'10.64.0.165' to database 'XXX_DBNAME_XXX'
Update of Bacula MySQL tables 15 to 16 succeeded.
ERROR 1044 (42000) at line 1: Access denied for user 'bacula9'@'10.64.0.165' to database 'XXX_DBNAME_XXX'
Update of Bacula MySQL tables 16 to 16 succeeded.

It seemed to work, you can ignore the XXX_DBNAME_XXX, as it doesn't fail when running use, it just continues.

This way we have an untouched bacula 5 database we can always revert to, and a new bacula 9, upgraded one. We can delete & rename the one we don't need afterwards.

akosiaris updated the task description. (Show Details)Tue, Oct 29, 12:01 PM

1```
2./check_bacula.py --bconsole=/usr/sbin/bconsole --verbose
3
4== jobs_with_all_failures (76) ==
5
6an-master1002.eqiad.wmnet-Monthly-1st-Mon-production-hadoop-namenode-backup
7analytics1029.eqiad.wmnet-Monthly-1st-Fri-production-hadoop-namenode-backup
8archiva1001.wikimedia.org-Monthly-1st-Fri-production-var-lib-archiva
9bast1002.wikimedia.org-Monthly-1st-Fri-production-home
10bast2002.wikimedia.org-Monthly-1st-Wed-production-home
11bast4002.wikimedia.org-Monthly-1st-Wed-production-home
12bast4002.wikimedia.org-Monthly-1st-Wed-production-srv-tftpboot
13bast5001.wikimedia.org-Monthly-1st-Sun-production-home
14bast5001.wikimedia.org-Monthly-1st-Sun-production-srv-tftpboot
15bromine.eqiad.wmnet-Monthly-1st-Sun-production-bugzilla-backup
16bromine.eqiad.wmnet-Monthly-1st-Sun-production-bugzilla-static
17...
18puppetmaster1001.eqiad.wmnet-Monthly-1st-Sat-production-var-lib-puppet-ssl
19puppetmaster1001.eqiad.wmnet-Monthly-1st-Sat-production-var-lib-puppet-volatile
20puppetmaster2001.codfw.wmnet-Monthly-1st-Fri-production-var-lib-puppet-ssl
21puppetmaster2001.codfw.wmnet-Monthly-1st-Fri-production-var-lib-puppet-volatile
22releases1001.eqiad.wmnet-Monthly-1st-Sat-production-srv-org-wikimedia
23releases2001.codfw.wmnet-Monthly-1st-Sun-production-srv-org-wikimedia
24seaborgium.wikimedia.org-Monthly-1st-Fri-production-openldap
25serpens.wikimedia.org-Monthly-1st-Sun-production-openldap
26torrelay1001.wikimedia.org-Monthly-1st-Fri-production-tor
27vega.codfw.wmnet-Monthly-1st-Sat-production-bugzilla-backup
28vega.codfw.wmnet-Monthly-1st-Sat-production-bugzilla-static
29vega.codfw.wmnet-Monthly-1st-Sat-production-rt-static
30
31== jobs_with_fresh_backups (0) ==
32
33
34== jobs_with_no_backups (1) ==
35
36webperf2002.codfw.wmnet-Monthly-1st-Thu-production-arclamp-application-data
37
38== jobs_with_stale_backups (0) ==
39
40
41== jobs_with_stale_full_backups (0) ==
42```

//TODO: Move log from /var/lib/bacula/log to /var/log/bacula/log

jcrespo updated the task description. (Show Details)Tue, Oct 29, 3:28 PM

Change 546972 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] bacula: Move logs to /var/log/bacula

https://gerrit.wikimedia.org/r/546972

Ottomata assigned this task to jcrespo.Tue, Oct 29, 3:34 PM
Ottomata added a subscriber: Ottomata.

Jaime, assigning to you, feel free to undo or reassign if this is not correct.

Thanks, it is indeed correct and this just happened today (even if alex did most of the work). Not closing because it is highly WIP.

For the record:

were merged related to this ticket.

I thought this error was due to a backup attempt, pre-patch. However, after I ran it manually, it failed again:

29-Oct 19:21 backup1001.eqiad.wmnet JobId 158825: Start Backup JobId 158825, Job=install1002.wikimedia.org-Monthly-1st-W
ed-production-srv-autoinstall.2019-10-29_19.21.10_14
29-Oct 19:21 backup1001.eqiad.wmnet JobId 158825: Using Device "FileStorageProduction" to write.
29-Oct 19:21 backup1001.eqiad.wmnet-fd JobId 158825: Error: openssl.c:68 Connect failure: ERR=error:14209102:SSL routine
s:tls_early_post_process_client_hello:unsupported protocol
29-Oct 19:21 backup1001.eqiad.wmnet-fd JobId 158825: Fatal error: bnet.c:75 TLS Negotiation failed.
29-Oct 19:21 backup1001.eqiad.wmnet-fd JobId 158825: Fatal error: TLS negotiation failed with FD at "208.80.154.22:38518
"
29-Oct 19:21 backup1001.eqiad.wmnet-fd JobId 158825: Fatal error: Incorrect authorization key from File daemon at client
 rejected.
For help, please see: http://www.bacula.org/rel-manual/en/problems/Bacula_Frequently_Asked_Que.html
29-Oct 19:21 install1002.wikimedia.org-fd JobId 158825: Error: openssl.c:86 Connect failure: ERR=error:1409442E:SSL rout
ines:ssl3_read_bytes:tlsv1 alert protocol version
29-Oct 19:21 backup1001.eqiad.wmnet-fd JobId 158825: Security Alert: Unable to authenticate File daemon
29-Oct 19:21 install1002.wikimedia.org-fd JobId 158825: Fatal error: TLS negotiation failed.
29-Oct 19:21 install1002.wikimedia.org-fd JobId 158825: Fatal error: Failed to authenticate Storage daemon.
29-Oct 19:21 backup1001.eqiad.wmnet JobId 158825: Fatal error: Bad response to Storage command: wanted 2000 OK storage
, got 2902 Bad storage

Either patch was not enough or there is additional issues.

Mentioned in SAL (#wikimedia-operations) [2019-10-29T19:32:46Z] <jynus> restarting bacula-fd on install1002 T236406

That wasn't enough:

158826  Full           0         0   Error    29-Oct-19 19:35 install1002.wikimedia.org-Monthly-1st-Wed-production-srv-autoinstall

I got it, it was the storage daemon that hadn't been restarted, not the clients (that is why the director could connect, but the sd failed):

158827  Full         111    265.9 K  OK       29-Oct-19 19:45 install1002.wikimedia.org-Monthly-1st-Wed-production-srv-autoinstall

Mentioned in SAL (#wikimedia-operations) [2019-10-30T10:07:03Z] <jynus> restarting bacula-dir, bacula-sd on backup1001 T236406

Some jobs are getting stuck (single xml backups) for some issue complaining about the sd daemon. Even cancel gets stuck (expecting it returns immediately, even if it takes some minutes to really cancel). Restore showed ssl errors, I restarted the fd on client, retried.

The large recovery worked as intended:

root@dbprov2001:/srv/backups/srv/backups/dumps/latest$ diff -r dump.s8.2019-10-29--02-43-52 /srv/backups/dumps/latest/dump.s8.2019-10-29--02-43-52
root@dbprov2001:/srv/backups/srv/backups/dumps/latest$ echo $?
0

Status at the moment:

== jobs_with_all_failures (6) ==

an-master1002.eqiad.wmnet-Monthly-1st-Mon-production-hadoop-namenode-backup
analytics1029.eqiad.wmnet-Monthly-1st-Fri-production-hadoop-namenode-backup
bromine.eqiad.wmnet-Monthly-1st-Sun-production-bugzilla-backup
cloudweb2001-dev.wikimedia.org-Monthly-1st-Mon-production-a-backup
labweb1001.wikimedia.org-Monthly-1st-Sat-production-a-backup
vega.codfw.wmnet-Monthly-1st-Sat-production-bugzilla-backup

== jobs_with_fresh_backups (87) ==
...

== jobs_with_no_backups (0) ==


== jobs_with_stale_backups (1) ==

matomo1001.eqiad.wmnet-Weekly-Wed-production-mysql-srv-backups

== jobs_with_stale_full_backups (0) ==

Checking the logs now.

jcrespo added a comment.EditedMon, Nov 4, 10:55 AM
  • an-master1002.eqiad.wmnet-Monthly-1st-Mon-production-hadoop-namenode-backup: connectivity issue T237016

bacula client:

Nov 04 04:35:10 an-master1002 bacula-fd[158469]: an-master1002.eqiad.wmnet-fd: job.c:1886-159474 Failed to connect to Storage daemon: backup1001.eqiad.wmnet:9103

sd server:

30-Oct 04:13 an-master1002.eqiad.wmnet-fd JobId 158838: Warning: bsock.c:107 Could not connect to Storage daemon on backup1001.eqiad.wmnet:9103. ERR=Connection timed out
Retrying ...
30-Oct 04:35 an-master1002.eqiad.wmnet-fd JobId 158838: Fatal error: bsock.c:113 Unable to connect to Storage daemon on backup1001.eqiad.wmnet:9103. ERR=Interrupted system call
30-Oct 04:35 an-master1002.eqiad.wmnet-fd JobId 158838: Fatal error: job.c:1886 Failed to connect to Storage daemon: backup1001.eqiad.wmnet:9103
30-Oct 04:35 backup1001.eqiad.wmnet JobId 158838: Fatal error: Bad response to Storage command: wanted 2000 OK storage
, got 2902 Bad storage
  • analytics1029.eqiad.wmnet-Monthly-1st-Fri-production-hadoop-namenode-backup: connectivity issue, same as above T237016
  • bromine.eqiad.wmnet-Monthly-1st-Sun-production-bugzilla-backup: potential puppet client missconfiguration or not production lifecycle, filed T237233
  • cloudweb2001-dev.wikimedia.org-Monthly-1st-Mon-production-a-backup: Filed T237237
30-Oct 05:11 cloudweb2001-dev.wikimedia.org-fd JobId 158852:      Could not stat "/a/backup": ERR=No such file or directory
  • labweb1001.wikimedia.org-Monthly-1st-Sat-production-a-backup: It is successful, but interpreted as failure because it copies 0 bytes (only a an empty dir). Probably not intended. Filed: T237237
  • vega.codfw.wmnet-Monthly-1st-Sat-production-bugzilla-backup: potential puppet client missconfiguration or not production lifecycle, filed T237233
  • matomo1001.eqiad.wmnet-Weekly-Wed-production-mysql-srv-backups: seems working now ok, the problem is that it is configured as "Weekly", but only does fulls, no daily incrementals
jcrespo added a subscriber: elukey.Mon, Nov 4, 11:06 AM

@elukey @Ottomata Re: matomo1001, is there a reason not to have daily incrementals? If the reason is that it generates a full backup each time, maybe I can suggest using the Monthly policy (like we use for databases, and in reality it generates weekly backups)? Or alternatively, do daily incrementals if daily granularity is ok and efficient.

This is not a big deal, but having similar policies for all backed up host unless necessary would be a huge simplification and also simplifies monitoring and debugging future issues/backup and recovery expectations.

Please also see unrelated root cause T237016.

jcrespo updated the task description. (Show Details)Mon, Nov 4, 11:07 AM
jcrespo updated the task description. (Show Details)

@elukey @Ottomata Re: matomo1001, is there a reason not to have daily incrementals? If the reason is that it generates a full backup each time, maybe I can suggest using the Monthly policy (like we use for databases, and in reality it generates weekly backups)? Or alternatively, do daily incrementals if daily granularity is ok and efficient.
This is not a big deal, but having similar policies for all backed up host unless necessary would be a huge simplification and also simplifies monitoring and debugging future issues/backup and recovery expectations.

I'll dig into it, the reason is mostly due to my ignorance about backups :)

Please also see unrelated root cause T237016.

Yep thanks already working on it! Hope to push the new rules to the firewalls asap

jcrespo added a comment.EditedMon, Nov 4, 11:47 AM

@elukey If this helps, I can try generating manually an incremental, for a better informed decision about storage size (it shouldn't take too long, and it would give us more better understanding about involved storage size).

Also thanks for the quick response!

PS:

159573  Back Incr          0         0  matomo1001.eqiad.wmnet-Weekly-Wed-production-mysql-srv-backups is running

I am all for simplifying and standardizing confs, so no opposition about incremental. Only one question - what would it change when trying to restore the database? This is basically my only concern at the moment.. If it is as simple as doing a recovery via Bacula I am all for it!

# check_bacula.py matomo1001.eqiad.wmnet-Weekly-Wed-production-mysql-srv-backups
2019-10-30 02:05:43: type: F, status: T, bytes: 782,728,416
2019-11-04 11:49:13: type: I, status: T, bytes: 516,049,808  <--- this is the one I generated manually

It seems it generates a full dump each time, even on incrementals:

root@matomo1001:/srv/backups$ ls -la
total 757576
drwx------ 2 root root      4096 Nov  4 11:48 .
drwxr-xr-x 4 root root      4096 Oct  1  2018 ..
-rw-r--r-- 1 root root    157099 Oct 23 02:05 mysql-201910230205.sql.gz
-rw-r--r-- 1 root root    157290 Oct 30 02:05 mysql-201910300205.sql.gz
-rw-r--r-- 1 root root    156662 Nov  4 11:48 mysql-201911041148.sql.gz
-rw-r--r-- 1 root root 259398314 Oct 23 02:05 piwik-201910230205.sql.gz
-rw-r--r-- 1 root root 261723186 Oct 30 02:05 piwik-201910300205.sql.gz
-rw-r--r-- 1 root root 254134972 Nov  4 11:49 piwik-201911041148.sql.gz

I believe this is the thing we wanted to change, which we still have to do, but that is another story. My suggestion for now is to apply the same policy as the production databases, the Monthly one, which will get the same results as you have now- a weekly full backup. I know the name is missleading and confusing, we will fix that at a later time.

I am all for simplifying and standardizing confs, so no opposition about incremental. Only one question - what would it change when trying to restore the database? This is basically my only concern at the moment.. If it is as simple as doing a recovery via Bacula I am all for it!

Yes, incrementals are transparent for bacula- and even in this case, because a new file is created each time, in reality from the point of view of bacula they will be incrementals, but from the point of view of you they will still be full backups, so no performance penalty either. It may even save some space so it doesn't have the same file stored multiple times. Note we do use the same exact policy for production database backups and we have trust on it!

Let me find where they are configured and I will send you a patch- later feel free to ping me on IRC and I will show you how to restore them.

Change 548236 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] WIP: Move matomo to a Montly full schedule, like production dbs

https://gerrit.wikimedia.org/r/548236

@elukey I am sorry, after looking closely to the policies, I mistakenly assumed the schedule was wrong. I will abandon patching the policy (for now) and will just amend the checking script to reflect current ground reality. Also note that backups are running normally, so no worries on your side.

Change 548236 abandoned by Jcrespo:
WIP: Move matomo to a Montly full schedule, like production dbs

Reason:
The schedule was right, the assumptions about it were wrong.

https://gerrit.wikimedia.org/r/548236

Change 548244 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] bacula: Fix hardcoded thresholds based on configured schedules

https://gerrit.wikimedia.org/r/548244

Change 548244 merged by Jcrespo:
[operations/puppet@production] bacula: Fix hardcoded thresholds based on configured schedules

https://gerrit.wikimedia.org/r/548244

The matomo false alert is now correctly gone, only the 6 issues due to the 3 tickets above (T236406#5630631) left:

All failures: 6 (an-master1002, ...), Fresh: 88 jobs
elukey added a comment.Mon, Nov 4, 1:41 PM

Nice thanks! Just pushed the new rules to the routers, so in theory an-master1002 and analytics1029 should go away now! Let me know :)

jcrespo added a comment.EditedMon, Nov 4, 1:50 PM

Forcing a manual run on the 2 above for validation.

Looking good:

 JobId  Type Level     Files     Bytes  Name              Status
======================================================================
159576  Back Full          1    3.520 G an-master1002.eqiad.wmnet-Monthly-1st-Mon-production-hadoop-namenode-backup is running

:-)

All failures: 4 (bromine, ...), Fresh: 90 jobs

Unsubbing elukey and Otto to prevent unwanted spam (feel free to resubscribe).

Down to 2:

All failures: 2 (cloudweb2001-dev, ...), Fresh: 90 jobs

Which should be fixed when cloud patch is reviewed and deployed.

backup1001 Backup freshness OK 2019-11-05 08:55:20 0d 0h 0m 37s 1/3 Fresh: 92 jobs
jcrespo updated the task description. (Show Details)Tue, Nov 5, 8:58 AM
jcrespo updated the task description. (Show Details)
jcrespo closed this task as Resolved.Tue, Nov 12, 9:22 AM

I am going to consider the switchover as done, and create a separate task for followups.