Strengthen backup infrastructure and support
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	akosiaris
	Jul 29 2019, 8:50 AM

Description

Tracking task for the Q1+Q2 2019-2020 SRE goal

Deploy new Bacula hardware
Transfer ownership and knowledge of Bacula backup infrastructure
Migrate general backup service from old to new host(s)
Setup basic backup monitoring

Details

Subject	Repo	Branch	Lines +/-
bacula: Remove old storage setup layout and increase concurrency	operations/puppet	production	+49 -91
bacula-director: remove unused /a/sqldata, /mnt/a filesets	operations/puppet	production	+0 -6
backup: Migrate bacula director from helium to backup1001	operations/puppet	production	+10 -16
cloud-vps: stub out some unused-on-vms puppetmaster bits	operations/puppet	production	+5 -1
profile::backup::host: Add the ability to configure ferm rules	operations/puppet	production	+11 -7
bacula: Fix error on bacula director install for older hosts	operations/puppet	production	+11 -3
bacula: Force install bacula-director, not a dependency on buster	operations/puppet	production	+6 -2
bacula: Change pool/storage names for new bacula director	operations/puppet	production	+15 -4
bacula: Add conditional storage device setup	operations/puppet	production	+88 -35
backups: Change file owner of bacula storage&director config	operations/puppet	production	+48 -48
backups: Apply no-srv-format recipe to backup hosts	operations/puppet	production	+2 -3
backups: Fix wrong dependency on buster bacula storage servers	operations/puppet	production	+1 -1
backups: Install bacula-sd instead of the sql variant on buster	operations/puppet	production	+17 -6
bacula: Make bacula db parameters configurable on hiera	operations/puppet	production	+11 -5
install_server: Update partman recipe to set / on last disks	operations/puppet	production	+9 -8
Add partman recipe equal to raid1-lvm-ext4-srv but with an additional hwraid	operations/puppet	production	+75 -1
backup: Reinstall backup1001 and backup2001 as buster	operations/puppet	production	+2 -0

Related Objects
Search...

Status	Assigned	Task
Resolved	jcrespo	T229209 Strengthen backup infrastructure and support
Resolved	• Cmjohnson	T224794 Degraded RAID on helium
		Unknown Object (Task)
Resolved	• Cmjohnson	T229706 helium.mgmt down
Resolved	Jclark-ctr	T232882 backup1001 failed disk (degraded RAID)
Resolved	jcrespo	T234900 Setup bacula backup monitoring
Resolved	jcrespo	T235838 Backups on buster hosts fail to run
Resolved	jcrespo	T236406 Switchover backup director service from helium to backup1001
Resolved	Dzahn	T237233 vega, bromine (bugzilla-static) is trying to backup a directory that doesn't exist
Resolved	Andrew	T237237 cloudweb2001-dev & labweb1001 backing up non-existing or empty directories
Resolved	akosiaris	T237016 Update router ACLs for newer bacula hosts
		Restricted Task
Resolved	Papaul	T237730 backup2001 crashed with no logs on 2019-11-08 14:22
Resolved	jcrespo	T238048 Followup to backup1001 bacula switchover (misc pending tasks)
Resolved	jcrespo	T260717 decom helium and heze
Resolved	jcrespo	T272686 print a list of backed up directories in the MOTD of production servers
Resolved	jcrespo	T273182 Revert OpenSSL min version configuration introduced for bacula compatibility
Resolved	jcrespo	T274809 Drop unused database "bacula" from m1
Resolved	jcrespo	T100954 Wikitech: update Bacula article

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 29 2019, 8:50 AM

akosiaris triaged this task as Medium priority.Jul 29 2019, 8:50 AM

Marostegui moved this task from Triage to Meta/Epic on the DBA board.Jul 29 2019, 8:51 AM

Marostegui added a project: Goal.Jul 29 2019, 9:08 AM

Dzahn added a subtask: T224794: Degraded RAID on helium.Jul 30 2019, 7:01 PM

Jclark-ctr closed subtask T224794: Degraded RAID on helium as Resolved.Aug 28 2019, 12:15 PM

Change 537130 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] backup: Reinstall backup1001 and backup2001 as buster

https://gerrit.wikimedia.org/r/537130

gerritbot added a project: Patch-For-Review.Sep 16 2019, 2:55 PM

Change 537130 merged by Jcrespo:
[operations/puppet@production] backup: Reinstall backup1001 and backup2001 as buster

https://gerrit.wikimedia.org/r/537130

jcrespo added a subtask: T232882: backup1001 failed disk (degraded RAID).Sep 16 2019, 3:07 PM

Script wmf-auto-reimage was launched by jynus on cumin2001.codfw.wmnet for hosts:

['backup2001.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201909161507_jynus_27549.log.

Maintenance_bot removed a project: Patch-For-Review.Sep 16 2019, 3:10 PM

Script wmf-auto-reimage was launched by jynus on cumin2001.codfw.wmnet for hosts:

['backup2001.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201909161514_jynus_29861.log.

Got stuck at kernel boot, could it be the same issue as T216240 ?

In T229209#5496239, @jcrespo wrote:

Got stuck at kernel boot, could it be the same issue as T216240 ?

Maybe, even if it is not, it wouldn't hurt to get the last firmware and BIOS updates before the host is finally in production, as later will be much more difficult.

In T229209#5496266, @Marostegui wrote:

In T229209#5496239, @jcrespo wrote:

Got stuck at kernel boot, could it be the same issue as T216240 ?

Maybe, even if it is not, it wouldn't hurt to get the last firmware and BIOS updates before the host is finally in production, as later will be much more difficult.

+1, pretty sure it's the same issue as before in 216240.

In other order of things, the RAID controller I think now has a random device id, so the boot installer failed. I am not sure we will be able to install it without a custom partman recipe, plus it may show up with a different id on eqiad and codfw.

Sadly, I cannot setup the RAID remotelly, because the server no longer boots and mgmt interface says:

Unified Server Configurator does not support console redirection.

Change 537325 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] Add partman recipe equal to raid1-lvm-ext4-srv but with an additional hwraid

https://gerrit.wikimedia.org/r/537325

gerritbot added a project: Patch-For-Review.Sep 17 2019, 8:11 AM

Change 537325 merged by Jcrespo:
[operations/puppet@production] Add partman recipe equal to raid1-lvm-ext4-srv but with an additional hwraid

https://gerrit.wikimedia.org/r/537325

Change 537336 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] install_server: Update comment on partman recipe

https://gerrit.wikimedia.org/r/537336

Jclark-ctr closed subtask T232882: backup1001 failed disk (degraded RAID) as Resolved.Sep 17 2019, 8:08 PM

Change 537336 merged by Jcrespo:
[operations/puppet@production] install_server: Update partman recipe to set / on last disks

https://gerrit.wikimedia.org/r/537336

Maintenance_bot removed a project: Patch-For-Review.Sep 18 2019, 8:10 AM

jcrespo reopened subtask T232882: backup1001 failed disk (degraded RAID) as Open.Sep 18 2019, 8:21 AM

Script wmf-auto-reimage was launched by jynus on cumin1001.eqiad.wmnet for hosts:

['backup1001.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201909180842_jynus_136869.log.

Completed auto-reimage of hosts:

['backup1001.eqiad.wmnet']

and were ALL successful.

backup1001 was also setup, however there is still a missing disk: T232882#5502241. Separating enclosures into different logical drives is going to pay off earlier than anticipated, as it may require rebuiding the virtual disk.

@akosiaris I would like to disable the accidental reimage of these servers (we suffered from these on a board change which reset the boot order, but could also happen due to human error). We do this on databases with a special recipe so it forces a puppet change to re-enable it, but I accept better solutions.

In T229209#5502350, @jcrespo wrote:

@akosiaris I would like to disable the accidental reimage of these servers (we suffered from these on a board change which reset the boot order, but could also happen due to human error). We do this on databases with a special recipe so it forces a puppet change to re-enable it, but I accept better solutions.

Sounds fine to me.

jcrespo closed subtask T232882: backup1001 failed disk (degraded RAID) as Resolved.Sep 19 2019, 7:41 AM

We may need some firmware updates, but hw is ready to go as soon as background raid initialization finishes on array2 of backup1001. Hosts installed with buster.

Next: Puppet.

Change 537928 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] backups: Apply no-srv-format recipe to backup hosts

https://gerrit.wikimedia.org/r/537928

gerritbot added a project: Patch-For-Review.Sep 19 2019, 8:44 AM

Change 538042 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] bacula: Make bacula db parameters configurable on hiera

https://gerrit.wikimedia.org/r/538042

Change 538042 merged by Jcrespo:
[operations/puppet@production] bacula: Make bacula db parameters configurable on hiera

https://gerrit.wikimedia.org/r/538042

Mentioned in SAL (#wikimedia-operations) [2019-09-20T08:52:35Z] <jynus> creating new database on m1 "bacula9" T229209

Change 538175 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] backups: Install bacula-sd instead of the sql variant on buster

https://gerrit.wikimedia.org/r/538175

Change 538175 merged by Jcrespo:
[operations/puppet@production] backups: Install bacula-sd instead of the sql variant on buster

https://gerrit.wikimedia.org/r/538175

Change 538236 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] backups: Fix wrong dependency on buster bacula storage servers

https://gerrit.wikimedia.org/r/538236

Change 538236 merged by Jcrespo:
[operations/puppet@production] backups: Fix wrong dependency on buster bacula storage servers

https://gerrit.wikimedia.org/r/538236

Change 537928 merged by Jcrespo:
[operations/puppet@production] backups: Apply no-srv-format recipe to backup hosts

https://gerrit.wikimedia.org/r/537928

Almost there:

Config error: Cannot open config file "/etc/bacula/bacula-sd.conf": Permission denied

Change 538239 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] backups: Change owner of /etc/bacula/bacula-sd.conf to bacula

https://gerrit.wikimedia.org/r/538239

Change 538239 merged by Jcrespo:
[operations/puppet@production] backups: Change file owner of bacula storage&director config

https://gerrit.wikimedia.org/r/538239

Change 541205 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] bacula: Add conditional storage device setup

https://gerrit.wikimedia.org/r/541205

Change 541205 merged by Jcrespo:
[operations/puppet@production] bacula: Add conditional storage device setup

https://gerrit.wikimedia.org/r/541205

Change 541209 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] bacula: Remove old storage setup layout and increase concurrency

https://gerrit.wikimedia.org/r/541209

Change 541517 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] bacula: Change pool/storage names for new bacula director

https://gerrit.wikimedia.org/r/541517

Change 541517 merged by Jcrespo:
[operations/puppet@production] bacula: Change pool/storage names for new bacula director

https://gerrit.wikimedia.org/r/541517

Change 541523 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] bacula: Force install bacula-director, not a dependency on buster

https://gerrit.wikimedia.org/r/541523

I got finally the director running, but sadly it won't start with no devices or clients provisioned, so I created a duplicate of the ones puppet may create:

root@backup1001:/srv/local$ bconsole
Connecting to Director backup1001.eqiad.wmnet:9101
1000 OK: 103 backup1001.eqiad.wmnet Version: 9.4.2 (04 February 2019)
Enter a period to cancel a command.
*status
Status available for:
     1: Director
     2: Storage
     3: Client
     4: Scheduled
     5: Network
     6: All
Select daemon type for status (1-6): 1
backup1001.eqiad.wmnet Version: 9.4.2 (04 February 2019) x86_64-pc-linux-gnu debian buster/sid
Daemon started 08-Oct-19 11:06, conf reloaded 08-Oct-2019 11:06:21
 Jobs: run=0, running=0 mode=0,0
 Heap: heap=356,352 smbytes=151,818 max_bytes=152,174 bufs=772 max_bufs=779
 Res: njobs=3 nclients=1 nstores=4 npools=4 ncats=1 nfsets=53 nscheds=21

Scheduled Jobs:
Level          Type     Pri  Scheduled          Job Name           Volume
===================================================================================
No Scheduled Jobs.
====

Running Jobs:
Console connected using TLS at 08-Oct-19 11:07
No Jobs running.
====
No Terminated Jobs.
====

In order for it to work without a test config, directly with puppet puppet, we need to point the director for both the new storages and at least one client- I will check if our management allows for more than one director in parallel.

@akosiaris We have reached an impass. We should:

Run puppet with the new permissions on the current bacula host, fix any issues found.
Plan for the migration- which we may have different philosophies on how we want to achieve it. Let's sync on IRC on how to followup.

Conditionals on puppet are getting harder and harder to manage, and we should do at least one of the 2 above soon.

This is also something you will want as it will finally free you from this OKR at last! :-D

jcrespo mentioned this in T234900: Setup bacula backup monitoring.Oct 15 2019, 12:20 PM

Reminder:

# TODO The IPv6 IP should be converted into a DNS AAAA resolve once we
# enabled the DNS record on the director

Also I wonder if naming the job file configuration:

cat vega.codfw.wmnet-rt-static-Monthly-1st-Sat-production.conf

but the name of the jobs:

vega.codfw.wmnet-rt-static-Monthly-1st-Sat-production.conf

Name = "vega.codfw.wmnet-Monthly-1st-Sat-production-rt-static"

was on purpose.

Sorry I missed that, thanks for pinging me on T234900.

In T229209#5565968, @jcrespo wrote:

@akosiaris We have reached an impass. We should:

Run puppet with the new permissions on the current bacula host, fix any issues found.

Sure, sounds fine to me.

Plan for the migration- which we may have different philosophies on how we want to achieve it. Let's sync on IRC on how to followup.

Conditionals on puppet are getting harder and harder to manage, and we should do at least one of the 2 above soon.

Probably both, let's indeed sync up on IRC

In T229209#5579152, @jcrespo wrote:

Reminder:

# TODO The IPv6 IP should be converted into a DNS AAAA resolve once we
# enabled the DNS record on the director

Heh, so old TODO. Arguably that should be moved into hiera and augmented to have both directors. I can help with that.

Also I wonder if naming the job file configuration:
cat vega.codfw.wmnet-rt-static-Monthly-1st-Sat-production.conf
but the name of the jobs:
vega.codfw.wmnet-rt-static-Monthly-1st-Sat-production.conf
Name = "vega.codfw.wmnet-Monthly-1st-Sat-production-rt-static"
was on purpose.

No, I don't think so. From what I see it's there since at least 2014 (a38742ee86b) and is the result of passing

"${name}-${real_jobdefaults}"

instead of

"${real_jobdefaults}-${name}"

in https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/production/modules/backup/manifests/set.pp#12

Since bacula::client::job does not use that for anything else than the name of the file, we can swap the 2 above with 0 issues to the system. A file will disappear but another will appear with the exact same contents and since the names of the files are unimportant it should be just fine

Mentioned in SAL (#wikimedia-operations) [2019-10-16T13:56:32Z] <jynus> reenabling puppet on helium T229209

Change 541523 merged by Jcrespo:
[operations/puppet@production] bacula: Force install bacula-director, not a dependency on buster

https://gerrit.wikimedia.org/r/541523

Change 543489 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] bacula: Fix error on bacula director install for older hosts

https://gerrit.wikimedia.org/r/543489

Change 543489 merged by Jcrespo:
[operations/puppet@production] bacula: Fix error on bacula director install for older hosts

https://gerrit.wikimedia.org/r/543489

I have discussed with alex a plan, there is a preliminary, but timid suggestion of steps on the design (more like diary) document.

For now I have left running on cumin1001:

transfer.py --no-compress --no-encrypt helium.eqiad.wmnet:/srv/baculasd2 backup1001.eqiad.wmnet:/srv/a
rchive

https://grafana.wikimedia.org/d/000000274/prometheus-machine-stats?panelId=8&fullscreen&orgId=1&var-server=helium&var-datasource=eqiad%20prometheus%2Fops&from=1571221426124&to=1571270399999
https://grafana.wikimedia.org/d/000000274/prometheus-machine-stats?panelId=8&fullscreen&orgId=1&from=1571221426124&to=1571270399999&var-server=backup1001&var-datasource=eqiad%20prometheus%2Fops

That is a relatively safe way to copy as it is already encrypted, and will verify its md5 sum after finishing.

The copy finished correctly and actually found a bug on transfer.py:

ERROR: Original checksum

c89dcd766fa3072718753b9ab0bdfb7d  baculasd2/archive0055
bac1b34fd88623746fb6f7230cd375fd  baculasd2/MegaSAS.log
04188e643713b33a2b8b724dfed5fe0a  baculasd2/archive0003

on helium.eqiad.wmnet is different than checksum 

bac1b34fd88623746fb6f7230cd375fd  baculasd2/MegaSAS.log
c89dcd766fa3072718753b9ab0bdfb7d  baculasd2/archive0055
04188e643713b33a2b8b724dfed5fe0a  baculasd2/archive0003

on backup1001.eqiad.wmnet

Change 543877 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] profile::backup::host: Add the ability to configure ferm rules

https://gerrit.wikimedia.org/r/543877

Change 543877 merged by Alexandros Kosiaris:
[operations/puppet@production] profile::backup::host: Add the ability to configure ferm rules

https://gerrit.wikimedia.org/r/543877

This may be interesting for our physical migration, on a worse case scenario:

2.5
 Maintaining a Valid Bootstrap File
By using a WriteBootstrap record in each of your Director’s Job resources, you can constantly maintain a
bootstrap file that will enable you to recover the state of your system as of the last backup without having the
Bacula catalog. This permits you to more easily recover from a disaster that destroys your Bacula catalog.
When a Job resource has a WriteBootstrap record, Bacula will maintain the designated file (normally on
another system but mounted by NFS) with up to date information necessary to restore your system. For
example, in my Director’s configuration file, I have the following record:
Write Bootstrap = "/mnt/deuter/files/backup/client-name.bsr"

We may have to do the migration earlier than we thought due to T235838.

Change 544665 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] backup: Migrate bacula director from helium to backup1001

https://gerrit.wikimedia.org/r/544665

Change 545567 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] cloud-vps: stub out the (unused-on-VMs) profile::backup::ferm_directors

https://gerrit.wikimedia.org/r/545567

Change 545567 merged by Andrew Bogott:
[operations/puppet@production] cloud-vps: stub out some unused-on-vms puppetmaster bits

https://gerrit.wikimedia.org/r/545567

Change 544665 merged by Jcrespo:
[operations/puppet@production] backup: Migrate bacula director from helium to backup1001

https://gerrit.wikimedia.org/r/544665

akosiaris added a subtask: T237016: Update router ACLs for newer bacula hosts.Oct 31 2019, 2:15 PM

jcrespo added a subtask: Restricted Task.Oct 31 2019, 3:53 PM

akosiaris closed subtask T237016: Update router ACLs for newer bacula hosts as Resolved.Nov 4 2019, 1:43 PM

Change 548585 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] bacula-director: remove unused /a/sqldata, /mnt/a filesets

https://gerrit.wikimedia.org/r/548585

Change 548585 merged by Jcrespo:
[operations/puppet@production] bacula-director: remove unused /a/sqldata, /mnt/a filesets

https://gerrit.wikimedia.org/r/548585

jcrespo closed subtask T235838: Backups on buster hosts fail to run as Resolved.Nov 5 2019, 10:39 AM

jcrespo added a subtask: T237730: backup2001 crashed with no logs on 2019-11-08 14:22.Nov 8 2019, 2:45 PM

Papaul closed subtask T237730: backup2001 crashed with no logs on 2019-11-08 14:22 as Resolved.Nov 8 2019, 5:28 PM

jcrespo closed subtask T236406: Switchover backup director service from helium to backup1001 as Resolved.Nov 12 2019, 9:22 AM

jcrespo updated the task description. (Show Details)Nov 12 2019, 9:35 AM

jcrespo added a subtask: T100954: Wikitech: update Bacula article.Dec 18 2019, 6:22 PM

jcrespo updated the task description. (Show Details)

jcrespo mentioned this in T159524: backup space is used unwisely.Apr 3 2020, 10:23 AM