Page MenuHomePhabricator

Strengthen backup infrastructure and support
Open, NormalPublic

Description

Tracking task for the Q1 2019-2020 SRE goal

  • Deploy new Bacula hardware
  • Transfer ownership and knowledge of Bacula backup infrastructure
  • [stretch] Migrate general backup service from old to new host(s)

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 29 2019, 8:50 AM
akosiaris triaged this task as Normal priority.Jul 29 2019, 8:50 AM
Marostegui moved this task from Triage to Meta/Epic on the DBA board.Jul 29 2019, 8:51 AM

Change 537130 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] backup: Reinstall backup1001 and backup2001 as buster

https://gerrit.wikimedia.org/r/537130

Change 537130 merged by Jcrespo:
[operations/puppet@production] backup: Reinstall backup1001 and backup2001 as buster

https://gerrit.wikimedia.org/r/537130

Script wmf-auto-reimage was launched by jynus on cumin2001.codfw.wmnet for hosts:

['backup2001.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201909161507_jynus_27549.log.

Script wmf-auto-reimage was launched by jynus on cumin2001.codfw.wmnet for hosts:

['backup2001.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201909161514_jynus_29861.log.

Got stuck at kernel boot, could it be the same issue as T216240 ?

Got stuck at kernel boot, could it be the same issue as T216240 ?

Maybe, even if it is not, it wouldn't hurt to get the last firmware and BIOS updates before the host is finally in production, as later will be much more difficult.

Got stuck at kernel boot, could it be the same issue as T216240 ?

Maybe, even if it is not, it wouldn't hurt to get the last firmware and BIOS updates before the host is finally in production, as later will be much more difficult.

+1, pretty sure it's the same issue as before in 216240.

In other order of things, the RAID controller I think now has a random device id, so the boot installer failed. I am not sure we will be able to install it without a custom partman recipe, plus it may show up with a different id on eqiad and codfw.

Sadly, I cannot setup the RAID remotelly, because the server no longer boots and mgmt interface says:

Unified Server Configurator does not support console redirection.

Change 537325 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] Add partman recipe equal to raid1-lvm-ext4-srv but with an additional hwraid

https://gerrit.wikimedia.org/r/537325

Change 537325 merged by Jcrespo:
[operations/puppet@production] Add partman recipe equal to raid1-lvm-ext4-srv but with an additional hwraid

https://gerrit.wikimedia.org/r/537325

Change 537336 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] install_server: Update comment on partman recipe

https://gerrit.wikimedia.org/r/537336

Change 537336 merged by Jcrespo:
[operations/puppet@production] install_server: Update partman recipe to set / on last disks

https://gerrit.wikimedia.org/r/537336

Script wmf-auto-reimage was launched by jynus on cumin1001.eqiad.wmnet for hosts:

['backup1001.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201909180842_jynus_136869.log.

Completed auto-reimage of hosts:

['backup1001.eqiad.wmnet']

and were ALL successful.

backup1001 was also setup, however there is still a missing disk: T232882#5502241. Separating enclosures into different logical drives is going to pay off earlier than anticipated, as it may require rebuiding the virtual disk.

@akosiaris I would like to disable the accidental reimage of these servers (we suffered from these on a board change which reset the boot order, but could also happen due to human error). We do this on databases with a special recipe so it forces a puppet change to re-enable it, but I accept better solutions.

@akosiaris I would like to disable the accidental reimage of these servers (we suffered from these on a board change which reset the boot order, but could also happen due to human error). We do this on databases with a special recipe so it forces a puppet change to re-enable it, but I accept better solutions.

Sounds fine to me.

jcrespo updated the task description. (Show Details)Sep 19 2019, 8:35 AM

We may need some firmware updates, but hw is ready to go as soon as background raid initialization finishes on array2 of backup1001. Hosts installed with buster.

Next: Puppet.

Change 537928 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] backups: Apply no-srv-format recipe to backup hosts

https://gerrit.wikimedia.org/r/537928

Change 538042 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] bacula: Make bacula db parameters configurable on hiera

https://gerrit.wikimedia.org/r/538042

Change 538042 merged by Jcrespo:
[operations/puppet@production] bacula: Make bacula db parameters configurable on hiera

https://gerrit.wikimedia.org/r/538042

Mentioned in SAL (#wikimedia-operations) [2019-09-20T08:52:35Z] <jynus> creating new database on m1 "bacula9" T229209

Change 538175 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] backups: Install bacula-sd instead of the sql variant on buster

https://gerrit.wikimedia.org/r/538175

Change 538175 merged by Jcrespo:
[operations/puppet@production] backups: Install bacula-sd instead of the sql variant on buster

https://gerrit.wikimedia.org/r/538175

Change 538236 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] backups: Fix wrong dependency on buster bacula storage servers

https://gerrit.wikimedia.org/r/538236

Change 538236 merged by Jcrespo:
[operations/puppet@production] backups: Fix wrong dependency on buster bacula storage servers

https://gerrit.wikimedia.org/r/538236

Change 537928 merged by Jcrespo:
[operations/puppet@production] backups: Apply no-srv-format recipe to backup hosts

https://gerrit.wikimedia.org/r/537928

Almost there:

Config error: Cannot open config file "/etc/bacula/bacula-sd.conf": Permission denied

Change 538239 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] backups: Change owner of /etc/bacula/bacula-sd.conf to bacula

https://gerrit.wikimedia.org/r/538239

Change 538239 merged by Jcrespo:
[operations/puppet@production] backups: Change file owner of bacula storage&director config

https://gerrit.wikimedia.org/r/538239

Change 541205 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] bacula: Add conditional storage device setup

https://gerrit.wikimedia.org/r/541205

Change 541205 merged by Jcrespo:
[operations/puppet@production] bacula: Add conditional storage device setup

https://gerrit.wikimedia.org/r/541205

Change 541209 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] bacula: Remove old storage setup layout and increase concurrency

https://gerrit.wikimedia.org/r/541209

Change 541517 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] bacula: Change pool/storage names for new bacula director

https://gerrit.wikimedia.org/r/541517

Change 541517 merged by Jcrespo:
[operations/puppet@production] bacula: Change pool/storage names for new bacula director

https://gerrit.wikimedia.org/r/541517

Change 541523 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] bacula: Force install bacula-director, not a dependency on buster

https://gerrit.wikimedia.org/r/541523

I got finally the director running, but sadly it won't start with no devices or clients provisioned, so I created a duplicate of the ones puppet may create:

root@backup1001:/srv/local$ bconsole
Connecting to Director backup1001.eqiad.wmnet:9101
1000 OK: 103 backup1001.eqiad.wmnet Version: 9.4.2 (04 February 2019)
Enter a period to cancel a command.
*status
Status available for:
     1: Director
     2: Storage
     3: Client
     4: Scheduled
     5: Network
     6: All
Select daemon type for status (1-6): 1
backup1001.eqiad.wmnet Version: 9.4.2 (04 February 2019) x86_64-pc-linux-gnu debian buster/sid
Daemon started 08-Oct-19 11:06, conf reloaded 08-Oct-2019 11:06:21
 Jobs: run=0, running=0 mode=0,0
 Heap: heap=356,352 smbytes=151,818 max_bytes=152,174 bufs=772 max_bufs=779
 Res: njobs=3 nclients=1 nstores=4 npools=4 ncats=1 nfsets=53 nscheds=21

Scheduled Jobs:
Level          Type     Pri  Scheduled          Job Name           Volume
===================================================================================
No Scheduled Jobs.
====

Running Jobs:
Console connected using TLS at 08-Oct-19 11:07
No Jobs running.
====
No Terminated Jobs.
====

In order for it to work without a test config, directly with puppet puppet, we need to point the director for both the new storages and at least one client- I will check if our management allows for more than one director in parallel.

jcrespo added a comment.EditedFri, Oct 11, 9:25 AM

@akosiaris We have reached an impass. We should:

  • Run puppet with the new permissions on the current bacula host, fix any issues found.
  • Plan for the migration- which we may have different philosophies on how we want to achieve it. Let's sync on IRC on how to followup.

Conditionals on puppet are getting harder and harder to manage, and we should do at least one of the 2 above soon.

This is also something you will want as it will finally free you from this OKR at last! :-D

jcrespo added a comment.EditedWed, Oct 16, 10:51 AM

Reminder:

# TODO The IPv6 IP should be converted into a DNS AAAA resolve once we
# enabled the DNS record on the director

Also I wonder if naming the job file configuration:

cat vega.codfw.wmnet-rt-static-Monthly-1st-Sat-production.conf

but the name of the jobs:

vega.codfw.wmnet-rt-static-Monthly-1st-Sat-production.conf
Name = "vega.codfw.wmnet-Monthly-1st-Sat-production-rt-static"

was on purpose.

Sorry I missed that, thanks for pinging me on T234900.

@akosiaris We have reached an impass. We should:

  • Run puppet with the new permissions on the current bacula host, fix any issues found.

Sure, sounds fine to me.

  • Plan for the migration- which we may have different philosophies on how we want to achieve it. Let's sync on IRC on how to followup.

Conditionals on puppet are getting harder and harder to manage, and we should do at least one of the 2 above soon.

Probably both, let's indeed sync up on IRC

Reminder:

# TODO The IPv6 IP should be converted into a DNS AAAA resolve once we
# enabled the DNS record on the director

Heh, so old TODO. Arguably that should be moved into hiera and augmented to have both directors. I can help with that.

Also I wonder if naming the job file configuration:

cat vega.codfw.wmnet-rt-static-Monthly-1st-Sat-production.conf

but the name of the jobs:

vega.codfw.wmnet-rt-static-Monthly-1st-Sat-production.conf
Name = "vega.codfw.wmnet-Monthly-1st-Sat-production-rt-static"

was on purpose.

No, I don't think so. From what I see it's there since at least 2014 (a38742ee86b) and is the result of passing

"${name}-${real_jobdefaults}"

instead of

"${real_jobdefaults}-${name}"

in https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/production/modules/backup/manifests/set.pp#12

Since bacula::client::job does not use that for anything else than the name of the file, we can swap the 2 above with 0 issues to the system. A file will disappear but another will appear with the exact same contents and since the names of the files are unimportant it should be just fine

Mentioned in SAL (#wikimedia-operations) [2019-10-16T13:56:32Z] <jynus> reenabling puppet on helium T229209

Change 541523 merged by Jcrespo:
[operations/puppet@production] bacula: Force install bacula-director, not a dependency on buster

https://gerrit.wikimedia.org/r/541523

Change 543489 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] bacula: Fix error on bacula director install for older hosts

https://gerrit.wikimedia.org/r/543489

Change 543489 merged by Jcrespo:
[operations/puppet@production] bacula: Fix error on bacula director install for older hosts

https://gerrit.wikimedia.org/r/543489

I have discussed with alex a plan, there is a preliminary, but timid suggestion of steps on the design (more like diary) document.

For now I have left running on cumin1001:

transfer.py --no-compress --no-encrypt helium.eqiad.wmnet:/srv/baculasd2 backup1001.eqiad.wmnet:/srv/a
rchive

https://grafana.wikimedia.org/d/000000274/prometheus-machine-stats?panelId=8&fullscreen&orgId=1&var-server=helium&var-datasource=eqiad%20prometheus%2Fops&from=1571221426124&to=1571270399999
https://grafana.wikimedia.org/d/000000274/prometheus-machine-stats?panelId=8&fullscreen&orgId=1&from=1571221426124&to=1571270399999&var-server=backup1001&var-datasource=eqiad%20prometheus%2Fops

That is a relatively safe way to copy as it is already encrypted, and will verify its md5 sum after finishing.

The copy finished correctly and actually found a bug on transfer.py:

ERROR: Original checksum

c89dcd766fa3072718753b9ab0bdfb7d  baculasd2/archive0055
bac1b34fd88623746fb6f7230cd375fd  baculasd2/MegaSAS.log
04188e643713b33a2b8b724dfed5fe0a  baculasd2/archive0003

on helium.eqiad.wmnet is different than checksum 

bac1b34fd88623746fb6f7230cd375fd  baculasd2/MegaSAS.log
c89dcd766fa3072718753b9ab0bdfb7d  baculasd2/archive0055
04188e643713b33a2b8b724dfed5fe0a  baculasd2/archive0003

on backup1001.eqiad.wmnet

Change 543877 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] profile::backup::host: Add the ability to configure ferm rules

https://gerrit.wikimedia.org/r/543877

Change 543877 merged by Alexandros Kosiaris:
[operations/puppet@production] profile::backup::host: Add the ability to configure ferm rules

https://gerrit.wikimedia.org/r/543877

This may be interesting for our physical migration, on a worse case scenario:

2.5
 Maintaining a Valid Bootstrap File
By using a WriteBootstrap record in each of your Director’s Job resources, you can constantly maintain a
bootstrap file that will enable you to recover the state of your system as of the last backup without having the
Bacula catalog. This permits you to more easily recover from a disaster that destroys your Bacula catalog.
When a Job resource has a WriteBootstrap record, Bacula will maintain the designated file (normally on
another system but mounted by NFS) with up to date information necessary to restore your system. For
example, in my Director’s configuration file, I have the following record:
Write Bootstrap = "/mnt/deuter/files/backup/client-name.bsr"

We may have to do the migration earlier than we thought due to T235838.

Change 544665 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] backup: Migrate bacula director from helium to backup1001

https://gerrit.wikimedia.org/r/544665