Tracking task for the Q1+Q2 2019-2020 SRE goal
- Deploy new Bacula hardware
- Transfer ownership and knowledge of Bacula backup infrastructure
- Migrate general backup service from old to new host(s)
- Setup basic backup monitoring
Tracking task for the Q1+Q2 2019-2020 SRE goal
Change 537130 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] backup: Reinstall backup1001 and backup2001 as buster
Change 537130 merged by Jcrespo:
[operations/puppet@production] backup: Reinstall backup1001 and backup2001 as buster
Script wmf-auto-reimage was launched by jynus on cumin2001.codfw.wmnet for hosts:
['backup2001.codfw.wmnet']
The log can be found in /var/log/wmf-auto-reimage/201909161507_jynus_27549.log.
Script wmf-auto-reimage was launched by jynus on cumin2001.codfw.wmnet for hosts:
['backup2001.codfw.wmnet']
The log can be found in /var/log/wmf-auto-reimage/201909161514_jynus_29861.log.
Maybe, even if it is not, it wouldn't hurt to get the last firmware and BIOS updates before the host is finally in production, as later will be much more difficult.
In other order of things, the RAID controller I think now has a random device id, so the boot installer failed. I am not sure we will be able to install it without a custom partman recipe, plus it may show up with a different id on eqiad and codfw.
Sadly, I cannot setup the RAID remotelly, because the server no longer boots and mgmt interface says:
Unified Server Configurator does not support console redirection.
Change 537325 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] Add partman recipe equal to raid1-lvm-ext4-srv but with an additional hwraid
Change 537325 merged by Jcrespo:
[operations/puppet@production] Add partman recipe equal to raid1-lvm-ext4-srv but with an additional hwraid
Change 537336 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] install_server: Update comment on partman recipe
Change 537336 merged by Jcrespo:
[operations/puppet@production] install_server: Update partman recipe to set / on last disks
Script wmf-auto-reimage was launched by jynus on cumin1001.eqiad.wmnet for hosts:
['backup1001.eqiad.wmnet']
The log can be found in /var/log/wmf-auto-reimage/201909180842_jynus_136869.log.
backup1001 was also setup, however there is still a missing disk: T232882#5502241. Separating enclosures into different logical drives is going to pay off earlier than anticipated, as it may require rebuiding the virtual disk.
@akosiaris I would like to disable the accidental reimage of these servers (we suffered from these on a board change which reset the boot order, but could also happen due to human error). We do this on databases with a special recipe so it forces a puppet change to re-enable it, but I accept better solutions.
We may need some firmware updates, but hw is ready to go as soon as background raid initialization finishes on array2 of backup1001. Hosts installed with buster.
Next: Puppet.
Change 537928 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] backups: Apply no-srv-format recipe to backup hosts
Change 538042 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] bacula: Make bacula db parameters configurable on hiera
Change 538042 merged by Jcrespo:
[operations/puppet@production] bacula: Make bacula db parameters configurable on hiera
Mentioned in SAL (#wikimedia-operations) [2019-09-20T08:52:35Z] <jynus> creating new database on m1 "bacula9" T229209
Change 538175 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] backups: Install bacula-sd instead of the sql variant on buster
Change 538175 merged by Jcrespo:
[operations/puppet@production] backups: Install bacula-sd instead of the sql variant on buster
Change 538236 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] backups: Fix wrong dependency on buster bacula storage servers
Change 538236 merged by Jcrespo:
[operations/puppet@production] backups: Fix wrong dependency on buster bacula storage servers
Change 537928 merged by Jcrespo:
[operations/puppet@production] backups: Apply no-srv-format recipe to backup hosts
Almost there:
Config error: Cannot open config file "/etc/bacula/bacula-sd.conf": Permission denied
Change 538239 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] backups: Change owner of /etc/bacula/bacula-sd.conf to bacula
Change 538239 merged by Jcrespo:
[operations/puppet@production] backups: Change file owner of bacula storage&director config
Change 541205 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] bacula: Add conditional storage device setup
Change 541205 merged by Jcrespo:
[operations/puppet@production] bacula: Add conditional storage device setup
Change 541209 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] bacula: Remove old storage setup layout and increase concurrency
Change 541517 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] bacula: Change pool/storage names for new bacula director
Change 541517 merged by Jcrespo:
[operations/puppet@production] bacula: Change pool/storage names for new bacula director
Change 541523 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] bacula: Force install bacula-director, not a dependency on buster
I got finally the director running, but sadly it won't start with no devices or clients provisioned, so I created a duplicate of the ones puppet may create:
root@backup1001:/srv/local$ bconsole Connecting to Director backup1001.eqiad.wmnet:9101 1000 OK: 103 backup1001.eqiad.wmnet Version: 9.4.2 (04 February 2019) Enter a period to cancel a command. *status Status available for: 1: Director 2: Storage 3: Client 4: Scheduled 5: Network 6: All Select daemon type for status (1-6): 1 backup1001.eqiad.wmnet Version: 9.4.2 (04 February 2019) x86_64-pc-linux-gnu debian buster/sid Daemon started 08-Oct-19 11:06, conf reloaded 08-Oct-2019 11:06:21 Jobs: run=0, running=0 mode=0,0 Heap: heap=356,352 smbytes=151,818 max_bytes=152,174 bufs=772 max_bufs=779 Res: njobs=3 nclients=1 nstores=4 npools=4 ncats=1 nfsets=53 nscheds=21 Scheduled Jobs: Level Type Pri Scheduled Job Name Volume =================================================================================== No Scheduled Jobs. ==== Running Jobs: Console connected using TLS at 08-Oct-19 11:07 No Jobs running. ==== No Terminated Jobs. ====
In order for it to work without a test config, directly with puppet puppet, we need to point the director for both the new storages and at least one client- I will check if our management allows for more than one director in parallel.
@akosiaris We have reached an impass. We should:
Conditionals on puppet are getting harder and harder to manage, and we should do at least one of the 2 above soon.
This is also something you will want as it will finally free you from this OKR at last! :-D
Reminder:
# TODO The IPv6 IP should be converted into a DNS AAAA resolve once we # enabled the DNS record on the director
Also I wonder if naming the job file configuration:
cat vega.codfw.wmnet-rt-static-Monthly-1st-Sat-production.conf
but the name of the jobs:
Name = "vega.codfw.wmnet-Monthly-1st-Sat-production-rt-static"
was on purpose.
Sorry I missed that, thanks for pinging me on T234900.
Sure, sounds fine to me.
- Plan for the migration- which we may have different philosophies on how we want to achieve it. Let's sync on IRC on how to followup.
Conditionals on puppet are getting harder and harder to manage, and we should do at least one of the 2 above soon.
Probably both, let's indeed sync up on IRC
Heh, so old TODO. Arguably that should be moved into hiera and augmented to have both directors. I can help with that.
Also I wonder if naming the job file configuration:
cat vega.codfw.wmnet-rt-static-Monthly-1st-Sat-production.confbut the name of the jobs:
vega.codfw.wmnet-rt-static-Monthly-1st-Sat-production.confName = "vega.codfw.wmnet-Monthly-1st-Sat-production-rt-static"was on purpose.
No, I don't think so. From what I see it's there since at least 2014 (a38742ee86b) and is the result of passing
"${name}-${real_jobdefaults}"
instead of
"${real_jobdefaults}-${name}"
Since bacula::client::job does not use that for anything else than the name of the file, we can swap the 2 above with 0 issues to the system. A file will disappear but another will appear with the exact same contents and since the names of the files are unimportant it should be just fine
Mentioned in SAL (#wikimedia-operations) [2019-10-16T13:56:32Z] <jynus> reenabling puppet on helium T229209
Change 541523 merged by Jcrespo:
[operations/puppet@production] bacula: Force install bacula-director, not a dependency on buster
Change 543489 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] bacula: Fix error on bacula director install for older hosts
Change 543489 merged by Jcrespo:
[operations/puppet@production] bacula: Fix error on bacula director install for older hosts
I have discussed with alex a plan, there is a preliminary, but timid suggestion of steps on the design (more like diary) document.
For now I have left running on cumin1001:
transfer.py --no-compress --no-encrypt helium.eqiad.wmnet:/srv/baculasd2 backup1001.eqiad.wmnet:/srv/a rchive
https://grafana.wikimedia.org/d/000000274/prometheus-machine-stats?panelId=8&fullscreen&orgId=1&var-server=helium&var-datasource=eqiad%20prometheus%2Fops&from=1571221426124&to=1571270399999
https://grafana.wikimedia.org/d/000000274/prometheus-machine-stats?panelId=8&fullscreen&orgId=1&from=1571221426124&to=1571270399999&var-server=backup1001&var-datasource=eqiad%20prometheus%2Fops
That is a relatively safe way to copy as it is already encrypted, and will verify its md5 sum after finishing.
The copy finished correctly and actually found a bug on transfer.py:
ERROR: Original checksum c89dcd766fa3072718753b9ab0bdfb7d baculasd2/archive0055 bac1b34fd88623746fb6f7230cd375fd baculasd2/MegaSAS.log 04188e643713b33a2b8b724dfed5fe0a baculasd2/archive0003 on helium.eqiad.wmnet is different than checksum bac1b34fd88623746fb6f7230cd375fd baculasd2/MegaSAS.log c89dcd766fa3072718753b9ab0bdfb7d baculasd2/archive0055 04188e643713b33a2b8b724dfed5fe0a baculasd2/archive0003 on backup1001.eqiad.wmnet
Change 543877 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] profile::backup::host: Add the ability to configure ferm rules
Change 543877 merged by Alexandros Kosiaris:
[operations/puppet@production] profile::backup::host: Add the ability to configure ferm rules
This may be interesting for our physical migration, on a worse case scenario:
2.5 Maintaining a Valid Bootstrap File By using a WriteBootstrap record in each of your Director’s Job resources, you can constantly maintain a bootstrap file that will enable you to recover the state of your system as of the last backup without having the Bacula catalog. This permits you to more easily recover from a disaster that destroys your Bacula catalog. When a Job resource has a WriteBootstrap record, Bacula will maintain the designated file (normally on another system but mounted by NFS) with up to date information necessary to restore your system. For example, in my Director’s configuration file, I have the following record: Write Bootstrap = "/mnt/deuter/files/backup/client-name.bsr"
We may have to do the migration earlier than we thought due to T235838.
Change 544665 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] backup: Migrate bacula director from helium to backup1001
Change 545567 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] cloud-vps: stub out the (unused-on-VMs) profile::backup::ferm_directors
Change 545567 merged by Andrew Bogott:
[operations/puppet@production] cloud-vps: stub out some unused-on-vms puppetmaster bits
Change 544665 merged by Jcrespo:
[operations/puppet@production] backup: Migrate bacula director from helium to backup1001
Change 548585 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] bacula-director: remove unused /a/sqldata, /mnt/a filesets
Change 548585 merged by Jcrespo:
[operations/puppet@production] bacula-director: remove unused /a/sqldata, /mnt/a filesets
Change #541209 abandoned by Jcrespo:
[operations/puppet@production] bacula: Remove old storage setup layout and increase concurrency
Reason:
I checked and this was handled somewhere else, it was split into multiple files.