Following T235425 service alert where data was lost (and, fortunately, since regenerated from a source). Served as reminder that we have no backups from webperf servers.
Description
Details
Project | Branch | Lines +/- | Subject | |
---|---|---|---|---|
operations/puppet | production | +7 -0 | webperf: add backups for arclamp application data |
Related Objects
Event Timeline
Change 543005 had a related patch set uploaded (by Krinkle; owner: Dzahn):
[operations/puppet@production] webperf: add backups for arclamp application data
Change 543005 merged by Dzahn:
[operations/puppet@production] webperf: add backups for arclamp application data
webperf1002 and webperf2002 are now "backup::host"s.
That means they got the Bacula file daemon installed and configured (/etc/bacula/bacula-fd.conf) by puppet.
On the Bacula server (director) side:
- a new file set defined in /etc/bacula/conf.d/fileset-arclamp-application-data.conf that says "arclamp-application-data" means File = /srv/xenon/
- a new client config for webperf1002 /etc/bacula/clients.d/webperf1002.eqiad.wmnet.conf which contains passwords/certs to talk to the server
- a new Bacula job (/etc/bacula/jobs.d/webperf1002.eqiad.wmnet-arclamp-application-data-Monthly-1st-Wed-production.conf) that says the backup frequency is JobDefs = "Monthly-1st-Wed-production"
- equivalent for webperf2002 ...
The hosts should soon show up in bconsole but i will confirm that later.
The size of /srv/xenon on webperf1002/2002 was 127GB as of today. The current Bacula server (helium) had 4.8T available and a new one (backup1001) is coming soon.
Confirmed on bconsole the new hosts are shown when typing restore and then selecting 5: Select the most recent backup for a client:
158: webperf1002.eqiad.wmnet-fd 159: webperf2002.codfw.wmnet-fd
After selecting a client for now we are getting "No FileSet found for client "webperf1002.eqiad.wmnet-fd" but that should change soon after the first backups are done.
cc: @jcrespo Just fyi that i merged that change and added this. helium puppet was active.
helium is no longer the backup server, you could go to https://grafana.wikimedia.org/d/413r2vbWk/bacula?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-job=webperf1002.eqiad.wmnet-Monthly-1st-Wed-production-arclamp-application-data&from=1577127671546&to=1578586171164 to check backup behavior. Icinga will also alert if backups are not fresh and correct beyond the configured period. I can see full backups for webperf taking 215GB. As part of upcoming goal work, we will require soon assistance from service owners to automate and test the recovery of backups.
Nice, thank you for confirming that @jcrespo.
To test it and create a howto in the form of the pastebin below, I have started the restore of a single random logfile ("2020-01-09.excimer.thumb.log").
On webperf1002 something should soon show up under /var/tmp/bacula-restores. If it does we can call this resolved i think.
Here we go, the example file has been restored.
[webperf1002:/var/tmp/bacula-restores/srv/xenon/logs] $ sudo file daily/2020-01-09.excimer.thumb.log daily/2020-01-09.excimer.thumb.log: ASCII text, with very long lines
@Krinkle ^ resolved?