Page MenuHomePhabricator

Backups for arclamp application data
Closed, ResolvedPublic

Description

Following T235425 service alert where data was lost (and, fortunately, since regenerated from a source). Served as reminder that we have no backups from webperf servers.

Event Timeline

Krinkle created this task.Oct 15 2019, 9:50 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 15 2019, 9:50 AM

Change 543005 had a related patch set uploaded (by Krinkle; owner: Dzahn):
[operations/puppet@production] webperf: add backups for arclamp application data

https://gerrit.wikimedia.org/r/543005

Change 543005 merged by Dzahn:
[operations/puppet@production] webperf: add backups for arclamp application data

https://gerrit.wikimedia.org/r/543005

Dzahn added a subscriber: Dzahn.Oct 23 2019, 9:30 PM

webperf1002 and webperf2002 are now "backup::host"s.

That means they got the Bacula file daemon installed and configured (/etc/bacula/bacula-fd.conf) by puppet.

On the Bacula server (director) side:

  • a new file set defined in /etc/bacula/conf.d/fileset-arclamp-application-data.conf that says "arclamp-application-data" means File = /srv/xenon/
  • a new client config for webperf1002 /etc/bacula/clients.d/webperf1002.eqiad.wmnet.conf which contains passwords/certs to talk to the server
  • a new Bacula job (/etc/bacula/jobs.d/webperf1002.eqiad.wmnet-arclamp-application-data-Monthly-1st-Wed-production.conf) that says the backup frequency is JobDefs = "Monthly-1st-Wed-production"
  • equivalent for webperf2002 ...

The hosts should soon show up in bconsole but i will confirm that later.

Dzahn added a subscriber: jcrespo.Oct 23 2019, 11:37 PM

The size of /srv/xenon on webperf1002/2002 was 127GB as of today. The current Bacula server (helium) had 4.8T available and a new one (backup1001) is coming soon.

Confirmed on bconsole the new hosts are shown when typing restore and then selecting 5: Select the most recent backup for a client:

158: webperf1002.eqiad.wmnet-fd
159: webperf2002.codfw.wmnet-fd

After selecting a client for now we are getting "No FileSet found for client "webperf1002.eqiad.wmnet-fd" but that should change soon after the first backups are done.

cc: @jcrespo Just fyi that i merged that change and added this. helium puppet was active.

Gilles assigned this task to dpifke.Jan 7 2020, 11:28 AM
jcrespo added a comment.EditedJan 9 2020, 4:10 PM

helium is no longer the backup server, you could go to https://grafana.wikimedia.org/d/413r2vbWk/bacula?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-job=webperf1002.eqiad.wmnet-Monthly-1st-Wed-production-arclamp-application-data&from=1577127671546&to=1578586171164 to check backup behavior. Icinga will also alert if backups are not fresh and correct beyond the configured period. I can see full backups for webperf taking 215GB. As part of upcoming goal work, we will require soon assistance from service owners to automate and test the recovery of backups.

Dzahn added a comment.EditedJan 9 2020, 9:22 PM

Nice, thank you for confirming that @jcrespo.

To test it and create a howto in the form of the pastebin below, I have started the restore of a single random logfile ("2020-01-09.excimer.thumb.log").

On webperf1002 something should soon show up under /var/tmp/bacula-restores. If it does we can call this resolved i think.

1ssh backup1001.eqiad.wmnet
2
3backup1001 is a Backup server (backup)
4
5[backup1001:~] $ sudo bconsole
6Connecting to Director backup1001.eqiad.wmnet:9101
71000 OK: 103 backup1001.eqiad.wmnet Version: 9.4.2 (04 February 2019)
8Enter a period to cancel a command.
9*restore
10Automatically selected Catalog: production
11Using Catalog "production"
12
13First you select one or more JobIds that contain files
14to be restored. You will be presented several methods
15of specifying the JobIds. Then you will be allowed to
16select which files from those JobIds are to be restored.
17
18To select the JobIds, you have the following choices:
19 1: List last 20 Jobs run
20 2: List Jobs where a given File is saved
21 3: Enter list of comma separated JobIds to select
22 4: Enter SQL list command
23 5: Select the most recent backup for a client
24 6: Select backup for a client before a specified time
25 7: Enter a list of files to restore
26 8: Enter a list of files to restore before a specified time
27 9: Find the JobIds of the most recent backup for a client
28 10: Find the JobIds for a backup for a client before a specified time
29 11: Enter a list of directories to restore for found JobIds
30 12: Select full restore to a specified Job date
31 13: Cancel
32Select item: (1-13): 5
33
34Defined Clients:
35 1: an-master1002.eqiad.wmnet-fd
36 2: analytics1002.eqiad.wmnet-fd
37 3: analytics1029.eqiad.wmnet-fd
38...
39 165: wasat.codfw.wmnet-fd
40 166: webperf1002.eqiad.wmnet-fd
41 167: webperf2002.codfw.wmnet-fd
42 168: wezen.codfw.wmnet-fd
43...
44
45Select the Client (1-170): 166
46Automatically selected FileSet: arclamp-application-data
47+---------+-------+----------+-----------------+---------------------+----------------+
48| JobId | Level | JobFiles | JobBytes | StartTime | VolumeName |
49+---------+-------+----------+-----------------+---------------------+----------------+
50| 192,344 | F | 13,953 | 216,479,256,864 | 2020-01-01 02:37:09 | production0076 |
51| 192,344 | F | 13,953 | 216,479,256,864 | 2020-01-01 02:37:09 | production0077 |
52| 192,434 | I | 112 | 946,839,568 | 2020-01-01 04:45:39 | production0077 |
53| 192,563 | I | 781 | 7,530,853,376 | 2020-01-02 05:02:32 | production0078 |
54| 192,690 | I | 790 | 7,414,383,200 | 2020-01-03 04:34:17 | production0078 |
55| 192,816 | I | 763 | 8,069,507,120 | 2020-01-04 04:35:04 | production0078 |
56| 192,940 | I | 799 | 7,728,817,200 | 2020-01-05 04:55:43 | production0079 |
57| 193,066 | I | 764 | 8,527,912,800 | 2020-01-06 04:46:25 | production0079 |
58| 194,152 | I | 946 | 9,896,843,104 | 2020-01-07 09:23:41 | production0080 |
59| 194,266 | I | 669 | 8,017,165,936 | 2020-01-08 05:01:12 | production0080 |
60| 194,379 | I | 801 | 8,146,442,368 | 2020-01-09 04:41:07 | production0080 |
61+---------+-------+----------+-----------------+---------------------+----------------+
62You have selected the following JobIds: 192344,192434,192563,192690,192816,192940,193066,194152,194266,194379
63
64Building directory tree for JobId(s) 192344,192434,192563,192690,192816,192940,193066,194152,194266,194379 ... ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
6519,816 files inserted into the tree.
66
67You are now entering file selection mode where you add (mark) and
68remove (unmark) files to be restored. No files are initially added, unless
69you used the "all" keyword on the command line.
70Enter "done" to leave this mode.
71
72
73cwd is: /
74cd /srv/xenon/logs/daily
75cwd is: /srv/xenon/logs/daily/
76$ ls
77...
78$ m 2020-01-09.excimer.thumb.log
791 file marked.
80...
81
82$ done
83Bootstrap records written to /var/lib/bacula/backup1001.eqiad.wmnet.restore.4205.bsr
84
85The Job will require the following (*=>InChanger):
86 Volume(s) Storage(s) SD Device(s)
87===========================================================================
88
89 production0080 backup1001-FileStorageProduction FileStorageProduction
90
91Volumes marked with "*" are in the Autochanger.
92
93
941 file selected to be restored.
95
96Using Catalog "production"
97Run Restore job
98JobName: RestoreFiles
99Bootstrap: /var/lib/bacula/backup1001.eqiad.wmnet.restore.4205.bsr
100Where: /var/tmp/bacula-restores
101Replace: Always
102FileSet: root
103Backup Client: webperf1002.eqiad.wmnet-fd
104Restore Client: webperf1002.eqiad.wmnet-fd
105Storage: backup1001-FileStorageProduction
106When: 2020-01-09 21:15:17
107Catalog: production
108Priority: 1
109Plugin Options: *None*
110OK to run? (yes/mod/no): yes
111Job queued. JobId=194398
112*
113*exit
114---
115
116ssh webperf1001.eqiad.wmnet
117cd /var/tmp/bacula-restores
118ls
119
120

Dzahn added a comment.Jan 9 2020, 9:25 PM

Here we go, the example file has been restored.

[webperf1002:/var/tmp/bacula-restores/srv/xenon/logs] $ sudo file daily/2020-01-09.excimer.thumb.log 
daily/2020-01-09.excimer.thumb.log: ASCII text, with very long lines

@Krinkle ^ resolved?

Krinkle closed this task as Resolved.Jan 9 2020, 10:32 PM

Yes.