Page MenuHomePhabricator

Review Bacula home backups set for stat100[56]
Closed, ResolvedPublic

Description

While reviewing the geowiki code review for Fran, I noticed this bits in stat100[5,6] roles:

role::statistics::cruncher

    include ::profile::backup::host
    backup::set { 'home' : }

profile::statistics::private

    include ::profile::backup::host
    backup::set { 'home' : }

On helium I can see the following (sudo bconsole -> status -> 3 -> 74|75):

Connecting to Client stat1005.eqiad.wmnet-fd at stat1005.eqiad.wmnet:9102

stat1005.eqiad.wmnet-fd Version: 7.4.4 (20 September 2016)  x86_64-pc-linux-gnu debian 9.0
Daemon started 07-Mar-18 10:56. Jobs: run=157 running=0.
 Heap: heap=73,728 smbytes=19,608 max_bytes=363,767 bufs=83 max_bufs=116
 Sizes: boffset_t=8 size_t=8 debug=0 trace=0 mode=0 bwlimit=0kB/s
 Plugin: bpipe-fd.so

Running Jobs:
Director connected using TLS at: 03-Aug-18 11:22
No Jobs running.
====

Terminated Jobs:
 JobId  Level    Files      Bytes   Status   Finished        Name
===================================================================
101735  Incr          0         0   OK       24-Jul-18 04:29 stat1005.eqiad.wmnet-Monthly-1st-Sat-production-home
101855  Incr          0         0   OK       25-Jul-18 04:50 stat1005.eqiad.wmnet-Monthly-1st-Sat-production-home
101974  Incr          0         0   OK       26-Jul-18 05:53 stat1005.eqiad.wmnet-Monthly-1st-Sat-production-home
102093  Incr          0         0   OK       27-Jul-18 04:29 stat1005.eqiad.wmnet-Monthly-1st-Sat-production-home
102212  Incr          0         0   OK       28-Jul-18 04:45 stat1005.eqiad.wmnet-Monthly-1st-Sat-production-home
102331  Incr          0         0   OK       29-Jul-18 04:23 stat1005.eqiad.wmnet-Monthly-1st-Sat-production-home
102450  Incr          0         0   OK       30-Jul-18 04:26 stat1005.eqiad.wmnet-Monthly-1st-Sat-production-home
102570  Incr          0         0   OK       31-Jul-18 04:32 stat1005.eqiad.wmnet-Monthly-1st-Sat-production-home
102706  Incr          0         0   OK       01-Aug-18 05:38 stat1005.eqiad.wmnet-Monthly-1st-Sat-production-home
102834  Incr          0         0   OK       02-Aug-18 09:13 stat1005.eqiad.wmnet-Monthly-1st-Sat-production-home


Connecting to Client stat1006.eqiad.wmnet-fd at stat1006.eqiad.wmnet:9102

stat1006.eqiad.wmnet-fd Version: 7.4.4 (20 September 2016)  x86_64-pc-linux-gnu debian 9.0
Daemon started 07-Mar-18 10:56. Jobs: run=284 running=0.
 Heap: heap=73,728 smbytes=19,608 max_bytes=442,633 bufs=83 max_bufs=138
 Sizes: boffset_t=8 size_t=8 debug=0 trace=0 mode=0 bwlimit=0kB/s
 Plugin: bpipe-fd.so

Running Jobs:
Director connected using TLS at: 03-Aug-18 11:22
No Jobs running.
====

Terminated Jobs:
 JobId  Level    Files      Bytes   Status   Finished        Name
===================================================================
101736  Incr          0         0   OK       24-Jul-18 04:29 stat1006.eqiad.wmnet-Monthly-1st-Mon-production-home
101856  Incr          0         0   OK       25-Jul-18 04:50 stat1006.eqiad.wmnet-Monthly-1st-Mon-production-home
101975  Incr          0         0   OK       26-Jul-18 05:53 stat1006.eqiad.wmnet-Monthly-1st-Mon-production-home
102094  Incr          0         0   OK       27-Jul-18 04:29 stat1006.eqiad.wmnet-Monthly-1st-Mon-production-home
102213  Incr          0         0   OK       28-Jul-18 04:45 stat1006.eqiad.wmnet-Monthly-1st-Mon-production-home
102332  Incr          0         0   OK       29-Jul-18 04:23 stat1006.eqiad.wmnet-Monthly-1st-Mon-production-home
102451  Incr          0         0   OK       30-Jul-18 04:26 stat1006.eqiad.wmnet-Monthly-1st-Mon-production-home
102571  Incr          0         0   OK       31-Jul-18 04:32 stat1006.eqiad.wmnet-Monthly-1st-Mon-production-home
102707  Incr          0         0   OK       01-Aug-18 05:38 stat1006.eqiad.wmnet-Monthly-1st-Mon-production-home
102835  Incr          0         0   OK       02-Aug-18 09:13 stat1006.eqiad.wmnet-Monthly-1st-Mon-production-home
====

The above looks very weird, since they seem running but Files/Bytes are zero (also I wasn't aware of any backup set for the huge home dirs on stat boxes). Is it a residue of a old puppet config or am I missing something?

Event Timeline

elukey triaged this task as Medium priority.Aug 3 2018, 11:24 AM
elukey created this task.
Milimetric raised the priority of this task from Medium to High.Aug 9 2018, 3:52 PM
Milimetric lowered the priority of this task from High to Medium.
Milimetric raised the priority of this task from Medium to High.
Milimetric lowered the priority of this task from High to Medium.
Milimetric moved this task from Incoming to Operational Excellence on the Analytics board.

Change 454480 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::statistics::cruncher|private: remove unused bacula settings

https://gerrit.wikimedia.org/r/454480

OO, I'm pretty sure we do want the home directories backed up. I betcha this is a mistake since we moved /home to /srv/home?

it could be yes! I was convinced that since the stat home dirs are huge we wouldn't use bacula but only rely on the RAID on those hosts, I am probably missing something then. Let's have a chat about this!

The size of the home directories might not be eligible for backup:

elukey@stat1004:/srv/home/elukey$ sudo du -hs /srv/home/
959G    /srv/home/

elukey@stat1005:/srv/home/elukey$ sudo du -hs /srv/home/
3.4T    /srv/home/

elukey@stat1006:/srv/home/elukey$ sudo du -hs /srv/home/
2.3T    /srv/home/

@akosiaris any suggestion? Is it feasible with our current backup infrastructure to be able to back up home dirs that big or not?

Hm 6TB ? Yes, although doing so currently might be a bit too much. We are however refreshing the infrastructure for both codfw/eqiad and the new one definitely has the free space. Mind if I stall this on T196477 (which is itself stalled on T202255) ?

Ya should be fine. I'd also be fine with just declaring that stat home dirs are not backed up, and if folks specifically wanted to save stuff with more redundancy, they can just put it into HDFS (wouldn't work for stat1006 users but oh welllll).

Change 454480 merged by Elukey:
[operations/puppet@production] profile::statistics::cruncher|private: remove unused bacula settings

https://gerrit.wikimedia.org/r/454480

@Ottomata let's decide what policy to adopt during our next ops sync! (try to get backups again or just close the task)

Just sent an email to people working on stat/notebook nodes to make sure that we have a shared understanding about home directories not backed up. Also added a note in https://wikitech.wikimedia.org/wiki/Analytics/Data_access#Analytics_clients. Let's wait some days for people to comment, and close the task in nobody suggests otherwise.

Talked to Product-Analytics on sync up about best practices of keeping code backed up in gerrit and data that needs to be backed up in hdfs, I think it can be closed

elukey claimed this task.