Page MenuHomePhabricator

Define and test phabricator backup system
Closed, ResolvedPublic

Description

Sean has allocated db1043 via m3-master.eqiad.wmnet for use. He has talked to me about having an LVM 8 hour snapshot backup, unsure of the timeline on that. The standard backup mechanisms are mostly known to alex. This involves http://blog.bacula.org/ in-house. We need to define the backup periods and test our backup strategy.

Details

Reference
fl411

Event Timeline

flimport raised the priority of this task from to High.Sep 12 2014, 1:39 AM
flimport set Reference to fl411.

akosiaris wrote on 2014-06-13 11:44:56 (UTC)

Hello,

So indeed the backup mechanism is implemented via bacula. Right now the default schedule is every day at 02:05 UTC a backup is taken. This default to being incrementals. On some "special" days (depending on machine) full and differentials are taken. In case these terms mean nothing to you, don't worry, it should be sufficient to know that things are backed up every day.
There is quite some documentation on https://wikitech.wikimedia.org/wiki/Bacula but the database specific part is still missing since we only recently put it into full production. Plus it has quite a few tunables and it needs some consideration every time before deploying it

Depending on the installation there are a couple of questions that need to be answered like how big is the database, how much we want to burden the machine while backing it up etc. Looking at d1043, I see it is a clean install machine so database size right now is zero (0) and right now any way of backing it up will cause 0 overhead. So I will suggest using the standard way of piping everything sequentially to bacula via the bpipe plugin. This has the benefit of being a 1-stage approach allowing everything to happen in bacula which allows easy backups/restore. It also scales up to 100+M row tables so we should be safe that for a long time.

The scheduling should probably be the default (once a day) and we can depend on binary logs replaying for the rest

I can concoct a patch and submit it for db1043 as soon as you got something running.

Rush wrote on 2014-07-03 19:08:53 (UTC)

@akosiaris ...can we get this going? We have data!

akosiaris wrote on 2014-07-08 09:44:25 (UTC)

So we got a change ready, https://gerrit.wikimedia.org/r/#/c/144653/ so this is moving along. Backups will be daily at 02:05 UTC with an easy way to restore to that point in time. Documentation will need to be updated for that, I will do that after the change goes live

akosiaris wrote on 2014-07-08 10:33:45 (UTC)

That change got abandoned after a discussion with Sean who is in favor of the dbstore approach that is already in place. So m3 should anyway just replicate to dbstore. Sean will be handling this one

Rush wrote on 2014-07-08 17:06:36 (UTC)

Hi Chase

(cc ops@ in case there is general interest)

Today I asked Alex not to configure m3 db bacula dumps directly from db1043 (m3-master). Instead we have replicated m3 phabricator% and phlegal% databases to dbstore1001 and dbstore1002 where they will be dumped using the existing bacula jobs that handle the wikis.

So for backup and recovery we have:

1. 8 hour lvm snapshots (x2; or 16 hours worth) on db1048 (m3-slave)
2. Normal (in-sync) replicated databases on dbstore1002
3. Delayed (24h behind) replicated databases on dbstore1001
4. Binary logs for up to a month on multiple boxes
5. Weekly full logical exports into bacula thanks to Alex
6. [soon] Replication to codfw

Fwiw the s[1-7] wiki shards are handled the same way, and some databases from m1 and m2 will soon be included too.

If you guys decide you want more frequent full logical backups for phabricator feel free to make them, but if possible please use a dbstore box as the source.

BR

qgil wrote on 2014-07-10 12:57:05 (UTC)

Has someone checked whether we have actual backups of https://legalpad.wikimedia.org/ being created in the past days?

springle wrote on 2014-07-22 01:31:37 (UTC)

Lvm snapshots are operating normally. Replication to dbstore boxes is operating normally. Full data dumps are running normally.

Qgil subscribed.

We need to test that backup works here.

I have asked at ops-requests.

Qgil claimed this task.
Qgil added a subscriber: chasemp.

And this is the reply: "According to our records, your request has been resolved."

I still would welcome a message from @chasemp confirming that he has seen the backups and they are operational, but this is already good enough to resolve this task (which was already resolved in fab.wmflabs time and I reopened here just in case).

In T248#16, @Qgil wrote:

I have asked at ops-requests.

That translates to: https://rt.wikimedia.org/Ticket/Display.html?id=8407

akosiaris replied on that ticket: "https://phabricator.wikimedia.org uses m3-master as database so what Sean states in https://phabricator.wikimedia.org/T248 still holds true. So backups are indeed in place."