Page MenuHomePhabricator

Document data backup and disaster recovery plan for https://space.wmflabs.org and https://discuss-space.wmflabs.org
Open, LowPublic

Description

The current prototype implementation of Space is hosted on instances in two Cloud-VPS projects (T225081, T225082). The Cloud VPS environment does not provide any automated backup system for instances (virtual machines) or the data contained inside them. The best effort data protection currently provided to all Cloud VPS projects includes usage of RAID in the storage layer which provides basic protection against catastrophic data loss caused by a single hard drive failure, but no further data protection is in place. A failure of multiple drives on the underlying hypervisor (host computer) for any instance can result in data corruption or complete loss of the instance's virtual hard drive. A failure of the entire hypervisor due to hardware malfunction is also possible and could lead to all instances hosted on a failed hypervisor being unavailable for hours to days to weeks if they can be recovered at all. The cloud-services-team is currently working on a multi-quarter project to increase the redundancy of instance storage and detach that storage from a single hypervisor (T90364), but this project is months away from active use.

As the usage of the Space prototype increases, it becomes more and more urgent for a well documented and tested backup and recovery plan to be implemented. Promotion at events like Wikimania and WikiArabia did not seem to include prominent notices that Space is currently only at the prototype/experiment phase of hosting which creates the worrying assumption that a product which has stickers and posters can be relied on to preserve important communications.

The cloud-services-team would be happy to consult with @Qgil and other the maintainers of the Discourse and WordPress instances that power https://space.wmflabs.org and https://discuss-space.wmflabs.org on their backup and recovery plans.

Event Timeline

bd808 created this task.Oct 11 2019, 12:17 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 11 2019, 12:17 AM
Qgil triaged this task as High priority.Oct 11 2019, 7:13 AM
Qgil edited projects, added Space (Oct-Dec-2019); removed Space.
Qgil added a subscriber: CKoerner_WMF.
Qgil added a comment.EditedOct 11 2019, 7:32 AM

Thank you for the prompt, and for the quality standards.

I will address https://discuss-space.wmflabs.org (Discuss) and I will let @CKoerner_WMF handle https://space.wmflabs.org (WordPress).

Currently Discourse produces daily backups that are stored locally in the same server (I just realized that the default was weekly; I have changed it to daily). Additional backups can be created on the spot from the web admin interface. These backups contain all the data needed to restore an instance completely. These backups can be accessed from the web admin interface as well. Whenever there is an upgrade or any other delicate operation in the command line, I create a backup and I download it to my laptop.

Discourse officially supports the remote storage of backups in Amazon S3. The functionality is available out of the box, no plugins are required. We haven't used this option. It is probably a no-go.

There is also a community maintained tool/process/instructions to store backups remotely in Google Drive. Given that the Foundation is using Google Drive and we have a lot of sensitive data there already, maybe this could be considered an interim option.

Of course, we can set up other ways to transfer backups automatically to other locations (e.g. Transferring backups via rsync and cron).

While we sort out this problem, I will download the daily backups to my laptop at least during working days. It is a good way not to forget about this problem. :)

Qgil moved this task from Backlog to Started on the Space (Oct-Dec-2019) board.Oct 11 2019, 7:37 AM
bd808 added a comment.Oct 11 2019, 3:20 PM

Discourse officially supports the remote storage of backups in Amazon S3. The functionality is available out of the box, no plugins are required. We haven't used this option. It is probably a no-go.
There is also a community maintained tool/process/instructions to store backups remotely in Google Drive. Given that the Foundation is using Google Drive and we have a lot of sensitive data there already, maybe this could be considered an interim option.

I personally think either S3 or Google Drive would be reasonable storage options. Both might be easier to explain if the backups files themselves are encrypted before being stored remotely. Hopefully that is something built-in or easy to add to the default archive process. Even without encryption though I think these are viable options in the short term.

Of course, we can set up other ways to transfer backups automatically to other locations (e.g. Transferring backups via rsync and cron).

Another project on the Cloud Services roadmap that we would like to implement in FY19/20 (before July 2020) is an rsync based backup service (T209530). Today I can not volunteer that WMCS folks will drop other things to make a prototype of this service, but I can start thinking about how we might fit at least a proof of concept into our plans for Q3 (January - March) and keep y'all in mind as early adopters.

While we sort out this problem, I will download the daily backups to my laptop at least during working days. It is a good way not to forget about this problem. :)

This seems like a good first step/stopgap until a better system can be designed. I would also encourage you to do a couple of tests of restoring the backups to a new instance to validate that the data you are saving is sufficient to perform a cold restore and also to help you develop a documented list of steps needed to actually use the backups. Figuring such things out under the pressure of an outage is much more stressful than doing it as an experiment. :)

I will let @CKoerner_WMF handle https://space.wmflabs.org (WordPress).

WordPress is backing up the database contents to my email on a daily basis. I backup the customizations we've made (Theme and plugins) any time we go to make changes to the site or update core/plugins. These are stored on my local computer.

Qgil lowered the priority of this task from High to Medium.Oct 28 2019, 5:57 PM

Still downloading Discourse backups manually regularly.

Tgr added a subscriber: Tgr.Jan 3 2020, 7:31 AM

@Qgil any thoughts about T180929: Make export dumps available for discourse.wmflabs.org (which is about the old Discourse instance but could be adopted)? Public dumps of all movement content and communication platforms seems like a reasonable expectation, although of course less urgent than backups.

@Tgr See m:Talk:Wikimedia_Space#Export_strategy. If I'm not mistaken (@Qgil please feel free to correct me), using dumps and open formats for the WMF's Space project has been specifically rejected.

Qgil moved this task from Backlog to Started on the Space (Jan-Mar-2020) board.Tue, Feb 18, 3:43 PM

An update after the Space announcement - https://discuss-space.wmflabs.org/t/next-steps-on-wikimedia-space/3184 - we are just keeping the status quo until the WordPress is moved to a production server and Discourse is frozen by March 31.

Qgil lowered the priority of this task from Medium to Low.Tue, Feb 18, 3:44 PM