Page MenuHomePhabricator

WMF media storage must be adequately backed up in a remote location
Open, HighPublic

Description

There is a desire to have 100% backup coverage of all data hosted at Wikimedia Foundation in a centralized solution. After wiki content database backups were finally set up (T79922), multimedia –specifically data stored on Swift to serve Wiki non-text content– was the highest priority in terms of impact (if lost), overall size and desire by the several WMF stockholders to be backed up.

While there is redundancy in place for media, high availability, while a must to protect against service loss, it is not a substitute for proper backups: software bugs, operator mistakes, employee sabotages, hardware issues and malicious attacks are all vectors that online redundancy would not necessarily protect effectively against. Geographically remote offline copies are needed -in addition to service HA- to effectively recover in the eventuality of a data loss.

Related Objects

Event Timeline

jcrespo triaged this task as High priority.Sep 11 2020, 1:00 PM
jcrespo moved this task from Backlog to Acknowledged on the SRE board.
jcrespo moved this task from Triage to Meta/Epic on the DBA board.

Ping @Miriam, we might be able to piggy back on this project to get access to image data outside of Swift for your ML schemes :)

Hi @Ottomata, thanks for the ping! Getting a copy of Commons (thumbnails only would be fine) which is directly accessible via stat machines would be amazing! Adding @fkaelin as we also chatted about this during recent conversations about the pain points of image work.

thumbnails only would be fine

Sadly the backups are focusing only on originals (for now).

@jcrespo originals would be great, too, I only thought of thumbnails because they generally require less space.

This is an example of an ongoing backup (testwiki):

db2102.codfw.wmnet[mediabackups]> select backup_status_name , count(*) from files JOIN backup_status ON backup_status.id = files.backup_status GROUP BY backup_status;
+--------------------+----------+
| backup_status_name | count(*) |
+--------------------+----------+
| pending            |     6599 |
| processing         |     1000 |
| backedup           |     3345 |
| duplicate          |      655 |
+--------------------+----------+
4 rows in set (0.009 sec)

In this case, there are 1000 files being currently processed. 6599 are waiting to be processed (queued). 3345 have already been backed up. And 655 were found to be duplicates of existing backups (were backed up before).

Aside from metadata, the database provides the persistence needed in case of an error- so we can continue backups from the point they were interrupted.

LSobanski mentioned this in Unknown Object (Task).Mar 10 2021, 2:37 PM
LSobanski mentioned this in Unknown Object (Task).
jcrespo added a subtask: Unknown Object (Task).Mar 10 2021, 4:24 PM
jcrespo added a subtask: Unknown Object (Task).
Papaul closed subtask Unknown Object (Task) as Resolved.Tue, Apr 6, 5:00 PM