Page MenuHomePhabricator

WMF media storage must be adequately backed up
Open, HighPublic

Assigned To
None
Authored By
jcrespo
Sep 11 2020, 1:00 PM
Referenced Files
F34886587: recovery.png
Dec 16 2021, 3:40 PM
F34883925: Screenshot_20211214_192631.png
Dec 14 2021, 6:40 PM
F34883926: Screenshot_20211214_192706.png
Dec 14 2021, 6:40 PM
F34598420: Screenshot from 2021-08-17 18-11-11.png
Aug 17 2021, 4:12 PM
F34598421: Screenshot from 2021-08-17 18-10-06.png
Aug 17 2021, 4:12 PM
F34598387: Screenshot from 2021-08-17 17-32-21.png
Aug 17 2021, 3:34 PM
F34598388: Screenshot from 2021-08-17 17-32-06.png
Aug 17 2021, 3:34 PM
Tokens
"Burninate" token, awarded by Nemo_bis."Yellow Medal" token, awarded by Ladsgroup."Barnstar" token, awarded by Legoktm."Love" token, awarded by Framawiki.

Description

There is a desire to have 100% backup coverage of all data hosted at Wikimedia Foundation in a centralized solution. After wiki content database backups were finally set up (T79922), multimedia –specifically data stored on Swift to serve Wiki non-text content– was the highest priority in terms of impact (if lost), overall size, and desire by the several WMF stakeholders to be backed up.

While there is redundancy in place for media, high availability, while a must to protect against service loss, is not a substitute for proper backups: software bugs, operator mistakes, employee sabotage, hardware issues and malicious attacks are all vectors that online redundancy would not necessarily protect effectively against. Geographically remote offline copies are needed -in addition to service HA- to effectively recover in the eventuality of a data loss.

Details

ProjectBranchLines +/-Subject
operations/software/mediabackupsmaster+154 -2
operations/puppetproduction+3 -3
operations/puppetproduction+3 -3
operations/puppetproduction+3 -3
operations/puppetproduction+3 -3
operations/puppetproduction+3 -3
operations/puppetproduction+3 -3
operations/puppetproduction+2 -2
operations/puppetproduction+1 -1
operations/puppetproduction+1 -1
operations/puppetproduction+6 -4
operations/puppetproduction+34 -4
operations/puppetproduction+4 -1
labs/privatemaster+4 -0
operations/puppetproduction+11 -4
operations/puppetproduction+3 -3
operations/puppetproduction+3 -3
operations/puppetproduction+3 -3
operations/puppetproduction+1 -1
operations/puppetproduction+1 -1
operations/puppetproduction+1 -1
operations/puppetproduction+1 -1
operations/puppetproduction+1 -1
operations/puppetproduction+3 -3
operations/puppetproduction+6 -6
operations/puppetproduction+33 -5
Show related patches Customize query in gerrit

Related Objects

StatusSubtypeAssignedTask
OpenNone
Resolvedjcrespo
Resolvedjcrespo
Resolvedjcrespo
Resolvedjcrespo
OpenNone
Resolvedfgiunchedi
Openjcrespo
Resolvedjcrespo
Resolvedjcrespo
Resolvedjcrespo
ResolvedPapaul
Resolvedjcrespo
Resolvedjcrespo
ResolvedRobH
Resolvedjcrespo
Resolvedjcrespo
Resolvedjcrespo

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Backup of commonswiki started, around 70K files backed up (slowly) so far:

root@db1176.eqiad.wmnet[mediabackups]> START TRANSACTION; select count(*), status_name, backup_status_name FROM files JOIN wikis ON wikis.id = files.wiki JOIN backup_status ON backup_status.id = files.backup_status JOIN file_status ON files.status = file_status.id WHERE wiki=392 GROUP BY status, backup_status; select count(*), TIMESTAMPDIFF(MINUTE, min(backup_time), max(backup_time)) from backups where wiki=392; COMMIT;
Query OK, 0 rows affected (0.001 sec)

+----------+-------------+--------------------+
| count(*) | status_name | backup_status_name |
+----------+-------------+--------------------+
| 76194185 | public      | pending            |
|     8000 | public      | processing         |
|    71989 | public      | backedup           |
|        9 | public      | error              |
|        2 | public      | duplicate          |
|  5910460 | archived    | pending            |
|  6551674 | deleted     | pending            |
+----------+-------------+--------------------+
7 rows in set (2 min 25.138 sec)

+----------+-----------------------------------------------------------+
| count(*) | TIMESTAMPDIFF(MINUTE, min(backup_time), max(backup_time)) |
+----------+-----------------------------------------------------------+
|    71989 |                                                        55 |
+----------+-----------------------------------------------------------+
1 row in set (1.027 sec)

Query OK, 0 rows affected (0.001 sec)

This is with low concurrency (8 threads), but it it is backing up things at ~1300 files/minute, which will mean 1 day for a full backup?

I made a mistake by an order of magnitude, we have backed up approximately 2.5TB or half a million of files in less than 6 hours, this will put, at the current low concurrency, an end time of 45 days at the current speed of 25 files per second. Let's see if we can increase the pace once maintenance finishes on codfw and becomes passive again.

I made a mistake by an order of magnitude, we have backed up approximately 2.5TB or half a million of files in less than 6 hours, this will put, at the current low concurrency, an end time of 45 days at the current speed of 25 files per second. Let's see if we can increase the pace once maintenance finishes on codfw and becomes passive again.

I think we should crank concurrency up and see how much read throughput we can get. Maintenance/rebalance is ongoing but I'm not expecting to be affecting throughput very much, and if it does we should find out IMHO

I think we should crank concurrency up and see how much read throughput we can get. Maintenance/rebalance is ongoing but I'm not expecting to be affecting throughput very much, and if it does we should find out IMHO

Sorry, when I said:

once maintenance finishes on codfw and becomes passive again

what I really meant is:

"once maintenance finishes on codfw and eqiad becomes passive again". I am running the backup on eqiad right now (not on codfw). Can I increase eqiad load?

Hey, @Ottomata I believe you organized or helped organize the watch party for "Turning the database inside-out". This may be offtopic here, but I wanted you give you my comments about it. And I think is a good example of redesign of a workflow in this way.

As we discussed on this ticket before, aside from historical issues and technical debt, most of the complexity on generating file backups is that they are not stored in a simple append-only/event-based format. My work on file backups, among other things, implies "turning the database upside down" by migrating the non-trivial file storage model for backups to a simpler model. I hope to have the Data Engineering team support to show the engineering behind what was done for file storage metadata to, e.g. Core and Product teams to convince them of the advantages of this workflow, for backups and recoveries, for analytics, for dumps, for database management, checking consistency errors, etc and possibly work towards similar models in production, too. I don't necessarily think it will be easy to apply it in all scopes, but probably faster on more specific ones.

The context of files, for example, would be the architecture work at T28741.

TL;TR: I support 100% working towards that model, please let's ally to convince other people it is good and we all will profit from it! :-).

I think we should crank concurrency up and see how much read throughput we can get. Maintenance/rebalance is ongoing but I'm not expecting to be affecting throughput very much, and if it does we should find out IMHO

Sorry, when I said:

once maintenance finishes on codfw and becomes passive again

what I really meant is:

"once maintenance finishes on codfw and eqiad becomes passive again". I am running the backup on eqiad right now (not on codfw). Can I increase eqiad load?

I haven't seen a higher-than-expected increase in latency so yeah IMHO good to bump concurrency a little

Thank you @godog, will do, slowly.

On the extreme, a 4x-8x the number of current threads would anyway move the bottleneck to minio, first (writes > reads in load, and we are way less sharded!).

Increased threads to 14 -7 on each worker. We now get a backup speed of over 44 files/s, which would imply a pending commons backup time of 21 days (although constant backup speed is probably not the case- it will vary depending on the production load, plus I expect finding more duplicates in the deleted and archived files).

I love to see it. Thanks for doing this. I was just dreaming of having such a backup off-cluster... Let me know when there is a half-PB of files that I and others can host. (Maybe get a physical copy to IArchive, they can stand up a second copy on ipfs+torrent, and a mirror network can start mirroring shards that way)

Hopefully the next time around only new + changed files will need to be backed up, in "commons-tarball-incremental-YYYY-NN"?

@Sj Sadly, we think we solved (private) backups, but we decided, for the scope of this task, to not solve dumps because it is a harder issue due to dynamicity and granularity of mediawiki permissions (files are deleted, undeleted and renamed all the time).

I discussed and even proposed to work together with @ArielGlenn on exports, but because of the specific needs of backups (mostly the ability to quickly recover and delete just a few files), that got out of scope of this, more immediate work. Eventual dumps design, however, could benefit from this work and even be built on top of it.

The average backup speed is now around 50 files/s, with a 3% overhead over normal traffic. We had backed up almost 20 million Commons files, close to 85TB in size, which is around 21% of the total. At this speed the full backup should take around 16 more days.

jcrespo changed the task status from Open to In Progress.Sep 16 2021, 2:40 PM
jcrespo changed the status of subtask T160229: Back up of Commons files from Open to In Progress.

We reached at some points (with the dc depooled, during night) peaks of 150 files/s, but it got as low as 6 files/s for the 20TB of TIFF images from the library of congress (lots of files of over 100MB each). Progress is now at over 80% completion. It may finish by Tuesday.

First pass of Commons full originals completed after 19 days (eqiad), with 99.94% success.

Most misses expected, due mostly to files moved after metadata acquisition, before physical copy

Stats (eqiad):

+-------------+----------+-----------------+
| wiki_name   | count(*) | sum(size)       |
+-------------+----------+-----------------+
| commonswiki | 88661107 | 363866620200562 |
| enwiki      |  7867722 |   2051528672500 |
| testwiki    |    11613 |     62868866309 |
+-------------+----------+-----------------+

Next steps- fix misses, finish all other smaller wikis + same backup on codfw - waiting on swift maintenance

Change 728341 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] mediabackups: Start backup of dewiki files on eqiad

https://gerrit.wikimedia.org/r/728341

Change 728341 merged by Jcrespo:

[operations/puppet@production] mediabackups: Start backup of dewiki files on eqiad

https://gerrit.wikimedia.org/r/728341

Change 730483 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] mediabackups: Backup enwikivoyage media on eqiad

https://gerrit.wikimedia.org/r/730483

Change 730483 merged by Jcrespo:

[operations/puppet@production] mediabackups: Backup enwikivoyage media on eqiad

https://gerrit.wikimedia.org/r/730483

Change 730533 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] mediabackups: Backup jvwikisource media on eqiad

https://gerrit.wikimedia.org/r/730533

Change 730533 merged by Jcrespo:

[operations/puppet@production] mediabackups: Backup jvwikisource media on eqiad

https://gerrit.wikimedia.org/r/730533

Change 730534 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] mediabackups: Backup mgwiktionary media on eqiad

https://gerrit.wikimedia.org/r/730534

Change 730534 merged by Jcrespo:

[operations/puppet@production] mediabackups: Backup mgwiktionary media on eqiad

https://gerrit.wikimedia.org/r/730534

Change 730550 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] mediabackups: Backup shwiki media on eqiad

https://gerrit.wikimedia.org/r/730550

Change 730550 merged by Jcrespo:

[operations/puppet@production] mediabackups: Backup shwiki media on eqiad

https://gerrit.wikimedia.org/r/730550

Change 730560 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] mediabackups: Backup srwiki media on eqiad

https://gerrit.wikimedia.org/r/730560

Change 730560 merged by Jcrespo:

[operations/puppet@production] mediabackups: Backup srwiki media on eqiad

https://gerrit.wikimedia.org/r/730560

Change 730723 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] mediabackups: Backup frwiki media on eqiad

https://gerrit.wikimedia.org/r/730723

Change 730723 merged by Jcrespo:

[operations/puppet@production] mediabackups: Backup frwiki media on eqiad

https://gerrit.wikimedia.org/r/730723

Change 740124 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] mediabackups: Backup s2 wikis, starting with bgwiki

https://gerrit.wikimedia.org/r/740124

Change 740124 merged by Jcrespo:

[operations/puppet@production] mediabackups: Backup s2 wikis, starting with bgwiki

https://gerrit.wikimedia.org/r/740124

Change 740862 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] mediabackups: Backup testcommonswiki

https://gerrit.wikimedia.org/r/740862

Change 740862 merged by Jcrespo:

[operations/puppet@production] mediabackups: Backup testcommonswiki

https://gerrit.wikimedia.org/r/740862

Change 747064 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] mediabackups: Make wiki optional, add another optional parameter: dblist

https://gerrit.wikimedia.org/r/747064

Change 747064 merged by Jcrespo:

[operations/puppet@production] mediabackups: Make wiki optional, add another optional parameter: dblist

https://gerrit.wikimedia.org/r/747064

Change 747113 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] mediabackup: Add an encryption key to store private file securely

https://gerrit.wikimedia.org/r/747113

Change 747160 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[labs/private@master] mediabackup: Add dummy age private key for mediabackups

https://gerrit.wikimedia.org/r/747160

Change 747160 merged by Jcrespo:

[labs/private@master] mediabackup: Add dummy age private key for mediabackups

https://gerrit.wikimedia.org/r/747160

Change 747170 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] puppetmaster: Install 'age' on puppetmaster frontends

https://gerrit.wikimedia.org/r/747170

This is the list of media backup errors (making it NDA-only, as I haven't checked yet everything there is non-private):
{P18232}

The worse wikis are easily explainable, or easily solvable, or both:

[mediabackups]> WITH  errors AS ( select count(*) as count, wiki_name FROM files FORCE INDEX(backup_status) JOIN wikis ON files.wiki = wikis.id JOIN backup_status ON backup_status.id = files.backup_status WHERE backup_status in (4) GROUP BY backup_status, wiki), total AS ( select count(*) as count, wiki_name FROM files JOIN wikis ON files.wiki = wikis.id GROUP BY wiki) SELECT wiki_name, errors.count * 100.0 / total.count AS percentage_errors FROM errors JOIN total USING(wiki_name) ORDER by percentage_errors DESC;
+----------------+-------------------+
| wiki_name      | percentage_errors |
+----------------+-------------------+
| gewikimedia    |         100.00000 | <- only has 1 public file + a very small number of deleted ones (<10)
| nlwikivoyage   |          99.36709 | <- some wikivoyage wikis have a large number of missing pre-WMF import, pre-2012 file references, all deleted/unused
| eswikinews     |          20.00000 | <- no public files + a very small number of deleted ones <(10)
| enwikivoyage   |          17.77054 | <- wikivoyage 
| arywiki        |          16.00000 | <- only has 21 public files (all backed up) + a very small number of deleted ones (<10)
| gdwiktionary   |          12.50000 | <- only has 7 public files (all backed up) + a very small number of deleted ones (<10)
| frwikivoyage   |           5.12821 | <- wikivoyage

Change 747170 merged by Jcrespo:

[operations/puppet@production] puppetmaster: Install 'age' on puppetmaster frontends

https://gerrit.wikimedia.org/r/747170

Change 747113 merged by Jcrespo:

[operations/puppet@production] mediabackup: Add an encryption key to store private files securely

https://gerrit.wikimedia.org/r/747113

A first pass on eqiad finished successfully: 101,970,844 files backed up successfully, with a total size of 373,335,321,603,376 bytes and an error rate (by size) of 0.035%.

Codfw is ongoing, with 59,510,150 backed up so far, and 245,128,583,098,537 bytes by size.

This is a prototype version of the (trivial/non-massive) recovery script, interactive version:

recovery.png (1×1 px, 249 KB)

I got inspired by bacula interactive text ui, but I am not good at designing UIs, so I will need feedback because I am not sure it is easy to understand :-/.

Codfw commonswiki backups are at 75% completion (68854627 files/301887395014767 bytes backed up), and will likely finish by next week.

Change 749561 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] mediabackups: Add minio port to ipv6 connections

https://gerrit.wikimedia.org/r/749561

Commonswiki codfw backup copy, with 91823709 files backed up and a 0.04% error rate.

The number of duplicates and errors stayed constant, which looks good- most likely we are only getting errors from the files that were changed while the snapshot was running.

To finish the codfw snapshot, only around 10 million files are pending from the other wikis- that will be done in early 2022.

Change 749561 abandoned by Jcrespo:

[operations/puppet@production] mediabackups: Add minio port to ipv6 connections

Reason:

Adding the missing/lost ipv6 dns records fixed the issue with current puppet code.

https://gerrit.wikimedia.org/r/749561

Change 752996 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] mediabackup: Backup testcommonswiki on codfw

https://gerrit.wikimedia.org/r/752996

Change 752996 merged by Jcrespo:

[operations/puppet@production] mediabackup: Backup testcommonswiki on codfw

https://gerrit.wikimedia.org/r/752996

Change 753095 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] mediabackup: Backup s1 (enwiki) media files on codfw

https://gerrit.wikimedia.org/r/753095

Change 753095 merged by Jcrespo:

[operations/puppet@production] mediabackup: Backup s1 (enwiki) media files on codfw

https://gerrit.wikimedia.org/r/753095

Change 753099 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] mediabackup: Update mediawiki replica for s1 backup on codfw

https://gerrit.wikimedia.org/r/753099

Change 753099 merged by Jcrespo:

[operations/puppet@production] mediabackup: Update mediawiki replica for s1 backup on codfw

https://gerrit.wikimedia.org/r/753099

Change 754013 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] mediabackups: Backup s2 media files at codfw

https://gerrit.wikimedia.org/r/754013

Change 754013 merged by Jcrespo:

[operations/puppet@production] mediabackups: Backup s2 media files at codfw

https://gerrit.wikimedia.org/r/754013

Change 754022 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] mediabackups: Backup s3 media files at codfw

https://gerrit.wikimedia.org/r/754022

Change 754023 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] mediabackups: Backup s5 media files at codfw

https://gerrit.wikimedia.org/r/754023

Change 754024 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] mediabackups: Backup s6 media files at codfw

https://gerrit.wikimedia.org/r/754024

Change 754025 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] mediabackups: Backup s7 media files at codfw

https://gerrit.wikimedia.org/r/754025

Change 754026 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] mediabackups: Backup s8 media files at codfw

https://gerrit.wikimedia.org/r/754026

Change 754022 merged by Jcrespo:

[operations/puppet@production] mediabackups: Backup s3 media files at codfw

https://gerrit.wikimedia.org/r/754022

Change 754023 merged by Jcrespo:

[operations/puppet@production] mediabackups: Backup s5 media files at codfw

https://gerrit.wikimedia.org/r/754023

Change 754024 merged by Jcrespo:

[operations/puppet@production] mediabackups: Backup s6 media files at codfw

https://gerrit.wikimedia.org/r/754024

Change 754025 merged by Jcrespo:

[operations/puppet@production] mediabackups: Backup s7 media files at codfw

https://gerrit.wikimedia.org/r/754025

Change 754026 merged by Jcrespo:

[operations/puppet@production] mediabackups: Backup s8 media files at codfw

https://gerrit.wikimedia.org/r/754026

Codfw first pass finished for all wikis, this is the percentage of errors:

{P18787}

The ones with high number of errors are known issues:

[mediabackups]> WITH  errors AS ( select count(*) as count, wiki_name FROM files FORCE INDEX(backup_status) JOIN wikis ON files.wiki = wikis.id JOIN backup_status ON backup_status.id = files.backup_status WHERE backup_status in (4) GROUP BY backup_status, wiki), total AS ( select count(*) as count, wiki_name FROM files JOIN wikis ON files.wiki = wikis.id GROUP BY wiki) SELECT wiki_name, errors.count * 100.0 / total.count AS percentage_errors FROM errors JOIN total USING(wiki_name) ORDER by percentage_errors DESC;
+----------------+-------------------+
| wiki_name      | percentage_errors |
+----------------+-------------------+
| jvwikisource   |         100.00000 | <--- "new wiki", not yet properly configured to be backed up
| gewikimedia    |         100.00000 | <--- "new wiki", not yet properly configured to be backed up
| nlwikivoyage   |          99.36709 | <--- some wikivoyage wikis have a large number of missing pre-WMF import, pre-2012 file references, all deleted/unused
| eswikinews     |          20.00000 | <- no public files + a very small number of deleted ones <(10)
| arywiki        |          20.00000 | only has 23 public files (all backed up) + a very small number of deleted ones (<10)
| enwikivoyage   |          17.57966 | <--- wikivoyage
| gdwiktionary   |          12.50000 | <--- no public files
| frwikivoyage   |           5.09554 | <--- wikivoyage
jcrespo changed the task status from In Progress to Open.Jan 25 2022, 11:35 AM
jcrespo changed the status of subtask T300020: Develop, package, deploy and document a single file recovery utility from Open to In Progress.

Change 802501 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/software/mediabackups@master] mediabackups: Add test units for the Util helper unit

https://gerrit.wikimedia.org/r/802501

Change 802501 merged by Jcrespo:

[operations/software/mediabackups@master] mediabackups: Add test units for the Util helper unit

https://gerrit.wikimedia.org/r/802501