Page MenuHomePhabricator

Content and database backup for WMDE Technical Wishes test wiki
Closed, ResolvedPublic3 Estimated Story Points

Description

Our test wiki is precarious in several ways, but we plan to put significant effort into creating templates and other content. It should be automatically backed up to more persistent storage.

A typical strategy is to back up the entire hard drive once per week, and keep 4 weeks of data.

Keep:

  • Users (in CentralAuth wiki)
  • Articles from all wikis
  • Custom extension tables
  • Uploads

Exact state of the filesystem is not necessary, we shouldn't be using any overrides not checked into git.

Event Timeline

Command to dump content of one wiki:

mwscript dumpBackup.php --wiki=wiki --full > content.xml

restore

mwscript importDump.php --wiki=wiki /vagrant/dumps/content-backup-20200727.xml
awight renamed this task from Content backup for WMDE Technical Wishes test wiki to Content and database backup for WMDE Technical Wishes test wiki.Jul 27 2020, 12:37 PM
awight updated the task description. (Show Details)

I've asked for an extra volume in WMCS, but not sure yet if this is a thing. If that works, we can mount the volume on the test instance, and write a cron script to run there.

I've asked for an extra volume in WMCS, but not sure yet if this is a thing. If that works, we can mount the volume on the test instance, and write a cron script to run there.

@bd808 (cross-posted from IRC), is it possible to request an external NFS volume in Cloud VPS? This would be c. 80 GB, for scheduled backups from an m1.medium instance.

@bd808 (cross-posted from IRC), is it possible to request an external NFS volume in Cloud VPS? This would be c. 80 GB, for scheduled backups from an m1.medium instance.

See https://wikitech.wikimedia.org/wiki/Help:Shared_storage#/data/project for one possibility. A project NFS share won't be backed up either, but may have higher reliability than local storage on an instance. We don't hand out new NFS shares often enough to have a well established request process, but I would suggest a phab task outlining the why and the expected storage need in GiB tagged with Cloud-VPS and cloud-services-team (Kanban) to get things rolling.

I would suggest a phab task

Thanks! Filed as T259254: Request 100GB NFS volume, for WMDE template test project.

A project NFS share won't be backed up either

Fine with me—but the page linked says "This data is backed up", surprisingly.

A project NFS share won't be backed up either

Fine with me—but the page linked says "This data is backed up", surprisingly.

I will tweak the wording on the wiki. NFS servers are replicated which provides redundancy and there are also snapshots for disaster recovery, but calling either option a "backup" is giving folks false hope about recovering files. I feel that most end users equate "backup" with something more robust like an Apple Time Machine incremental backup system that also includes trivial recovery capabilities. Backula snapshots of an NFS server are not that friendly. :)

Backula snapshots of an NFS server are not that friendly. :)

Thanks for the clarification! In our case, we would be using your 2N or N+1 storage to create our own "friendly" backups, so from your description we will end up with at least N+2 redundancy, which is reassuring. It means that even during the moments where we've crashed vagrant and have to restore from our friendly dumps, we're still protected against small asteroid strikes :-D

The backup script is in good shape, now we're just waiting to attach the NFS volume.

awight removed awight as the assignee of this task.Aug 7 2020, 1:53 PM

The last exciting step is something about vagrant NFS... The partition is mounted to the guest, using the same settings as for /vagrant. We can write mysqldump files and create directories. cp and rsync can create files from images/, but they have mode 0 and we can't write content. I don't understand what's special about these commands, but the fix is probably to change something about mount options.

@bd808 @Bstorm I'm sure this is exactly the sort of thing that makes NFS a pain to administer, apologies, but... have you seen something like this before?

# Mounted like this from the host OS.
nfs-tools-project.svc.eqiad.wmnet:/srv/misc/shared/wmde-templates-alpha/project on /mnt/nfs/labstore-secondary-project type nfs4 (rw,noatime,vers=4.0,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=172.16.2.11,local_lock=none,addr=10.64.37.18)

# Attached by vagrant with, (AFAIK this will also respect the config.nfs.map_uid and config.nfs.map_gid set in Vagrantfile.rb)
+++ b/Vagrantfile-extra.rb
@@ -0,0 +1,8 @@
+Vagrant.configure('2') do |config|
+  config.vm.synced_folder '/mnt/nfs/labstore-secondary-project', '/srv/project',
+    id: 'backup',
+    type: :nfs,
+    mount_options: ['async'],
+    nfs_version: 4,
+    nfs_udp: false
+end

# Mounted like this from the guest OS.
vagrant@mediawiki-vagrant:/srv/project/backups$ mount | grep project
192.168.122.1:/mnt/nfs/labstore-secondary-project on /srv/project type nfs4 (rw,noatime,vers=4.0,rsize=16384,wsize=16384,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=192.168.122.81,local_lock=none,addr=192.168.122.1)

# cp fails every time, creating empty files with mode 0o000
vagrant@mediawiki-vagrant:/srv/project/backups$ cp /vagrant/mediawiki/images/README /srv/project/backups/
cp: cannot create regular file '/srv/project/backups/README': Permission denied
vagrant@mediawiki-vagrant:/srv/project/backups$ ls -l README 
---------- 1 vagrant_share vagrant_share 0 Oct 17  1972 README
vagrant@mediawiki-vagrant:/srv/project/backups$ rm -f README

# Creating the file gives it a normal mode.
vagrant@mediawiki-vagrant:/srv/project/backups$ touch README
vagrant@mediawiki-vagrant:/srv/project/backups$ ls -l README
-rw-r--r-- 1 vagrant_share vagrant_share 0 Aug 10 11:33 README

# Now cp works just fine.
vagrant@mediawiki-vagrant:/srv/project/backups$ cp /vagrant/mediawiki/images/README /srv/project/backups/
vagrant@mediawiki-vagrant:/srv/project/backups$ ls -l README 
-rw-r--r-- 1 vagrant_share vagrant_share 84 Aug 10 11:34 README

# umask is fine.
vagrant@mediawiki-vagrant:~$ umask
0022

I guess it's something subtle about file ownership or user mapping? The same failure happens with rsync, so I'm out of workaround ideas.

Also tried copying as the root user, with no success:

root@mediawiki-vagrant:~# cp /vagrant/mediawiki/images/README /srv/project/backups/README 
cp: cannot create regular file '/srv/project/backups/README': Permission denied
root@mediawiki-vagrant:~# ls -l !$
ls -l /srv/project/backups/README
---------- 1 vagrant_share vagrant_share 0 Oct 20  1972 /srv/project/backups/README

I'm able to cp files into the /vagrant NFS share from the guest OS, with seemingly identical mount options.

# Mounted (at the NFS level, at least) with the same options as the project partition above. 
192.168.122.1:/srv/mediawiki-vagrant on /vagrant type nfs4 (rw,noatime,vers=4.0,rsize=16384,wsize=16384,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=192.168.122.81,local_lock=none,addr=192.168.122.1)

vagrant@mediawiki-vagrant:~$ cp /vagrant/mediawiki/images/README /vagrant/foo
vagrant@mediawiki-vagrant:~$ ls -l /vagrant/foo
-rw-r--r-- 1 vagrant_share vagrant_share 84 Aug 10 12:03 /vagrant/foo

I haven't been able to dump a list of mounts from vagrant or lxc, to determine whether anything is different at that level.

I am able to cp files from the host OS.

awight@mediawiki1004:/srv/mediawiki-vagrant$ cp /srv/mediawiki-vagrant/mediawiki/images/README /mnt/nfs/labstore-secondary-project/backups/
awight@mediawiki1004:/srv/mediawiki-vagrant$ ls -l /mnt/nfs/labstore-secondary-project/backups/README 
-rw-r--r-- 1 awight wikidev 84 Aug 10 12:59 /mnt/nfs/labstore-secondary-project/backups/README

Note that our weekly database backups are already working as hoped, and these contain all of the embodied labor so far, of imported templates. We could close this task and leave the uploads backups for a follow-up.

I poked around a bit just to see if I could figure out what strange thing is happening here. I can reproduce the failure of cp inside the Vagrant managed LXC container when the target is under /srv/project. The /srv/project mount inside the LXC container is a mount of the instance's /mnt/nfs/labstore-secondary-project.

inside LXC
$ mount | grep /srv/project
192.168.122.1:/mnt/nfs/labstore-secondary-project on /srv/project type nfs4 (rw,noatime,vers=4.0,rsize=16384,wsize=16384,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=192.168.122.81,local_lock=none,addr=192.168.122.1)
on instance
$ mount | grep /mnt/nfs
nfs-tools-project.svc.eqiad.wmnet:/srv/misc/shared/wmde-templates-alpha/project on /mnt/nfs/labstore-secondary-project type nfs4 (rw,noatime,vers=4.0,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=172.16.2.11,local_lock=none,addr=10.64.37.18)

The failure looks like this when running under strace:

stat("./Gemfile", 0x7ffdee3aeb80)       = -1 ENOENT (No such file or directory)
open("/vagrant/Gemfile", O_RDONLY)      = 3
fstat(3, {st_mode=S_IFREG|0664, st_size=777, ...}) = 0
open("./Gemfile", O_WRONLY|O_CREAT|O_EXCL, 0664) = -1 EACCES (Permission denied)

The instance logs this failure in /var/log/messages.log:

Aug 10 17:31:42 mediawiki1004 kernel: [374657.057838] ------------[ cut here ]------------
Aug 10 17:31:42 mediawiki1004 kernel: [374657.057841] nfsd4_process_open2 failed to open newly-created file! status=13
Aug 10 17:31:42 mediawiki1004 kernel: [374657.057899] WARNING: CPU: 1 PID: 575 at fs/nfsd/nfs4proc.c:456 nfsd4_open+0x4e0/0x6f0 [nfsd]
Aug 10 17:31:42 mediawiki1004 kernel: [374657.057900] Modules linked in: xt_conntrack ipt_REJECT nf_reject_ipv4 tun devlink veth nft_chain_route_ipv4 xt_CHECKSUM nft_chain_nat_ipv4 ipt_MASQUERADE nf_nat_ipv4 nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache nft_counter xt_tcpudp nft_compat bridge stp llc nf_tables nfnetlink sch_ingress cls_u32 sch_htb qxl ttm hid_generic drm_kms_helper crct10dif_pclmul crc32_pclmul ghash_clmulni_intel drm evdev usbhid joydev pcspkr serio_raw virtio_balloon hid qemu_fw_cfg button dm_mod act_mirred nfsd ifb auth_rpcgss nfs_acl lockd grace sunrpc ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2 crc32c_generic fscrypto ecb
ata_generic virtio_net net_failover failover virtio_blk crc32c_intel uhci_hcd ehci_hcd ata_piix usbcore aesni_intel aes_x86_64 crypto_simd
Aug 10 17:31:42 mediawiki1004 kernel: [374657.057947]  psmouse cryptd glue_helper libata virtio_pci i2c_piix4 virtio_ring virtio scsi_mod usb_common
Aug 10 17:31:42 mediawiki1004 kernel: [374657.057954] CPU: 1 PID: 575 Comm: nfsd Tainted: G        W         4.19.0-10-amd64 #1 Debian 4.19.132-1
Aug 10 17:31:42 mediawiki1004 kernel: [374657.057955] Hardware name: OpenStack Foundation OpenStack Nova, BIOS 1.10.2-1 04/01/2014
Aug 10 17:31:42 mediawiki1004 kernel: [374657.057962] RIP: 0010:nfsd4_open+0x4e0/0x6f0 [nfsd]
Aug 10 17:31:42 mediawiki1004 kernel: [374657.057964] Code: 80 88 a8 01 00 00 01 e9 52 fe ff ff 80 bb 15 01 00 00 00 0f 84 ef fe ff ff 44 89 fe 48 c7 c7 f0 10 79 c0 0f ce e8 da 2d d1 ee <0f> 0b e9 d7 fe ff ff 48 8b 83 18 01 00 00 8b 55 00 48 8d 75 04 89
Aug 10 17:31:42 mediawiki1004 kernel: [374657.057965] RSP: 0000:ffffb0b100c73da8 EFLAGS: 00010286
Aug 10 17:31:42 mediawiki1004 kernel: [374657.057966] RAX: 0000000000000000 RBX: ffff9f9f78dd6240 RCX: 0000000000000006
Aug 10 17:31:42 mediawiki1004 kernel: [374657.057966] RDX: 0000000000000007 RSI: 0000000000000096 RDI: ffff9f9f7bb166b0
Aug 10 17:31:42 mediawiki1004 kernel: [374657.057967] RBP: ffff9f9f78dd3068 R08: 0000000000001020 R09: 0000000000000004
Aug 10 17:31:42 mediawiki1004 kernel: [374657.057968] R10: 0000000000000000 R11: 0000000000000001 R12: ffff9f9edf3abc00
Aug 10 17:31:42 mediawiki1004 kernel: [374657.057968] R13: ffff9f9f759a4000 R14: ffff9f9f7a9eac40 R15: 000000000d000000
Aug 10 17:31:42 mediawiki1004 kernel: [374657.057970] FS:  0000000000000000(0000) GS:ffff9f9f7bb00000(0000) knlGS:0000000000000000
Aug 10 17:31:42 mediawiki1004 kernel: [374657.057970] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033

This feels like a race between the 2 NFS servers that are being used and their caches. If I do a touch $TARGET before doing cp $SOURCE $TARGET then things seem to work fine as @awight reported. The strace in that case looks just a bit different:

stat("./Gemfile", {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
open("/vagrant/Gemfile", O_RDONLY)      = 3
fstat(3, {st_mode=S_IFREG|0664, st_size=777, ...}) = 0
open("./Gemfile", O_WRONLY|O_TRUNC)     = 4

@awight I don't have an answer about what is happening here, but I do have a suggestion that might let you work around this: can you setup your backup of the images to happen on the instance (mediawiki1004.wmde-templates-alpha.eqiad.wmflabs) rather than from inside the LXC container running on the instance? Does that work around whatever weird layered NFS issue that is happening here?

The failure looks like this when running under strace:

I'm so sorry to cause you an strace session...

can you setup your backup of the images to happen on the instance (mediawiki1004.wmde-templates-alpha.eqiad.wmflabs) rather than from inside the LXC container running on the instance

Thanks, I'll go for some variation on this theme. It would be uncomfortable to run the database backups from the host OS since there's no direct connectivity, but I'm okay with running the images/ backup separately. It can use a slightly different strategy in fact, we wouldn't need to keep weekly, pseudo-incremental copies since the uploads are already versioned by MediaWiki, we can simply sync into a directory. I guess I'll have to write a role in operations/puppet to make it sustainable, but we'll be okay with a quick, user-level crontab entry for now.

I've added this personal crontab entry. We can close the task and puppetize in followup work.

0 1 * * 6 /usr/bin/rsync -av /srv/mediawiki-vagrant/mediawiki/images /mnt/nfs/labstore-secondary-project/backups/

Demo:

ssh mediawiki1004.wmde-templates-alpha.eqiad.wmflabs
...
$ ls /mnt/nfs/labstore-secondary-project/backups/
backup-1596805602  backup-1596805642  backup-1596805958  backup-1596844801  images

$ ls /mnt/nfs/labstore-secondary-project/backups/backup-1596844801
centralauth.sql.gz          enwiki.sql.gz  frwiki.sql.gz            images             ruwiki.sql.gz        wikishared.sql.gz  zhwikivoyagewiki.sql.gz
centralauthtestwiki.sql.gz  eswiki.sql.gz  frwiktionarywiki.sql.gz  loginwiki.sql.gz   trwiki.sql.gz        wiki.sql.gz
dewiki.sql.gz               fawiki.sql.gz  hewiki.sql.gz            mobilewiki.sql.gz  wikidatawiki.sql.gz  zhwiki.sql.gz

$ zgrep -c Template /mnt/nfs/labstore-secondary-project/backups/backup-1596844801/enwiki.sql.gz                                                      
22

$ ls -a /mnt/nfs/labstore-secondary-project/backups/images/
.  ..  .htaccess  README
Lena_WMDE claimed this task.
Lena_WMDE set the point value for this task to 5.
Lena_WMDE moved this task from Demo to Done on the WMDE-QWERTY-Sprint-2020-07-22 board.
awight changed the point value for this task from 5 to 3.Aug 11 2020, 9:56 AM