Page MenuHomePhabricator

cloud NFS: figure out backups for cinder volumes
Closed, ResolvedPublic

Description

We plan to store cloud NFS data on cinder volumes. For that we need to figure out how to backup the volumes out of ceph.

Details

SubjectRepoBranchLines +/-
operations/puppetproduction+1 -1
operations/puppetproduction+4 -3
operations/puppetproduction+2 -1
operations/puppetproduction+8 -1
operations/puppetproduction+132 -2
operations/puppetproduction+2 -10
operations/puppetproduction+18 -4
operations/puppetproduction+1 -9
operations/puppetproduction+6 -2
operations/puppetproduction+3 -0
labs/privatemaster+2 -4
operations/puppetproduction+2 -2
operations/puppetproduction+21 -11
labs/privatemaster+2 -2
labs/privatemaster+1 -1
operations/puppetproduction+4 -2
operations/puppetproduction+2 -2
operations/puppetproduction+3 -1
operations/puppetproduction+5 -0
operations/puppetproduction+2 -0
operations/puppetproduction+3 -0
operations/puppetproduction+9 -9
operations/puppetproduction+5 -1
operations/puppetproduction+1 -1
operations/puppetproduction+2 -0
operations/puppetproduction+2 -0
operations/puppetproduction+165 -2
labs/privatemaster+1 -0
labs/privatemaster+1 -0
operations/puppetproduction+71 -71
Show related patches Customize query in gerrit

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 730769 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] openstack: cinder backups: use per-deployment rabbit pass

https://gerrit.wikimedia.org/r/730769

Change 730769 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] openstack: cinder backups: use per-deployment rabbit pass

https://gerrit.wikimedia.org/r/730769

Change 730771 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] hieradata: cloudbackup2002: fix typo in LVM volue group name

https://gerrit.wikimedia.org/r/730771

Change 730771 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] hieradata: cloudbackup2002: fix typo in LVM volue group name

https://gerrit.wikimedia.org/r/730771

Change 730776 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] openstack: cinder backups: create directory for mount

https://gerrit.wikimedia.org/r/730776

Change 730776 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] openstack: cinder backups: create directory for mount

https://gerrit.wikimedia.org/r/730776

Mentioned in SAL (#wikimedia-cloud) [2021-10-14T12:28:37Z] <arturo> [codfw1dev] add DB grants for cloudbackup2002.codfw.wmnet IP address to the cinder DB (T292546)

Change 730779 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] openstack: cinder: allow backup API actions

https://gerrit.wikimedia.org/r/730779

Change 730779 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] openstack: cinder: allow backup API actions

https://gerrit.wikimedia.org/r/730779

Change 730782 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] openstack: galera: allow DB access to cinder-backup nodes

https://gerrit.wikimedia.org/r/730782

Change 730782 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] openstack: galera: allow DB access to cinder-backup nodes

https://gerrit.wikimedia.org/r/730782

Change 730784 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] openstack: cinder.conf: specify lock path

https://gerrit.wikimedia.org/r/730784

Change 730784 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] openstack: cinder.conf: specify lock path

https://gerrit.wikimedia.org/r/730784

Change 730829 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] openstack: cinder backups: introduce ceph client config

https://gerrit.wikimedia.org/r/730829

Change 730829 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] openstack: cinder backups: introduce ceph client config

https://gerrit.wikimedia.org/r/730829

Change 731370 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] hieradata: openstack: cinder-backups: fix ceph keyring file name

https://gerrit.wikimedia.org/r/731370

Change 731370 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] hieradata: openstack: cinder-backups: fix ceph keyring file name

https://gerrit.wikimedia.org/r/731370

Change 731375 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] hieradata: openstack: cinder-backups: fix permissions of ceph keyring file

https://gerrit.wikimedia.org/r/731375

Change 731375 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] hieradata: openstack: cinder-backups: fix permissions of ceph keyring file

https://gerrit.wikimedia.org/r/731375

Current status:

A bit of hiera mess in puppet prevents the cinder-backup service (running on cloudbackup2002.codfw.wmnet) from getting the right ceph credentials (as can be seen on /var/log/cinder/cinder-backup.log when triggering a backup action on cloudcontrol2001-dev.wikimedia.org)

So I had a hunch today. We haven't fully tested yet that cinder-backups can indeed fetch information from the ceph cluster (because we found T293752: cloud ceph: refactor rbd client puppet profiles and blocked on it).
I decided to workaround this today to verify if the cinder-backups does work with ceph as intended or not.

Surprise: it doesn't.

It shows this log line:

2021-10-21 12:32:36.677 22312 DEBUG os_brick.initiator.linuxrbd [req-46594b0e-9032-4497-836e-016d97a44a40 novaadmin admin - - -] opening connection to ceph cluster (timeout=-1). connect /usr/lib/python3/dist-packages/os_brick/initiator/linuxrbd.py:70

There is some traffic going on between cloudbackup2002 and the mons:

aborrero@cloudbackup2002:~ $ sudo tcpdump -i any tcp port 3300 or tcp port 6789
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on any, link-type LINUX_SLL (Linux cooked), capture size 262144 bytes
12:28:09.980792 IP cloudbackup2002.codfw.wmnet.34228 > cloudcephmon2004-dev.codfw.wmnet.6789: Flags [S], seq 994059222, win 42340, options [mss 1460,sackOK,TS val 2245731376 ecr 0,nop,wscale 9], length 0
12:28:09.980852 IP cloudbackup2002.codfw.wmnet.37236 > cloudcephmon2003-dev.codfw.wmnet.6789: Flags [S], seq 768127409, win 42340, options [mss 1460,sackOK,TS val 2666147774 ecr 0,nop,wscale 9], length 0
12:28:09.980863 IP cloudbackup2002.codfw.wmnet.50994 > cloudcephmon2002-dev.codfw.wmnet.6789: Flags [S], seq 4087198713, win 42340, options [mss 1460,sackOK,TS val 3124464738 ecr 0,nop,wscale 9], length 0
12:28:09.981021 IP cloudcephmon2004-dev.codfw.wmnet.6789 > cloudbackup2002.codfw.wmnet.34228: Flags [S.], seq 1706716108, ack 994059223, win 43440, options [mss 1460,sackOK,TS val 958699844 ecr 2245731376,nop,wscale 9], length 0
[..]

I checked logs on the mons, there is no specific information about what's going on:

aborrero@cloudcephmon2002-dev:~ $ sudo tail /var/log/ceph/ceph.audit.log
2021-10-21T12:53:03.834986+0000 mon.cloudcephmon2003-dev (mon.1) 512964 : audit [DBG] from='client.? 208.80.153.75:0/3066437256' entity='client.codfw1dev-cinder' cmd=[{,",p,r,e,f,i,x,",:,",d,f,",,, ,",f,o,r,m,a,t,",:,",j,s,o,n,",}]: dispatch
2021-10-21T12:53:03.836307+0000 mon.cloudcephmon2003-dev (mon.1) 512965 : audit [DBG] from='client.? 208.80.153.75:0/3066437256' entity='client.codfw1dev-cinder' cmd=[{,",p,r,e,f,i,x,",:,",o,s,d, ,p,o,o,l, ,g,e,t,-,q,u,o,t,a,",,, ,",p,o,o,l,",:, ,",c,o,d,f,w,1,d,e,v,-,c,i,n,d,e,r,",,, ,",f,o,r,m,a,t,",:,",j,s,o,n,",}]: dispatch
2021-10-21T12:53:04.610579+0000 mon.cloudcephmon2004-dev (mon.2) 1222350 : audit [INF] from='mgr.23690785 10.192.20.7:0/763' entity='mgr.cloudcephmon2002-dev' cmd=[{"prefix":"config rm","who":"mgr","name":"mgr/rbd_support/cloudcephmon2002-dev/trash_purge_schedule"}]: dispatch
2021-10-21T12:53:04.611581+0000 mon.cloudcephmon2002-dev (mon.0) 708620 : audit [INF] from='mgr.23690785 ' entity='mgr.cloudcephmon2002-dev' cmd=[{"prefix":"config rm","who":"mgr","name":"mgr/rbd_support/cloudcephmon2002-dev/trash_purge_schedule"}]: dispatch

I see however this weird line cmd=[{,",p,r,e,f,i,x,",:,",o,s,d, ,p,o,o,l, ,g,e,t,-,q,u,o,t,a,",,, ,",p,o,o,l,",:, ,",c,o,d,f,w,1,d,e,v,-,c,i,n,d,e,r,",,, ,",f,o,r,m,a,t,",:,",j,s,o,n,",}]: which seems like a bad serialization somewhere?

Additional action items:

  • have a sensible timeout for the rbd client connection. Not sure where that is set though (apparently not /etc/cinder/cinder.conf)

I see however this weird line cmd=[{,",p,r,e,f,i,x,",:,",o,s,d, ,p,o,o,l, ,g,e,t,-,q,u,o,t,a,",,, ,",p,o,o,l,",:, ,",c,o,d,f,w,1,d,e,v,-,c,i,n,d,e,r,",,, ,",f,o,r,m,a,t,",:,",j,s,o,n,",}]: which seems like a bad serialization somewhere?

This seems to me like a config option that was expected to be an array of strings, having set as a string xd

Change 734690 had a related patch set uploaded (by David Caro; author: David Caro):

[operations/puppet@production] ceph::osd: add cinder backup hosts to ferm

https://gerrit.wikimedia.org/r/734690

Change 734690 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] ceph::osd: add cinder backup hosts to ferm

https://gerrit.wikimedia.org/r/734690

This seems to me like a config option that was expected to be an array of strings, having set as a string xd

For the record, it was a missing firewall rule on the osd side (really confusing error messages from ceph cli :S)

Update:

  • we were able to fix a hiera issue that was preventing us from testing the right ceph keydata for cinder-backups ahead of the ceph refactor https://gerrit.wikimedia.org/r/c/operations/puppet/+/734937 thanks @jbond for the assistance
  • with that change in place, the cinder-backup service now works fine. I was able to backup several volumes, and restore them
  • I think now we've finally validated the basic functionality of the cinder-backups API and can safely proceed with the next steps, which are:

Extra notes:

  • It was nice to discover that the chunking algorithm that cinder-backup uses takes into account empty blocks, this means the storage on the backup side will be more efficiently managed (ie, backing up a 20G volume of empty data takes very little storage space, not 20G)
  • the cinder-backup service is also designed for horizontal scalability. We could just add more cloudbackup servers and cinder-backup will just know how to work with them to split load/storage.

Example session with the new CLI:

root@cloudcontrol2001-dev:~# openstack volume list --all-projects
+--------------------------------------+--------------------------------------------+-----------+------+---------------------------------------------------------------+
| ID                                   | Name                                       | Status    | Size | Attached to                                                   |
+--------------------------------------+--------------------------------------------+-----------+------+---------------------------------------------------------------+
| 74bf4553-c92e-4fd5-88ef-33fb789ab07a | tlsvol                                     | available |    3 |                                                               |
| bcde703e-1ad9-40c5-badf-5f5eeae18508 | trove-58ec6fd7-0822-440b-beb5-2581e0edf98f | in-use    |    2 | Attached to 3e2b42b3-7b92-4805-9be2-00b2ab5d349b on /dev/vdb  |
| 468cf670-3f23-483b-9309-2f98d289c5dc | bleh                                       | available |    1 |                                                               |
| 4a4f04b1-7c27-4d30-9446-479390b29526 | ussurivol                                  | available |    3 |                                                               |
| 3c82177d-4272-4d63-bef0-edfa3f4a38a5 |                                            | available |   20 |                                                               |
| fbecb639-216c-4d92-a91f-ace4b87e2b0b | testvolume                                 | available |    8 |                                                               |
+--------------------------------------+--------------------------------------------+-----------+------+---------------------------------------------------------------+
root@cloudcontrol2001-dev:~# openstack volume backup create 468cf670-3f23-483b-9309-2f98d289c5dc --name "test backup"
+-------+--------------------------------------+
| Field | Value                                |
+-------+--------------------------------------+
| id    | dd42d300-e1c4-442c-9c6f-fae352e6df9c |
| name  | test backup                          |
+-------+--------------------------------------+
root@cloudcontrol2001-dev:~# openstack volume backup list
+--------------------------------------+-------------+-------------+----------+------+
| ID                                   | Name        | Description | Status   | Size |
+--------------------------------------+-------------+-------------+----------+------+
| dd42d300-e1c4-442c-9c6f-fae352e6df9c | test backup | None        | creating |    1 |
+--------------------------------------+-------------+-------------+----------+------+
root@cloudcontrol2001-dev:~# openstack volume backup list
+--------------------------------------+-------------+-------------+-----------+------+
| ID                                   | Name        | Description | Status    | Size |
+--------------------------------------+-------------+-------------+-----------+------+
| dd42d300-e1c4-442c-9c6f-fae352e6df9c | test backup | None        | available |    1 |
+--------------------------------------+-------------+-------------+-----------+------+
root@cloudcontrol2001-dev:~# openstack volume backup show dd42d300-e1c4-442c-9c6f-fae352e6df9c
+-----------------------+--------------------------------------------+
| Field                 | Value                                      |
+-----------------------+--------------------------------------------+
| availability_zone     | None                                       |
| container             | dd/42/dd42d300-e1c4-442c-9c6f-fae352e6df9c |
| created_at            | 2021-10-27T10:57:59.000000                 |
| data_timestamp        | 2021-10-27T10:57:59.000000                 |
| description           | None                                       |
| fail_reason           | None                                       |
| has_dependent_backups | False                                      |
| id                    | dd42d300-e1c4-442c-9c6f-fae352e6df9c       |
| is_incremental        | False                                      |
| name                  | test backup                                |
| object_count          | 1                                          |
| size                  | 1                                          |
| snapshot_id           | None                                       |
| status                | available                                  |
| updated_at            | 2021-10-27T10:58:21.000000                 |
| volume_id             | 468cf670-3f23-483b-9309-2f98d289c5dc       |
+-----------------------+--------------------------------------------+

Change 740551 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] cloud: cinder-backups: use main ceph cinder keyring

https://gerrit.wikimedia.org/r/740551

Change 740554 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] cinder: fix config template and don't reuse 'ceph_pool' that much

https://gerrit.wikimedia.org/r/740554

Change 740562 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[labs/private@master] ceph: codfw: refresh entry name for codfw1dev-cinder-backups

https://gerrit.wikimedia.org/r/740562

Change 740562 merged by Arturo Borrero Gonzalez:

[labs/private@master] ceph: codfw: refresh entry name for codfw1dev-cinder-backups

https://gerrit.wikimedia.org/r/740562

Change 740564 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[labs/private@master] codfw1dev: backups: refresh entry for ceph keyring

https://gerrit.wikimedia.org/r/740564

Change 740564 merged by Arturo Borrero Gonzalez:

[labs/private@master] codfw1dev: backups: refresh entry for ceph keyring

https://gerrit.wikimedia.org/r/740564

Change 740554 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] cinder: fix config template and don't reuse 'ceph_pool' that much

https://gerrit.wikimedia.org/r/740554

Change 740579 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] openstack: codfw1dev: deploy general cinder keyring in cinder-backups nodes

https://gerrit.wikimedia.org/r/740579

Change 740579 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] openstack: codfw1dev: deploy general cinder keyring in cinder-backups nodes

https://gerrit.wikimedia.org/r/740579

Change 740827 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] cloud: codfw1dev: fix keyring owner/group for cinder-backups

https://gerrit.wikimedia.org/r/740827

Change 740829 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[labs/private@master] hiera: cloud: refresh keyname for codfw1dev cinder backups

https://gerrit.wikimedia.org/r/740829

Change 740829 merged by Arturo Borrero Gonzalez:

[labs/private@master] hiera: cloud: refresh keyname for codfw1dev cinder backups

https://gerrit.wikimedia.org/r/740829

Change 740827 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] cloud: codfw1dev: fix keyring owner/group for cinder-backups

https://gerrit.wikimedia.org/r/740827

Change 742273 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] cinder.conf: Tune settings for the backup agent.

https://gerrit.wikimedia.org/r/742273

Change 742273 merged by Andrew Bogott:

[operations/puppet@production] cinder.conf: Tune settings for the backup agent.

https://gerrit.wikimedia.org/r/742273

Change 740551 abandoned by Arturo Borrero Gonzalez:

[operations/puppet@production] cloud: cinder-backups: use main ceph cinder keyring

Reason:

a different patch was merged, see ed5658a51946148376ec19a6474d5e972bb34167

https://gerrit.wikimedia.org/r/740551

Logged upstream bug:

https://bugs.launchpad.net/cinder/+bug/1952804

I'm not sure if this is a deal-breaker or not; even with that bug fixed there will still be a race which causes a stuck job if a backup backend goes down unexpectedly.

Here is the other serious upstream bug I've been seeing:

https://bugs.launchpad.net/cinder/+bug/1952805

That means that we can use incremental backups or multiple backend nodes, but not both.

Here is the other serious upstream bug I've been seeing:

https://bugs.launchpad.net/cinder/+bug/1952805

That means that we can use incremental backups or multiple backend nodes, but not both.

Thanks for identifying the problem and reporting it upstream.

My thought: I introduced puppet support for multi-node cinder-backup nodes because that's the way to use all our current storage dedicated to backups (remember, 2*cloudbackup servers in codfw with 200TB storage each).

This is to say: I don't see any problem in using just 1 cinder-backup node until this upstream bug is fixed. This bug shouldn't be a blocker.
All of our short term backup storage requirements for the NFS migration can be covered with a single 200TB cinder-backup node. Example: cloudbackup2002, using 10TB out of 214TB (204TB free).

Logged upstream bug:

https://bugs.launchpad.net/cinder/+bug/1952804

I'm not sure if this is a deal-breaker or not; even with that bug fixed there will still be a race which causes a stuck job if a backup backend goes down unexpectedly.

Again, thanks for identifying the issue and reporting it upstream!

What you described in the upstream ticket seems like exactly what I've been experiencing. I've seen backups fail right after the cinder-backup agent started (after a config change or whatever). So perhaps is just a matter of not being anxious, and don't schedule backups until the cinder-backup service has been up for, lets say 5 minutes.

What concerns me more is that cinder seems to leave the backup in 'creating' state forever. Ideally it would declare it 'failed backup' Did you ever see cinder declaring it 'failed'?

What concerns me more is that cinder seems to leave the backup in 'creating' state forever. Ideally it would declare it 'failed backup' Did you ever see cinder declaring it 'failed'?

Yes, usually! When the unavailable service comes up and gets oriented the stuck backup usually changes to an error state. But not always :/

I think we should switch all of our testing to a single-backend model and see if it mostly stops breaking.

What concerns me more is that cinder seems to leave the backup in 'creating' state forever. Ideally it would declare it 'failed backup' Did you ever see cinder declaring it 'failed'?

Yes, usually! When the unavailable service comes up and gets oriented the stuck backup usually changes to an error state. But not always :/

I think we should switch all of our testing to a single-backend model and see if it mostly stops breaking.

ok, agreed.

hey @Andrew I just noticed that I didn't look yet into the root of the problem I commented here: https://phabricator.wikimedia.org/T292546#7447927

  • some weird rbd command serialization problem
  • connectivity issues cinder-backups <-> ceph

I suspect there could be problems related to different ceph client lib versions, or rbd proto v1 vs proto v2, stuff like that. But honestly, didn't have the chance to dig deeper, in case you want something to investigate.

I looked at this more today. A lot of the suddenly-failing jobs seem to be poorly-surfaced OOM issues (I'm testing on cloudbackup1001-dev which only has 4Gb of RAM). When I change the buffer size to be much smaller I get many fewer failures although, unfortunately, I'm still seeing occasional jobs stuck in 'creating' forever.

backup_file_size = 3276800

Change 744821 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] cinder-backup: generate backup_file_size relative to available RAM

https://gerrit.wikimedia.org/r/744821

Change 744821 merged by Andrew Bogott:

[operations/puppet@production] cinder-backup: generate backup_file_size relative to available RAM

https://gerrit.wikimedia.org/r/744821

Change 745765 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] ceph: auth: drop cinder-backup keyrings

https://gerrit.wikimedia.org/r/745765

Change 745765 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] ceph: auth: drop cinder-backup keyrings

https://gerrit.wikimedia.org/r/745765

Change 755057 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] Add cinder-backup role/profile for eqiad1, use on cloudbackup2002

https://gerrit.wikimedia.org/r/755057

Change 755057 merged by Andrew Bogott:

[operations/puppet@production] Add cinder-backup role/profile for eqiad1, use on cloudbackup2002

https://gerrit.wikimedia.org/r/755057

Change 755753 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] Define profile::openstack::eqiad1::cinder::backup::nodes

https://gerrit.wikimedia.org/r/755753

Change 755753 merged by Andrew Bogott:

[operations/puppet@production] Define profile::openstack::eqiad1::cinder::backup::nodes

https://gerrit.wikimedia.org/r/755753

Change 755759 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] Provide cinder backup node list to rabbitmq in eqiad1

https://gerrit.wikimedia.org/r/755759

Change 755759 merged by Andrew Bogott:

[operations/puppet@production] Provide cinder backup node list to rabbitmq in eqiad1

https://gerrit.wikimedia.org/r/755759

Change 755788 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] ceph: list cloudbackup2002 as a cinder backup node

https://gerrit.wikimedia.org/r/755788

Change 755788 merged by Andrew Bogott:

[operations/puppet@production] ceph: list cloudbackup2002 as a cinder backup node

https://gerrit.wikimedia.org/r/755788

Change 763310 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] wmcs-cinder-backup-manager.py: increase total backup timeout

https://gerrit.wikimedia.org/r/763310

Change 763310 merged by Andrew Bogott:

[operations/puppet@production] wmcs-cinder-backup-manager.py: increase total backup timeout

https://gerrit.wikimedia.org/r/763310

This is mostly working now -- all modest-sized volumes are getting backed up fine.

I have one outlier, the 8Tb 'maps' volume in the 'maps' project never seems to complete. I've increased the timeout to 18 hours with no success.

root@cloudbackup2002:/usr/lib/python3/dist-packages/cinder/backup# iperf -c cloudcephosd1024.eqiad.wmnet -p 7100
------------------------------------------------------------
Client connecting to cloudcephosd1024.eqiad.wmnet, TCP port 7100
TCP window size:  325 KByte (default)
------------------------------------------------------------
[  3] local 10.192.32.186 port 40620 connected with 10.64.20.20 port 7100
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec  2.38 GBytes  2.04 Gbits/sec

That should be barely enough bandwidth... at 2Gbits/second we should be able to transfer an 8TB file in 32000 seconds or around 9 hours. That sounds excessive but if we're only doing occasional full backups this is all somewhat possible if we can optimize a bit more. On the other hand, I note that that rate is awfully close to a round number which has me wondering if there's a throttle someplace we could adjust.

The tools nfs mount is also about 8Tb so if we get maps working we should be able to get tools working too as long as they don't try to do their full backups on the same day.

This is actually working now. The maps volume is handled as an edge case (incremental backups don't really function for volumes that large) but we're getting periodic backups at least.

There's ongoing upstream work to tidy up this feature but our deployment is in OK shape now.