cloud NFS: figure out backups for cinder volumes
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	aborrero
	Oct 5 2021, 3:05 PM

Description

We plan to store cloud NFS data on cinder volumes. For that we need to figure out how to backup the volumes out of ceph.

Details

Subject	Repo	Branch	Lines +/-
wmcs-cinder-backup-manager.py: increase total backup timeout	operations/puppet	production	+1 -1
ceph: list cloudbackup2002 as a cinder backup node	operations/puppet	production	+4 -3
Provide cinder backup node list to rabbitmq in eqiad1	operations/puppet	production	+2 -1
Define profile::openstack::eqiad1::cinder::backup::nodes	operations/puppet	production	+8 -1
Add cinder-backup role/profile for eqiad1, use on cloudbackup2002	operations/puppet	production	+132 -2
ceph: auth: drop cinder-backup keyrings	operations/puppet	production	+2 -10
cinder-backup: generate backup_file_size relative to available RAM	operations/puppet	production	+18 -4
cloud: cinder-backups: use main ceph cinder keyring	operations/puppet	production	+1 -9
cinder.conf: Tune settings for the backup agent.	operations/puppet	production	+6 -2
cloud: codfw1dev: fix keyring owner/group for cinder-backups	operations/puppet	production	+3 -0
hiera: cloud: refresh keyname for codfw1dev cinder backups	labs/private	master	+2 -4
openstack: codfw1dev: deploy general cinder keyring in cinder-backups nodes	operations/puppet	production	+2 -2
cinder: fix config template and don't reuse 'ceph_pool' that much	operations/puppet	production	+21 -11
codfw1dev: backups: refresh entry for ceph keyring	labs/private	master	+2 -2
ceph: codfw: refresh entry name for codfw1dev-cinder-backups	labs/private	master	+1 -1
ceph::osd: add cinder backup hosts to ferm	operations/puppet	production	+4 -2
hieradata: openstack: cinder-backups: fix permissions of ceph keyring file	operations/puppet	production	+2 -2
hieradata: openstack: cinder-backups: fix ceph keyring file name	operations/puppet	production	+3 -1
openstack: cinder backups: introduce ceph client config	operations/puppet	production	+5 -0
openstack: cinder.conf: specify lock path	operations/puppet	production	+2 -0
openstack: galera: allow DB access to cinder-backup nodes	operations/puppet	production	+3 -0
openstack: cinder: allow backup API actions	operations/puppet	production	+9 -9
openstack: cinder backups: create directory for mount	operations/puppet	production	+5 -1
hieradata: cloudbackup2002: fix typo in LVM volue group name	operations/puppet	production	+1 -1
openstack: cinder backups: use per-deployment rabbit pass	operations/puppet	production	+2 -0
openstack: cinder backups: use per-deployment DB pass	operations/puppet	production	+2 -0
cloudbackup: deploy cinder-backup service	operations/puppet	production	+165 -2
hieradata: add placeholder for profile::openstack::base::nova::rabbit_pass	labs/private	master	+1 -0
hieradata: add profile::openstack::base::cinder::db_pass placeholder	labs/private	master	+1 -0
openstack: cinder: refactor configuration file to its own module	operations/puppet	production	+71 -71

Related Objects
Search...

Status	Assigned	Task
Resolved	• Bstorm	T216208 ToolsDB overload and cleanup
Resolved	• Bstorm	T216441 Evaluate transferring the non-replicated tables to the new toolsdb server
Resolved	fnegri	T236101 Find a way to remove non-replicated tables from ToolsDB
Resolved	dcaro	T301951 toolsdb: full disk on clouddb1001 broke clouddb1002 (secondary) replication
Open	None	T301967 toolsdb: evaluate storage usage by some tools
Open	fnegri	T291782 Migrate largest ToolsDB users to Trove
Open	None	T272395 Cloud: reduce NAT exceptions from cloud to production
Resolved	Andrew	T291405 [NFS] Reduce or eliminate bare-metal NFS servers
Resolved	Andrew	T292546 cloud NFS: figure out backups for cinder volumes
Resolved	aborrero	T293752 cloud ceph: refactor rbd client puppet profiles
Duplicate	None	T294429 cinder-backups: figure out automation
Resolved	aborrero	T295584 eqiad: 2 VMs for cloudbackup-dev
Resolved	aborrero	T296413 cinder: get victoria point release in the bpo repo
Resolved	aborrero	T299708 network access to eqiad ceph cluster from cloudbackup2002
Resolved	Andrew	T339830 cinder-backup getting OOM-killed for large volumes
Resolved	Andrew	T344065 Replace cinder-backup process with backy2
Resolved	Andrew	T358855 Use cloudbackup100[12]-dev for cinder backup test/dev

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 730769 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] openstack: cinder backups: use per-deployment rabbit pass

https://gerrit.wikimedia.org/r/730769

Change 730769 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] openstack: cinder backups: use per-deployment rabbit pass

https://gerrit.wikimedia.org/r/730769

Change 730771 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] hieradata: cloudbackup2002: fix typo in LVM volue group name

https://gerrit.wikimedia.org/r/730771

Change 730771 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] hieradata: cloudbackup2002: fix typo in LVM volue group name

https://gerrit.wikimedia.org/r/730771

Change 730776 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] openstack: cinder backups: create directory for mount

https://gerrit.wikimedia.org/r/730776

Change 730776 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] openstack: cinder backups: create directory for mount

https://gerrit.wikimedia.org/r/730776

Mentioned in SAL (#wikimedia-cloud) [2021-10-14T12:28:37Z] <arturo> [codfw1dev] add DB grants for cloudbackup2002.codfw.wmnet IP address to the cinder DB (T292546)

Change 730779 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] openstack: cinder: allow backup API actions

https://gerrit.wikimedia.org/r/730779

Change 730779 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] openstack: cinder: allow backup API actions

https://gerrit.wikimedia.org/r/730779

Change 730782 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] openstack: galera: allow DB access to cinder-backup nodes

https://gerrit.wikimedia.org/r/730782

Change 730782 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] openstack: galera: allow DB access to cinder-backup nodes

https://gerrit.wikimedia.org/r/730782

Change 730784 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] openstack: cinder.conf: specify lock path

https://gerrit.wikimedia.org/r/730784

Change 730784 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] openstack: cinder.conf: specify lock path

https://gerrit.wikimedia.org/r/730784

Change 730829 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] openstack: cinder backups: introduce ceph client config

https://gerrit.wikimedia.org/r/730829

Change 730829 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] openstack: cinder backups: introduce ceph client config

https://gerrit.wikimedia.org/r/730829

Change 731370 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] hieradata: openstack: cinder-backups: fix ceph keyring file name

https://gerrit.wikimedia.org/r/731370

Change 731370 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] hieradata: openstack: cinder-backups: fix ceph keyring file name

https://gerrit.wikimedia.org/r/731370

Change 731375 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] hieradata: openstack: cinder-backups: fix permissions of ceph keyring file

https://gerrit.wikimedia.org/r/731375

Change 731375 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] hieradata: openstack: cinder-backups: fix permissions of ceph keyring file

https://gerrit.wikimedia.org/r/731375

Current status:

A bit of hiera mess in puppet prevents the cinder-backup service (running on cloudbackup2002.codfw.wmnet) from getting the right ceph credentials (as can be seen on /var/log/cinder/cinder-backup.log when triggering a backup action on cloudcontrol2001-dev.wikimedia.org)

dcaro mentioned this in T293805: [NFS] Create script to automate the cinder volume backups on cloudbackup2001.Oct 19 2021, 4:02 PM

So I had a hunch today. We haven't fully tested yet that cinder-backups can indeed fetch information from the ceph cluster (because we found T293752: cloud ceph: refactor rbd client puppet profiles and blocked on it).
I decided to workaround this today to verify if the cinder-backups does work with ceph as intended or not.

Surprise: it doesn't.

It shows this log line:

2021-10-21 12:32:36.677 22312 DEBUG os_brick.initiator.linuxrbd [req-46594b0e-9032-4497-836e-016d97a44a40 novaadmin admin - - -] opening connection to ceph cluster (timeout=-1). connect /usr/lib/python3/dist-packages/os_brick/initiator/linuxrbd.py:70

There is some traffic going on between cloudbackup2002 and the mons:

aborrero@cloudbackup2002:~ $ sudo tcpdump -i any tcp port 3300 or tcp port 6789
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on any, link-type LINUX_SLL (Linux cooked), capture size 262144 bytes
12:28:09.980792 IP cloudbackup2002.codfw.wmnet.34228 > cloudcephmon2004-dev.codfw.wmnet.6789: Flags [S], seq 994059222, win 42340, options [mss 1460,sackOK,TS val 2245731376 ecr 0,nop,wscale 9], length 0
12:28:09.980852 IP cloudbackup2002.codfw.wmnet.37236 > cloudcephmon2003-dev.codfw.wmnet.6789: Flags [S], seq 768127409, win 42340, options [mss 1460,sackOK,TS val 2666147774 ecr 0,nop,wscale 9], length 0
12:28:09.980863 IP cloudbackup2002.codfw.wmnet.50994 > cloudcephmon2002-dev.codfw.wmnet.6789: Flags [S], seq 4087198713, win 42340, options [mss 1460,sackOK,TS val 3124464738 ecr 0,nop,wscale 9], length 0
12:28:09.981021 IP cloudcephmon2004-dev.codfw.wmnet.6789 > cloudbackup2002.codfw.wmnet.34228: Flags [S.], seq 1706716108, ack 994059223, win 43440, options [mss 1460,sackOK,TS val 958699844 ecr 2245731376,nop,wscale 9], length 0
[..]

I checked logs on the mons, there is no specific information about what's going on:

aborrero@cloudcephmon2002-dev:~ $ sudo tail /var/log/ceph/ceph.audit.log
2021-10-21T12:53:03.834986+0000 mon.cloudcephmon2003-dev (mon.1) 512964 : audit [DBG] from='client.? 208.80.153.75:0/3066437256' entity='client.codfw1dev-cinder' cmd=[{,",p,r,e,f,i,x,",:,",d,f,",,, ,",f,o,r,m,a,t,",:,",j,s,o,n,",}]: dispatch
2021-10-21T12:53:03.836307+0000 mon.cloudcephmon2003-dev (mon.1) 512965 : audit [DBG] from='client.? 208.80.153.75:0/3066437256' entity='client.codfw1dev-cinder' cmd=[{,",p,r,e,f,i,x,",:,",o,s,d, ,p,o,o,l, ,g,e,t,-,q,u,o,t,a,",,, ,",p,o,o,l,",:, ,",c,o,d,f,w,1,d,e,v,-,c,i,n,d,e,r,",,, ,",f,o,r,m,a,t,",:,",j,s,o,n,",}]: dispatch
2021-10-21T12:53:04.610579+0000 mon.cloudcephmon2004-dev (mon.2) 1222350 : audit [INF] from='mgr.23690785 10.192.20.7:0/763' entity='mgr.cloudcephmon2002-dev' cmd=[{"prefix":"config rm","who":"mgr","name":"mgr/rbd_support/cloudcephmon2002-dev/trash_purge_schedule"}]: dispatch
2021-10-21T12:53:04.611581+0000 mon.cloudcephmon2002-dev (mon.0) 708620 : audit [INF] from='mgr.23690785 ' entity='mgr.cloudcephmon2002-dev' cmd=[{"prefix":"config rm","who":"mgr","name":"mgr/rbd_support/cloudcephmon2002-dev/trash_purge_schedule"}]: dispatch

I see however this weird line cmd=[{,",p,r,e,f,i,x,",:,",o,s,d, ,p,o,o,l, ,g,e,t,-,q,u,o,t,a,",,, ,",p,o,o,l,",:, ,",c,o,d,f,w,1,d,e,v,-,c,i,n,d,e,r,",,, ,",f,o,r,m,a,t,",:,",j,s,o,n,",}]: which seems like a bad serialization somewhere?

Additional action items:

have a sensible timeout for the rbd client connection. Not sure where that is set though (apparently not /etc/cinder/cinder.conf)

dcaro added a project: User-dcaro.Oct 26 2021, 9:01 AM

dcaro moved this task from To refine to Today on the User-dcaro board.

I see however this weird line cmd=[{,",p,r,e,f,i,x,",:,",o,s,d, ,p,o,o,l, ,g,e,t,-,q,u,o,t,a,",,, ,",p,o,o,l,",:, ,",c,o,d,f,w,1,d,e,v,-,c,i,n,d,e,r,",,, ,",f,o,r,m,a,t,",:,",j,s,o,n,",}]: which seems like a bad serialization somewhere?

This seems to me like a config option that was expected to be an array of strings, having set as a string xd

Change 734690 had a related patch set uploaded (by David Caro; author: David Caro):

[operations/puppet@production] ceph::osd: add cinder backup hosts to ferm

https://gerrit.wikimedia.org/r/734690

Change 734690 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] ceph::osd: add cinder backup hosts to ferm

https://gerrit.wikimedia.org/r/734690

This seems to me like a config option that was expected to be an array of strings, having set as a string xd

For the record, it was a missing firewall rule on the osd side (really confusing error messages from ceph cli :S)

dcaro moved this task from Today to Doing on the User-dcaro board.Oct 27 2021, 10:52 AM

Update:

we were able to fix a hiera issue that was preventing us from testing the right ceph keydata for cinder-backups ahead of the ceph refactor https://gerrit.wikimedia.org/r/c/operations/puppet/+/734937 thanks @jbond for the assistance
with that change in place, the cinder-backup service now works fine. I was able to backup several volumes, and restore them
I think now we've finally validated the basic functionality of the cinder-backups API and can safely proceed with the next steps, which are:
- the ceph refactor T293752: cloud ceph: refactor rbd client puppet profiles
- enable cinder-backup in the eqiad1 deployment
- figure out how to instrument / automate the backup logic

Extra notes:

It was nice to discover that the chunking algorithm that cinder-backup uses takes into account empty blocks, this means the storage on the backup side will be more efficiently managed (ie, backing up a 20G volume of empty data takes very little storage space, not 20G)
the cinder-backup service is also designed for horizontal scalability. We could just add more cloudbackup servers and cinder-backup will just know how to work with them to split load/storage.

Example session with the new CLI:

root@cloudcontrol2001-dev:~# openstack volume list --all-projects
+--------------------------------------+--------------------------------------------+-----------+------+---------------------------------------------------------------+
| ID                                   | Name                                       | Status    | Size | Attached to                                                   |
+--------------------------------------+--------------------------------------------+-----------+------+---------------------------------------------------------------+
| 74bf4553-c92e-4fd5-88ef-33fb789ab07a | tlsvol                                     | available |    3 |                                                               |
| bcde703e-1ad9-40c5-badf-5f5eeae18508 | trove-58ec6fd7-0822-440b-beb5-2581e0edf98f | in-use    |    2 | Attached to 3e2b42b3-7b92-4805-9be2-00b2ab5d349b on /dev/vdb  |
| 468cf670-3f23-483b-9309-2f98d289c5dc | bleh                                       | available |    1 |                                                               |
| 4a4f04b1-7c27-4d30-9446-479390b29526 | ussurivol                                  | available |    3 |                                                               |
| 3c82177d-4272-4d63-bef0-edfa3f4a38a5 |                                            | available |   20 |                                                               |
| fbecb639-216c-4d92-a91f-ace4b87e2b0b | testvolume                                 | available |    8 |                                                               |
+--------------------------------------+--------------------------------------------+-----------+------+---------------------------------------------------------------+
root@cloudcontrol2001-dev:~# openstack volume backup create 468cf670-3f23-483b-9309-2f98d289c5dc --name "test backup"
+-------+--------------------------------------+
| Field | Value                                |
+-------+--------------------------------------+
| id    | dd42d300-e1c4-442c-9c6f-fae352e6df9c |
| name  | test backup                          |
+-------+--------------------------------------+
root@cloudcontrol2001-dev:~# openstack volume backup list
+--------------------------------------+-------------+-------------+----------+------+
| ID                                   | Name        | Description | Status   | Size |
+--------------------------------------+-------------+-------------+----------+------+
| dd42d300-e1c4-442c-9c6f-fae352e6df9c | test backup | None        | creating |    1 |
+--------------------------------------+-------------+-------------+----------+------+
root@cloudcontrol2001-dev:~# openstack volume backup list
+--------------------------------------+-------------+-------------+-----------+------+
| ID                                   | Name        | Description | Status    | Size |
+--------------------------------------+-------------+-------------+-----------+------+
| dd42d300-e1c4-442c-9c6f-fae352e6df9c | test backup | None        | available |    1 |
+--------------------------------------+-------------+-------------+-----------+------+
root@cloudcontrol2001-dev:~# openstack volume backup show dd42d300-e1c4-442c-9c6f-fae352e6df9c
+-----------------------+--------------------------------------------+
| Field                 | Value                                      |
+-----------------------+--------------------------------------------+
| availability_zone     | None                                       |
| container             | dd/42/dd42d300-e1c4-442c-9c6f-fae352e6df9c |
| created_at            | 2021-10-27T10:57:59.000000                 |
| data_timestamp        | 2021-10-27T10:57:59.000000                 |
| description           | None                                       |
| fail_reason           | None                                       |
| has_dependent_backups | False                                      |
| id                    | dd42d300-e1c4-442c-9c6f-fae352e6df9c       |
| is_incremental        | False                                      |
| name                  | test backup                                |
| object_count          | 1                                          |
| size                  | 1                                          |
| snapshot_id           | None                                       |
| status                | available                                  |
| updated_at            | 2021-10-27T10:58:21.000000                 |
| volume_id             | 468cf670-3f23-483b-9309-2f98d289c5dc       |
+-----------------------+--------------------------------------------+

dcaro moved this task from Doing to To refine on the User-dcaro board.Oct 28 2021, 8:18 AM

aborrero added a subtask: T295584: eqiad: 2 VMs for cloudbackup-dev.Nov 12 2021, 12:07 PM

aborrero mentioned this in T295592: deploy cinder-backups for eqiad1.Nov 12 2021, 12:58 PM

dcaro changed the status of subtask T293752: cloud ceph: refactor rbd client puppet profiles from Open to In Progress.Nov 12 2021, 1:16 PM

dcaro changed the status of subtask T293752: cloud ceph: refactor rbd client puppet profiles from In Progress to Open.Nov 15 2021, 8:58 AM

dcaro changed the status of subtask T293752: cloud ceph: refactor rbd client puppet profiles from Open to In Progress.Nov 15 2021, 12:45 PM

dcaro changed the status of subtask T293752: cloud ceph: refactor rbd client puppet profiles from In Progress to Open.Nov 16 2021, 8:30 AM

Change 740551 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] cloud: cinder-backups: use main ceph cinder keyring

https://gerrit.wikimedia.org/r/740551

Change 740554 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] cinder: fix config template and don't reuse 'ceph_pool' that much

https://gerrit.wikimedia.org/r/740554

Change 740562 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[labs/private@master] ceph: codfw: refresh entry name for codfw1dev-cinder-backups

https://gerrit.wikimedia.org/r/740562

Change 740562 merged by Arturo Borrero Gonzalez:

[labs/private@master] ceph: codfw: refresh entry name for codfw1dev-cinder-backups

https://gerrit.wikimedia.org/r/740562

Change 740564 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[labs/private@master] codfw1dev: backups: refresh entry for ceph keyring

https://gerrit.wikimedia.org/r/740564

Change 740564 merged by Arturo Borrero Gonzalez:

[labs/private@master] codfw1dev: backups: refresh entry for ceph keyring

https://gerrit.wikimedia.org/r/740564

Change 740554 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] cinder: fix config template and don't reuse 'ceph_pool' that much

https://gerrit.wikimedia.org/r/740554

aborrero mentioned this in rLPRIf18fd38bc061: ceph: codfw: refresh entry name for codfw1dev-cinder-backups.Nov 22 2021, 12:48 PM

aborrero mentioned this in rLPRI83f713beeb2f: codfw1dev: backups: refresh entry for ceph keyring.

Change 740579 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] openstack: codfw1dev: deploy general cinder keyring in cinder-backups nodes

https://gerrit.wikimedia.org/r/740579

Change 740579 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] openstack: codfw1dev: deploy general cinder keyring in cinder-backups nodes

https://gerrit.wikimedia.org/r/740579

Change 740827 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] cloud: codfw1dev: fix keyring owner/group for cinder-backups

https://gerrit.wikimedia.org/r/740827

Change 740829 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[labs/private@master] hiera: cloud: refresh keyname for codfw1dev cinder backups

https://gerrit.wikimedia.org/r/740829

Change 740829 merged by Arturo Borrero Gonzalez:

[labs/private@master] hiera: cloud: refresh keyname for codfw1dev cinder backups

https://gerrit.wikimedia.org/r/740829

aborrero mentioned this in rLPRI215193e74ea7: hiera: cloud: refresh keyname for codfw1dev cinder backups.Nov 23 2021, 1:13 PM

Change 740827 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] cloud: codfw1dev: fix keyring owner/group for cinder-backups

https://gerrit.wikimedia.org/r/740827

aborrero closed subtask T295584: eqiad: 2 VMs for cloudbackup-dev as Resolved.Nov 24 2021, 10:06 AM

dcaro removed a project: User-dcaro.Nov 24 2021, 3:13 PM

aborrero closed subtask T296413: cinder: get victoria point release in the bpo repo as Resolved.Nov 26 2021, 11:24 AM

Change 742273 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] cinder.conf: Tune settings for the backup agent.

https://gerrit.wikimedia.org/r/742273

Change 742273 merged by Andrew Bogott:

[operations/puppet@production] cinder.conf: Tune settings for the backup agent.

https://gerrit.wikimedia.org/r/742273

Change 740551 abandoned by Arturo Borrero Gonzalez:

[operations/puppet@production] cloud: cinder-backups: use main ceph cinder keyring

Reason:

a different patch was merged, see ed5658a51946148376ec19a6474d5e972bb34167

https://gerrit.wikimedia.org/r/740551

Logged upstream bug:

https://bugs.launchpad.net/cinder/+bug/1952804

I'm not sure if this is a deal-breaker or not; even with that bug fixed there will still be a race which causes a stuck job if a backup backend goes down unexpectedly.

Here is the other serious upstream bug I've been seeing:

https://bugs.launchpad.net/cinder/+bug/1952805

That means that we can use incremental backups or multiple backend nodes, but not both.

In T292546#7538903, @Andrew wrote:

Here is the other serious upstream bug I've been seeing:

https://bugs.launchpad.net/cinder/+bug/1952805

That means that we can use incremental backups or multiple backend nodes, but not both.

Thanks for identifying the problem and reporting it upstream.

My thought: I introduced puppet support for multi-node cinder-backup nodes because that's the way to use all our current storage dedicated to backups (remember, 2*cloudbackup servers in codfw with 200TB storage each).

This is to say: I don't see any problem in using just 1 cinder-backup node until this upstream bug is fixed. This bug shouldn't be a blocker.
All of our short term backup storage requirements for the NFS migration can be covered with a single 200TB cinder-backup node. Example: cloudbackup2002, using 10TB out of 214TB (204TB free).

In T292546#7538891, @Andrew wrote:

Logged upstream bug:

https://bugs.launchpad.net/cinder/+bug/1952804

I'm not sure if this is a deal-breaker or not; even with that bug fixed there will still be a race which causes a stuck job if a backup backend goes down unexpectedly.

Again, thanks for identifying the issue and reporting it upstream!

What you described in the upstream ticket seems like exactly what I've been experiencing. I've seen backups fail right after the cinder-backup agent started (after a config change or whatever). So perhaps is just a matter of not being anxious, and don't schedule backups until the cinder-backup service has been up for, lets say 5 minutes.

What concerns me more is that cinder seems to leave the backup in 'creating' state forever. Ideally it would declare it 'failed backup' Did you ever see cinder declaring it 'failed'?

What concerns me more is that cinder seems to leave the backup in 'creating' state forever. Ideally it would declare it 'failed backup' Did you ever see cinder declaring it 'failed'?

Yes, usually! When the unavailable service comes up and gets oriented the stuck backup usually changes to an error state. But not always :/

I think we should switch all of our testing to a single-backend model and see if it mostly stops breaking.

In T292546#7541229, @Andrew wrote:

What concerns me more is that cinder seems to leave the backup in 'creating' state forever. Ideally it would declare it 'failed backup' Did you ever see cinder declaring it 'failed'?

Yes, usually! When the unavailable service comes up and gets oriented the stuck backup usually changes to an error state. But not always :/

I think we should switch all of our testing to a single-backend model and see if it mostly stops breaking.

ok, agreed.

hey @Andrew I just noticed that I didn't look yet into the root of the problem I commented here: https://phabricator.wikimedia.org/T292546#7447927

some weird rbd command serialization problem
connectivity issues cinder-backups <-> ceph

I suspect there could be problems related to different ceph client lib versions, or rbd proto v1 vs proto v2, stuff like that. But honestly, didn't have the chance to dig deeper, in case you want something to investigate.

I looked at this more today. A lot of the suddenly-failing jobs seem to be poorly-surfaced OOM issues (I'm testing on cloudbackup1001-dev which only has 4Gb of RAM). When I change the buffer size to be much smaller I get many fewer failures although, unfortunately, I'm still seeing occasional jobs stuck in 'creating' forever.

backup_file_size = 3276800

Change 744821 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] cinder-backup: generate backup_file_size relative to available RAM

https://gerrit.wikimedia.org/r/744821

Change 744821 merged by Andrew Bogott:

[operations/puppet@production] cinder-backup: generate backup_file_size relative to available RAM

https://gerrit.wikimedia.org/r/744821

aborrero reassigned this task from aborrero to Andrew.Dec 9 2021, 9:29 AM

Change 745765 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] ceph: auth: drop cinder-backup keyrings

https://gerrit.wikimedia.org/r/745765

Change 745765 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] ceph: auth: drop cinder-backup keyrings

https://gerrit.wikimedia.org/r/745765

aborrero closed subtask T293752: cloud ceph: refactor rbd client puppet profiles as Resolved.Dec 10 2021, 11:22 AM

Change 755057 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] Add cinder-backup role/profile for eqiad1, use on cloudbackup2002

https://gerrit.wikimedia.org/r/755057

Change 755057 merged by Andrew Bogott:

[operations/puppet@production] Add cinder-backup role/profile for eqiad1, use on cloudbackup2002

https://gerrit.wikimedia.org/r/755057

Change 755753 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] Define profile::openstack::eqiad1::cinder::backup::nodes

https://gerrit.wikimedia.org/r/755753

Change 755753 merged by Andrew Bogott:

[operations/puppet@production] Define profile::openstack::eqiad1::cinder::backup::nodes

https://gerrit.wikimedia.org/r/755753

Change 755759 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] Provide cinder backup node list to rabbitmq in eqiad1

https://gerrit.wikimedia.org/r/755759

Change 755759 merged by Andrew Bogott:

[operations/puppet@production] Provide cinder backup node list to rabbitmq in eqiad1

https://gerrit.wikimedia.org/r/755759

Change 755788 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] ceph: list cloudbackup2002 as a cinder backup node

https://gerrit.wikimedia.org/r/755788

Change 755788 merged by Andrew Bogott:

[operations/puppet@production] ceph: list cloudbackup2002 as a cinder backup node

https://gerrit.wikimedia.org/r/755788

Andrew added a subtask: T299708: network access to eqiad ceph cluster from cloudbackup2002.Jan 20 2022, 9:57 PM

Andrew closed subtask T299708: network access to eqiad ceph cluster from cloudbackup2002 as Resolved.Jan 21 2022, 10:37 PM

Change 763310 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] wmcs-cinder-backup-manager.py: increase total backup timeout

https://gerrit.wikimedia.org/r/763310

Change 763310 merged by Andrew Bogott:

[operations/puppet@production] wmcs-cinder-backup-manager.py: increase total backup timeout

https://gerrit.wikimedia.org/r/763310

This is mostly working now -- all modest-sized volumes are getting backed up fine.

I have one outlier, the 8Tb 'maps' volume in the 'maps' project never seems to complete. I've increased the timeout to 18 hours with no success.

root@cloudbackup2002:/usr/lib/python3/dist-packages/cinder/backup# iperf -c cloudcephosd1024.eqiad.wmnet -p 7100
------------------------------------------------------------
Client connecting to cloudcephosd1024.eqiad.wmnet, TCP port 7100
TCP window size:  325 KByte (default)
------------------------------------------------------------
[  3] local 10.192.32.186 port 40620 connected with 10.64.20.20 port 7100
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec  2.38 GBytes  2.04 Gbits/sec

That should be barely enough bandwidth... at 2Gbits/second we should be able to transfer an 8TB file in 32000 seconds or around 9 hours. That sounds excessive but if we're only doing occasional full backups this is all somewhat possible if we can optimize a bit more. On the other hand, I note that that rate is awfully close to a round number which has me wondering if there's a throttle someplace we could adjust.

The tools nfs mount is also about 8Tb so if we get maps working we should be able to get tools working too as long as they don't try to do their full backups on the same day.

Aklapper edited projects, added cloud-services-team (Kanban); removed cloud-services-team (FY2021/2022-Q3).Mar 22 2022, 1:06 AM

bd808 moved this task from Inbox to Doing on the cloud-services-team (Kanban) board.Sep 27 2022, 9:30 PM

This is actually working now. The maps volume is handled as an edge case (incremental backups don't really function for volumes that large) but we're getting periodic backups at least.

There's ongoing upstream work to tidy up this feature but our deployment is in OK shape now.

fnegri edited projects, added cloud-services-team (FY2023/2024-Q1-Q2), Goal; removed cloud-services-team (Kanban).Jul 26 2023, 4:00 PM

fnegri moved this task from Backlog to Done on the cloud-services-team (FY2023/2024-Q1-Q2) board.Aug 7 2023, 2:48 PM

Andrew closed subtask T344065: Replace cinder-backup process with backy2 as Resolved.Nov 30 2023, 4:22 AM

Maintenance_bot removed a project: Patch-For-Review.Nov 30 2023, 4:30 AM

Andrew closed subtask T339830: cinder-backup getting OOM-killed for large volumes as Resolved.Nov 30 2023, 4:37 AM