Page MenuHomePhabricator

Practice Galera disaster recovert
Closed, ResolvedPublic

Description

Let's confirm alerting and disaster recovery are working properly with the Galera setup.

Proposed steps:

  • Delete one of the databases (glance?) on cloudcontrol2001-dev
  • Stop galera on cloudcontrol2001, 2003, 2004
  • Confirm that some sensible alerts are showing up on icinga
  • Recover!

Rudimentary docs can be found at https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Galera and on https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting

Event Timeline

aborrero moved this task from Inbox to Doing on the cloud-services-team (Kanban) board.

We seem to have 2 related services: mysql and mariadb. It was confusing to know which one was the actual galera-enabled database. The mariadb service is full of errors and wont start apparently. I suggest we drop it, if we can. It can produce systemd icinga alerts in the future.

Mentioned in SAL (#wikimedia-cloud) [2020-07-03T11:36:21Z] <arturo> [codfw1dev] dropped glance database in the galera cluster T256283

Mentioned in SAL (#wikimedia-cloud) [2020-07-03T11:39:12Z] <arturo> [codfw1dev] stopped mysql database in the galera cluster T256283

After dropping the database and stopping the mysql service I don't see any mention in icinga about the state of the openstack system being wrong.

But obviously the API is returning HTTP/500:

root@cloudcontrol2001-dev:~# openstack endpoint list
An unexpected error prevented the server from fulfilling your request. (HTTP 500) (Request-ID: req-073731fd-8af0-4032-8e5e-4dc7eec229ef)

Mentioned in SAL (#wikimedia-cloud) [2020-07-03T11:44:40Z] <arturo> [codfw1dev] restoring glance database backup from bacula into cloudcontrol2001-dev (T256283)

Trying to import the database from the backup, this was unexpected:

root@cloudcontrol2001-dev:~# mysqlimport glance /var/tmp/bacula-restores/srv/backups/glance-202007030408.sql -u root
mysqlimport: Error: 2002 Can't connect to local MySQL server through socket '/var/run/mysqld/mysqld.sock' (111)

We should note in the docs which particular command to use to reimport the database backup in this galera setup.

Ok, couple of things:

root@cloudcontrol2001-dev:~# mysql -u root glance < /var/tmp/bacula-restores/srv/backups/glance.sql
  • after all this the data was loaded and openstack is happy again:
root@cloudcontrol2004-dev:~# openstack image list
+--------------------------------------+---------------------------------------------+--------+
| ID                                   | Name                                        | Status |
+--------------------------------------+---------------------------------------------+--------+
| d1b2ea32-10ca-40a5-a3fc-babc3956f049 | debian-10.0-buster                          | active |
| f7f9a861-c227-4ea5-927b-571f11538d86 | debian-10.2.0-raw-upstream                  | active |
| 21dd4e70-487d-4c2a-9813-5b6997fae03e | debian-10.3-buster-upstream                 | active |
| cb24cf99-b77b-432e-a861-b5ff5fef95a0 | debian-9.11-stretch                         | active |
| 94321599-0b42-4f6f-8a80-67a2ff561870 | debian-9.11-stretch (deprecated 2019-12-18) | active |
| 23d2421f-43ab-4307-8e99-aaaaabc67d02 | debian-9.8-stretch (deprecated 2019-12-18)  | active |
| 1ab9141c-c713-4265-bd00-fab3e59aab69 | debian-buster (deprecated 2019-12-18)       | active |
+--------------------------------------+---------------------------------------------+--------+

Mentioned in SAL (#wikimedia-cloud) [2020-07-03T12:51:57Z] <arturo> [codfw1dev] galera cluster should be up and running, openstack happy (T256283)

aborrero lowered the priority of this task from High to Low.

I'm actually closing the task. The practice itself has been successfully done.