Page MenuHomePhabricator

Use postgres instead of sqlite for backy2
Closed, ResolvedPublic

Description

The latest version of backy2 (1.13.8) has an incompatible schema with the version currently installed on our backup hosts (1.13.2). Because we're using the sqlite backend, the schema upgrade will need to happen manually. We'd need a script that exports backup metadata so we can export, wipe out the old DB, and then re-import.

Automatic upgrades are supported with a postgres backend. Postgres will solve some other problems as well, so let's move to that.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Andrew triaged this task as Medium priority.Mar 21 2023, 8:27 PM
Andrew renamed this task from backy2 schema upgrades to Use postgres instead of sqlite for backy2.May 23 2023, 6:16 PM
Andrew updated the task description. (Show Details)

Change 922589 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[labs/private@master] add fake backy2 postgres passwords

https://gerrit.wikimedia.org/r/922589

Change 922590 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] backy2: install postgres on backy2 hosts

https://gerrit.wikimedia.org/r/922590

Change 922589 merged by Andrew Bogott:

[labs/private@master] add fake backy2 postgres passwords

https://gerrit.wikimedia.org/r/922589

Change 922590 abandoned by Andrew Bogott:

[operations/puppet@production] backy2: install postgres on backy2 hosts

Reason:

https://gerrit.wikimedia.org/r/922590

Change 922591 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] backy2: switch to using postgres backend

https://gerrit.wikimedia.org/r/922591

This is running on cloudbackup100[34].

The upgrade process is something like...

  • or, easier but scarier --
fnegri changed the task status from Open to In Progress.May 25 2023, 4:12 PM
fnegri moved this task from Backlog to In progress on the cloud-services-team (FY2022/2023-Q4) board.

Change 923410 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] backy2: switch from sqlite to postgres backend

https://gerrit.wikimedia.org/r/923410

Change 922591 merged by Andrew Bogott:

[operations/puppet@production] backy2: Prepare for switch to postgres backend

https://gerrit.wikimedia.org/r/922591

Change 923412 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] backy2: include python3-psycopg2

https://gerrit.wikimedia.org/r/923412

Change 923412 merged by Andrew Bogott:

[operations/puppet@production] backy2: include python3-psycopg2

https://gerrit.wikimedia.org/r/923412

The backup nodes now have postgres installed with a 'backy2' user and a 'backy2' table and a backy-generated schema. To actually switch backy2 over from sqlite to posgres, merge the following patch:

https://gerrit.wikimedia.org/r/c/operations/puppet/+/923410

The password for the new user and table can be found on puppetmaster1001.eqiad.wmnet in /srv/private/hieradata/eqiad/profile/wmcs/backy2.yaml

This is running on cloudbackup100[34].

The upgrade process is something like...

  • or, easier but scarier --

I am testing a data migration from sqlite to postgres using pgloader and the following config file:

load database
    from sqlite:///srv/backy2/backy.sqlite.backup_2023-05-25
    into postgresql://backy2:PASSWORD@localhost:5432/backy2

  with include no drop, truncate, create no tables;

This means we are retaining the Postgres table schemas created by Backy2 itself, rather than letting pgloader decide the type casting for SQLite columns.
This results in a few warning, that are probably safe to ignore:

root@cloudbackup1003:/srv/backy2# pgloader db.load
2023-05-30T15:39:02.022000Z LOG pgloader version "3.6.1"
2023-05-30T15:39:02.073000Z LOG Migrating from #<SQLITE-CONNECTION sqlite:///srv/backy2/backy.sqlite.backup_2023-05-25 {1006873A03}>
2023-05-30T15:39:02.074000Z LOG Migrating into #<PGSQL-CONNECTION pgsql://backy2@localhost:5432/backy2 {1006874FA3}>
2023-05-30T15:39:02.215000Z WARNING Source column "public"."stats"."date" is casted to type "timestamptz" which is not the same as "timestamp without time zone", the type of current target database column "public"."stats"."date".
2023-05-30T15:39:02.215000Z WARNING Source column "public"."stats"."id" is casted to type "bigint" which is not the same as "integer", the type of current target database column "public"."stats"."id".
2023-05-30T15:39:02.215000Z WARNING Source column "public"."stats"."version_uid" is casted to type "text" which is not the same as "character varying", the type of current target database column "public"."stats"."version_uid".
2023-05-30T15:39:02.215000Z WARNING Source column "public"."stats"."version_name" is casted to type "text" which is not the same as "character varying", the type of current target database column "public"."stats"."version_name".
2023-05-30T15:39:02.215000Z WARNING Source column "public"."versions"."uid" is casted to type "text" which is not the same as "character varying", the type of current target database column "public"."versions"."uid".
2023-05-30T15:39:02.215000Z WARNING Source column "public"."versions"."date" is casted to type "timestamptz" which is not the same as "timestamp without time zone", the type of current target database column "public"."versions"."date".
2023-05-30T15:39:02.215000Z WARNING Source column "public"."versions"."expire" is casted to type "timestamptz" which is not the same as "timestamp without time zone", the type of current target database column "public"."versions"."expire".
2023-05-30T15:39:02.215000Z WARNING Source column "public"."versions"."name" is casted to type "text" which is not the same as "character varying", the type of current target database column "public"."versions"."name".
2023-05-30T15:39:02.215000Z WARNING Source column "public"."versions"."snapshot_name" is casted to type "text" which is not the same as "character varying", the type of current target database column "public"."versions"."snapshot_name".
2023-05-30T15:39:02.215000Z WARNING Source column "public"."versions"."valid" is casted to type "bigint" which is not the same as "integer", the type of current target database column "public"."versions"."valid".
2023-05-30T15:39:02.216000Z WARNING Source column "public"."versions"."protected" is casted to type "bigint" which is not the same as "integer", the type of current target database column "public"."versions"."protected".
2023-05-30T15:39:02.216000Z WARNING Source column "public"."deleted_blocks"."uid" is casted to type "text" which is not the same as "character varying", the type of current target database column "public"."deleted_blocks"."uid".
2023-05-30T15:39:02.216000Z WARNING Source column "public"."deleted_blocks"."delete_candidate" is casted to type "bigint" which is not the same as "integer", the type of current target database column "public"."deleted_blocks"."delete_candidate".
2023-05-30T15:39:02.216000Z WARNING Source column "public"."tags"."version_uid" is casted to type "text" which is not the same as "character varying", the type of current target database column "public"."tags"."version_uid".
2023-05-30T15:39:02.216000Z WARNING Source column "public"."tags"."name" is casted to type "text" which is not the same as "character varying", the type of current target database column "public"."tags"."name".
2023-05-30T15:39:02.216000Z WARNING Source column "public"."blocks"."uid" is casted to type "text" which is not the same as "character varying", the type of current target database column "public"."blocks"."uid".
2023-05-30T15:39:02.216000Z WARNING Source column "public"."blocks"."version_uid" is casted to type "text" which is not the same as "character varying", the type of current target database column "public"."blocks"."version_uid".
2023-05-30T15:39:02.216000Z WARNING Source column "public"."blocks"."id" is casted to type "bigint" which is not the same as "integer", the type of current target database column "public"."blocks"."id".
2023-05-30T15:39:02.216000Z WARNING Source column "public"."blocks"."date" is casted to type "timestamptz" which is not the same as "timestamp without time zone", the type of current target database column "public"."blocks"."date".
2023-05-30T15:39:02.216000Z WARNING Source column "public"."blocks"."checksum" is casted to type "text" which is not the same as "character varying", the type of current target database column "public"."blocks"."checksum".
2023-05-30T15:39:02.216000Z WARNING Source column "public"."blocks"."valid" is casted to type "bigint" which is not the same as "integer", the type of current target database column "public"."blocks"."valid".
2023-05-30T15:39:02.217000Z WARNING Source column "public"."blocks"."enc_version" is casted to type "bigint" which is not the same as "integer", the type of current target database column "public"."blocks"."enc_version".
2023-05-30T15:39:02.217000Z WARNING Source column "public"."blocks"."enc_envkey" is casted to type "text" which is not the same as "character varying", the type of current target database column "public"."blocks"."enc_envkey".
2023-05-30T15:39:02.217000Z WARNING Source column "public"."blocks"."enc_nonce" is casted to type "text" which is not the same as "character varying", the type of current target database column "public"."blocks"."enc_nonce".
2023-05-30T15:39:02.217000Z WARNING Source column "public"."alembic_version"."version_num" is casted to type "text" which is not the same as "character varying", the type of current target database column "public"."alembic_version"."version_num".
2023-05-30T15:39:02.236000Z WARNING PostgreSQL warning: constraint "tags_version_uid_fkey" of relation "tags" does not exist, skipping
2023-05-30T15:39:02.236000Z WARNING PostgreSQL warning: constraint "blocks_version_uid_fkey" of relation "blocks" does not exist, skipping

The test import completed in 50 minutes:

2023-05-30T16:29:31.826000Z LOG report summary reset
             table name     errors       rows      bytes      total time
-----------------------  ---------  ---------  ---------  --------------
                  fetch          0          0                     0.000s
        fetch meta data          0          6                     0.025s
      Drop Foreign Keys          0          4                     0.021s
               Truncate          0          6                     0.022s
-----------------------  ---------  ---------  ---------  --------------
                  stats          0      62734    10.9 MB         14.484s
               versions          0       1482   273.5 kB          0.961s
         deleted_blocks          0     497538    29.4 MB         37.162s
                   tags          0       1482    82.5 kB          0.160s
        alembic_version          0          1     0.0 kB          0.015s
                 blocks          0   11137503     2.1 GB      49m53.819s
-----------------------  ---------  ---------  ---------  --------------
COPY Threads Completion          0          4                  50m8.471s
        Reset Sequences          0          2                    10.245s
    Create Foreign Keys          0          2                    10.769s
       Install Comments          0          0                     0.000s
-----------------------  ---------  ---------  ---------  --------------
      Total import time          ✓   11700740     2.1 GB      50m29.485s

The disk cache on the host is very efficient and re-running the same operation (which truncates the tables and re-imports the same data) took only 18 minutes:

Total import time          ✓   11700740     2.1 GB      18m42.709s

I think the best plan for the actual migration of Backy2 to use Postgres is:

Change 923410 merged by FNegri:

[operations/puppet@production] backy2: switch from sqlite to postgres backend

https://gerrit.wikimedia.org/r/923410

I didn't need to disable puppet or the timers, I simply stopped any running backup process on both hosts (cloudbackup1003 and 1004), and verified the timers were not set to restart them before a few hours.

root@cloudbackup1003:/srv/backy2# systemctl list-timers --all |grep backup
Thu 2023-06-01 00:05:00 UTC  6h left       Wed 2023-05-31 00:05:01 UTC  17h ago      purge_vm_backup.timer                            purge_vm_backup.service
Thu 2023-06-01 05:00:00 UTC  11h left      Wed 2023-05-31 05:00:01 UTC  12h ago      backup_vms.timer                                 backup_vms.service
Thu 2023-06-01 13:00:00 UTC  19h left      Wed 2023-05-31 13:00:01 UTC  4h 33min ago backup_glance_images.timer                       backup_glance_images.service

root@cloudbackup1004:/srv/backy2# systemctl list-timers --all |grep backup
Thu 2023-06-01 00:05:00 UTC  6h left       n/a                          n/a          purge_vm_backup.timer                            purge_vm_backup.service
Thu 2023-06-01 17:00:00 UTC  23h left      n/a                          n/a          backup_vms.timer                                 backup_vms.service

At that point I could successfully run pgloader on both hosts to copy data from SQLite to Postgres, then merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/923410 to update the backy2 config.

@Andrew we can try waiting for the next scheduled run of the timers, or if you like you can start the service manually and check that it works as expected. If there's any error, you can revert to the previous config, as the SQLite database is still there.

The backup_vms timer has run on cloudbackup1003 at 05:00:01 UTC:

root@cloudbackup1003:/srv/backy2# systemctl list-timers --all |grep backup_vms
Fri 2023-06-02 05:00:00 UTC  18h left      Thu 2023-06-01 05:00:01 UTC  5h 27min ago backup_vms.timer                                 backup_vms.service

The sqlite database has not been touched, so it looks like backy2 is correctly using the new Postgres db. In the Postgres db , I also see the stats column contains many rows with today's date:

backy2=> select count(*) from stats where date > '2023-06-01';
 count
-------
   209

The first one coincides with the time of timer (05:00 UTC):

backy2=> select * from stats where date > '2023-06-01' order by date asc limit 1;
-[ RECORD 1 ]-------+------------------------------------------
date                | 2023-06-01 05:02:49.156803

Mentioned in SAL (#wikimedia-cloud) [2023-06-01T13:09:20Z] <dhinus> shutting down cloud-cumin-04 and restoring from backup, to test backups are working (T332734)

Mentioned in SAL (#wikimedia-cloud) [2023-06-01T13:27:24Z] <dhinus> cloud-cumin-04 successfully restored from backup (T332734)

I successfully restored a backup from last night, following the instructions at https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Instance_backups#Restoring

Before resolving this ticket, I will wait for cloudbackup1004 to complete the next backup_vms run, that is scheduled for later today.

Backups are working correctly on both cloudbackup1003 and cloudbackup1004.