Page MenuHomePhabricator

cloudvirt1021 - CRITICAL: Status of the systemd unit backup_vms
Closed, ResolvedPublic

Description

from an alert:
https://alerts.wikimedia.org/?q=team%3Dwmcs

Check unit status of backup_vms
summary: CRITICAL: Status of the systemd unit backup_vms
2 hours ago
instance: cloudvirt1021
source: icinga
team: wmcs
@receiver: irc-spam

Event Timeline

dcaro triaged this task as High priority.Mar 22 2022, 10:18 AM
dcaro created this task.

It seems there's some mess with the backy parameter order or the database:

Mar 20 20:00:05 cloudvirt1021 systemd[1]: Started Backup vms assigned to this host.
Mar 20 20:02:02 cloudvirt1021 wmcs-backup[3528010]:    ERROR: [backy2.logging] Couldn't parse datetime string: 'b0094916-a7bf-11ec-95b0-dd68c0a50dd7'
Mar 20 20:02:02 cloudvirt1021 wmcs-backup[3528010]:    ERROR: [backy2.logging] Backy failed.
Mar 20 20:02:02 cloudvirt1021 wmcs-backup[3527381]: Traceback (most recent call last):
Mar 20 20:02:02 cloudvirt1021 wmcs-backup[3527381]:   File "/usr/local/sbin/wmcs-backup", line 2037, in <module>
Mar 20 20:02:02 cloudvirt1021 wmcs-backup[3527381]:     args.func()
Mar 20 20:02:02 cloudvirt1021 wmcs-backup[3527381]:   File "/usr/local/sbin/wmcs-backup", line 1870, in <lambda>
Mar 20 20:02:02 cloudvirt1021 wmcs-backup[3527381]:     func=lambda: get_current_instances_state(
Mar 20 20:02:02 cloudvirt1021 wmcs-backup[3527381]:   File "/usr/local/sbin/wmcs-backup", line 1600, in get_current_instances_state
Mar 20 20:02:02 cloudvirt1021 wmcs-backup[3527381]:     backup_entries = get_backups()
Mar 20 20:02:02 cloudvirt1021 wmcs-backup[3527381]:   File "/usr/lib/python3/dist-packages/rbd2backy2.py", line 600, in get_backups
Mar 20 20:02:02 cloudvirt1021 wmcs-backup[3527381]:     backup = subprocess.check_output([BACKY, "-ms", "ls"])
Mar 20 20:02:02 cloudvirt1021 wmcs-backup[3527381]:   File "/usr/lib/python3.9/subprocess.py", line 424, in check_output
Mar 20 20:02:02 cloudvirt1021 wmcs-backup[3527381]:     return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
Mar 20 20:02:02 cloudvirt1021 wmcs-backup[3527381]:   File "/usr/lib/python3.9/subprocess.py", line 528, in run
Mar 20 20:02:02 cloudvirt1021 wmcs-backup[3527381]:     raise CalledProcessError(retcode, process.args,
Mar 20 20:02:02 cloudvirt1021 wmcs-backup[3527381]: subprocess.CalledProcessError: Command '['/usr/bin/backy2', '-ms', 'ls']' returned non-zero exit status 100.

It's related to the database for the backups, backy2 by itself already fails when trying to do an ls:

root@cloudvirt1021:~# backy2 -d -v -ms ls
   ERROR: [backy2.logging] Unexpected exception
   ERROR: [backy2.logging] Couldn't parse datetime string: 'b0094916-a7bf-11ec-95b0-dd68c0a50dd7'
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/backy2/scripts/backy.py", line 769, in main
    func(**func_args)
  File "/usr/lib/python3/dist-packages/backy2/scripts/backy.py", line 139, in ls
    versions = backy.ls()
  File "/usr/lib/python3/dist-packages/backy2/backy.py", line 74, in ls
    versions = self.meta_backend.get_versions()
  File "/usr/lib/python3/dist-packages/backy2/meta_backends/sql.py", line 397, in get_versions
    return self.session.query(Version).order_by(Version.name, Version.date).all()
  File "/usr/lib/python3/dist-packages/sqlalchemy/orm/query.py", line 3373, in all
    return list(self)
  File "/usr/lib/python3/dist-packages/sqlalchemy/orm/loading.py", line 100, in instances
    cursor.close()
  File "/usr/lib/python3/dist-packages/sqlalchemy/util/langhelpers.py", line 68, in __exit__
    compat.raise_(
  File "/usr/lib/python3/dist-packages/sqlalchemy/util/compat.py", line 182, in raise_
    raise exception
  File "/usr/lib/python3/dist-packages/sqlalchemy/orm/loading.py", line 80, in instances
    rows = [proc(row) for row in fetch]
  File "/usr/lib/python3/dist-packages/sqlalchemy/orm/loading.py", line 80, in <listcomp>
    rows = [proc(row) for row in fetch]
  File "/usr/lib/python3/dist-packages/sqlalchemy/orm/loading.py", line 579, in _instance
    _populate_full(
  File "/usr/lib/python3/dist-packages/sqlalchemy/orm/loading.py", line 725, in _populate_full
    dict_[key] = getter(row)
  File "/usr/lib/python3/dist-packages/sqlalchemy/engine/result.py", line 107, in __getitem__
    return processor(self._row[index])
  File "/usr/lib/python3/dist-packages/sqlalchemy/processors.py", line 44, in process
    raise ValueError(
ValueError: Couldn't parse datetime string: 'b0094916-a7bf-11ec-95b0-dd68c0a50dd7'
   ERROR: [backy2.logging] Backy failed.

I think that with the upgrade, we might have to do some sqlalchemy migrations that we have not done...
looking

#

So in order to migrate from version to version of backy if you use sqlite, you have to before upgrading export the backups, remove the database, and then after upgrading import them:

Tried opening the same backy db with an older version (buster):

dcaro@cloudbackup2002:~$ backy2 --version
2.13.2

But still got error:

dcaro@cloudbackup2002:~$ backy2 -c backy.cfg ls
    INFO: [backy2.logging] $ /usr/bin/backy2 -c backy.cfg ls
   ERROR: [backy2.logging] Couldn't parse datetime string: 'b0094916-a7bf-11ec-95b0-dd68c0a50dd7'
   ERROR: [backy2.logging] Backy failed.

So not sure anymore what version we had.

This is interesting though:

sqlite> select * from versions where date='b0094916-a7bf-11ec-95b0-dd68c0a50dd7';
|b0094916-a7bf-11ec-95b0-dd68c0a50dd7|4|2022-03-19 20:03:47.708378||4194304|1|0|
...

So it seems that the versions table has some corrupted data somehow.

Yep, just the DB was corrupted, to check:

root@cloudvirt1021:~# sqlite3 /srv/backy/backy.sqlite
SQLite version 3.34.1 2021-01-20 14:10:07
Enter ".help" for usage hints.
sqlite> pragma integrity_check;
*** in database main ***
On tree page 71625 cell 219: Rowid 861598 out of order
On tree page 4 cell 0: 2nd reference to page 71847
row 304958 missing from index ix_blocks_checksum
...

To fix did the following:

root@cloudvirt1021:~# sqlite3 /srv/backy/backy.sqlite
sqlite> .output backup.db
sqlite> .dump
sqlite> .quit

# the backy.fixed.sqlite file did not exist yet, sqlite3 creates it
root@cloudvirt1021:~# sqlite3 backy.fixed.sqlite
SQLite version 3.34.1 2021-01-20 14:10:07
Enter ".help" for usage hints.
sqlite> .read backup.db
sqlite> .quit


# restore the DB
root@cloudvirt1021:~# cp backy.fixed.sqlite /srv/backy2/backy.sqlite

That did it, will add to the runbook

Change 772839 had a related patch set uploaded (by David Caro; author: David Caro):

[operations/puppet@production] wmcs.backy2: add link to the runbook for backup_vms

https://gerrit.wikimedia.org/r/772839

dcaro changed the task status from Open to In Progress.Mar 22 2022, 2:41 PM
dcaro moved this task from To refine to Doing on the User-dcaro board.
dcaro removed dcaro as the assignee of this task.Mar 22 2022, 4:58 PM
dcaro claimed this task.

Change 772839 merged by David Caro:

[operations/puppet@production] wmcs.backy2: add link to the runbook for backup_vms

https://gerrit.wikimedia.org/r/772839

Change 775961 had a related patch set uploaded (by Nskaggs; author: Nskaggs):

[operations/puppet@production] wmcs.backy2: fix typo in link to runbook for backup_vms

https://gerrit.wikimedia.org/r/775961

Change 775961 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] wmcs.backy2: fix typo in link to runbook for backup_vms

https://gerrit.wikimedia.org/r/775961