Page MenuHomePhabricator

cloudservices2005-dev backups are clogging all backups
Closed, ResolvedPublic

Description

cloudservices2005-dev are, apparently, failing. However, they do not fail fast, they have been "producing a backup" for 3 days already, causing clogging on the general backup system.

The most likely cause is that the pre-script is failing and getting stuck, causing a huge amount of queuing on general backups, as concurrency is limited to avoid killing the storage system for too many writes. Alternatively, the host is in a bad state, network-wise, causing huge timeouts.

I am going to kill the non-finishing backup process and then remove the 2 configured backups out of rotation to prevent more bacula issues and delay to other hosts, but the host should be checked (or the pre script reviewed).

Event Timeline

I have cancelled the never-finishing jobs from the host and now other backups flow correctly. I have disabled puppet on backup1001 and disabled cloudservices2005-dev backups,, until we have a more stable status, but at least other backups (not cloudservices2005-dev's) can now happen at a regular pace.

taavi subscribed.

Thanks for the poke. This host seems to have accidentally gotten rebooted back to the broken kernel from T393366: Regression in RAID10 software RAID with 6.1.135 and was in a locked up state. I've rebooted it and upgraded it to a working kernel which should resolve those backup issues.

Should I retry a backup now, to check this ticket is resolved?

Doing. Thanks for the prompt response! I will report if backups now finish correctly.

The backups now worked nicely, thank you. This is resolved to me.

Terminated Jobs:
 JobId  Level      Files    Bytes   Status   Finished        Name 
====================================================================
627142  Full          14    30.08 K  OK       21-May-25 07:58 cloudservices2005-dev.codfw.wmnet-Monthly-1st-Sun-productionEqiad-mysql-srv-backups-dumps-latest
627143  Full           3    8.391 M  OK       21-May-25 07:58 cloudservices2005-dev.codfw.wmnet-Monthly-1st-Sun-productionEqiad-openldap
[10:01:54] <icinga-wm> RECOVERY - Backup freshness on backup1001 is OK: Fresh: 145 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring

👏👏👏