Page MenuHomePhabricator

Investigation: Stability for CopyPatrol
Open, Needs TriagePublic3 Estimated Story Points

Description

Last night there was a db failure that lost some CopyPatrol data, and the bot isn't currently running.

https://phabricator.wikimedia.org/T179227#3726731
https://lists.wikimedia.org/pipermail/cloud-announce/2017-November/000007.html

What can we do to help make CopyPatrol more stable? Should somebody from CommTech get familiar with Eranbot? Is it helpful/possible to split the CopyPatrol related piece out of the larger Eranbot? Should we create a backup server for CopyPatrol data?

See also T182712

Event Timeline

As for backups, I think it may be a more general question (not specific to CopyPatrol) and @bd808 may better answer it - is data in c1.labsdb was backed up? (do we have backup for DB s51306__copyright_p from host enwiki.labsdb beside the backup I manually did few days ago?). I didn't understand from the docs where it is recommended to backup to? (it only says where NOT)
anyway, it seems that likely serious HW failures aren't so common, and even in this case we got notified few days beforewise we should backup.

kaldari set the point value for this task to 3.Nov 1 2017, 11:16 PM

Functionally nothing in Toolforge or Cloud VPS is backed up. Most of the user created databases on tools.db.svc.eqiad.wmflabs (tools-db) are replicated to a secondary server (there are a few dbs that are too large or active to replicate). This is not a backup however in any traditional sense. A rogue delete will be replicated and wipe out data on the secondary host as well. There are snapshots of some NFS file systems, but again these are not true backups. There is no consistent retention duration and very cumbersome recovery procedures.

There is currently NO location inside of Cloud VPS/Toolfogre that is recommended as a storage location for true backups. We do not have enough hardware to provide redundant, durable storage. There are some vague ideas about how we might someday provide a backup system for selective amounts of volunteer generated content, but implementing those ideas is certainly more than a year off and more likely multiple years. If a tool generates data that cannot be recreated in the event of loss its currently up to the tool maintainer to figure out some relatively safe location for the storage of backups.