The analytics1003 host runs a mariadb database and various Hadoop daemons, like Hive (server/metastore), Camus (executed periodically via cron), Oozie, etc.. The database is also used by other hosts/clusters like:
- Druid analytics (druid100[1-3])
- Druid public (druid100[4-6])
- Thorium (Hue)
We currently back up the database via a LVM snapshot copied to analytics1002, without stopping Mariadb first (so there might be a chance that the snapshot used in a restore emergency operation leads to a corrupted database).
The database is ~13G in /var/lib/mysql, and it is a relatively low volume/traffic. The main issue though is that:
- If analytics1003 goes down temporarily, then Druid might also be momentarily impacted (and also Hue).
- if analytics1003 goes down permanently (hw failure), then all the Hadoop related scheduled and recurrent jobs will be stopped too.
Ideally we should:
- Have a backup host somewhere, maybe in Ganeti
- Have a mysql automatic failover in case the database on analytics1003 is not reachable.
- Use the new host to back up the database, maybe via LVM and periodically via Bacula.
In case of complete failure of analytics1003 we could temporarily apply there the analytics_cluster coordinator's role and keep going with Hadoop jobs, until the original host is fixed.