Page MenuHomePhabricator

Implement persistence of spicerack and cumin logs to survive host reimage/refresh/failure
Closed, ResolvedPublic

Description

Problem statement

Currently the logs for Cumin and Spicerack on the cumin hosts don't go to logstash because they might contain sensitive information and for that to happen T213902 needs to be implemented first. Even with that, the retention on logstash is usually short.

I think it's important to not loose all the information that we gather when running cookbooks and cumin commands when we reimage one of those hosts, or refresh it or if it has a fatal hardware failure.
Currently at each reimage/refresh we've either lost the data or it was manually copied over (when we remembered to do that), but is error prone and can be forgotten.
They can be useful in various different ways, later audit or debugging of what happen earlier, even gathering some statistics (although some basic ones could be gathered from SAL).
For how long to keep them, that's a good question to be answered. I think we need more than 3 months and potentially few years, but maybe that's not needed.

Stats about the data
  • Paths:
    • /var/log/spicerack
    • /var/log/cumin
  • Hosts: 3. The hosts with cluster::management and cluster::unprivmanagement roles.
  • Sizes: the sizes are very different between hosts because people tend to use the eqiad host more. Given the recent reimage of cumin1001 I can't give very precise estimates.
    • /var/log/spicerack: cumin1001 -> 1.4GB/year ; cumin2002 -> ~200MB/year
    • /var/log/cumin: <20MB/year/host with current settings (I'm thinking of logging always at debug level in a separate file, that might increase the sizes, but we're talking about small things anyway)
  • Retention: ideally few years, but at least longer than our standard 3 months.
  • Peculiarities: because those are log files, they are append only in nature, but also they get rotated by the application (not logrotate) and they are just rotated, never deleted. The rotation can make a backup approach inefficient because what's now foo.log.1 tomorrow will be foo.log.2.
Possible approaches
  1. Wait for T213902 and ask for a longer retention there. Is there a known ETA?
  2. Backup the data in bacula, either as a temporary or permanent solution
  3. ... add other possible approaches

Event Timeline

Volans triaged this task as Medium priority.Mar 23 2022, 10:39 AM
Volans created this task.

Right away, I would suggest to setup regular backups for the 2 suggested paths to fix the most immediate issue:

Currently at each reimage/refresh we've either lost the data or it was manually copied over (when we remembered to do that), but is error prone and can be forgotten.

they are just rotated, never deleted

This would signify to me that even with a 3-month retention, we would be able to keep years of those files, as long as we recover them within that timeline in case of a loss?

More sophisticated solutions could be thought, such as T213902, specially for active needs such as gathering statistics and looking at historical data, but this is the method that other systems use (e.g. secrets backups) until a proper archive system is setup.

they are just rotated, never deleted

This would signify to me that even with a 3-month retention, we would be able to keep years of those files, as long as we recover them within that timeline in case of a loss?

Yes, if we remember to restore them after a reimage/refresh within 3 months. It will surely help and most likely allow to keep them long-term.

they are just rotated, never deleted

This would signify to me that even with a 3-month retention, we would be able to keep years of those files, as long as we recover them within that timeline in case of a loss?

Yes, if we remember to restore them after a reimage/refresh within 3 months. It will surely help and most likely allow to keep them long-term.

+1 on doing that

It could also go to swift. Just a thought, feel free to ignore. Specially if your usecase is different from a typical "backup and recovery" usecase.

Change 792125 had a related patch set uploaded (by Volans; author: Volans):

[operations/puppet@production] cluster::management: backup auditing logs

https://gerrit.wikimedia.org/r/792125

Change 792125 merged by Volans:

[operations/puppet@production] cluster::management: backup auditing logs

https://gerrit.wikimedia.org/r/792125

Volans claimed this task.

I've merged the change and @jcrespo has run it manually. The backup is working fine. Resolving.