failure
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Volans
	Mar 23 2022, 10:39 AM

Description

Problem statement

Currently the logs for Cumin and Spicerack on the cumin hosts don't go to logstash because they might contain sensitive information and for that to happen T213902 needs to be implemented first. Even with that, the retention on logstash is usually short.

I think it's important to not loose all the information that we gather when running cookbooks and cumin commands when we reimage one of those hosts, or refresh it or if it has a fatal hardware failure.
Currently at each reimage/refresh we've either lost the data or it was manually copied over (when we remembered to do that), but is error prone and can be forgotten.
They can be useful in various different ways, later audit or debugging of what happen earlier, even gathering some statistics (although some basic ones could be gathered from SAL).
For how long to keep them, that's a good question to be answered. I think we need more than 3 months and potentially few years, but maybe that's not needed.

Stats about the data

Paths:
- /var/log/spicerack
- /var/log/cumin
Hosts: 3. The hosts with cluster::management and cluster::unprivmanagement roles.
Sizes: the sizes are very different between hosts because people tend to use the eqiad host more. Given the recent reimage of cumin1001 I can't give very precise estimates.
- /var/log/spicerack: cumin1001 -> 1.4GB/year ; cumin2002 -> ~200MB/year
- /var/log/cumin: <20MB/year/host with current settings (I'm thinking of logging always at debug level in a separate file, that might increase the sizes, but we're talking about small things anyway)
Retention: ideally few years, but at least longer than our standard 3 months.
Peculiarities: because those are log files, they are append only in nature, but also they get rotated by the application (not logrotate) and they are just rotated, never deleted. The rotation can make a backup approach inefficient because what's now foo.log.1 tomorrow will be foo.log.2.

Possible approaches

Wait for T213902 and ask for a longer retention there. Is there a known ETA?
Backup the data in bacula, either as a temporary or permanent solution
... add other possible approaches

Details

	Subject	Repo	Branch	Lines +/-
	cluster::management: backup auditing logs	operations/puppet	production	+8 -0

Customize query in gerrit

Related Objects

Mentioned Here: T213902: Implement sensitive logstash access control

Event Timeline

Volans triaged this task as Medium priority.Mar 23 2022, 10:39 AM

Volans created this task.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 23 2022, 10:39 AM

Relevant link: https://meta.wikimedia.org/wiki/Data_retention_guidelines

Right away, I would suggest to setup regular backups for the 2 suggested paths to fix the most immediate issue:

Currently at each reimage/refresh we've either lost the data or it was manually copied over (when we remembered to do that), but is error prone and can be forgotten.

they are just rotated, never deleted

This would signify to me that even with a 3-month retention, we would be able to keep years of those files, as long as we recover them within that timeline in case of a loss?

More sophisticated solutions could be thought, such as T213902, specially for active needs such as gathering statistics and looking at historical data, but this is the method that other systems use (e.g. secrets backups) until a proper archive system is setup.

In T304497#7799359, @jcrespo wrote:

they are just rotated, never deleted

This would signify to me that even with a 3-month retention, we would be able to keep years of those files, as long as we recover them within that timeline in case of a loss?

Yes, if we remember to restore them after a reimage/refresh within 3 months. It will surely help and most likely allow to keep them long-term.

In T304497#7799436, @Volans wrote:

In T304497#7799359, @jcrespo wrote:

they are just rotated, never deleted

This would signify to me that even with a 3-month retention, we would be able to keep years of those files, as long as we recover them within that timeline in case of a loss?

Yes, if we remember to restore them after a reimage/refresh within 3 months. It will surely help and most likely allow to keep them long-term.