Page MenuHomePhabricator

Change retention values for Graphite metrics
Closed, ResolvedPublic

Description

We should store synthetic testing metrics longer than we do today. The crux metrics we collect we store for 2 years, but for our own metrics we save them only for 33 days. That should at least be one quarter (so we can compare the beginning of the quarter with the end) or even better one full year.

We also have a minimum retention of 10 minutes but we do not run tests that often, I think we should change that to 1 hour and the we can setup specific rules if we need to run tests more often.

Event Timeline

I've cleaned up all old data and the disk is 8% full at the moment. I'll add some new rules and do the change tool by tool to see what the total size will be.

Change 732281 had a related patch set uploaded (by Phedenskog; author: Phedenskog):

[performance/synthetic-monitoring-tests@master] Increase retention time to 1 hour.

https://gerrit.wikimedia.org/r/732281

I've made a test run on the server by copying on of the nodes of data into another structure and then running:

find . -type f -name "*.wsp" -exec whisper-resize.py --aggregate --nobackup {} 1h:1y \;

And then moving it back to a temp folder, and looking at the data in Grafana. The data looks good. The thing is that old annotations that link to the result, will not match the exact Graphite data point, but I think that is ok as long as the new ones are perfect.

It takes some time to run, so I'm gonna do it in many small steps, stopping whisper, change the data, change the config, turning everything again, verify and then do it again.

Two more things: we only keep the result HTML data for a month but maybe that is ok (or at least for now). Also when I tested I configure it to store data for 1 year but probably should have 13 months (or 14) so you can go back comparing January with January etc.

New day, new plan. Lets start will all the desktop tests and deploy that change and see how long time it will take.

Change 732281 abandoned by Phedenskog:

[performance/synthetic-monitoring-tests@master] Increase retention time to 1 hour.

Reason:

Lets do this step by step instead

https://gerrit.wikimedia.org/r/732281

Change 732907 had a related patch set uploaded (by Phedenskog; author: Phedenskog):

[performance/synthetic-monitoring-tests@master] Run tests once an hour.

https://gerrit.wikimedia.org/r/732907

Change 732907 merged by jenkins-bot:

[performance/synthetic-monitoring-tests@master] Run tests once an hour.

https://gerrit.wikimedia.org/r/732907

Change 732929 had a related patch set uploaded (by Phedenskog; author: Phedenskog):

[performance/synthetic-monitoring-tests@master] Set retention to 1 hour for all tests.

https://gerrit.wikimedia.org/r/732929

Change 732929 merged by jenkins-bot:

[performance/synthetic-monitoring-tests@master] Set retention to 1 hour for all tests.

https://gerrit.wikimedia.org/r/732929

This is done but I need to update the documentation and cleanup the backup files.

Documentation updated and backup files removed.