Page MenuHomePhabricator

Increase Xenon logs and flame graphs to 14 days
Closed, ResolvedPublic

Description

https://performance.wikimedia.org/xenon/svgs/hourly/
https://performance.wikimedia.org/xenon/logs/hourly/

Log files from the last 24 hours seem in order. For some hours, some of the entry points are missing, but at least api, load, index and the combined all always exist.

Last 24 hours
2017-05-30_21.thumb_handler.log
2017-05-30_21.RunJobs.log
2017-05-30_21.load.log
2017-05-30_21.index.log
2017-05-30_21.api.log
2017-05-30_21.all.log

2017-05-30_20.touch.log
2017-05-30_20.thumb_handler.log
2017-05-30_20.RunJobs.log
2017-05-30_20.robots.log
2017-05-30_20.load.log
2017-05-30_20.index.log
2017-05-30_20.favicon.log
2017-05-30_20.api.log
2017-05-30_20.all.log

2017-05-30_19.thumb_handler.log
2017-05-30_19.static.log
2017-05-30_19.RunJobs.log
2017-05-30_19.load.log
2017-05-30_19.index.log
2017-05-30_19.favicon.log
2017-05-30_19.api.log
2017-05-30_19.all.log

2017-05-30_18.thumb_handler.log
2017-05-30_18.thumb.log
2017-05-30_18.static.log
2017-05-30_18.RunJobs.log
2017-05-30_18.load.log
2017-05-30_18.index.log
2017-05-30_18.favicon.log
2017-05-30_18.api.log
2017-05-30_18.all.log

2017-05-30_17.touch.log
2017-05-30_17.thumb_handler.log
2017-05-30_17.RunJobs.log
2017-05-30_17.load.log
2017-05-30_17.index.log
2017-05-30_17.favicon.log
2017-05-30_17.api.log
2017-05-30_17.all.log

2017-05-30_16.touch.log
2017-05-30_16.thumb_handler.log
2017-05-30_16.RunJobs.log
2017-05-30_16.robots.log
2017-05-30_16.load.log
2017-05-30_16.index.log
2017-05-30_16.api.log
2017-05-30_16.all.log

2017-05-30_15.thumb_handler.log
2017-05-30_15.RunJobs.log
2017-05-30_15.load.log
2017-05-30_15.index.log
2017-05-30_15.favicon.log
2017-05-30_15.api.log
2017-05-30_15.all.log

2017-05-30_14.thumb_handler.log
2017-05-30_14.RunJobs.log
2017-05-30_14.load.log
2017-05-30_14.index.log
2017-05-30_14.favicon.log
2017-05-30_14.api.log
2017-05-30_14.all.log

2017-05-30_13.thumb_handler.log
2017-05-30_13.RunJobs.log
2017-05-30_13.load.log
2017-05-30_13.index.log
2017-05-30_13.favicon.log
2017-05-30_13.api.log
2017-05-30_13.all.log

2017-05-30_12.thumb_handler.log
2017-05-30_12.RunJobs.log
2017-05-30_12.load.log
2017-05-30_12.index.log
2017-05-30_12.api.log
2017-05-30_12.all.log

2017-05-30_11.touch.log
2017-05-30_11.thumb_handler.log
2017-05-30_11.RunJobs.log
2017-05-30_11.load.log
2017-05-30_11.index.log
2017-05-30_11.favicon.log
2017-05-30_11.api.log
2017-05-30_11.all.log

2017-05-30_10.thumb_handler.log
2017-05-30_10.thumb.log
2017-05-30_10.RunJobs.log
2017-05-30_10.load.log
2017-05-30_10.index.log
2017-05-30_10.api.log
2017-05-30_10.all.log

2017-05-30_09.thumb_handler.log
2017-05-30_09.RunJobs.log
2017-05-30_09.robots.log
2017-05-30_09.load.log
2017-05-30_09.index.log
2017-05-30_09.api.log
2017-05-30_09.all.log

2017-05-30_08.thumb_handler.log
2017-05-30_08.RunJobs.log
2017-05-30_08.robots.log
2017-05-30_08.load.log
2017-05-30_08.index.log
2017-05-30_08.favicon.log
2017-05-30_08.api.log
2017-05-30_08.all.log

2017-05-30_07.thumb_handler.log
2017-05-30_07.RunJobs.log
2017-05-30_07.opensearch_desc.log
2017-05-30_07.load.log
2017-05-30_07.index.log
2017-05-30_07.api.log
2017-05-30_07.all.log

2017-05-30_06.thumb_handler.log
2017-05-30_06.RunJobs.log
2017-05-30_06.load.log
2017-05-30_06.index.log
2017-05-30_06.favicon.log
2017-05-30_06.api.log
2017-05-30_06.all.log

2017-05-30_05.thumb_handler.log
2017-05-30_05.RunJobs.log
2017-05-30_05.load.log
2017-05-30_05.index.log
2017-05-30_05.favicon.log
2017-05-30_05.api.log
2017-05-30_05.all.log

2017-05-30_04.thumb_handler.log
2017-05-30_04.RunJobs.log
2017-05-30_04.load.log
2017-05-30_04.index.log
2017-05-30_04.favicon.log
2017-05-30_04.api.log
2017-05-30_04.all.log

2017-05-30_03.thumb_handler.log
2017-05-30_03.RunJobs.log
2017-05-30_03.load.log
2017-05-30_03.index.log
2017-05-30_03.favicon.log
2017-05-30_03.api.log
2017-05-30_03.all.log

2017-05-30_02.thumb_handler.log
2017-05-30_02.RunJobs.log
2017-05-30_02.load.log
2017-05-30_02.index.log
2017-05-30_02.favicon.log
2017-05-30_02.api.log
2017-05-30_02.all.log

2017-05-30_01.thumb_handler.log
2017-05-30_01.thumb.log
2017-05-30_01.RunJobs.log
2017-05-30_01.load.log
2017-05-30_01.index.log
2017-05-30_01.favicon.log
2017-05-30_01.api.log
2017-05-30_01.all.log

2017-05-30_00.thumb_handler.log
2017-05-30_00.thumb.log
2017-05-30_00.RunJobs.log
2017-05-30_00.load.log
2017-05-30_00.index.log
2017-05-30_00.favicon.log
2017-05-30_00.api.log
2017-05-30_00.all.log

2017-05-29_23.thumb_handler.log
2017-05-29_23.RunJobs.log
2017-05-29_23.load.log
2017-05-29_23.index.log
2017-05-29_23.favicon.log
2017-05-29_23.api.log
2017-05-29_23.all.log

2017-05-29_22.thumb_handler.log
2017-05-29_22.RunJobs.log
2017-05-29_22.load.log
2017-05-29_22.index.log
2017-05-29_22.favicon.log
2017-05-29_22.api.log
2017-05-29_22.all.log

2017-05-29_21.thumb_handler.log
2017-05-29_21.RunJobs.log
2017-05-29_21.load.log
2017-05-29_21.index.log
2017-05-29_21.api.log
2017-05-29_21.all.log

Going back further, files become increasingly more scarce. Given that the retention period for these files is supposed to be very simple ("only keep the last 24 latest ones" – based on alphanumerical sorting), it's surprising that so many still exist.

Older
2017-05-29_20.favicon.log
2017-05-29_19.favicon.log
2017-05-29_18.thumb.log
2017-05-29_18.favicon.log
2017-05-29_17.static.log
2017-05-29_16.favicon.log
2017-05-29_15.touch.log
2017-05-29_15.thumb.log
2017-05-29_15.favicon.log
2017-05-29_14.favicon.log
2017-05-29_13.thumb.log
2017-05-29_13.static.log
2017-05-29_12.thumb.log
2017-05-29_09.thumb.log
2017-05-29_08.thumb.log
2017-05-29_04.static.log
2017-05-29_03.robots.log
2017-05-29_02.opensearch_desc.log
2017-05-29_01.robots.log
2017-05-29_00.touch.log

2017-05-28_19.thumb.log
2017-05-28_19.static.log
2017-05-28_18.thumb.log
2017-05-28_18.static.log
2017-05-28_15.static.log
2017-05-28_14.thumb.log
2017-05-28_14.static.log
2017-05-28_13.thumb.log
2017-05-28_13.static.log
2017-05-28_12.thumb.log
2017-05-28_04.static.log

2017-05-27_18.opensearch_desc.log
2017-05-27_17.thumb.log
2017-05-27_16.static.log
2017-05-27_13.opensearch_desc.log
2017-05-27_11.thumb.log
2017-05-27_10.thumb.log
2017-05-27_10.robots.log
2017-05-27_06.opensearch_desc.log
2017-05-27_00.thumb.log

2017-05-26_23.static.log
2017-05-26_21.thumb.log
2017-05-26_19.static.log
2017-05-26_18.thumb.log
2017-05-26_17.static.log
2017-05-26_16.opensearch_desc.log
2017-05-26_14.thumb.log
2017-05-26_14.static.log
2017-05-26_13.opensearch_desc.log
2017-05-26_08.robots.log
2017-05-26_06.thumb.log
2017-05-26_00.static.log

2017-05-25_23.static.log
2017-05-25_21.touch.log
2017-05-25_21.thumb.log
2017-05-25_20.robots.log
2017-05-25_19.static.log
2017-05-25_15.static.log
2017-05-25_15.robots.log
2017-05-25_13.static.log
2017-05-25_09.static.log
2017-05-25_05.static.log

[..]

2017-05-18_18.touch.log
2017-05-18_15.touch.log
2017-05-18_01.touch.log

2017-05-17_19.robots.log
2017-05-17_16.robots.log
2017-05-17_15.touch.log
2017-05-17_10.robots.log

2017-05-16_16.touch.log

2017-05-15_17.touch.log
2017-05-15_10.robots.log
2017-05-15_09.robots.log

2017-05-14_21.touch.log
2017-05-14_15.touch.log

2017-05-13_14.touch.log
2017-05-13_03.touch.log

2017-05-12_23.opensearch_desc.log
2017-05-12_17.opensearch_desc.log

2017-05-11_23.touch.log

2017-04-27_17.opensearch_desc.log
2017-04-27_11.opensearch_desc.log

2017-04-25_22.opensearch_desc.log

2017-04-24_19.opensearch_desc.log
2017-04-24_14.opensearch_desc.log
2017-04-24_00.opensearch_desc.log

Especially note how there are stretches were consistently only one entry point of one hour is kept from an entire day, and the same entry point for several days in a row.

The code seems to operate consistently on the entire directory's content, so it doesn't make sense that hour after hour, it is still keeping these files somehow.

https://github.com/wikimedia/puppet/blob/f05eed605a09eb47afd04d5217e6ace34768733c/modules/xenon/files/xenon-log#L62-L79

In light of T166345, I was actually surprised we (supposedly) are only meant to keep 24 files. I was hoping we'd keep at least 7 days's worth (we keep 90 days' worth for the daily summaries, retaining the hourlies for at least a day would be nice) – I suspect this was originally configured based on us only having "combined" flame graphs per hour, not per entry point, so 24 is less now than it once was. There doesn't seem to be any logic about time-based retension, it's based on the number of files only.

Xenon-log creates and purges log files. The SVGs are created and purged by xenon-generate-svgs - which, after it (re)creates SVGs for any newer log files, also deletes SVGs for which no log files exist - basically keeping it in sync with the result of xenon-log, which seems to be working fine.

Tasks:

  • Figure out how retention currently works and (if needed) fix it.
  • Update configation (if needed) to ensure daily files are kept for 90 days. And increase hourly file retention from 24 hours to at least 7 days, perhaps 14 days would be acceptable too.

Event Timeline

Krinkle renamed this task from Xenon logs and flame graphs go missing (random?) to Xenon logs and flame graphs left behind (random?).May 30 2017, 9:29 PM
Krinkle updated the task description. (Show Details)

It's actually doing its job the way it's supposed to in the code. It keeps the 24 most recent hourly ones per tag, and the 90 most recent daily ones per tag:

gilles@mwlog1001:/srv/xenon/logs/hourly$ ls -al | grep touch | wc -l
25
gilles@mwlog1001:/srv/xenon/logs/hourly$ ls -al | grep index.log | wc -l
25
gilles@mwlog1001:/srv/xenon/logs/hourly$ ls -al | grep api.log | wc -l
25

gilles@mwlog1001:/srv/xenon/logs/daily$ ls -al | grep touch | wc -l
91
gilles@mwlog1001:/srv/xenon/logs/daily$ ls -al | grep index.log | wc -l
91
gilles@mwlog1001:/srv/xenon/logs/daily$ ls -al | grep api.log | wc -l
91

Ah! Nice catch. I misread the code to mean that tag equals the formatted date (Y-m-d_H), which indeed didn't make sense.

I suppose, then, that the reason some files persist from several months ago and even several years ago, is because our tags changed (some tags are no longer used, so purge no longer runs for them).

Okay, so now that we understand how it works and confirmed it is working well, let's just increase the hourly retention. I'm proposing 14 days (336), but if size become a concern we could start with 8 days (192 days).

Krinkle lowered the priority of this task from High to Low.Jun 7 2017, 6:53 PM

Change 362114 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[operations/puppet@production] xenon: Increase hourly retention from 1 to 14 days

https://gerrit.wikimedia.org/r/362114

Krinkle renamed this task from Xenon logs and flame graphs left behind (random?) to Increase Xenon logs and flame graphs to 14 days.Jun 29 2017, 12:45 AM

Change 362114 merged by Dzahn:
[operations/puppet@production] xenon: Increase hourly retention from 1 to 14 days

https://gerrit.wikimedia.org/r/362114

18:12 < mutante> !log mwlog1001 - deleted /srv/xenon/logs from 2015 and 2016 as requested by Krinkle. Also merged https://gerrit.wikimedia.org/r/#/c/362114/ so now logs are retained for 14 days