Page MenuHomePhabricator

Increase retention of ArcLamp SVGs to 2 years
Closed, ResolvedPublic

Description

Increase retention of flame graph SVGs from 90 days to 2 years. We dont have much use of the log files beyond the last month or two, so we can keep those as it is.

Work:

  • Change the file removal logic such that it takes separate configuration (arclamp.yaml) for retention of log files vs SVG files.
  • Estimate the required space for 2 years of tiny SVG files.
  • Update prof config to set SVGs to 2 years.
  • Clean up and remove the swift integrations.
Original description

From T199853, [..] Xenon daily data is currently rotated after 90 days, and Xenon hourly data rotated after 14 days. [..] See also T166624.

I'd like to increase retention of profiles, and keep hourly data for 90 days, and daily data for 1-2 years. [..] that would require:

  • daily: 22 GB per 90d = 179 GB per 2 years.
  • hourly: 4 GB per 14d = 26 GB per 90 days.

Due to limited storage on the webperf-2 VMs, this will be blocked on adding an upload mechanism to ArcLamp to support Swift.

Details at T199853#4437177.

Related Objects

Event Timeline

Krinkle renamed this task from Increase retention of ArcLamp to 2 years to Increase retention of ArcLamp to 2 years by moving to Swift.Jun 22 2020, 8:08 PM

Removing T227026 as strict dependency. Probably makes sense to first deploy the non-puppet ArcLamp that uses Swift to the existing webperf1002 first, and then move to k8s from there later as part of T227026.

Main reason being to unlock the 90 day retention sooner (T235455), and indeed the 2y retention (this ticket).

Krinkle triaged this task as Medium priority.Jul 17 2020, 12:09 AM
Aklapper added a subscriber: dpifke.

Removing task assignee due to inactivity, as this open task has been assigned for more than two years. See the email sent to the task assignee on February 06th 2022 (and T295729).

Please assign this task to yourself again if you still realistically [plan to] work on this task - it would be welcome.

If this task has been resolved in the meantime, or should not be worked on ("declined"), please update its task status via "Add Action… 🡒 Change Status".

Also see https://www.mediawiki.org/wiki/Bug_management/Assignee_cleanup for tips how to best manage your individual work in Phabricator.

Krinkle renamed this task from Increase retention of ArcLamp to 2 years by moving to Swift to Increase retention of ArcLamp SVGs to 2 years.Oct 4 2022, 7:34 PM
Krinkle updated the task description. (Show Details)

The new bare metal host has 280GB disk space on /srv. We currently use 134GB (52%) of this broken down as follows:

  • 78G logs/daily (of which 73 GB uncompressed last 3 days, and 5 GB compressed from 3 weeks prior)
  • 57G logs/hourly
  • 122M svgs/daily
  • 390M svgs/hourly

Our current retention configuration is set to 336 hourly (14 days), and 90 daily, as declared at /etc/arclamp-log-excimer.yaml, which is provisioned from Puppet Hiera:

profile::webperf::arclamp::compress_logs_days: 3
profile::webperf::arclamp::retain_hourly_logs_hours: 336
profile::webperf::arclamp::retain_daily_logs_days: 90

Having said that, it appears this isn't working correctly. For example, looking at once tag (load.php) from one channel (excimer-wall) the oldest SVG is 2023-02-15.excimer-wall.load.svgz, and the oldest trace log is 2023-02-15.excimer-wall.load.log.gz, as seen at https://performance.wikimedia.org/arclamp/logs/daily/ and https://performance.wikimedia.org/arclamp/svgs/daily/, and confirmed by our monitoring (Grafana dashboard: Arc Lamp) which indeed says the oldest file pruned is generally around the 3 week mark.

Setting aside for a moment a likely bug in arclamp-log.py that is pruning files too early after only 1/3rd the configured retention (most likely it is counting unrelated files from other channels and so there is likely a 3x effect somewhere)...

My proposal to achieve this is to let go of log files far earlier than flame graphs. Keeping trace logs for 3.5 weeks seems more than enough, and we can cut it down further if needed. Let's extrapolate how long we could keep SVGs for in the available space. We currently keep SVGs for 3.5 weeks which consumes 512MB. We have (rounding down) 140GB space available, which could give us upto 9 months or 280 days of retention.

Things we could do to extend this:

  • Compress raw trace logs after 2 days instead of 3 days, then we get an extra 26GB or 52 days or 2 months.
  • Shorten retention of hourly graphs to e.g. 1 week.
  • Remove support for cputime graphs. This is currently 50% of the space and why the original estimate of 2 years isn't realistic because the addition of walltime doubled the data. I personally focus on walltime only. We could start by shortening it, or keeping it at the current length.
  • Reduce sampling rate. The interval is 60s, local to each php-fpm worker on all app servers. This yields:
    • 4M excimer-wall.all samples daily (1M RunJobs, 1M api.php, 1M index.php, 0.5M rest.php, 0.1M load.php).
    • 3M excimer-cpu.all samples daily (0.5M RunJobs, 0.5M api.php, 0.5M index.php, 0.5M rest.php, 0.1M load.php).
    • About 150-200K excimer-wall.all samples per hour.

Change 908374 had a related patch set uploaded (by Krinkle; author: Krinkle):

[performance/docroot@master] php-profiling: Create new flame graph selectors

https://gerrit.wikimedia.org/r/908374

In Firefox this looks good but in Safari and Chrome it looks like this for me:

Screenshot 2023-04-13 at 06.34.13.png (1×2 px, 479 KB)

Change 908374 merged by jenkins-bot:

[performance/docroot@master] php-profiling: Create new flame graph selectors

https://gerrit.wikimedia.org/r/908374

Krinkle claimed this task.
Krinkle updated the task description. (Show Details)