Page MenuHomePhabricator

Update our Graphite metrics for current retention config
Closed, ResolvedPublic

Description

A few months ago I was wondering why all our Graphite metrics from before July 2016 are missing. Investigation got me nowhere.

Now I'm wondering why our Graphite metrics don't go back beyond November 2016.

Long boring investigation later: Because that's what our Graphite retention rules (used to) specify:

1m:7d,5m:14d,15m:30d,1h:1y

This means 1 minute for the last 7 days, etc. and 1 hour for the last year. And beyond that, nada! For some reason I assumed the configuration somehow specified retention for the previous period and therefore the beyond the last one there is some kind of default, but this means there is actually no unlimited retention by default.

Fortunately, over a year ago this was increased to 5 years in 01d26c2c16e9cbab7c6de1625b705d4ab7ec7c33.

1m:7d,5m:14d,15m:30d,1h:1y,1d:5y

Unfortunately, Graphite hardcodes retention configuration in the Whisper files for individual metrics and there is no logic by default to update retention rules for existing metrics so we need to run some whisper command on each of the metrics we care about to make sure we stop deleting data from last year. We're losing stuff every day now.

Screen Shot 2017-11-02 at 15.05.31.png (639×2 px, 195 KB)

Checklist

Using the find sillypipe from T179622#4058076 to verify:

PathStatus regarding retention configComment
/var/lib/carbon/whisper/ResourceLoader OKFixed.
/var/lib/carbon/whisper/frontend OKFixed.
/var/lib/carbon/whisper/webpagetest OKFixed, and deleted some stuff. – T179622#4058076
/var/lib/carbon/whisper/browsertime OK(Was already fine.)
/var/lib/carbon/whisper/mw/ OKFixed and deleted some stuff – T179622#4077863
/var/lib/carbon/whisper/performance/ OK(Was already fine.)
/var/lib/carbon/whisper/ve/ OKFixed.
/var/lib/carbon/whisper/VisualEditor/ OKFixed.

Event Timeline

Krinkle triaged this task as High priority.

Mentioned in SAL (#wikimedia-operations) [2017-11-02T22:02:01Z] <Krinkle> Mass-resizing Graphite/Whisper files on graphite1001 and graphite2001 for T179622 (frontend.* namespace)

Krinkle renamed this task from Update all performance team Graphite metrics for current retention rules to Update our Graphite metrics for current retention rules.Nov 2 2017, 10:02 PM
Krinkle updated the task description. (Show Details)

Mentioned in SAL (#wikimedia-operations) [2017-11-03T18:52:13Z] <Krinkle> Starting whisper-mass-resize for frontend.navtiming on graphite2001 (T179622)

Mentioned in SAL (#wikimedia-operations) [2017-11-04T00:18:29Z] <Krinkle> Finished whisper-mass-resize for frontend.navtiming on graphite2001 (T179622)

Krinkle moved this task from Inbox, needs triage to Doing (old) on the Performance-Team board.

Mentioned in SAL (#wikimedia-operations) [2018-03-08T04:27:45Z] <Krinkle> Running whisper-mass-resize for ResourceLoader.* metrics on graphite1001 and graphite2001 (T179622)

Noticed just now that when running a "Last 1 year" query on the ResourceLoader, some of the metrics are broken. Most probably the broken ones are the ones we've had the longest. Specifically, they started before June 2016, and thus have the old retention settings imprinted on their Whisper files. Which means 1) They only last a year, and 2) They're discarded after a week if they have less than a certain number of values per minute.

I've confirmed this by running whisper-info on graphite1001 on one of the metrics and compared it to the info output from a newer metric.

whisper-info ResourceLoader/responses/long_cache_control/200/rate.wsp
maxRetention: 31536000
xFilesFactor: 0.5
aggregationMethod: average
fileSize: 309088

Archive 0
retention: 604800
secondsPerPoint: 60 [..]

Archive 1
retention: 1209600
secondsPerPoint: 300 [..]

Archive 2
retention: 2592000
secondsPerPoint: 900 [..]

Archive 3
retention: 31536000
secondsPerPoint: 3600 [..]

Main problem being that there is only 3 archives, and xFF=0.5

Compared to a newer file:

whisper-info /var/lib/carbon/whisper/frontend/navtiming/loadEventEnd/overall/rate.wsp
maxRetention: 157680000
xFilesFactor: 0.00999999977648   
aggregationMethod: average
fileSize: 331000

Archive 0
retention: 604800
secondsPerPoint: 60 [..]

Archive 1
retention: 1209600
secondsPerPoint: 300 [..]

Archive 2
retention: 2592000
secondsPerPoint: 900 [..]

Archive 3
retention: 31536000
secondsPerPoint: 3600 [..]

Archive 4
retention: 157680000
secondsPerPoint: 86400 [..]

After the fix:

whisper-info /var/lib/carbon/whisper/ResourceLoader/responses/long_cache_control/200/rate.wsp
maxRetention: 157680000
xFilesFactor: 0.00999999977648
aggregationMethod: average
fileSize: 331000

Archive 0
retention: 604800
secondsPerPoint: 60 [..]

Archive 1
retention: 1209600
secondsPerPoint: 300 [..]

Archive 2
retention: 2592000
secondsPerPoint: 900 [..]

Archive 3
retention: 31536000
secondsPerPoint: 3600 [..]

Archive 4
retention: 157680000
secondsPerPoint: 86400 [..]

Mentioned in SAL (#wikimedia-operations) [2018-03-12T18:53:01Z] <Krinkle> Clean up left-over .wsp.bak files under frontend.navtiming* on graphite1001 (following T179622)

I used the following find command over /var/lib/carbon/whisper.

$ find -name "rate.wsp" | xargs -I '%' bash -c 'echo "file: %" && whisper-info %' | grep -E 'file:|maxRetention|xFiles' | grep -v 'maxRetention: 157680000' | grep -F -B1 -A1 maxRetention

I found a bunch of older WebPageTest metrics as well. However, before converting them, I think we should take this moment to get rid of old data that we aren't using anymore.

Deleted:

  • webpagetest.testwiki.* (unused since Dec 2016 – 239a8d07c594)
  • webpagetest.test2wiki.* (unused since Dec 2016 – 239a8d07c594)

Proposing to delete (@Peter Please check!):

Based on mtime being more than 2 years ago:

/var/lib/carbon/whisper/webpagetest/$ find . -name "*.wsp" -mtime +600 | {local} | node merge.js '/'
- enwiki-bc-mobile-2gslow
  - anonymous/Barack_Obama/us-east-1/Google_Chrome-emulateMobile/repeatView
- enwiki-bc-mobile-beta-2gslow
  - anonymous/Barack_Obama/us-east-1/Google_Chrome-emulateMobile/repeatView
- enwiki-beta-mobile
  - anonymous/Chamber_music/us-east-1/Google_Chrome-emulateMobile/repeatView
- enwiki-beta
  - anonymous/Chamber_music/us-east-1
    - Firefox/repeatView
    - Google_Chrome/repeatView
    - Internet_Explorer/repeatView
- enwiki-mobile-2gslow-netspeedb
  - anonymous/Barack_Obama/us-east-1/Google_Chrome-emulateMobile
    - firstView
    - repeatView
- enwiki-mobile-2gslow
  - anonymous
    - Barack_Obama/us-east-1/Google_Chrome-emulateMobile/repeatView
    - ja-BarackObama/us-east-1/Google_Chrome-emulateMobile/repeatView
    - ja-Japan/us-east-1/Google_Chrome-emulateMobile/repeatView
- enwiki-mobile-beta-2gslow
  - anonymous
    - Barack_Obama/us-east-1/Google_Chrome-emulateMobile/repeatView
    - ja-BarackObama/us-east-1/Google_Chrome-emulateMobile/repeatView
    - ja-Japan/us-east-1/Google_Chrome-emulateMobile/repeatView
- enwiki-mobile-real3g
  - anonymous
    - Abdulkalam/Bangalore/Google_Chrome/firstView
    - Facebook/SanFrancisco/Google_Chrome/firstView
- enwiki-mobile
  - BlankPage/us-east-1/anonymous/chrome-emulateMobile
    - firstView
    - repeatView
  - Facebook-second/us-east-1/anonymous/chrome-emulateMobile/firstView
  - Facebook
    - Dulles/anonymous/Dulles_MotoG_Motorola_G___Chrome
      - firstView
      - repeatView
    - us-east-1
      - anonymous/chrome-emulateMobile
        - firstView
        - repeatView
      - authenticated/chrome-emulateMobile/firstView
  - San_Francisco
    - Dulles/anonymous/Dulles_MotoG_Motorola_G___Chrome/firstView
    - us-east-1/anonymous/chrome-emulateMobile
      - firstView
      - repeatView
  - anonymous
    - BlankPage/us-east-1/Google_Chrome-emulateMobile/repeatView
    - Facebook/Dulles/Dulles_iPhone6_iPhone_6_iOS_9/repeatView
    - San_Francisco/us-east-1/Google_Chrome-emulateMobile
      - firstView
      - repeatView
- enwiki
  - BlankPage/us-east-1/anonymous
    - chrome
      - firstView
      - repeatView
    - firefox
      - firstView
      - repeatView
    - ie
      - firstView
      - repeatView
  - Facebook-second/us-east-1/anonymous/chrome/firstView
  - Facebook
    - anonymous/ie
      - firstView
      - repeatView
    - us-east-1
      - anonymous
        - chrome
          - firstView
          - repeatView
        - firefox
          - firstView
          - repeatView
        - ie
          - firstView
          - repeatView
      - authenticated/chrome/firstView
  - Main_Page/us-east-1/anonymous
    - chrome
      - firstView
      - repeatView
    - firefox
      - firstView
      - repeatView
    - ie
      - firstView
      - repeatView
  - anonymous
    - BlankPage
      - Dulles/Google_Chrome/repeatView
      - us-east-1
        - Firefox/repeatView
        - Google_Chrome/repeatView
        - Internet_Explorer/repeatView
    - Facebook
      - Dulles
        - Firefox/repeatView
        - Google_Chrome/repeatView
      - eu-west-1
        - Firefox/firstView
        - Google_Chrome
          - firstView
          - repeatView
      - us-east-1
        - Firefox/repeatView
        - Internet_Explorer/repeatView
    - Main_Page/us-east-1
      - Firefox
        - firstView
        - repeatView
      - Google_Chrome
        - firstView
        - repeatView
      - Internet_Explorer
        - firstView
        - repeatView
- portals-beta
  - anonymous/wikipedia_org/us-east-1
    - Firefox/repeatView
    - Google_Chrome/repeatView
    - Internet_Explorer/repeatView
- portals
  - anonymous/wikipedia_org/us-east-1
    - Firefox/repeatView
    - Google_Chrome/repeatView
    - Internet_Explorer/repeatView
- wikidatawiki-beta
  - anonymous/Italy/us-east-1
    - Firefox/repeatView
    - Google_Chrome/repeatView
    - Internet_Explorer/repeatView
- wikidatawiki
  - anonymous
    - Berlin/us-east-1
      - Firefox/repeatView
      - Google_Chrome/repeatView
      - Internet_Explorer/repeatView
    - Main_Page/us-east-1
      - Firefox/repeatView
      - Google_Chrome/repeatView
      - Internet_Explorer/repeatView

Looks good @Krinkle ! When we moved to Linux and collected the "final" metrics for Windows we can do a major cleanup again.

Mentioned in SAL (#wikimedia-operations) [2018-03-20T03:56:21Z] <Krinkle> Deleting stale webpagetest.* metrics on graphite1001 and graphite2001 (any wsp file last modified 600+ days ago) – T179622

Mentioned in SAL (#wikimedia-operations) [2018-03-20T23:41:02Z] <Krinkle> Mass no-op resizing of Whisper files on graphite2001 and graphite1001 for T179622 (webpagetest.* namespace)

Mentioned in SAL (#wikimedia-operations) [2018-03-24T00:39:35Z] <Krinkle> Correct retention rules for Whisper files on graphite2001 and graphite1001 per T179622 (/var/lib/carbon/whisper/mw/*)

Using the find sillypipe from T179622#4058076 found some metrics with old retention rules under /var/lib/carbon/whisper/mw. Added to the checklist in the task description.

Found outdated retention rules in:

  • mw/performance/save/: Fixed with resize.
  • mw/js/deprecate/: Delete some old ResourceLoader stuff. Fixed the rest with resize.
  • mw/js/rlfeature2016/: Deleted (old ResourceLoader).
  • mw/errors/: Fixed with resize.
Krinkle renamed this task from Update our Graphite metrics for current retention rules to Update our Graphite metrics for current retention config.Mar 24 2018, 1:13 AM
Krinkle updated the task description. (Show Details)

Mentioned in SAL (#wikimedia-operations) [2018-03-24T01:27:03Z] <Krinkle> Correct retention rules for Whisper files on graphite2001 and graphite1001 per T179622 (/var/lib/carbon/whisper/VisualEditor/*)

Mentioned in SAL (#wikimedia-operations) [2018-03-27T02:57:48Z] <Krinkle> Fix retention rules for Whisper files on graphite2001 and graphite1001 per T179622 (/var/lib/carbon/whisper/ve/*)