Page MenuHomePhabricator

Shorten logstash retention temporarily
Closed, ResolvedPublic

Description

With T200960: Logstash packet loss essentially fixed the daily indices have grown significantly. I don't think we have enough space at the moment to store 30 days worth of logs (note the jump on Aug 01 below). I propose we shorten retention to 20 days until more hardware is available to expand the elasticsearch cluster.

health status index                      uuid                   pri rep docs.count docs.deleted store.size pri.store.size
green  open   logstash-2018.07.15        8c9mHFHtSZSYpm4vfk6fIg   1   2   22178907            0     51.6gb         17.2gb
green  open   logstash-2018.07.16        -MU2KyT5RniDmNNu_kFYJg   1   2   21420726            0     56.8gb         18.9gb
green  open   logstash-2018.07.17        lQLJbkzSQLy1MU9ms7UIuA   1   2   32341235            0     81.1gb           27gb
green  open   logstash-2018.07.18        zXogj42hT-SOyIbtH1eQyw   1   2   41270961            0      120gb           40gb
green  open   logstash-2018.07.19        TWJQ3VVJShigNBqVG73GGA   1   2   27917982            0     96.8gb         32.2gb
green  open   logstash-2018.07.20        fkKwqF2jTuSVCCowxI7Ojg   1   2   29655604            0     77.2gb         25.7gb
green  open   logstash-2018.07.21        jZ_2VEwoRiK1jyWq_UT2Og   1   2   37298191            0     93.2gb           31gb
green  open   logstash-2018.07.22        1KK5nZabTEaf_TTKCrju1A   1   2   35022762            0     86.1gb         28.7gb
green  open   logstash-2018.07.23        jAj_pgvvQgylJbmNDHcCJA   1   2   29495638            0     73.3gb         24.4gb
green  open   logstash-2018.07.24        oRhOzOUzRMiqAVGdFslvKw   1   2   28934880            0     69.6gb         23.2gb
green  open   logstash-2018.07.25        uwz1Q2vNSxeU9esiOTlNeA   1   2   28811814            0       71gb         23.6gb
green  open   logstash-2018.07.26        Crs7ovg4SH6hk2HQBeH9OQ   1   2   33923331            0     79.6gb         26.5gb
green  open   logstash-2018.07.27        7fd6Zr1VQeKcl9mKcbPFng   1   2   37867909            0     87.4gb         29.1gb
green  open   logstash-2018.07.28        TmvU5E17Q6iUyX2-9kShCA   1   2   36741805            0     81.5gb         27.1gb
green  open   logstash-2018.07.29        _kgX0vPwSYOkuUIySLy05g   1   2   37386064            0     84.2gb         28.1gb
green  open   logstash-2018.07.30        nHBNpuCmTAuAZf0UC4VMsQ   1   2   35410644            0     89.6gb         29.8gb
green  open   logstash-2018.07.31        lbu-c5FyQbGF5GJCFsZEGg   1   2   22007146            0     58.3gb         19.4gb
green  open   logstash-2018.08.01        Jn8Y6yDLTIW7s0jWcPjLfQ   1   2   18820037            0     50.3gb         16.7gb
green  open   logstash-2018.08.02        nisbJJqVQ6iNfsq6xifjdw   1   2   60243114            0    176.2gb         58.7gb
green  open   logstash-2018.08.03        OC_TRPReTvWkUpCv36pvlw   1   2  118656863            0    316.3gb        105.4gb
green  open   logstash-2018.08.04        g1PXmAZ0S366Pj5xjSZfYQ   1   2  111150066            0    294.2gb         98.1gb
green  open   logstash-2018.08.05        wN5QsRQDSDyJRqhrgk7G3Q   1   2  106689339            0    282.4gb         94.1gb
green  open   logstash-2018.08.06        5_x2JVeGSjOnJJ6Gub4f4Q   1   2  103911287            0    277.8gb         92.7gb
green  open   logstash-2018.08.07        guKiDaenRMK1IZhI4sqWHA   1   2  119193259            0      321gb          107gb
green  open   logstash-2018.08.08        eoSdWMRpTaqZkvfe3FR9Eg   1   2  121122893            0    326.8gb        108.9gb
green  open   logstash-2018.08.09        MZgVZRY_SmuLZCsO7g3NPQ   1   2  117946182            0    333.2gb        111.1gb
green  open   logstash-2018.08.10        UoG_FVTXRASrkUy41Lo5Ag   1   2  108333043            0    280.7gb         93.6gb
green  open   logstash-2018.08.11        DvR_MOmjQEOvseB3darusg   1   2  104247074            0    268.9gb         89.7gb
green  open   logstash-2018.08.12        lHfTcTi1Qzut47ZMqSb8JQ   1   2  103329426            0    266.8gb         88.9gb
green  open   logstash-2018.08.13        Ums0SeWlRGul9DMZIIOa6g   1   2  106676870            0    277.9gb         92.7gb
green  open   logstash-2018.08.14        ikPxvshpSvOZM5bKjAPtKQ   1   2   94564331            0    253.8gb           84gb
green  open   logstash-2018.08.15        57JjETJCRhiuphm42oHDng   1   2          3            0    299.1kb         99.7kb

Event Timeline

fgiunchedi created this task.

Could we maybe dump by channel type? api-feature-usage is by far the majority of logstash events, but is much less likely useful to reatain for the full 30 days relative to other log types (I think anyways. I'm not a consumer of that log type so I'm not really sure how its used exactly).

We can't delete inside indices easily, no. Dropping old indices is cheap compared to actually looking inside and delete only specific data. I'll clarify in the task description that this is a temporary bandaid though until we get more logstash hardware.

fgiunchedi renamed this task from Shorten logstash retention to Shorten logstash retention temporarily.Aug 14 2018, 11:38 PM
fgiunchedi updated the task description. (Show Details)

api-feature-usage is exposed via Special:ApiFeatureUsage, which queries the log entries from elasticsearch, I'm not sure if that's dependent upon them being in logstash though.

[..] api-feature-usage is by far the majority of logstash events, [..]

This channel has always been among the bigger ones, but it's currently extra large due the recent deprecation of an API still actively used by three WMF internal services.

Specifically, the api-feature pattern known as action=query&prop=revisions&!rvslots. Over 50% of all Logstash entries combined (including those from syslog and other non-mediawiki services) are due to this. They can be found with the following query:

(channel:api-feature-usage AND agent:(Parsoid OR WMF) AND feature:rvslots) OR (type:parsoid AND message:"Template Fetch")

And involves these five normalised messages:

typemessage
mediawikiapi-feature-usage INFO "action=query&prop=revisions&!rvslots" "Mobile-Content-Service/WMF"
mediawikiapi-feature-usage INFO "action=query&prop=revisions&!rvslots" "ChangePropagation/WMF"
mediawikiapi-feature-usage INFO "action=query&prop=revisions&!rvslots" "Parsoid/0.9"
parsoidWARNING Template Fetch Because "rvslots" was not specified, a legacy format has been used for the output. This format is deprecated, and in the future the new format will always be used.
parsoidWARNING Template Fetch Subscribe to the mediawiki-api-announce mailing list at https://lists.wikimedia.org/mailman/listinfo/mediawiki-api-announce for notice of API deprecations and breaking changes. Use [[Special:ApiFeatureUsage]] to see usage of deprecated features by your application.

See the announcement for this deprecation. I've filed T201974, which recommends temporarily disabling this deprecation.

ApiFeatureUsage doesn't depend on it being in logstash, but it's convenient for looking at usage of deprecated features to see whether something seems safe to be completely removed.

Another option would be to exclude the specific feature value from logstash, like the code that was removed in rOPUPc9d82c3a365e: Logstash: Cleanup exclusion of API continuation logging.

With T201974 solved, overall influx of Logstash databases has dropped over 50%:

Screen Shot 2018-08-20 at 17.48.04.png (408×1 px, 38 KB)
Screen Shot 2018-08-20 at 17.48.10.png (552×756 px, 45 KB)

Thanks for your help on investigating this everyone! Very helpful insights.

As it stands these are the options I believe:

  1. unconditionally shorten retention for all indices via elasticsearch-curator
  2. since this is sort of a one-time "spam", delete unwanted records from existing indices as @Bawolff suggested
  3. @herron suggested to drop the number of replicas from 2 to 1 for "older" indices, e.g. older than 15 days
  1. seems undesirable and thus we should do it only as a last resort, 3. seems like a good tradeoff given the current capacity we have. Finally we should do 2. for now but as a one-off thing as manually cleaning up "spam" is obviously not sustainable.

(I'm using spam for lack of a better word here, the messages themselves have value)

Adding the below to the /etc/curator/cleanup_logstash.yaml curator config temporarily and running /usr/bin/curator --config /etc/curator/config-logstash.yaml /etc/curator/cleanup_logstash.yaml once should do the trick for a one-off cleanup of the high volume logs.

3:
  action: replicas
  description: >-
    after 15 days set number of replicas to 1
  options:
    count: 1
  filters:
  - filtertype: pattern
    kind: prefix
    value: logstash-
    exclude:
  - filtertype: age
    source: creation_date
    direction: older
    unit: days
    unit_count: 15

Tested in a lab instance:

before:

# curl 'localhost:9200/_cat/indices?v'
health status index                      uuid                   pri rep docs.count docs.deleted store.size pri.store.size
yellow open   logstash-syslog-2018.08.21 X92LbL7hSJ6D7-_j_nenRA   1   2      85250            0      3.6mb          3.6mb
yellow open   logstash-syslog-2018.08.07 fOx001daQ4aBhfbzZv6W6g   1   2    2051499            0     96.3mb         96.3mb
green  open   .kibana                    6fPzHSABRf28GRrxccslBA   1   0          3            0     13.6kb         13.6kb
yellow open   logstash-syslog-2018.07.27 hcNE4qc2T4KJqWsLhkDImQ   1   2   26483623            0      1.2gb          1.2gb
yellow open   logstash-syslog-2018.08.06 4JT8pPiJRZKEt84MdsgWow   1   2     672760            0     31.4mb         31.4mb
yellow open   logstash-syslog-2018.07.26 450X1oKvR1eBF6UnZkaZYQ   1   2   12242619            0    614.8mb        614.8mb

after:

# curl 'localhost:9200/_cat/indices?v'
health status index                      uuid                   pri rep docs.count docs.deleted store.size pri.store.size
yellow open   logstash-syslog-2018.08.21 X92LbL7hSJ6D7-_j_nenRA   1   2     150625            0      8.6mb          8.6mb
yellow open   logstash-syslog-2018.08.07 fOx001daQ4aBhfbzZv6W6g   1   2    2051499            0     96.3mb         96.3mb
green  open   .kibana                    6fPzHSABRf28GRrxccslBA   1   0          3            0     13.6kb         13.6kb
yellow open   logstash-syslog-2018.07.27 hcNE4qc2T4KJqWsLhkDImQ   1   1   26483623            0      1.2gb          1.2gb
yellow open   logstash-syslog-2018.08.06 4JT8pPiJRZKEt84MdsgWow   1   2     672760            0     31.4mb         31.4mb
yellow open   logstash-syslog-2018.07.26 450X1oKvR1eBF6UnZkaZYQ   1   1   12242619            0    614.8mb        614.8mb

Change 454354 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] logstash: reduce replica count on old logstash indices

https://gerrit.wikimedia.org/r/454354

Thanks for your help on investigating this everyone! Very helpful insights.

As it stands these are the options I believe:

  1. unconditionally shorten retention for all indices via elasticsearch-curator
  2. since this is sort of a one-time "spam", delete unwanted records from existing indices as @Bawolff suggested

The call to issue for this (for a single index) would be sth like this (followed by /_forcemerge to actually reclaim the disk space)

curl -X POST "localhost:9200/logstash-2018.08.03/_delete_by_query" -H 'Content-Type: application/json' -d'
{
  "query": {
    "query_string": {
      "query": "(channel:api-feature-usage AND agent:(Parsoid OR WMF) AND feature:rvslots) OR (type:parsoid AND message:'Template Fetch')"
    }
  }
}'

Mentioned in SAL (#wikimedia-operations) [2018-08-27T15:12:13Z] <godog> set transient low watermark to 80% for elasticsearch logstash cluster to allow shard replica allocation - T201971

decreased logstash elasticsearch index replica count to 1 on indices older than 1 day:

health status index                      uuid                   pri rep docs.count docs.deleted store.size pri.store.size
green  open   logstash-2018.07.28        TmvU5E17Q6iUyX2-9kShCA   1   1   36741805            0     54.3gb         27.1gb
green  open   logstash-2018.07.29        _kgX0vPwSYOkuUIySLy05g   1   1   37386064            0     56.1gb         28.1gb
green  open   logstash-2018.07.30        nHBNpuCmTAuAZf0UC4VMsQ   1   1   35410644            0     59.7gb         29.8gb
green  open   logstash-2018.07.31        lbu-c5FyQbGF5GJCFsZEGg   1   1   22007146            0     38.9gb         19.4gb
green  open   logstash-2018.08.01        Jn8Y6yDLTIW7s0jWcPjLfQ   1   1   18820037            0     33.5gb         16.7gb
green  open   logstash-2018.08.02        nisbJJqVQ6iNfsq6xifjdw   1   1   60243114            0    117.5gb         58.7gb
green  open   logstash-2018.08.03        OC_TRPReTvWkUpCv36pvlw   1   1  118656863            0    210.7gb        105.4gb
green  open   logstash-2018.08.04        g1PXmAZ0S366Pj5xjSZfYQ   1   1  111150066            0    196.2gb         98.1gb
green  open   logstash-2018.08.05        wN5QsRQDSDyJRqhrgk7G3Q   1   1  106689339            0    188.2gb         94.1gb
green  open   logstash-2018.08.06        5_x2JVeGSjOnJJ6Gub4f4Q   1   1  103911287            0    185.3gb         92.7gb
green  open   logstash-2018.08.07        guKiDaenRMK1IZhI4sqWHA   1   1  119193259            0      214gb          107gb
green  open   logstash-2018.08.08        eoSdWMRpTaqZkvfe3FR9Eg   1   1  121122893            0      218gb        108.9gb
green  open   logstash-2018.08.09        MZgVZRY_SmuLZCsO7g3NPQ   1   1  117946182            0    222.2gb        111.1gb
green  open   logstash-2018.08.10        UoG_FVTXRASrkUy41Lo5Ag   1   1  108333043            0      187gb         93.6gb
green  open   logstash-2018.08.11        DvR_MOmjQEOvseB3darusg   1   1  104247074            0    179.4gb         89.7gb
green  open   logstash-2018.08.12        lHfTcTi1Qzut47ZMqSb8JQ   1   1  103329426            0    177.8gb         88.9gb
green  open   logstash-2018.08.13        Ums0SeWlRGul9DMZIIOa6g   1   1  106676870            0    186.1gb         93.1gb
green  open   logstash-2018.08.14        ikPxvshpSvOZM5bKjAPtKQ   1   1   98317937            0    174.5gb         87.3gb
green  open   logstash-2018.08.15        57JjETJCRhiuphm42oHDng   1   1   99003697            0    170.6gb         85.4gb
green  open   logstash-2018.08.16        1sfqN8tESay3eBLDuupIeg   1   1   91030557            0    170.2gb         85.1gb
green  open   logstash-2018.08.17        TtDZytNcRVqQC9-5NrEvIw   1   1   64489531            0    125.7gb         62.8gb
green  open   logstash-2018.08.18        SwCGo_2FRTiDLQehC1DEXQ   1   1   59428640            0      109gb         54.5gb
green  open   logstash-2018.08.19        hCRD3TwTSg6f4XK8mLhE0w   1   1   58321112            0    107.3gb         53.7gb
green  open   logstash-2018.08.20        MJNl1L-MSGeyvqLVuM-iIw   1   1   53108141            0    105.2gb         52.6gb
green  open   logstash-2018.08.21        bVvXHEIsRO2xvtmy3XJZgg   1   1   52424064            0    105.7gb         52.8gb
green  open   logstash-2018.08.22        VqleAKqCTey3jKM3u9EbaQ   1   1   57247424            0    115.9gb         57.9gb
green  open   logstash-2018.08.23        M-ccGEs9RiilCBmj-q6csg   1   1   60396314            0    120.8gb         60.4gb
green  open   logstash-2018.08.24        hoGgMIEgS1K42AsMK7mzVw   1   1   56537595            0    113.3gb         56.6gb
green  open   logstash-2018.08.25        zBUQ0VjITM6Sm5nw3C3IZw   1   1   57506651            0    118.3gb         59.1gb
green  open   logstash-2018.08.26        fRdT6WeQQRiyLkHZEt1psA   1   1   59359543            0    120.1gb           60gb
yellow open   logstash-2018.08.27        GZk1dBunSheMB_rHgC8SFA   1   2   39329242            0     89.7gb         44.4gb
green  open   logstash-backup_dce2       y7tLA30ZSVujBu-7Bb3xFQ   1   1    1012975            0      2.2gb          1.1gb
green  open   logstash-syslog-2018.07.28 OwQMouwqRzSRBu7su0VN9A   1   1      38118            0     28.3mb         14.1mb
green  open   logstash-syslog-2018.07.29 UAg_YwQxSTSWZCWdyqqMBQ   1   1      41129            0       31mb         15.5mb
green  open   logstash-syslog-2018.07.30 kmUQLsGxS6uBmMaYOMzadw   1   1      42281            0     32.9mb         16.4mb
green  open   logstash-syslog-2018.07.31 ichFY8gOTXu7KMYBvBK9vA   1   1      18325            0     14.8mb          7.4mb
green  open   logstash-syslog-2018.08.01 4JSp8cp8S5m7invBPith3Q   1   1      20641            0     19.2mb          9.6mb
green  open   logstash-syslog-2018.08.02 Ki4kJ8CGSSarRLn0fQRYyw   1   1     418511            0    312.2mb        156.1mb
green  open   logstash-syslog-2018.08.03 yHYpsoOITaee5PSoTLpofQ   1   1     903308            0    615.7mb        307.9mb
green  open   logstash-syslog-2018.08.04 aX-y71coT5CrNulGDk5pJQ   1   1     468934            0    362.6mb        181.3mb
green  open   logstash-syslog-2018.08.05 EM1kUJYRTICdIOAbteIqeQ   1   1     405765            0    315.9mb        157.9mb
green  open   logstash-syslog-2018.08.06 -3CIm73bQMyEn96ak0CFgw   1   1     411881            0    323.5mb        161.7mb
green  open   logstash-syslog-2018.08.07 Ha59asAxSuylhXM9QTcGHw   1   1     610723            0    458.4mb        229.2mb
green  open   logstash-syslog-2018.08.08 jBstZtJJSeSabQZdxmgZDQ   1   1     476216            0    349.3mb        174.6mb
green  open   logstash-syslog-2018.08.09 LKMyUOVtRHqlDriVSVH44Q   1   1     497248            0    368.5mb        184.3mb
green  open   logstash-syslog-2018.08.10 Q_wCixXxTvCErZxMWB5a4A   1   1     521283            0    392.5mb        196.3mb
green  open   logstash-syslog-2018.08.11 HUAhGIPxTKagLhL2L2xfUg   1   1     407367            0    291.8mb        145.8mb
green  open   logstash-syslog-2018.08.12 6nkgBoMESmWaA4N1FtqQKA   1   1     396095            0      282mb          141mb
green  open   logstash-syslog-2018.08.13 DLKJvEnzSIqboWo8wpYh-Q   1   1     410454            0    295.5mb        147.8mb
green  open   logstash-syslog-2018.08.14 xLA3sgCWSEC2Z3As_bulGw   1   1     418426            0    320.1mb          160mb
green  open   logstash-syslog-2018.08.15 Gp8Bcx7YRVG_ytUyfiaABg   1   1     383677            0    278.5mb        139.2mb
green  open   logstash-syslog-2018.08.16 0sVXcwRrScmTM_Zj1DYi2Q   1   1     391714            0    288.5mb        144.2mb
green  open   logstash-syslog-2018.08.17 AUZF8skkSM6jSNBqzglDmg   1   1     392838            0    287.7mb        143.8mb
green  open   logstash-syslog-2018.08.18 LN187xAFRn6V4FL__XaKFw   1   1     401569            0    289.3mb        144.7mb
green  open   logstash-syslog-2018.08.19 suBq4GtnRcOL59MiYZjXMw   1   1     368504            0    271.3mb        135.7mb
green  open   logstash-syslog-2018.08.20 Cup28-lvR9ezIOx2XBewoQ   1   1     401094            0    299.1mb        149.6mb
green  open   logstash-syslog-2018.08.21 s7wYqR5_Qhe0yCwPt4NX_w   1   1     428993            0    316.7mb        158.4mb
green  open   logstash-syslog-2018.08.22 pq2f73izRFya4pHT1qUfHw   1   1     379862            0    284.3mb        142.1mb
green  open   logstash-syslog-2018.08.23 X-yx7p17SmCCurGnNzH3Dg   1   1     399633            0    297.4mb        148.7mb
green  open   logstash-syslog-2018.08.24 b80bvtvSSYKhiEekeqnY9Q   1   1     291523            0    216.7mb        108.3mb
green  open   logstash-syslog-2018.08.25 zh4NY2HwRYqL8aeot-UR2w   1   1     261209            0    195.3mb         97.5mb
green  open   logstash-syslog-2018.08.26 MYXJLS9dTu6gNkKiDcj8SA   1   1     271077            0    197.5mb         98.7mb
green  open   logstash-syslog-2018.08.27 4yW4bOIDS7OH62rCrwRAlA   1   2     197792            0    217.6mb         72.1mb

Change 454354 merged by Herron:
[operations/puppet@production] logstash: reduce replica count on old logstash indices

https://gerrit.wikimedia.org/r/454354

fgiunchedi claimed this task.

Resolving this since we're using less replicas for older indices now and no longer have file system usage pressure. We can reevaluate in some weeks if needed.