Page MenuHomePhabricator

Fix conflict between monthly and weekly index buckets
Closed, ResolvedPublic

Description

Indexes created with monthly indexes %{+YYYY.MM} share the same pattern as weekly indexes %{+xxxx.ww}. This could lead to problems when curating weekly and monthly indexes if they matching the same curator pattern.

One possible solution is to make weekly buckets underscore-separated and calendar year buckets dot-separated.

@herron @fgiunchedi what do you think about that convention?

Details

Related Changes in Gerrit:
SubjectRepoBranchLines +/-
operations/puppetproduction+1 -1
operations/puppetproduction+2 -2
operations/puppetproduction+15 -10
operations/puppetproduction+30 -31
operations/puppetproduction+1 -1
operations/puppetproduction+16 -10
operations/puppetproduction+1 -1
operations/puppetproduction+22 -2
operations/puppetproduction+1 -1
operations/puppetproduction+6 -6
operations/puppetproduction+4 -5 K
operations/software/ecsmaster+3 -3
operations/software/ecsmaster+1 -1
operations/puppetproduction+8 -24
operations/puppetproduction+14 -14
operations/puppetproduction+57 -0
operations/puppetproduction+14 -0
operations/puppetproduction+44 -33
operations/puppetproduction+46 -2
operations/puppetproduction+8 -4
Show related patches Customize query in gerrit

Event Timeline

Sounds like that'd work, although I wonder if there are any alternatives that may be more obvious at a glance? Could we get away with doing something like including the unit as a keyword before the stamp? e.g.

ecs-1.7.0-5-alerts-bucket-daily-2022.01.01
ecs-1.7.0-5-alerts-bucket-weekly-2022.01
ecs-1.7.0-5-alerts-bucket-monthly-2022.01

Agreed, something explicit in the index name seems preferable to me too and less prone to errors.

Sounds like that'd work, although I wonder if there are any alternatives that may be more obvious at a glance? Could we get away with doing something like including the unit as a keyword before the stamp? e.g.

I want to explore this idea further. It would remove the need to key off of outputs by default we included bucket in the index pattern. This would look like:

  • logstash-daily-2022.04.04
  • logstash-deploy-daily-2022.04.04
  • logstash-mediawiki-daily-2022.04.04
  • logstash-syslog-daily-2022.04.04
  • ecs-default-1.7.0-5-weekly-2022.14
  • ecs-alerts-1.7.0-5-yearly-2022
  • ecs-test-1.7.0-5-weekly-2022.14
  • w3creportingapi-1.0.0-2-weekly-2022.14
  • dlq-1.0.0-1-daily-2022.04.04

This matches ^(?<output>[a-z0-9]+)-((?<partition>[a-z0-9]+)-)?((?<version>[0-9\.]+)-(?<revision>[0-9]+)-)?(?<bucket>daily|weekly|monthly|yearly)-(?<created>[0-9.]+)$.

I do not think this will simplify our curator config much. We'll be trading output-defined actions for a set of implicit (bucket-defined) actions and adding exceptions. This would look like:

  • dlq: delete older than 2 days dlq-*-%Y.%m.%d
  • ecs-test: delete older than 2 weeks ecs-test-*-weekly-%Y.%W
  • ecs-alerts: delete older than N years ecs-alerts-*-yearly-%Y
  • daily indexes: delete older than 91 days *-daily-%Y.%m.%d
  • weekly indexes: delete older than 12 weeks *-weekly-%Y.%W
  • monthly indexes: delete older than 3 months *-monthly-%Y.%m
  • (can't set a default yearly filter)

This introduces a problem where if we introduce a new monthly index with a 9-month retention (e.g. ecs-foo-*-monthly-2022.04), the monthly action (delete older than 3 months) would have to exclude the pattern in addition to setting up the 9-month action.

I see a link between the creation format and retention. If we were to extend the definition of "bucket" to combine format and retention ("policy", perhaps?), we could reuse curator actions more effectively. An example strawman approach:

^(?<output>[a-z0-9]+)-((?<partition>[a-z0-9]+)-)?((?<version>[0-9\.]+)-(?<revision>[0-9]+)-)?(?<policy>[a-z0-9]+)-(?<created>[0-9.]+)$

daily:
  format: %Y.%m.%d
  retention: 90 days
  example: logstash-mediawiki-daily-2022.04.04
weekly:
  format: %Y.%W
  retention: 14 weeks
  example: w3creportingapi-1.0.0-2-weekly-2022.14
monthly:
  format: %Y.%m
  retention: 3 months
  example: ecs-foo-1.7.0-5-monthly-2022.04
3years:
  format: %Y
  retention: 3 years
  example: ecs-alerts-1.7.0-5-3years-2022
2weeks:
  format: %Y.%W
  retention: 2 weeks
  example: ecs-test-1.7.0-5-2weeks-2022.14
2days:
  format: %Y.%m.%d
  retention: 2 days
  example: dlq-1.0.0-1-2days-2022.04.04

What do you think?

If I'm understanding correctly the idea is to have a set of generic curator rules that would automatically set retention based on patterns like "2weeks" or "2days" in the index name?

One challenge that comes to mind is transitioning retention time on a given index. Say increasing from 2 weeks to 4 weeks. IIUC we would be on the hook to handle renaming/reindexing of the affected existing indices from 2weeks to 4weeks or weekly shortly after the next index rollover to avoid losing them.

If I'm understanding correctly the idea is to have a set of generic curator rules that would automatically set retention based on patterns like "2weeks" or "2days" in the index name?

Yes, the pattern combines retention period and bucket. "2days", etc. is an example to illustrate the concept. They could just as easily be more kind-of-data-oriented approach (ecs-alerts-1.7.0-5-alerts-2022.14) or a policy revision approach (ecs-1.7.0-5-alerts-0-2022.14).

One challenge that comes to mind is transitioning retention time on a given index. Say increasing from 2 weeks to 4 weeks. IIUC we would be on the hook to handle renaming/reindexing of the affected existing indices from 2weeks to 4weeks or weekly shortly after the next index rollover to avoid losing them.

Good point. Policies named with such specific terms makes them less flexible. Perhaps kind-of-data-oriented policies is preferable here. To evaluate the procedure for adjustment:

  1. Extending or reducing retention
    1. Reconfigure curator policy with new schedule
  2. Extending or reducing retention with resized buckets
    1. Configure curator with new policy
    2. Configure logstash to set new policy and creation pattern
    3. Reconfigure old curator policy with new schedule
      1. Add comment indicating when the last index matching this pattern would be deleted
    4. Remove old policy when no longer needed

It occurs to me we could also get there by leveraging the output-partition pair to achieve the same goal. For example:

ecs-1.7.0-5-alerts-2022.14:
  format: %Y.%W
  retention: 105 weeks
ecs-1.7.0-5-alerts2-2022:
  format: %Y
  retention: 2 years

It also occurs to me a combination of output-partition-policy revision would get us there. For example:

ecs-1.7.0-5-alerts-0-2022.14
  format: %Y.%W
  retention: 105 weeks
ecs-1.7.0-5-alerts-1-2022
  format: %Y
  retention: 2 years

The more we explore, the more I see the benefits of rolling out a modified index pattern to all indexes like so: ^(?<output>[a-z0-9]+)-(?<partition>[a-z0-9]+)-(?<policy_revision>[0-9]+)-(?<template_version>[0-9\.]+)-(?<template_revision>[0-9]+)-(?<created>[0-9.]+)$
This would:

  1. Move partition to after output definition
  2. Add a policy revision after partition

The indexes would then be formatted like so:

  • logstash-default-0-1.0.0-1-2022.04.04
  • logstash-deploy-0-1.0.0-1-2022.04.04
  • logstash-mediawiki-0-1.0.0-1-2022.04.04
  • logstash-syslog-0-1.0.0-1-2022.04.04
  • ecs-default-0-1.7.0-5-2022.14
  • --> ecs-alerts-0-1.7.0-5-2022.14
  • --> ecs-alerts-1-1.7.0-5-2022
  • ecs-test-0-1.7.0-5-2022.14
  • w3creportingapi-default-0-1.0.0-2-2022.14
  • dlq-default-0-1.0.0-1-2022.04.04

This could homogenize the curator filters into a predictable format and replace all regex filters with prefix filters. Unfortunately, we could not construct any generic "catch-all" rules, but we do not have that even now and is not a hard requirement AFAIK.

Per discussion today in the Observability meeting, we agreed to evaluate the proposed index name refactor to:

  1. support simplified curator filters
  2. rolling transitions of bucket size based on the output-partition-policy revision prefix combination
  3. logstash to enforce index pattern, else send to DLQ

Change 775375 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] logstash: populate target index format and add pipeline diagnostics

https://gerrit.wikimedia.org/r/775375

Change 777874 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] logstash: replace all instances of @metadata.partition

https://gerrit.wikimedia.org/r/777874

Change 777880 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] logstash: set partition on legacy indexes

https://gerrit.wikimedia.org/r/777880

Change 777882 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] logstash: transform human-friendly values to bucket date format

https://gerrit.wikimedia.org/r/777882

Change 777891 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] logstash: add target index validation step

https://gerrit.wikimedia.org/r/777891

Change 777874 merged by Cwhite:

[operations/puppet@production] logstash: replace all instances of @metadata.partition

https://gerrit.wikimedia.org/r/777874

Change 775375 merged by Cwhite:

[operations/puppet@production] logstash: populate target index format and add pipeline diagnostics

https://gerrit.wikimedia.org/r/775375

Change 777880 merged by Cwhite:

[operations/puppet@production] logstash: set partition on legacy indexes

https://gerrit.wikimedia.org/r/777880

Change 777882 merged by Cwhite:

[operations/puppet@production] logstash: transform rotation frequency values to datestamp format

https://gerrit.wikimedia.org/r/777882

Change 777891 merged by Cwhite:

[operations/puppet@production] logstash: add target index validation step

https://gerrit.wikimedia.org/r/777891

colewhite changed the task status from Open to In Progress.May 24 2022, 6:46 PM
colewhite claimed this task.
colewhite triaged this task as Medium priority.

Change 798974 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] beta-logs: enable pipeline-managed index patterns

https://gerrit.wikimedia.org/r/798974

Change 798974 merged by Cwhite:

[operations/puppet@production] beta-logs: enable pipeline-managed index patterns

https://gerrit.wikimedia.org/r/798974

Change 798982 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] logstash: curator support new and legacy index patterns

https://gerrit.wikimedia.org/r/798982

Change 799001 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] logstash: enable pipeline-managed index patterns

https://gerrit.wikimedia.org/r/799001

Change 798982 merged by Cwhite:

[operations/puppet@production] logstash: curator support new and legacy index patterns

https://gerrit.wikimedia.org/r/798982

Change 802873 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/software/ecs@master] add new index pattern format

https://gerrit.wikimedia.org/r/802873

Change 802873 merged by jenkins-bot:

[operations/software/ecs@master] add new index pattern format

https://gerrit.wikimedia.org/r/802873

Change 803345 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/software/ecs@master] templates: replace all version instances

https://gerrit.wikimedia.org/r/803345

Change 803345 merged by jenkins-bot:

[operations/software/ecs@master] templates: replace all version instances

https://gerrit.wikimedia.org/r/803345

Change 803350 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] add new index pattern to ecs templates

https://gerrit.wikimedia.org/r/803350

Change 803350 merged by Cwhite:

[operations/puppet@production] logstash: add new index pattern to ecs templates

https://gerrit.wikimedia.org/r/803350

Change 799001 merged by Cwhite:

[operations/puppet@production] logstash: enable pipeline-managed index patterns

https://gerrit.wikimedia.org/r/799001

Deployed! Will be watching it this week for any issues.

Last thing to do is migrate w3creportingapi-* and dlq-*.

Change 815799 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] logstash: add missing closing curly brace

https://gerrit.wikimedia.org/r/815799

Change 815799 merged by Cwhite:

[operations/puppet@production] logstash: add missing closing curly brace

https://gerrit.wikimedia.org/r/815799

Change 822444 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] logstash: replace legacy routing filters

https://gerrit.wikimedia.org/r/822444

Change 822450 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] logstash: use logstash routing for w3creportingapi stream

https://gerrit.wikimedia.org/r/822450

Change 822452 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] logstash: update production w3creportingapi guard condition

https://gerrit.wikimedia.org/r/822452

Change 822450 merged by Cwhite:

[operations/puppet@production] logstash: use logstash routing for w3creportingapi stream

https://gerrit.wikimedia.org/r/822450

Change 822452 merged by Cwhite:

[operations/puppet@production] logstash: update production w3creportingapi guard condition

https://gerrit.wikimedia.org/r/822452

Change 824751 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] beta-logs: dlq use logstash-managed index pattern

https://gerrit.wikimedia.org/r/824751

Change 824752 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] logstash: dlq use logstash-managed index pattern

https://gerrit.wikimedia.org/r/824752

Change 824753 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] beta-logs: w3creportingapi to use logstash-managed index pattern

https://gerrit.wikimedia.org/r/824753

Change 824754 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] logstash: w3creportingapi to use logstash-managed index pattern

https://gerrit.wikimedia.org/r/824754

Change 824751 merged by Cwhite:

[operations/puppet@production] beta-logs: dlq use logstash-managed index pattern

https://gerrit.wikimedia.org/r/824751

Change 824753 merged by Cwhite:

[operations/puppet@production] beta-logs: w3creportingapi to use logstash-managed index pattern

https://gerrit.wikimedia.org/r/824753

Change 824752 merged by Cwhite:

[operations/puppet@production] logstash: dlq use logstash-managed index pattern

https://gerrit.wikimedia.org/r/824752

Change 826372 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] logstash: add dlq revision to index pattern

https://gerrit.wikimedia.org/r/826372

Change 826372 merged by Cwhite:

[operations/puppet@production] logstash: use puppet dlq version and revision for index pattern

https://gerrit.wikimedia.org/r/826372

Change 824754 merged by Cwhite:

[operations/puppet@production] logstash: w3creportingapi to use logstash-managed index pattern

https://gerrit.wikimedia.org/r/824754

dlq and w3creportingapi indexes migrated. This is complete.