Page MenuHomePhabricator

Did retrying canary checks do anything?
Closed, ResolvedPublic

Description

22:03:44 Executing check 'Logstash canary error rate'
22:03:44 Check 'Logstash canary error rate' failed: ERROR: 33% OVER_THRESHOLD (Avg. errors per 10 seconds: Before: 0.08, After: 1.50, Threshold: 1.00)

22:03:44 The average error rate across canaries increased by 10.0x (rerun with --force to override this check, see https://logstash.wikimedia.org for details).
[r] Retry canary checks
[c] Continue with deployment
[e] Exit scap
What do you want to do? (default: [e]): r
22:04:02 Executing check 'Logstash canary error rate'
22:04:02 Check 'Logstash canary error rate' failed: ERROR: 33% OVER_THRESHOLD (Avg. errors per 10 seconds: Before: 0.08, After: 1.50, Threshold: 1.00)

22:04:02 The average error rate across canaries increased by 10.0x (rerun with --force to override this check, see https://logstash.wikimedia.org for details).
[r] Retry canary checks
[c] Continue with deployment
[e] Exit scap
What do you want to do? (default: [e]): r
22:04:10 Executing check 'Logstash canary error rate'
22:04:10 Check 'Logstash canary error rate' failed: ERROR: 33% OVER_THRESHOLD (Avg. errors per 10 seconds: Before: 0.08, After: 1.50, Threshold: 1.00)

22:04:10 The average error rate across canaries increased by 10.0x (rerun with --force to override this check, see https://logstash.wikimedia.org for details).
[r] Retry canary checks
[c] Continue with deployment
[e] Exit scap
What do you want to do? (default: [e]): r
22:04:21 Executing check 'Logstash canary error rate'
22:04:21 Check 'Logstash canary error rate' failed: ERROR: 33% OVER_THRESHOLD (Avg. errors per 10 seconds: Before: 0.08, After: 1.50, Threshold: 1.00)

22:04:21 The average error rate across canaries increased by 10.0x (rerun with --force to override this check, see https://logstash.wikimedia.org for details).
[r] Retry canary checks
[c] Continue with deployment
[e] Exit scap
What do you want to do? (default: [e]): r
22:04:29 Executing check 'Logstash canary error rate'
22:04:29 Check 'Logstash canary error rate' failed: ERROR: 33% OVER_THRESHOLD (Avg. errors per 10 seconds: Before: 0.08, After: 1.50, Threshold: 1.00)

22:04:29 The average error rate across canaries increased by 10.0x (rerun with --force to override this check, see https://logstash.wikimedia.org for details).
[r] Retry canary checks
[c] Continue with deployment
[e] Exit scap
What do you want to do? (default: [e]): r
22:04:39 Executing check 'Logstash canary error rate'
22:04:39 Check 'Logstash canary error rate' failed: ERROR: 33% OVER_THRESHOLD (Avg. errors per 10 seconds: Before: 0.08, After: 1.50, Threshold: 1.00)

22:04:39 The average error rate across canaries increased by 10.0x (rerun with --force to override this check, see https://logstash.wikimedia.org for details).
[r] Retry canary checks
[c] Continue with deployment
[e] Exit scap
What do you want to do? (default: [e]): r
22:05:09 Executing check 'Logstash canary error rate'
22:05:09 Check 'Logstash canary error rate' failed: ERROR: 33% OVER_THRESHOLD (Avg. errors per 10 seconds: Before: 0.08, After: 1.50, Threshold: 1.00)

22:05:09 The average error rate across canaries increased by 10.0x (rerun with --force to override this check, see https://logstash.wikimedia.org for details).
[r] Retry canary checks
[c] Continue with deployment
[e] Exit scap
What do you want to do? (default: [e]): r
22:06:13 Executing check 'Logstash canary error rate'
22:06:13 Check 'Logstash canary error rate' failed: ERROR: 33% OVER_THRESHOLD (Avg. errors per 10 seconds: Before: 0.08, After: 1.50, Threshold: 1.00)

22:06:13 The average error rate across canaries increased by 10.0x (rerun with --force to override this check, see https://logstash.wikimedia.org for details).
[r] Retry canary checks
[c] Continue with deployment
[e] Exit scap
What do you want to do? (default: [e]): r
22:06:28 Executing check 'Logstash canary error rate'
22:06:28 Check 'Logstash canary error rate' failed: ERROR: 33% OVER_THRESHOLD (Avg. errors per 10 seconds: Before: 0.08, After: 1.50, Threshold: 1.00)

22:06:28 The average error rate across canaries increased by 10.0x (rerun with --force to override this check, see https://logstash.wikimedia.org for details).
[r] Retry canary checks
[c] Continue with deployment
[e] Exit scap
What do you want to do? (default: [e]): r
22:07:04 Executing check 'Logstash canary error rate'
22:07:04 Check 'Logstash canary error rate' failed: ERROR: 33% OVER_THRESHOLD (Avg. errors per 10 seconds: Before: 0.08, After: 1.50, Threshold: 1.00)

22:07:04 The average error rate across canaries increased by 10.0x (rerun with --force to override this check, see https://logstash.wikimedia.org for details).
[r] Retry canary checks
[c] Continue with deployment
[e] Exit scap
What do you want to do? (default: [e]): c
22:07:26 Continuing with deployment
22:07:26 Finished sync-check-canaries (duration: 04m 22s)

It seemed, that retrying didn't do anything; it seems to be reporting the same avg error rate, even after waiting a couple of minutes...

Details

TitleReferenceAuthorSource BranchDest Branch
Move logstash checker code into scaprepos/releng/scap!358dancymaster-I8ec33a8cdad453e35e3c67840f3e0ee843d28dadmaster
Customize query in GitLab

Event Timeline

dancy changed the task status from Open to In Progress.Fri, Jun 14, 9:57 PM
dancy claimed this task.
dancy triaged this task as Low priority.

@Reedy wrote:

It seemed, that retrying didn't do anything; it seems to be reporting the same avg error rate, even after waiting a couple of minutes...

Indeed. The old logic was extending the time window over which error counts were collected each time you retried. Now (as of scap 4.89.0) it always just looks at the last 20 (canary_wait_time) seconds, so the effects of transient problems should fade away and allow the checker to pass.