Page MenuHomePhabricator

Scap is checking canary servers in dormant instead of active-dc
Closed, ResolvedPublic

Description

Summary

scap canaries hosts are hardcoded in operations/puppet.git in a hiera configuration file.

  • The switch other process should have an action to update list of canaries
  • Puppet roles mediawiki::canary_appserver, mediawiki::appserver::canary_api are apparently legacy/useless.
  • From discussion with SRE : scap dsh groups/list of canaries should move to conftool

From the last scap sync-file log:

01:14:16 Finished Canaries Synced (duration: 00m 03s)
01:14:16 Executing check 'Check endpoints for mw1279.eqiad.wmnet'
01:14:16 Executing check 'Check endpoints for mw1276.eqiad.wmnet'
01:14:16 Executing check 'Check endpoints for mw1261.eqiad.wmnet'
01:14:16 Executing check 'Check endpoints for mw1264.eqiad.wmnet'
01:14:16 Executing check 'Check endpoints for mwdebug1002.eqiad.wmnet'
01:14:16 Executing check 'Check endpoints for mwdebug1001.eqiad.wmnet'
01:14:16 Executing check 'Check endpoints for mw1263.eqiad.wmnet'
01:14:16 Executing check 'Check endpoints for mw1262.eqiad.wmnet'
01:14:16 Executing check 'Check endpoints for mw1278.eqiad.wmnet'
01:14:16 Executing check 'Check endpoints for mw1277.eqiad.wmnet'
01:14:16 Executing check 'Check endpoints for mw1265.eqiad.wmnet'
01:14:16 Check 'Check endpoints for mw1276.eqiad.wmnet' failed: /wiki/{title} (Main Page) is CRITICAL: Test Main Page returned the unexpected status 503 (expecting: 200); /wiki/{title} (Special Version) is CRITICAL: Test Special Version returned the unexpected status 503 (expecting: 200); /w/api.php (Main Page pageprops) is CRITICAL: Test Main Page pageprops returned the unexpected status 503 (expecting: 200)

01:14:18 Finished Canary Endpoint Check Complete (duration: 00m 02s)
01:14:18 Waiting for canary traffic...
01:14:36 Executing check 'Logstash Error rate for mw1279.eqiad.wmnet'
01:14:36 Executing check 'Logstash Error rate for mw1276.eqiad.wmnet'
01:14:36 Executing check 'Logstash Error rate for mw1261.eqiad.wmnet'
01:14:36 Executing check 'Logstash Error rate for mw1264.eqiad.wmnet'
01:14:36 Executing check 'Logstash Error rate for mwdebug1002.eqiad.wmnet'
01:14:36 Executing check 'Logstash Error rate for mwdebug1001.eqiad.wmnet'
01:14:36 Executing check 'Logstash Error rate for mw1263.eqiad.wmnet'
01:14:36 Executing check 'Logstash Error rate for mw1262.eqiad.wmnet'
01:14:36 Executing check 'Logstash Error rate for mw1278.eqiad.wmnet'
01:14:36 Executing check 'Logstash Error rate for mw1277.eqiad.wmnet'
01:14:36 Executing check 'Logstash Error rate for mw1265.eqiad.wmnet'
01:14:36 Finished sync-check-canaries (duration: 00m 23s)
01:14:36 Started sync-proxies

It should be checking canary servers in codfw instead because the eqiad ones are dormant / not useful.

Event Timeline

Change 461637 had a related patch set uploaded (by Hashar; owner: Hashar):
[operations/puppet@production] scap: use mediawiki canaries from codfw

https://gerrit.wikimedia.org/r/461637

Change 461637 merged by Alexandros Kosiaris:
[operations/puppet@production] scap: use mediawiki canaries from codfw

https://gerrit.wikimedia.org/r/461637

Mentioned in SAL (#wikimedia-operations) [2018-09-20T14:28:33Z] <hashar@deploy1001> Synchronized typos: Dummy sync to verify list of canaries for T204907 (duration: 00m 59s)

After discussion with Alexandros and Giuseppe, for now we have just updated the list of hosts in the dsh files. Now we have:

14:28:03 Finished Canaries Synced (duration: 00m 03s)
14:28:03 Executing check 'Check endpoints for mw2218.codfw.wmnet'
14:28:03 Executing check 'Check endpoints for mw2217.codfw.wmnet'
14:28:03 Executing check 'Check endpoints for mw2226.codfw.wmnet'
14:28:03 Executing check 'Check endpoints for mw2225.codfw.wmnet'
14:28:03 Executing check 'Check endpoints for mwdebug2002.codfw.wmnet'
14:28:03 Executing check 'Check endpoints for mw2216.codfw.wmnet'
14:28:03 Executing check 'Check endpoints for mw2215.codfw.wmnet'
14:28:03 Executing check 'Check endpoints for mwdebug2001.codfw.wmnet'
14:28:03 Executing check 'Check endpoints for mw2227.codfw.wmnet'
14:28:03 Executing check 'Check endpoints for mw2224.codfw.wmnet'
14:28:03 Executing check 'Check endpoints for mw2228.codfw.wmnet'
14:28:08 Finished Canary Endpoint Check Complete (duration: 00m 04s)
14:28:08 Waiting for canary traffic...
14:28:23 Executing check 'Logstash Error rate for mw2218.codfw.wmnet'
14:28:23 Executing check 'Logstash Error rate for mw2217.codfw.wmnet'
14:28:23 Executing check 'Logstash Error rate for mw2226.codfw.wmnet'
14:28:23 Executing check 'Logstash Error rate for mw2225.codfw.wmnet'
14:28:23 Executing check 'Logstash Error rate for mwdebug2002.codfw.wmnet'
14:28:23 Executing check 'Logstash Error rate for mw2216.codfw.wmnet'
14:28:23 Executing check 'Logstash Error rate for mw2215.codfw.wmnet'
14:28:23 Executing check 'Logstash Error rate for mwdebug2001.codfw.wmnet'
14:28:23 Executing check 'Logstash Error rate for mw2227.codfw.wmnet'
14:28:23 Executing check 'Logstash Error rate for mw2224.codfw.wmnet'
14:28:23 Executing check 'Logstash Error rate for mw2228.codfw.wmnet'
14:28:24 Finished sync-check-canaries (duration: 00m 24s)

A better fix would have to be figured out to update those files when the primary datacenter is switched. Either we can keep this task or fill another one.

greg subscribed.

(I assume SRE will do the adding to conftool and the editing/extending of the relevant dc switchover cookbook.)

In addition to the servers Scap checks, there is also the url it reports when finding it issue.

16:34:42 Executing check 'Logstash Error rate for mw2224.codfw.wmnet'
16:34:42 Executing check 'Logstash Error rate for mw2228.codfw.wmnet'
16:34:42 Check 'Logstash Error rate for mw2224.codfw.wmnet' failed: ERROR: 50% OVER_THRESHOLD (Avg. Error rate: Before: 0.09, After: 2.00, Threshold: 1.00)

16:34:42 Canary error check failed for 1 canaries, less than threshold to halt deployment (2/11), see https://logstash.wikimedia.org/goto/2cc7028226a539553178454fc2f14459 for details. Continuing...
16:34:42 Finished sync-check-canaries (duration: 00m 24s)

It now checks Codfw servers, but the url https://logstash.wikimedia.org/goto/2cc7028226a539553178454fc2f14459 is a shortcut for eqiad canaries still. We should probably make this a regular dashboard that always shows both, without needing to vary it, or to use a query query parameter so that it can dynamically be set to the current set of canaries.

As far as solving the logstash URL I think the best approach would be to just have the entire list in a dashboard. I 've updated the "scap canary" dashboard. However for sharing purposes this generated a new URL https://logstash.wikimedia.org/goto/db09a36be5ed3e81155041f7d46ad040. I 'll update scap.cfg with it

Change 463453 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] scap: Update logstash URL for mediawiki canaries

https://gerrit.wikimedia.org/r/463453

Change 463453 merged by Alexandros Kosiaris:
[operations/puppet@production] scap: Update logstash URL for mediawiki canaries

https://gerrit.wikimedia.org/r/463453

Change 463469 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] Use conftool to populate mw canaries in scap

https://gerrit.wikimedia.org/r/463469

Change 463708 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] scap: Replace an ugly hack with puppet 4 syntax

https://gerrit.wikimedia.org/r/463708

Change 463709 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] WIP: scap: Move prefix from confd to key creation

https://gerrit.wikimedia.org/r/463709

Change 463708 merged by Alexandros Kosiaris:
[operations/puppet@production] scap: Replace an ugly hack with puppet 4 syntax

https://gerrit.wikimedia.org/r/463708

Change 465661 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] Revert "scap: use mediawiki canaries from codfw"

https://gerrit.wikimedia.org/r/465661

Change 465661 merged by Alexandros Kosiaris:
[operations/puppet@production] Revert "scap: use mediawiki canaries from codfw"

https://gerrit.wikimedia.org/r/465661

Change 463469 abandoned by Alexandros Kosiaris:
Use conftool to populate mw canaries in scap

Reason:
https://gerrit.wikimedia.org/r/#/c/operations/puppet/ /465411/ was merged, let's abandon this

https://gerrit.wikimedia.org/r/463469

This was done, resolving.