Page MenuHomePhabricator

Scap is checking canary servers in dormant instead of active-dc
Closed, ResolvedPublic

Description

Summary

scap canaries hosts are hardcoded in operations/puppet.git in a hiera configuration file.

  • The switch other process should have an action to update list of canaries
  • Puppet roles mediawiki::canary_appserver, mediawiki::appserver::canary_api are apparently legacy/useless.
  • From discussion with SRE : scap dsh groups/list of canaries should move to conftool

From the last scap sync-file log:

01:14:16 Finished Canaries Synced (duration: 00m 03s)
01:14:16 Executing check 'Check endpoints for mw1279.eqiad.wmnet'
01:14:16 Executing check 'Check endpoints for mw1276.eqiad.wmnet'
01:14:16 Executing check 'Check endpoints for mw1261.eqiad.wmnet'
01:14:16 Executing check 'Check endpoints for mw1264.eqiad.wmnet'
01:14:16 Executing check 'Check endpoints for mwdebug1002.eqiad.wmnet'
01:14:16 Executing check 'Check endpoints for mwdebug1001.eqiad.wmnet'
01:14:16 Executing check 'Check endpoints for mw1263.eqiad.wmnet'
01:14:16 Executing check 'Check endpoints for mw1262.eqiad.wmnet'
01:14:16 Executing check 'Check endpoints for mw1278.eqiad.wmnet'
01:14:16 Executing check 'Check endpoints for mw1277.eqiad.wmnet'
01:14:16 Executing check 'Check endpoints for mw1265.eqiad.wmnet'
01:14:16 Check 'Check endpoints for mw1276.eqiad.wmnet' failed: /wiki/{title} (Main Page) is CRITICAL: Test Main Page returned the unexpected status 503 (expecting: 200); /wiki/{title} (Special Version) is CRITICAL: Test Special Version returned the unexpected status 503 (expecting: 200); /w/api.php (Main Page pageprops) is CRITICAL: Test Main Page pageprops returned the unexpected status 503 (expecting: 200)

01:14:18 Finished Canary Endpoint Check Complete (duration: 00m 02s)
01:14:18 Waiting for canary traffic...
01:14:36 Executing check 'Logstash Error rate for mw1279.eqiad.wmnet'
01:14:36 Executing check 'Logstash Error rate for mw1276.eqiad.wmnet'
01:14:36 Executing check 'Logstash Error rate for mw1261.eqiad.wmnet'
01:14:36 Executing check 'Logstash Error rate for mw1264.eqiad.wmnet'
01:14:36 Executing check 'Logstash Error rate for mwdebug1002.eqiad.wmnet'
01:14:36 Executing check 'Logstash Error rate for mwdebug1001.eqiad.wmnet'
01:14:36 Executing check 'Logstash Error rate for mw1263.eqiad.wmnet'
01:14:36 Executing check 'Logstash Error rate for mw1262.eqiad.wmnet'
01:14:36 Executing check 'Logstash Error rate for mw1278.eqiad.wmnet'
01:14:36 Executing check 'Logstash Error rate for mw1277.eqiad.wmnet'
01:14:36 Executing check 'Logstash Error rate for mw1265.eqiad.wmnet'
01:14:36 Finished sync-check-canaries (duration: 00m 23s)
01:14:36 Started sync-proxies

It should be checking canary servers in codfw instead because the eqiad ones are dormant / not useful.

Event Timeline

Change 461637 had a related patch set uploaded (by Hashar; owner: Hashar):
[operations/puppet@production] scap: use mediawiki canaries from codfw

https://gerrit.wikimedia.org/r/461637

Change 461637 merged by Alexandros Kosiaris:
[operations/puppet@production] scap: use mediawiki canaries from codfw

https://gerrit.wikimedia.org/r/461637

Mentioned in SAL (#wikimedia-operations) [2018-09-20T14:28:33Z] <hashar@deploy1001> Synchronized typos: Dummy sync to verify list of canaries for T204907 (duration: 00m 59s)

After discussion with Alexandros and Giuseppe, for now we have just updated the list of hosts in the dsh files. Now we have:

14:28:03 Finished Canaries Synced (duration: 00m 03s)
14:28:03 Executing check 'Check endpoints for mw2218.codfw.wmnet'
14:28:03 Executing check 'Check endpoints for mw2217.codfw.wmnet'
14:28:03 Executing check 'Check endpoints for mw2226.codfw.wmnet'
14:28:03 Executing check 'Check endpoints for mw2225.codfw.wmnet'
14:28:03 Executing check 'Check endpoints for mwdebug2002.codfw.wmnet'
14:28:03 Executing check 'Check endpoints for mw2216.codfw.wmnet'
14:28:03 Executing check 'Check endpoints for mw2215.codfw.wmnet'
14:28:03 Executing check 'Check endpoints for mwdebug2001.codfw.wmnet'
14:28:03 Executing check 'Check endpoints for mw2227.codfw.wmnet'
14:28:03 Executing check 'Check endpoints for mw2224.codfw.wmnet'
14:28:03 Executing check 'Check endpoints for mw2228.codfw.wmnet'
14:28:08 Finished Canary Endpoint Check Complete (duration: 00m 04s)
14:28:08 Waiting for canary traffic...
14:28:23 Executing check 'Logstash Error rate for mw2218.codfw.wmnet'
14:28:23 Executing check 'Logstash Error rate for mw2217.codfw.wmnet'
14:28:23 Executing check 'Logstash Error rate for mw2226.codfw.wmnet'
14:28:23 Executing check 'Logstash Error rate for mw2225.codfw.wmnet'
14:28:23 Executing check 'Logstash Error rate for mwdebug2002.codfw.wmnet'
14:28:23 Executing check 'Logstash Error rate for mw2216.codfw.wmnet'
14:28:23 Executing check 'Logstash Error rate for mw2215.codfw.wmnet'
14:28:23 Executing check 'Logstash Error rate for mwdebug2001.codfw.wmnet'
14:28:23 Executing check 'Logstash Error rate for mw2227.codfw.wmnet'
14:28:23 Executing check 'Logstash Error rate for mw2224.codfw.wmnet'
14:28:23 Executing check 'Logstash Error rate for mw2228.codfw.wmnet'
14:28:24 Finished sync-check-canaries (duration: 00m 24s)

A better fix would have to be figured out to update those files when the primary datacenter is switched. Either we can keep this task or fill another one.

greg added a subscriber: greg.

(I assume SRE will do the adding to conftool and the editing/extending of the relevant dc switchover cookbook.)

In addition to the servers Scap checks, there is also the url it reports when finding it issue.

16:34:42 Executing check 'Logstash Error rate for mw2224.codfw.wmnet'
16:34:42 Executing check 'Logstash Error rate for mw2228.codfw.wmnet'
16:34:42 Check 'Logstash Error rate for mw2224.codfw.wmnet' failed: ERROR: 50% OVER_THRESHOLD (Avg. Error rate: Before: 0.09, After: 2.00, Threshold: 1.00)

16:34:42 Canary error check failed for 1 canaries, less than threshold to halt deployment (2/11), see https://logstash.wikimedia.org/goto/2cc7028226a539553178454fc2f14459 for details. Continuing...
16:34:42 Finished sync-check-canaries (duration: 00m 24s)

It now checks Codfw servers, but the url https://logstash.wikimedia.org/goto/2cc7028226a539553178454fc2f14459 is a shortcut for eqiad canaries still. We should probably make this a regular dashboard that always shows both, without needing to vary it, or to use a query query parameter so that it can dynamically be set to the current set of canaries.

As far as solving the logstash URL I think the best approach would be to just have the entire list in a dashboard. I 've updated the "scap canary" dashboard. However for sharing purposes this generated a new URL https://logstash.wikimedia.org/goto/db09a36be5ed3e81155041f7d46ad040. I 'll update scap.cfg with it

Change 463453 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] scap: Update logstash URL for mediawiki canaries

https://gerrit.wikimedia.org/r/463453

Change 463453 merged by Alexandros Kosiaris:
[operations/puppet@production] scap: Update logstash URL for mediawiki canaries

https://gerrit.wikimedia.org/r/463453

Change 463469 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] Use conftool to populate mw canaries in scap

https://gerrit.wikimedia.org/r/463469

Change 463708 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] scap: Replace an ugly hack with puppet 4 syntax

https://gerrit.wikimedia.org/r/463708

Change 463709 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] WIP: scap: Move prefix from confd to key creation

https://gerrit.wikimedia.org/r/463709

Change 463708 merged by Alexandros Kosiaris:
[operations/puppet@production] scap: Replace an ugly hack with puppet 4 syntax

https://gerrit.wikimedia.org/r/463708

Change 465661 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] Revert "scap: use mediawiki canaries from codfw"

https://gerrit.wikimedia.org/r/465661

Change 465661 merged by Alexandros Kosiaris:
[operations/puppet@production] Revert "scap: use mediawiki canaries from codfw"

https://gerrit.wikimedia.org/r/465661

Change 463469 abandoned by Alexandros Kosiaris:
Use conftool to populate mw canaries in scap

Reason:
https://gerrit.wikimedia.org/r/#/c/operations/puppet/ /465411/ was merged, let's abandon this

https://gerrit.wikimedia.org/r/463469

This was done, resolving.