Page MenuHomePhabricator

No mw canary servers in codfw
Closed, ResolvedPublic

Description

We currently don't have any servers with the mediawiki::appserver::canary_api and mediawiki::appserver:.canary roles in codfw. I'm pretty certain we had those in the past, but maybe there got dropped by means of hardware refreshment?

Given that there's a DC switchover coming, we should fix that.


canary appservers codfw:

mwdebug2001 (row A, ganeti VM)
mwdebug2002 (row B, ganeti VM)
mw2163 (C3, physical)
mw2164 (C3, physical)
mw2271 (D3, physical)
mw2272 (D3, physical)

canary API appservers codfw:

mw2215 (A3, physical)
mw2216 (A3, physical)
mw2244 (A4, physical)
mw2245 (A4, physical)

Event Timeline

Change 564175 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] define 2 API appservers per row in codfw as canary API appservers

https://gerrit.wikimedia.org/r/564175

Yeah, we need at least a total of 4 api and 4 app canary servers in codfw. In eqiad our canary app (5) and api (4) servers are in the same rack actually, we can spread them a bit when we install the new servers

Agreed, I think for our uses of the canaries, rack redundancy is not a must, but would still be nice to have when re-adding canaries to codfw.

Change 564175 merged by Dzahn:
[operations/puppet@production] define 2 API appservers per row in codfw as canary API appservers

https://gerrit.wikimedia.org/r/564175

The following are now declared canary API appservers in site.pp:

mw2215, mw2216 (rack A3)

mw2244, mw2245 (rack A4)

Change 570405 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] site: define 2 codfw appservers as canary_appservers

https://gerrit.wikimedia.org/r/570405

Change 570405 merged by Dzahn:
[operations/puppet@production] site: define 2 codfw appservers as canary_appservers

https://gerrit.wikimedia.org/r/570405

Mentioned in SAL (#wikimedia-operations) [2020-02-06T22:13:40Z] <mutante> turning mw2271 and mw2163 into canary appservers for codfw, this adds mediawiki-testers shell users and removes scap sql scripts, rest stays as is (T242606)

mw2163 and mw2271 have been turned into canary appservers now. As opposed to canary API appservers this means actual puppet changes which are:

  • mediawiki-testers shell access group gets added
  • scap sql scripts get removed
  • nginx, keepalive-requests value changes from 100 to 1000

Together with existing mwdebug2001 and mwdebug2002 this makes it 4 as well.

Is this resolved or would you really like them reimaged as mwdebug2003 and mwdebug2004 ?

@jijiki What do you think ? Is this good now? 4 of each type and in different rows/racks.

Given we have 5 canary appservers in eqiad + 2 debug servers, I would recommend we add another 2 in codfw

@jijiki Don't we have mwdebug2001 and mwdebug2002 in codfw too?

@Urbanecm they do not get user traffic, so they are good enough for testing, but not good enough for canary deloys. When we switch to codfw, we will need them.

Is that different from what eqiad debug servers do? I'm trying to understand why you said "Given we have 5 canary appservers in eqiad + 2 debug servers" (emphasis mine).

@Urbanecm yes, so that is a total of 7 canary app servers in eqiad, of which 5 get real user traffic. Since we will be switching to codfw, it makes sense to have a similar setup in codfw.

Change 571366 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] site: define 2 more canary appservers in codfw

https://gerrit.wikimedia.org/r/571366

Change 571366 merged by Dzahn:
[operations/puppet@production] site: define 2 more canary appservers in codfw

https://gerrit.wikimedia.org/r/571366

Dzahn updated the task description. (Show Details)

@jijiki @Urbanecm

I added 2 more canary appservers. now we have:

mwdebug2001 (row A, ganeti VM)
mwdebug2002 (row B, ganeti VM)
mw2163 (C3, physical)
mw2164 (C3, physical)
mw2271 (D3, physical)
mw2272 (D3, physical)

canary API appservers codfw:

mw2215 (A3, physical)
mw2216 (A3, physical)
mw2244 (A4, physical)
mw2245 (A4, physical)
jijiki claimed this task.

thank you daniel!

Change 574902 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] scap: add codfw canary appservers to dsh group

https://gerrit.wikimedia.org/r/574902

Change 574902 merged by Dzahn:
[operations/puppet@production] scap: add codfw canary appservers to dsh group

https://gerrit.wikimedia.org/r/574902

Dzahn reopened this task as Open.EditedMay 22 2020, 11:35 AM
Dzahn claimed this task.

reopening because i am decom'ing servers in T247018 and that included some canaries.

so we need to assign new ones, for both jobrunners and appservers

Change 598710 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] site: define mw2187,mw2188 as new canary appservers

https://gerrit.wikimedia.org/r/598710

Change 598710 merged by Dzahn:
[operations/puppet@production] site: define mw2187,mw2188 as new canary appservers

https://gerrit.wikimedia.org/r/598710

Change 598729 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] site: define mw2249,mw2250 as jobrunner canaries in codfw

https://gerrit.wikimedia.org/r/598729

Change 598729 merged by Dzahn:
[operations/puppet@production] site: define mw2249,mw2250 as jobrunner canaries in codfw

https://gerrit.wikimedia.org/r/598729

mw2187, mw2188 are new canary appservers, replacing mw2271, mw2272

mw2249, mw2250 are new jobrunner canaries that we did not have in codfw.

Now we have 13 canaries in eqiad and 12 canaries in codfw.

5 x appserver, 4 x api, 2 x jobrunner, 2 x parsoid (eqiad)

vs

4 x appserver, 4 x api, 2 x jobrunner, 2 x parsoid (codfw)