Page MenuHomePhabricator

Audit "misc" cluster hosts
Open, NormalPublic

Description

While researching something else I ran into the current host list for misc cluster which contains hosts that should live in their own cluster (new or existing).

AFAIK to add an host to an existing cluster setting $cluster in puppet is enough.

To add brand new cluster AFAICS it is necessary to add it to hieradata/common/monitoring.yaml and hieradata/common.yaml, possibly other places?

The paste below has been obtained with
cumin --force -o json 'P:cumin::target%cluster = misc' 'hostname -f' | awk 'x==1 { print $0 } /_____FORMATTED_OUTPUT_____/ { x=1 }'

1{
2 "actinium.wikimedia.org": "actinium.wikimedia.org",
3 "alcyone.wikimedia.org": "alcyone.wikimedia.org",
4 "alsafi.wikimedia.org": "alsafi.wikimedia.org",
5 "aluminium.wikimedia.org": "aluminium.wikimedia.org",
6 "archiva1001.wikimedia.org": "archiva1001.wikimedia.org",
7 "auth1001.eqiad.wmnet": "auth1001.eqiad.wmnet",
8 "auth1002.eqiad.wmnet": "auth1002.eqiad.wmnet",
9 "auth2001.codfw.wmnet": "auth2001.codfw.wmnet",
10 "authdns1001.wikimedia.org": "authdns1001.wikimedia.org",
11 "authdns2001.wikimedia.org": "authdns2001.wikimedia.org",
12 "backup2001.codfw.wmnet": "backup2001.codfw.wmnet",
13 "boron.eqiad.wmnet": "boron.eqiad.wmnet",
14 "bromine.eqiad.wmnet": "bromine.eqiad.wmnet",
15 "cloudstore1008.wikimedia.org": "cloudstore1008.wikimedia.org",
16 "cloudstore1009.wikimedia.org": "cloudstore1009.wikimedia.org",
17 "cobalt.wikimedia.org": "cobalt.wikimedia.org",
18 "cp1099.eqiad.wmnet": "cp1099.eqiad.wmnet",
19 "darmstadtium.eqiad.wmnet": "darmstadtium.eqiad.wmnet",
20 "dbmonitor1001.wikimedia.org": "dbmonitor1001.wikimedia.org",
21 "dbmonitor2001.wikimedia.org": "dbmonitor2001.wikimedia.org",
22 "debmonitor1001.eqiad.wmnet": "debmonitor1001.eqiad.wmnet",
23 "debmonitor2001.codfw.wmnet": "debmonitor2001.codfw.wmnet",
24 "deploy1001.eqiad.wmnet": "deploy1001.eqiad.wmnet",
25 "deploy2001.codfw.wmnet": "deploy2001.codfw.wmnet",
26 "dubnium.wikimedia.org": "dubnium.wikimedia.org",
27 "eeden.wikimedia.org": "eeden.wikimedia.org",
28 "etherpad1001.eqiad.wmnet": "etherpad1001.eqiad.wmnet",
29 "eventlog1002.eqiad.wmnet": "eventlog1002.eqiad.wmnet",
30 "fermium.wikimedia.org": "fermium.wikimedia.org",
31 "flerovium.eqiad.wmnet": "flerovium.eqiad.wmnet",
32 "furud.codfw.wmnet": "furud.codfw.wmnet",
33 "gerrit2001.wikimedia.org": "gerrit2001.wikimedia.org",
34 "grafana1001.eqiad.wmnet": "grafana1001.eqiad.wmnet",
35 "hassaleh.codfw.wmnet": "hassaleh.codfw.wmnet",
36 "hassium.eqiad.wmnet": "hassium.eqiad.wmnet",
37 "helium.eqiad.wmnet": "helium.eqiad.wmnet",
38 "heze.codfw.wmnet": "heze.codfw.wmnet",
39 "install1002.wikimedia.org": "install1002.wikimedia.org",
40 "install2002.wikimedia.org": "install2002.wikimedia.org",
41 "iron.wikimedia.org": "iron.wikimedia.org",
42 "kafkamon1001.eqiad.wmnet": "kafkamon1001.eqiad.wmnet",
43 "kafkamon2001.codfw.wmnet": "kafkamon2001.codfw.wmnet",
44 "kraz.wikimedia.org": "kraz.wikimedia.org",
45 "krypton.eqiad.wmnet": "krypton.eqiad.wmnet",
46 "labpuppetmaster1002.wikimedia.org": "labpuppetmaster1002.wikimedia.org",
47 "labweb1001.wikimedia.org": "labweb1001.wikimedia.org",
48 "labweb1002.wikimedia.org": "labweb1002.wikimedia.org",
49 "lvs2009.codfw.wmnet": "lvs2009.codfw.wmnet",
50 "matomo1001.eqiad.wmnet": "matomo1001.eqiad.wmnet",
51 "mendelevium.eqiad.wmnet": "mendelevium.eqiad.wmnet",
52 "multatuli.wikimedia.org": "multatuli.wikimedia.org",
53 "mwlog1001.eqiad.wmnet": "mwlog1001.eqiad.wmnet",
54 "mwlog2001.codfw.wmnet": "mwlog2001.codfw.wmnet",
55 "mwmaint1002.eqiad.wmnet": "mwmaint1002.eqiad.wmnet",
56 "mwmaint2001.codfw.wmnet": "mwmaint2001.codfw.wmnet",
57 "mx1001.wikimedia.org": "mx1001.wikimedia.org",
58 "mx2001.wikimedia.org": "mx2001.wikimedia.org",
59 "netmon1002.wikimedia.org": "netmon1002.wikimedia.org",
60 "netmon1003.wikimedia.org": "netmon1003.wikimedia.org",
61 "netmon2001.wikimedia.org": "netmon2001.wikimedia.org",
62 "notebook1003.eqiad.wmnet": "notebook1003.eqiad.wmnet",
63 "notebook1004.eqiad.wmnet": "notebook1004.eqiad.wmnet",
64 "orespoolcounter1001.eqiad.wmnet": "orespoolcounter1001.eqiad.wmnet",
65 "orespoolcounter1002.eqiad.wmnet": "orespoolcounter1002.eqiad.wmnet",
66 "orespoolcounter2001.codfw.wmnet": "orespoolcounter2001.codfw.wmnet",
67 "orespoolcounter2002.codfw.wmnet": "orespoolcounter2002.codfw.wmnet",
68 "oresrdb1001.eqiad.wmnet": "oresrdb1001.eqiad.wmnet",
69 "oresrdb1002.eqiad.wmnet": "oresrdb1002.eqiad.wmnet",
70 "oresrdb2001.codfw.wmnet": "oresrdb2001.codfw.wmnet",
71 "oresrdb2002.codfw.wmnet": "oresrdb2002.codfw.wmnet",
72 "people1001.eqiad.wmnet": "people1001.eqiad.wmnet",
73 "phab1001.eqiad.wmnet": "phab1001.eqiad.wmnet",
74 "phab1002.eqiad.wmnet": "phab1002.eqiad.wmnet",
75 "phab2001.codfw.wmnet": "phab2001.codfw.wmnet",
76 "ping1001.eqiad.wmnet": "ping1001.eqiad.wmnet",
77 "ping2001.codfw.wmnet": "ping2001.codfw.wmnet",
78 "planet1001.eqiad.wmnet": "planet1001.eqiad.wmnet",
79 "planet2001.codfw.wmnet": "planet2001.codfw.wmnet",
80 "pollux.wikimedia.org": "pollux.wikimedia.org",
81 "pybal-test2001.codfw.wmnet": "pybal-test2001.codfw.wmnet",
82 "pybal-test2002.codfw.wmnet": "pybal-test2002.codfw.wmnet",
83 "pybal-test2003.codfw.wmnet": "pybal-test2003.codfw.wmnet",
84 "releases1001.eqiad.wmnet": "releases1001.eqiad.wmnet",
85 "releases2001.codfw.wmnet": "releases2001.codfw.wmnet",
86 "rhenium.wikimedia.org": "rhenium.wikimedia.org",
87 "roentgenium.eqiad.wmnet": "roentgenium.eqiad.wmnet",
88 "ruthenium.eqiad.wmnet": "ruthenium.eqiad.wmnet",
89 "scandium.eqiad.wmnet": "scandium.eqiad.wmnet",
90 "seaborgium.wikimedia.org": "seaborgium.wikimedia.org",
91 "serpens.wikimedia.org": "serpens.wikimedia.org",
92 "sessionstore2001.codfw.wmnet": "sessionstore2001.codfw.wmnet",
93 "sessionstore2002.codfw.wmnet": "sessionstore2002.codfw.wmnet",
94 "sessionstore2003.codfw.wmnet": "sessionstore2003.codfw.wmnet",
95 "sodium.wikimedia.org": "sodium.wikimedia.org",
96 "sulfur.wikimedia.org": "sulfur.wikimedia.org",
97 "torrelay1001.wikimedia.org": "torrelay1001.wikimedia.org",
98 "tungsten.eqiad.wmnet": "tungsten.eqiad.wmnet",
99 "tureis.codfw.wmnet": "tureis.codfw.wmnet",
100 "ununpentium.wikimedia.org": "ununpentium.wikimedia.org",
101 "vega.codfw.wmnet": "vega.codfw.wmnet",
102 "weblog1001.eqiad.wmnet": "weblog1001.eqiad.wmnet"
103}

TODO

  • Misplaced hosts moved to an existing (or new) cluster
  • Decide if missing $cluster in production should result in a puppet failure instead of defaulting to misc

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 27 2018, 10:26 AM

In the Foundations meeting, we considered removing "misc" as the default cluster and have puppet fail if there is no cluster set.

Change 476393 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] hiera: add cluster definition to recursor role

https://gerrit.wikimedia.org/r/476393

Change 476396 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] hiera: add cluster definition to spare role

https://gerrit.wikimedia.org/r/476396

I think it will break things and otherwise be frustrating to fail on no cluster definition unless we could somehow limit it to production only. It could easily be renamed though: https://github.com/wikimedia/puppet/blob/production/manifests/site.pp#L6

jcrespo added subscribers: Marostegui, Banyek, jcrespo.

Adding DBA for the few db hosts that shouldn't be there, remove the tag when those are fixed:

  • New pc* hosts
  • New dbstore* hosts
  • dbmonitor (unsure of that one, that is most likely misc, as it is an apache)

I can fix regex.yaml to add the new parsercache there, but the dbstore appearing on that list do not exist: dbstore1003 and dbstore1005

Ah right, the dbstore1003 and dbstore1005 are the new hosts that will replace dbstore1002 T210478: Migrate dbstore1002 to a multi instance setup on dbstore100[3-5]

Change 476807 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] regex.yaml: Add new pc and new dbstores

https://gerrit.wikimedia.org/r/476807

Joe added a subscriber: Joe.Nov 30 2018, 6:52 AM

I would suggest you re-do your query excluding servers with role::spare::system, such as most of the cp* boxes and the conf* ones. Actually, it might make sense to put them in a separate cluster.

Change 476807 merged by Marostegui:
[operations/puppet@production] regex.yaml: Add new pc and new dbstores

https://gerrit.wikimedia.org/r/476807

Marostegui edited projects, added User-Marostegui; removed DBA.Nov 30 2018, 8:17 AM
Marostegui added subscribers: Volans, elukey.

Adding DBA for the few db hosts that shouldn't be there, remove the tag when those are fixed:

  • New pc* hosts
  • New dbstore* hosts
  • dbmonitor (unsure of that one, that is most likely misc, as it is an apache)

What I have done:

I will leave dbmonitor ones for @Volans to decide!
@elukey @Banyek we will need to remove dbstore1002 from there once it is decommissioned next year

I will leave dbmonitor ones for @Volans to decide!

Why me? It's not deBmonitor 😜
I guess misc is ok, it seems pointless to me to create a group for each small service that has 1 host per DC. TBD if we want a group for "tooling" services.

Change 476393 merged by Cwhite:
[operations/puppet@production] hiera: add cluster definition to recursor role

https://gerrit.wikimedia.org/r/476393

jijiki triaged this task as Normal priority.Dec 3 2018, 9:37 AM
jijiki added a subscriber: jijiki.

@colewhite @fgiunchedi should we add a checklist of actions need to be done in order to consider this task as "Resolved?"

fgiunchedi updated the task description. (Show Details)Dec 3 2018, 12:08 PM

@colewhite @fgiunchedi should we add a checklist of actions need to be done in order to consider this task as "Resolved?"

Sounds good! Just made it so

Change 476396 merged by Cwhite:
[operations/puppet@production] hiera: add cluster definition to spare role

https://gerrit.wikimedia.org/r/476396

Change 477366 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] Add prometheus cluster definition

https://gerrit.wikimedia.org/r/477366

Change 477367 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] Add graphite cluster definition

https://gerrit.wikimedia.org/r/477367

Change 477366 merged by Cwhite:
[operations/puppet@production] Add prometheus cluster definition

https://gerrit.wikimedia.org/r/477366

Change 477367 merged by Cwhite:
[operations/puppet@production] Add graphite cluster definition

https://gerrit.wikimedia.org/r/477367

Change 478372 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] add bastion cluster definition

https://gerrit.wikimedia.org/r/478372

Change 478372 merged by Cwhite:
[operations/puppet@production] add bastion cluster definition

https://gerrit.wikimedia.org/r/478372

Change 478774 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] hiera: add trafficserver cluster definition

https://gerrit.wikimedia.org/r/478774

Change 478774 merged by Cwhite:
[operations/puppet@production] hiera: add cache_ats cluster definition

https://gerrit.wikimedia.org/r/478774

Change 479772 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] hiera: add puppetboard and puppetdb to puppet cluster

https://gerrit.wikimedia.org/r/479772

Change 479843 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] hiera: add alerting_host cluster definition

https://gerrit.wikimedia.org/r/479843

Change 479845 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] hiera: add ci cluster definition

https://gerrit.wikimedia.org/r/479845

Change 479843 merged by Cwhite:
[operations/puppet@production] hiera: add alerting cluster definition

https://gerrit.wikimedia.org/r/479843

Change 479845 merged by Cwhite:
[operations/puppet@production] hiera: add ci cluster definition

https://gerrit.wikimedia.org/r/479845

Change 480664 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] hiera: add management cluster definition

https://gerrit.wikimedia.org/r/480664

Change 480666 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] hiera: add debmonitor cluster definition

https://gerrit.wikimedia.org/r/480666

I went through the list of current misc hosts and looked for "obvious" candidates to be in their own cluster. This was driven by (either of) two factors: would it provide value to have said hosts resources displayed/grouped? and does this group of hosts perform a function critical enough to affect user facing requests?

This is the host -> cluster map what I came up with:

"certcentral1001.eqiad.wmnet": "certcentral",
"certcentral2001.codfw.wmnet": "certcentral",
"cloudcontrol1003.wikimedia.org": "wmcs",
"cloudcontrol1004.wikimedia.org": "wmcs",
"cloudnet1003.eqiad.wmnet": "wmcs",
"cloudnet1004.eqiad.wmnet": "wmcs",
"cloudservices1003.wikimedia.org": "wmcs",
"cloudservices1004.wikimedia.org": "wmcs",
"cloudvirt1014.eqiad.wmnet": "wmcs",
"cloudvirt1015.eqiad.wmnet": "wmcs",
"cloudvirt1016.eqiad.wmnet": "wmcs",
"cloudvirt1017.eqiad.wmnet": "wmcs",
"cloudvirt1018.eqiad.wmnet": "wmcs",
"cloudvirt1019.eqiad.wmnet": "wmcs",
"cloudvirt1020.eqiad.wmnet": "wmcs",
"cloudvirt1021.eqiad.wmnet": "wmcs",
"cloudvirt1022.eqiad.wmnet": "wmcs",
"cloudvirt1023.eqiad.wmnet": "wmcs",
"cloudvirt1024.eqiad.wmnet": "wmcs",
"cloudvirtan1001.eqiad.wmnet": "wmcs",
"cloudvirtan1002.eqiad.wmnet": "wmcs",
"cloudvirtan1003.eqiad.wmnet": "wmcs",
"cloudvirtan1004.eqiad.wmnet": "wmcs",
"cloudvirtan1005.eqiad.wmnet": "wmcs",
"labpuppetmaster1001.wikimedia.org": "wmcs",
"labpuppetmaster1002.wikimedia.org": "wmcs",
"labstore1006.wikimedia.org": "wmcs",
"labstore1007.wikimedia.org": "wmcs",
"labweb1001.wikimedia.org": "wmcs",
"labweb1002.wikimedia.org": "wmcs",
"graphite1004.eqiad.wmnet": "graphite",
"graphite2003.codfw.wmnet": "graphite",
"lithium.eqiad.wmnet": "syslog",
"centrallog1001.eqiad.wmnet": "syslog",
"wezen.codfw.wmnet": "syslog"
"poolcounter1001.eqiad.wmnet": "poolcounter",
"poolcounter1003.eqiad.wmnet": "poolcounter",
"poolcounter2001.codfw.wmnet": "poolcounter",
"poolcounter2002.codfw.wmnet": "poolcounter",
"sessionstore2001.codfw.wmnet": "sessionstore",
"sessionstore2002.codfw.wmnet": "sessionstore",
"sessionstore2003.codfw.wmnet": "sessionstore",
"snapshot1005.eqiad.wmnet": "snapshot",
"snapshot1006.eqiad.wmnet": "snapshot",
"snapshot1007.eqiad.wmnet": "snapshot",
"snapshot1008.eqiad.wmnet": "snapshot",
"snapshot1009.eqiad.wmnet": "snapshot",
"webperf1001.eqiad.wmnet": "webperf",
"webperf1002.eqiad.wmnet": "webperf",
"webperf2001.codfw.wmnet": "webperf",
"webperf2002.codfw.wmnet": "webperf",

As for the remaining misc hosts we can iterate again on the list and/or let the hosts "service owners" decide

fgiunchedi moved this task from Backlog to Radar on the User-fgiunchedi board.Dec 20 2018, 10:20 AM

Change 479772 merged by Cwhite:
[operations/puppet@production] hiera: add puppetboard and puppetdb to puppet cluster

https://gerrit.wikimedia.org/r/479772

Change 480664 merged by Cwhite:
[operations/puppet@production] hiera: add management cluster definition

https://gerrit.wikimedia.org/r/480664

Change 482108 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] hiera: add certcentral cluster definition

https://gerrit.wikimedia.org/r/482108

Change 480666 abandoned by Cwhite:
hiera: add debmonitor cluster definition

Reason:
per conversation

https://gerrit.wikimedia.org/r/480666

Change 482149 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] hiera: add wmcs cluster definition

https://gerrit.wikimedia.org/r/482149

Change 482108 merged by Cwhite:
[operations/puppet@production] hiera: add certcentral cluster definition

https://gerrit.wikimedia.org/r/482108

Change 482149 merged by Cwhite:
[operations/puppet@production] hiera: add wmcs cluster definition

https://gerrit.wikimedia.org/r/482149

Change 482884 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] hiera: add cluster definition for graphite

https://gerrit.wikimedia.org/r/482884

Change 482894 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] hiera: add cluster definition to webperf servers

https://gerrit.wikimedia.org/r/482894

Change 483009 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] hiera: add cluster definition to poolcounter servers

https://gerrit.wikimedia.org/r/483009

Change 482884 merged by Cwhite:
[operations/puppet@production] hiera: add cluster definition for graphite

https://gerrit.wikimedia.org/r/482884

Change 482894 merged by Cwhite:
[operations/puppet@production] hiera: add cluster definition to webperf servers

https://gerrit.wikimedia.org/r/482894

Change 483602 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] hiera: add cluster definition to snapshot servers

https://gerrit.wikimedia.org/r/483602

Change 483612 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] hiera: add cluster definition to syslog servers

https://gerrit.wikimedia.org/r/483612

Change 483009 merged by Cwhite:
[operations/puppet@production] hiera: add cluster definition to poolcounter servers

https://gerrit.wikimedia.org/r/483009

Change 483602 merged by Cwhite:
[operations/puppet@production] hiera: add cluster definition to dumps servers

https://gerrit.wikimedia.org/r/483602

Change 483612 merged by Cwhite:
[operations/puppet@production] hiera: add cluster definition to syslog servers

https://gerrit.wikimedia.org/r/483612

fgiunchedi updated the task description. (Show Details)Jan 15 2019, 1:53 PM
colewhite removed colewhite as the assignee of this task.Jan 24 2019, 10:37 PM