Page MenuHomePhabricator

Check the size of every cluster in codfw to see if it matches eqiad's capacity
Closed, ResolvedPublic

Description

We should check every cluster in codfw to ensure its number of cores/RAM/disk capacity matches what we have in eqiad.

I suspect we could create a grafana dashboard containing such data so that we don't have to check manually. Data can probably be gathered from prometheus

Event Timeline

Joe created this task.Jan 23 2017, 4:51 PM

Yes, in fact we can already answer these questions with Prometheus. I've drafted a dashboard showing CPU/host/memory differences per-cluster in https://grafana.wikimedia.org/dashboard/db/prometheus-cluster-audit

Clusters where eqiad has more hosts than codfw. Note that the list needs more auditing due to various factors e.g. decommissioned hosts, hosts in misc serve a plethora of functions, etc

{cluster="mysql"}	46
{cluster="misc"}	40
{cluster="lvs"}	6
{cluster="parsoid"}	4
{cluster="memcached"}	3
{cluster="labsnfs"}	2
{cluster="imagescaler"}	2
{cluster="videoscaler"}	2
{cluster="redis"}	2
{cluster="api_appserver"}	1
{cluster="puppet"}	1
{cluster="cache_upload"}	1
{cluster="jobrunner"}	1

Same number of hosts

{cluster="restbase_test"}	0
{cluster="cache_text"}	0
{cluster="restbase"}	0
{cluster="eventbus"}	0
{cluster="cache_maps"}	0
{cluster="sca"}	0
{cluster="scb"}	0
{cluster="cache_misc"}    0

Of those clusters with the same number of hosts, scb has +68GB ram in eqiad than codfw whereas the remaining clusters have the same amount of ram in both sites

re: misc, I gave a quick look at both lists of hosts and excluding a few miscategorized hosts (provisioned but no puppet roles applied perhaps, mc / mw / aqs). The rest is either misc hostname systems or eqiad-only things (druid, thumbor, oresdb, dataset, netmon, snapshot, notebook, etc) or codfw-only (labtest).

With T153488 two job runners in eqiad were reimaged as video scalers. We should probably do the same in codfw in case a similar upload surge occurs during the dc switchover window?

elukey added a comment.EditedFeb 8 2017, 11:21 AM

With T153488 two job runners in eqiad were reimaged as video scalers. We should probably do the same in codfw in case a similar upload surge occurs during the dc switchover window?

+1 (if we have spare capacity to repurpose)

elukey added a comment.EditedFeb 13 2017, 10:21 AM

Difference between eqiad and codfw counts from Prometheus dashboard:

ClusterHostsCPUsRAM
appservers-23-1000-2TB
api_appservers+1-104+86GB
imagescalers+2+80+135GB
jobrunner+1-16+67GB
videoscalers+2+16+34GB

Hardware list:

  • appservers eqiad:

nproc => 53 * '32' + 15 * '40' + 2 * '1' (mwdebug)
RAM => 2 * 3.9GB (mwdebug) + 68 * 62GB

  • appservers codfw:

nproc => 1 * '24' + 29 * '32' + 62 * '40'
RAM => 1 * 10GB + 92 * 62GB

If the above counts are consistent, I'd to:

  1. reimage 3 appservers (40 cores) as api_appservers
  2. reimage 2 appservers (40 cores) as imagescalers
  3. reimage 1 appserver (40 cores) as jobrunner
  4. reimage 2 appservers (32 cores) as videoscalers

If the above counts are consistent, I'd to:

  1. reimage 3 appservers (40 cores) as api_appservers
  2. reimage 2 appservers (40 cores) as imagescalers
  3. reimage 1 appserver (40 cores) as jobrunner
  4. reimage 2 appservers (32 cores) as videoscalers

Seems sane to me to balance things a bit more in codfw

Joe added a comment.Feb 14 2017, 11:22 AM

If the above counts are consistent, I'd to:

  1. reimage 3 appservers (40 cores) as api_appservers
  2. reimage 2 appservers (40 cores) as imagescalers
  3. reimage 1 appserver (40 cores) as jobrunner
  4. reimage 2 appservers (32 cores) as videoscalers

Seems sane to me to balance things a bit more in codfw

+1

Joe added a comment.Feb 14 2017, 11:22 AM

Also note that while for videoscalers and jobrunners it is advisable to reimage, in the other cases a simple change of role in puppet is ok.

Change 337563 had a related patch set uploaded (by Elukey):
Change role to mw222[123] (appservers -> api_appservers)

https://gerrit.wikimedia.org/r/337563

Change 337563 merged by Elukey:
Change role to mw222[123] (appservers -> api_appservers)

https://gerrit.wikimedia.org/r/337563

Change 337567 had a related patch set uploaded (by Elukey):
Move mw222[123] from appservers to api_appservers (conftool)

https://gerrit.wikimedia.org/r/337567

Change 337567 merged by Elukey:
Move mw222[123] from appservers to api_appservers (conftool)

https://gerrit.wikimedia.org/r/337567

Change 337584 had a related patch set uploaded (by Elukey):
Move mw224[45] from appservers to imagescalers

https://gerrit.wikimedia.org/r/337584

Change 337584 merged by Elukey:
Move mw224[45] from appservers to imagescalers

https://gerrit.wikimedia.org/r/337584

Done a quick check to see how much the mw2* hosts are spread among rows:

elukey@neodymium:~$ sudo -i salt -C 'G@cluster:imagescaler and G@site:codfw' cmd.run 'lldpcli show neighbors | grep SysName' --output=raw | egrep -o 'asw-\w-codfw' | sort | uniq -c
      2 asw-a-codfw
      4 asw-c-codfw
elukey@neodymium:~$ sudo -i salt -C 'G@cluster:jobrunner and G@site:codfw' cmd.run 'lldpcli show neighbors | grep SysName' --output=raw | egrep -o 'asw-\w-codfw' | sort | uniq -c
      4 asw-a-codfw
     10 asw-c-codfw
elukey@neodymium:~$ sudo -i salt -C 'G@cluster:appserver and G@site:codfw' cmd.run 'lldpcli show neighbors | grep SysName' --output=raw | egrep -o 'asw-\w-codfw' | sort | uniq -c
     21 asw-a-codfw
     29 asw-b-codfw
     37 asw-c-codfw
elukey@neodymium:~$ sudo -i salt -C 'G@cluster:api_appserver and G@site:codfw' cmd.run 'lldpcli show neighbors | grep SysName' --output=raw | egrep -o 'asw-\w-codfw' | sort | uniq -c
      9 asw-a-codfw
     28 asw-b-codfw
     15 asw-c-codfw
elukey@neodymium:~$ sudo -i salt -C 'G@cluster:videoscaler and G@site:codfw' cmd.run 'lldpcli show neighbors | grep SysName' --output=raw | egrep -o 'asw-\w-codfw' | sort | uniq -c
      1 asw-a-codfw
      1 asw-c-codfw

Change 338108 had a related patch set uploaded (by Elukey):
Move codfw appserver conftool-data to codfw.yaml

https://gerrit.wikimedia.org/r/338108

Change 338108 merged by Elukey:
Move codfw appserver conftool-data to codfw.yaml

https://gerrit.wikimedia.org/r/338108

Change 338962 had a related patch set uploaded (by Elukey):
Move three codfw MW appservers to jobrunner/videoscalers

https://gerrit.wikimedia.org/r/338962

Change 338962 merged by Elukey:
Move three codfw MW appservers to jobrunner/videoscalers

https://gerrit.wikimedia.org/r/338962

Change 339166 had a related patch set uploaded (by Elukey):
Change partman recipe for new MW codfw videoscalers

https://gerrit.wikimedia.org/r/339166

Change 339167 had a related patch set uploaded (by Elukey):
Replace mw2119 with mw2117 as scap proxy

https://gerrit.wikimedia.org/r/339167

Change 339167 merged by Elukey:
Replace mw2119 with mw2117 as scap proxy

https://gerrit.wikimedia.org/r/339167

Change 339166 merged by Elukey:
Change partman recipe for new MW codfw videoscalers

https://gerrit.wikimedia.org/r/339166

First step of the MW rebalancing done. This is how things look like in codfw vs eqiad from the ROWs distribution pov:

CODFW

elukey@neodymium:~$ sudo -i salt -C 'G@cluster:imagescaler and G@site:codfw' cmd.run 'lldpcli show neighbors | grep SysName' --output=raw | egrep -o 'asw-\w-codfw' | sort | uniq -c
      2 asw-a-codfw
      4 asw-c-codfw
elukey@neodymium:~$ sudo -i salt -C 'G@cluster:videoscaler and G@site:codfw' cmd.run 'lldpcli show neighbors | grep SysName' --output=raw | egrep -o 'asw-\w-codfw' | sort | uniq -c
      1 asw-a-codfw
      2 asw-b-codfw
      1 asw-c-codfw
elukey@neodymium:~$ sudo -i salt -C 'G@cluster:jobrunner and G@site:codfw' cmd.run 'lldpcli show neighbors | grep SysName' --output=raw | egrep -o 'asw-\w-codfw' | sort | uniq -c
      5 asw-a-codfw
     10 asw-c-codfw
elukey@neodymium:~$ sudo -i salt -C 'G@cluster:api_appserver and G@site:codfw' cmd.run 'lldpcli show neighbors | grep SysName' --output=raw | egrep -o 'asw-\w-codfw' | sort | uniq -c
      9 asw-a-codfw
     28 asw-b-codfw
     15 asw-c-codfw
elukey@neodymium:~$ sudo -i salt -C 'G@cluster:appserver and G@site:codfw' cmd.run 'lldpcli show neighbors | grep SysName' --output=raw | egrep -o 'asw-\w-codfw' | sort | uniq -c
     20 asw-a-codfw
     27 asw-b-codfw
     37 asw-c-codfw

EQIAD

elukey@neodymium:~$ sudo -i salt -C 'G@cluster:imagescaler and G@site:eqiad' cmd.run 'lldpcli show neighbors | grep SysName' --output=raw | egrep -o 'asw-\w-eqiad' | sort | uniq -c
      6 asw-b-eqiad
elukey@neodymium:~$ sudo -i salt -C 'G@cluster:videoscaler and G@site:eqiad' cmd.run 'lldpcli show neighbors | grep SysName' --output=raw | egrep -o 'asw-\w-eqiad' | sort | uniq -c
      1 asw-a-eqiad
      1 asw-b-eqiad
      2 asw-c-eqiad
elukey@neodymium:~$ sudo -i salt -C 'G@cluster:jobrunner and G@site:eqiad' cmd.run 'lldpcli show neighbors | grep SysName' --output=raw | egrep -o 'asw-\w-eqiad' | sort | uniq -c
      8 asw-b-eqiad
      7 asw-c-eqiad
elukey@neodymium:~$ sudo -i salt -C 'G@cluster:api_appserver and G@site:eqiad' cmd.run 'lldpcli show neighbors | grep SysName' --output=raw | egrep -o 'asw-\w-eqiad' | sort | uniq -c
     12 asw-a-eqiad
      7 asw-b-eqiad
     12 asw-c-eqiad
     19 asw-d-eqiad
elukey@neodymium:~$ sudo -i salt -C 'G@cluster:appserver and G@site:eqiad' cmd.run 'lldpcli show neighbors | grep SysName' --output=raw | egrep -o 'asw-\w-eqiad' | sort | uniq -c
     15 asw-a-eqiad
      9 asw-b-eqiad
     19 asw-c-eqiad
     25 asw-d-eqiad

It looks good in my opinion, but others might want to chime in and confirm?

Joe added a comment.Mar 23 2017, 1:29 PM

@elukey looking at the numbers, the only slightly worrying situation is for apis in codfw: if we lose ROW B we lose more than half of the capacity. We might want to add apis in row a or row c once we get new hardware.

@Joe definitely. I already added 3 new api-appservers in row-a (T155180), so now the numbers looks a bit better:

elukey@neodymium:~$ sudo -i salt -C 'G@cluster:api_appserver and G@site:codfw' cmd.run 'lldpcli show neighbors | grep SysName' --output=raw | egrep -o 'asw-\w-codfw' | sort | uniq -c
     12 asw-a-codfw
     28 asw-b-codfw
     15 asw-c-codfw

Maybe we could move some row-c appservers to apis?

faidon added a subscriber: faidon.Mar 29 2017, 5:31 PM

I think this task (matching eqiad's capacity) is done and what's left is to balance codfw betweens rows a little better, right? If so, can we make a separate task for that, as it's likely it won't be part of our effort for the switchover?

elukey added a comment.Apr 6 2017, 7:49 AM

Just decommed mw2090-96 after adding new appservers, last snapshot:

elukey@neodymium:~$ sudo -i salt -C 'G@cluster:imagescaler and G@site:codfw' cmd.run 'lldpcli show neighbors | grep SysName' --output=raw | egrep -o 'asw-\w-codfw' | sort | uniq -c
      2 asw-a-codfw
      4 asw-c-codfw
elukey@neodymium:~$ sudo -i salt -C 'G@cluster:videoscaler and G@site:codfw' cmd.run 'lldpcli show neighbors | grep SysName' --output=raw | egrep -o 'asw-\w-codfw' | sort | uniq -c
      1 asw-a-codfw
      2 asw-b-codfw
      1 asw-c-codfw
elukey@neodymium:~$ sudo -i salt -C 'G@cluster:jobrunner and G@site:codfw' cmd.run 'lldpcli show neighbors | grep SysName' --output=raw | egrep -o 'asw-\w-codfw' | sort | uniq -c
      5 asw-a-codfw
     10 asw-c-codfw
elukey@neodymium:~$ sudo -i salt -C 'G@cluster:api_appserver and G@site:codfw' cmd.run 'lldpcli show neighbors | grep SysName' --output=raw | egrep -o 'asw-\w-codfw' | sort | uniq -c
     12 asw-a-codfw
     28 asw-b-codfw
     15 asw-c-codfw
elukey@neodymium:~$ sudo -i salt -C 'G@cluster:appserver and G@site:codfw' cmd.run 'lldpcli show neighbors | grep SysName' --output=raw | egrep -o 'asw-\w-codfw' | sort | uniq -c
     20 asw-a-codfw
     28 asw-b-codfw
     37 asw-c-codfw

As @Joe was saying we'll need be wise in row allocation for new hardware, but this task should be done @paravoid.

elukey closed this task as Resolved.Apr 6 2017, 7:51 AM