Page MenuHomePhabricator

Assess SCB@CODFW preparedness for the DC switchover
Closed, ResolvedPublic

Description

With the switchover coming we should access how well prepared is the SCB@CODFW cluster to accept the traffic that the current SCB@EQIAD is handling. In the past (a year ago) we had no problem with that but since things have changed since then we need to reevaluate.

Looking at the numbers:

CPU

CPU wise the EQIAD cluster comprises of:

  • 2 boxes with 2(CPUs)x8(cores)x2(HT) ranked at 5000 Bogomips[1]
  • 2 boxes with 2(CPUs)x6(cores)x2(HT) ranked at 4800 Bogomips

For a grand total of 112 logical CPUs (HT included) and 550K Bogomips

CODFW cluster has

  • 4 boxes with 2(CPUs)x4(cores)x2(HT) ranked at 6000 Bogomips

For a grand total of 64 logical CPUs (HT included) and 384K Bogomips

So CPU wise the CODFW cluster is clearly underpowered by >30%. That is not necessarily a call to rush providing more power to the cluster however.

Looking at the CPU usage of the EQIAD cluster [2]

we see an average usage of ~50%, but, a week ago it was down to 15%[3] and has in 2 days double twice. Figuring out the reason for that CPU increase and if possible fixing it means we are well within acceptable limits and the CODFW cluster will be able to survive the load with room to spare.

Memory

Memory wise the EQIAD cluster has a grand total of 188GB [2] while the CODFW cluster has 140GB [4]. Memory usage at EQIAD is around 100GB (with spikes up to 115) which is within the available memory. Furthermore due to the way our services are setup the amount of workers spawned is highly related to the number of CPUs available, so it's expected that we will be using less memory in CODFW.

Network

Total usage in EQIAD is around 12Mbps[2] and all boxes are equipped with 1Gbps cards. We have plenty of to spare there.

Disk

Disks in EQIAD are idling generally [5] and /srv usage is low [2]. That is logical given that none of the services are IOPS heavy. No reason for any actionables here.

References

[1] Bogomips is not a very accurate measurement of a CPUs power, but what we care for right here is getting a ballpark idea of the CPU power of the cluster and not really caring about the accuracy.
[2] https://grafana.wikimedia.org/dashboard/db/prometheus-cluster-breakdown?from=now-3h&to=now&var-datasource=eqiad%20prometheus%2Fops&var-cluster=scb&cluster=scb
[3] https://grafana.wikimedia.org/dashboard/file/server-board.json?panelId=7&fullscreen&var-server=scb1001&var-network=eth0&from=now-30d&to=now
[4] https://grafana.wikimedia.org/dashboard/db/prometheus-cluster-breakdown?from=now-3h&to=now&var-datasource=codfw%20prometheus%2Fops&var-cluster=scb&cluster=scb&var-instance=All
[5] https://grafana.wikimedia.org/dashboard/file/server-board.json?panelId=18&fullscreen&var-server=scb1003&var-network=eth0
[6] https://grafana.wikimedia.org/dashboard/file/server-board.json?panelId=17&fullscreen&var-server=scb1003&var-network=eth0

Event Timeline

akosiaris triaged this task as Medium priority.Jan 26 2017, 12:38 PM

Turns out the CPU increase mentioned above is not the result of some bug or otherwise malfunction/change in our infrastructure but rather the result of legitimate traffic coming from a single user of the API who crawls historical enwiki revisions requesting ORES scores for the damanging model for those. While the fact that a single user is capable of increasing CPU usage threefold is disconcerting, the traffic appears largely legitimate. I say largely because the API etiquette of having a descriptive User-Agent header (https://www.mediawiki.org/wiki/API:Etiquette#User-Agent_header) is violated by sending Python-urllib/1.17 which is clearly a default. The rest of the guidelines (e.g. req/s) seems well within acceptable limits.

The end result of the above investigation is that during the migration to CODFW, the SCB cluster will probably be pushed to its limtis to keep up. A way forward would be to look into expanding it with 25%-50% of its current power before the migration.

With T156631 and T159486 done we now have the required capacity to serve all of EQIAD traffic, so I am resolving this

akosiaris claimed this task.