With the switchover coming we should access how well prepared is the SCB@CODFW cluster to accept the traffic that the current SCB@EQIAD is handling. In the past (a year ago) we had no problem with that but since things have changed since then we need to reevaluate.
Looking at the numbers:
CPU
CPU wise the EQIAD cluster comprises of:
- 2 boxes with 2(CPUs)x8(cores)x2(HT) ranked at 5000 Bogomips[1]
- 2 boxes with 2(CPUs)x6(cores)x2(HT) ranked at 4800 Bogomips
For a grand total of 112 logical CPUs (HT included) and 550K Bogomips
CODFW cluster has
- 4 boxes with 2(CPUs)x4(cores)x2(HT) ranked at 6000 Bogomips
For a grand total of 64 logical CPUs (HT included) and 384K Bogomips
So CPU wise the CODFW cluster is clearly underpowered by >30%. That is not necessarily a call to rush providing more power to the cluster however.
Looking at the CPU usage of the EQIAD cluster [2]
we see an average usage of ~50%, but, a week ago it was down to 15%[3] and has in 2 days double twice. Figuring out the reason for that CPU increase and if possible fixing it means we are well within acceptable limits and the CODFW cluster will be able to survive the load with room to spare.
Memory
Memory wise the EQIAD cluster has a grand total of 188GB [2] while the CODFW cluster has 140GB [4]. Memory usage at EQIAD is around 100GB (with spikes up to 115) which is within the available memory. Furthermore due to the way our services are setup the amount of workers spawned is highly related to the number of CPUs available, so it's expected that we will be using less memory in CODFW.
Network
Total usage in EQIAD is around 12Mbps[2] and all boxes are equipped with 1Gbps cards. We have plenty of to spare there.
Disk
Disks in EQIAD are idling generally [5] and /srv usage is low [2]. That is logical given that none of the services are IOPS heavy. No reason for any actionables here.
References
[1] Bogomips is not a very accurate measurement of a CPUs power, but what we care for right here is getting a ballpark idea of the CPU power of the cluster and not really caring about the accuracy.
[2] https://grafana.wikimedia.org/dashboard/db/prometheus-cluster-breakdown?from=now-3h&to=now&var-datasource=eqiad%20prometheus%2Fops&var-cluster=scb&cluster=scb
[3] https://grafana.wikimedia.org/dashboard/file/server-board.json?panelId=7&fullscreen&var-server=scb1001&var-network=eth0&from=now-30d&to=now
[4] https://grafana.wikimedia.org/dashboard/db/prometheus-cluster-breakdown?from=now-3h&to=now&var-datasource=codfw%20prometheus%2Fops&var-cluster=scb&cluster=scb&var-instance=All
[5] https://grafana.wikimedia.org/dashboard/file/server-board.json?panelId=18&fullscreen&var-server=scb1003&var-network=eth0
[6] https://grafana.wikimedia.org/dashboard/file/server-board.json?panelId=17&fullscreen&var-server=scb1003&var-network=eth0