We should run a stress test on the new hardware as we prepare to bring it online. I want to get a clear measure of the type of throughput we can expect and where we start to overload. I feel like we've been relying on assumptions for a bit too long and this is a good opportunity to re-calibrate. If we move ORES fully to the ores* nodes, we can safely do such a test before moving over fully. If we want to do the hybrid setup, I'd like to switch from active-active to one datacenter to run the test against the other datacenter.
This task is done when we do a series of test runs at different score request frequencies to determine the capacity of the new cluster.
- Write a script to produce scoring requests at variable frequencies (use a param to skip the cache -- like "?features")
- Run the script against a test installation of ORES on ores* nodes
- Document the results.
Our highest traffic load over the past year was 4k requests/minute, https://grafana-admin.wikimedia.org/dashboard/db/ores?panelId=3&fullscreen&orgId=1&from=1486192159051&to=1486517049462 . Our mean average load over the year was 522 requests/minute.