We noticed we have been offshooting the deployment window of WDQS recently due to testing and also the non-parallel deployment of WDQS via scap. We don't want to restart more than 1 server at a time in a single cluster, to keep enough capacity to serve all the traffic. But we can restart servers from each cluster at the same time (public / internal & eqiad / codfw).
Description
Related Objects
Event Timeline
I think we can parallelize, but we should do it in a smart way, so no more than one server in each cluster out of 3 is restared at the same time. But we can restart one in eqiad and one in codfw at the same time, same with internal and public. So we could do 4 servers at once instead of one.
This needs to be discussed with the rel-eng team before re-estimating and starting implementation.
Note that if adding support in Scap is too complex, it might make sense to implement deployment as cookbooks instead
I'll talk to rel-eng to see what scap changes are needed to parallelize between groups (wdqs eqiad public vs wdqs eqiad internal, etc)
There's a chance it might be worth it to rely on a cookbook to rolling restart. Basically we'd use scap to get the new code in place and a cookbook to do the actual rolling restarts to actually uptake the changes. But for now I'd assume we'll just be changing it in scap-land and not introducing a cookbook