@Aaron, @ori thank you for your work on emergency parsercache key implementation. I want to track here pending task related to those, starting with some discussion:
[x] Should we, slowly, change the keys to something more reasonable (e.g., name of the shards (pc1, pc2, pc3; changing 1 key per server pair until all old keys are expired). Will changing one key at a time affect the sharding function for the others, too?
[] Should we implement a more deterministic sharding function? As far as I know, the server depends now on the key and the number of servers, but that means that on maintenance (it is very typical to depool one server at a time), keys are distributed randomly among the servers. Could be only that keys going to the old server are sent randomly, while already present ones due to the previous function go to the right servers- for example, maintaining the keys but pointing to a NULL server. Maybe the rule should be failovering cross-datacenter? Should we buy 2 servers per "shard" and datacenter to maintain always the same servers?
[] As a more long term question, how should parsercache be handled for active-active. Is that something that parsercache architecture should know about, or should be resolve it at mediawiki "routing" layer?
[] Could we have a hot-swap (we now have a spare host) mechanism, or something else completely separate to allow for automatic failure detection and recovery that works for the parsercache model (it depends also on the handling of the above question)