Kartotherian is currently being blubberized and soon we should be able to start creating a new chart/deployment for it on Wikikube. Kartotherian will run on nodejs-20 and Bookworm, see T327396
The current pain points that I see are related to load balancing, more specifically:
- We have two LVS services, kartotherian (plaintext, port 6533) and karotherian-ssl (TLS, port 443), but one LVS/discovery endpoint kartotherian.discovery.wmnet.
- The maps.wikimedia.org domain points directly to the 443 port, using TLS.
- On mapsXXXX we have nginx serving traffic for port 443, and kartotherian (nodejs) serving port 6533. Afaics the nginx config is mostly about TLS and performance, and it just proxies to port 6533.
The first question that I have is why do we need both, since ideally all clients should just use TLS. The second is more about what to do for the migration bare-metal -> k8s, since we'll not be able to use port 443 on k8s. This is what we did with Thumbor, when moving it from bare metal to Wikikube:
- Deploy Thumbor on Wikikube, making it to listen on the same port as its bare metal cousin.
- Add Wikikube workers behing the Thumbor LVS endpoint (initially depooled, to sit side-by-side with bare metal nodes).
- Slowly enable some Wikikube workers to serve Thumbor prod traffic from K8s, and measure their issues/performances/etc..
- Eventually leave only Wikikube workers pooled, and remove all bare metal hosts.
Due to the 443 port we cannot easily do this, so this is my idea:
- We add another listen 6543 to the Kartotherian's nginx config, so that the bare metal hosts will also serve TLS traffic from that port. It should be easy enough to do, but I need to verify that it works as expected.
- We create a new Puppet load balanced service called kartotherian-k8s-ssl. We use the same IP Addresses of the other kartotherian LVS services, just with a new port 6543 (port number just randomly picked up, not used in puppet's servive.yaml yet). In theory it shouldn't require any pybal config/restart, just updated settings in puppet for monitoring etc..
- Since we haven't created a new LVS IP etc.., the bare metal hosts should be pooled and ready to go.
- When we are comfortable, we move the ATS config (CDN) of maps.wikimedia.org to the new port.
- Then we deploy Kartotherian to K8s, with nodePort 6543 and 6533. When we are done, we should have happy pods running on Wikikube serving TLS traffic via Mesh (so nginx not needed at this point) from 6543 and plaintext traffic from 6533 (assuming that we'll still need it).
- At this point, we should be able to pool Wikikube workers in the Kartotherian LVS service. We should be able to add just a few of them, not the entire fleet, since kube-proxy should route the traffic for nodePorts 6543/6533 correctly to the Wikikube workers running the kartotherian pods.
- Once ready, we pool the first Wikikube worker and then we observe how well the k8s pods behave.
- Slowly over time, we pool all Wikikube workers and we depool gradually the bare metal nodes.
- Bare metal nodes not used anymore for prod traffic, undeploy kartotherian from them.
Not sure if I have missed anything important, please lemme know your thoughts!