This actionable of T158837 is now unblocked.
Roles on hafnium:
- webperf
Services in webperf:
- navtiming
- statsv
For this task:
- Verify everything on hafnium is puppetized.
- Figure out how we want to do the migration.
Thoughts
When thinking about migration, we should also think about a future switch-over. Ideally we'd use the same mechanism.
The status quo is that if things fail, we'd presumably remove the role from hafnium, make sure the process is dead, apply the role to the new server, run puppet right away. This, however, in order to happen quickly, this "status quo" would require 2 code changes, 2 manual commands on a server.
Ideally we'd figure a way that will also ensure the two instances logically cannot be allowed to run at the same time (which would cause duplicate reporting in statsd). The first thing that comes to mind as a possible way to ensure de-duplication is to use a Kafka consumer group. That way Kafka is responsible for only ever sending the same message to one of our webperf servers. The only downside is that if we do that by default, we will also by default consume Kafka across DCs (and send data back to statsd across DCs), which isn't good for latency. That's fine for edge cases during a switchover, but doesn't seem useful as the default state.
Alternatively, we could have both machines enable the role by default, but have some kind of switch in the code that only ensures running/not-running of the service. We'd only keep the primary one running, and then make one code change to switch which one running. During the switch, one of them may start before the other stops, which Kafka would mediate in a sensible way.