Current state
Each primary DC has an independent instance of the ArcLamp Python service which subscribes to the sampling profiler stream exposed by app servers, and writes these to local log files on-disk, converts these to flame graphs. A local Apache server exposes this directory via a proxy at https://performance.wikimedia.org/arclamp/.
The service is active-active (1 VM in Eqiad, and an identical provision in Codfw):
There is only 1 redundancy. This is both more than needed, and not enough. It is more than needed, because it's doing everything twice. And not enough because both or either could fail with no automatic recovery (e.g. depool, re-provision etc).
Storage is extremely limited (50G local space, T199853). This forces us to continuously delete profiles more than 14 days old, which gives only a very narrow window for perf analysis.
There are no backups (as far as I know).
Outcome
- No longer require management two VMs.
- Able to replace the deployed service or hardware without losing data or requiring manual migration.
- Able to retain daily data summary graphs for 2 years (requires an estimated 179G per T199853).
- Backups (either for us specifically, or as ensured by the storage layer).
Proposal
- Migrate Arc Lamp storage writing to a local disk to something external - currently assuming Swift. This is expected to automatically take care of backups and local-DC redundancy for the DC-local performance.wikimedia.org proxy, or to at least make it implement redundancy/backups.
- Change Apache logic for https://performance.wikimedia.org/arclamp, to read from virtual Swift directories (instead of local-disk).
- Deploy through Kubernetes instead of Scap (e.g. Linux container image)