Page MenuHomePhabricator

Deploy ArcLamp process as stateless/scalable service (Kubernetes)
Closed, DeclinedPublic

Description

Current state

Each primary DC has an independent instance of the ArcLamp Python service which subscribes to the sampling profiler stream exposed by app servers, and writes these to local log files on-disk, converts these to flame graphs. A local Apache server exposes this directory via a proxy at https://performance.wikimedia.org/arclamp/.

The service is active-active (1 VM in Eqiad, and an identical provision in Codfw):

There is only 1 redundancy. This is both more than needed, and not enough. It is more than needed, because it's doing everything twice. And not enough because both or either could fail with no automatic recovery (e.g. depool, re-provision etc).

Storage is extremely limited (50G local space, T199853). This forces us to continuously delete profiles more than 14 days old, which gives only a very narrow window for perf analysis.

There are no backups (as far as I know).

Outcome

  • No longer require management two VMs.
  • Able to replace the deployed service or hardware without losing data or requiring manual migration.
  • Able to retain daily data summary graphs for 2 years (requires an estimated 179G per T199853).
  • Backups (either for us specifically, or as ensured by the storage layer).

Proposal

  • Migrate Arc Lamp storage writing to a local disk to something external - currently assuming Swift. This is expected to automatically take care of backups and local-DC redundancy for the DC-local performance.wikimedia.org proxy, or to at least make it implement redundancy/backups.
  • Change Apache logic for https://performance.wikimedia.org/arclamp, to read from virtual Swift directories (instead of local-disk).
  • Deploy through Kubernetes instead of Scap (e.g. Linux container image)

See also

Event Timeline

Removing task assignee due to inactivity, as this open task has been assigned for more than two years. See the email sent to the task assignee on February 06th 2022 (and T295729).

Please assign this task to yourself again if you still realistically [plan to] work on this task - it would be welcome.

If this task has been resolved in the meantime, or should not be worked on ("declined"), please update its task status via "Add Action… 🡒 Change Status".

Also see https://www.mediawiki.org/wiki/Bug_management/Assignee_cleanup for tips how to best manage your individual work in Phabricator.

Given revised scope of T200108, we won't be needing Swift and thus won't be stateless. It seems a lot simpler to maintain in this way than to manage a complex set of local bufferingn and file syncing back and forth with Swift just to fit it into a statless k8s pipeline.