Deploy ArcLamp process as stateless/scalable service (Kubernetes)
Closed, DeclinedPublic
Actions

Assigned To

None

Authored By

	Krinkle
	Jul 1 2019, 7:31 PM

Description

Current state

Each primary DC has an independent instance of the ArcLamp Python service which subscribes to the sampling profiler stream exposed by app servers, and writes these to local log files on-disk, converts these to flame graphs. A local Apache server exposes this directory via a proxy at https://performance.wikimedia.org/arclamp/.

The service is active-active (1 VM in Eqiad, and an identical provision in Codfw):

There is only 1 redundancy. This is both more than needed, and not enough. It is more than needed, because it's doing everything twice. And not enough because both or either could fail with no automatic recovery (e.g. depool, re-provision etc).

Storage is extremely limited (50G local space, T199853). This forces us to continuously delete profiles more than 14 days old, which gives only a very narrow window for perf analysis.

There are no backups (as far as I know).

Outcome

No longer require management two VMs.
Able to replace the deployed service or hardware without losing data or requiring manual migration.
Able to retain daily data summary graphs for 2 years (requires an estimated 179G per T199853).
Backups (either for us specifically, or as ensured by the storage layer).

Proposal

Migrate Arc Lamp storage writing to a local disk to something external - currently assuming Swift. This is expected to automatically take care of backups and local-DC redundancy for the DC-local performance.wikimedia.org proxy, or to at least make it implement redundancy/backups.
Change Apache logic for https://performance.wikimedia.org/arclamp, to read from virtual Swift directories (instead of local-disk).
Deploy through Kubernetes instead of Scap (e.g. Linux container image)

Related Objects
Search...

Status	Assigned	Task
Resolved	aaron	T247717 Reduce flamegraph.pl threshold from minwidth=2 to minwidth=1
Declined	None	T227026 Deploy ArcLamp process as stateless/scalable service (Kubernetes)
Resolved	• dpifke	T244776 Swift container for performance flame graphs (ArcLamp)
Resolved	• dpifke	T200109 Maintain and deploy Arc Lamp code from its own repository (outside Puppet)

Event Timeline

Krinkle created this task.Jul 1 2019, 7:31 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 1 2019, 7:31 PM

• kchapman moved this task from Inbox, needs triage to To-do: Goals, prioritized next 4 Quarters on the Performance-Team board.Jul 1 2019, 7:49 PM

Krinkle added a project: Arc-Lamp.Aug 5 2019, 5:52 PM

• Gilles assigned this task to • dpifke.Jan 7 2020, 11:38 AM

• dpifke mentioned this in T243916: arclamp-generate-svgs overlaps itself.Feb 3 2020, 5:49 PM

• dpifke merged a task: T243916: arclamp-generate-svgs overlaps itself.

• dpifke subscribed.

• dpifke mentioned this in T244776: Swift container for performance flame graphs (ArcLamp).Feb 10 2020, 7:19 PM

• dpifke moved this task from To-do: Goals, prioritized next 4 Quarters to Doing (old) on the Performance-Team board.Feb 10 2020, 9:31 PM

• dpifke mentioned this in T235456: Let Arc-Lamp store its trace "log" files in compressed format.Feb 18 2020, 9:38 PM

Krinkle mentioned this in T247717: Reduce flamegraph.pl threshold from minwidth=2 to minwidth=1.Mar 15 2020, 11:36 PM

Krinkle added a subtask: T244776: Swift container for performance flame graphs (ArcLamp).

Krinkle added a parent task: T247717: Reduce flamegraph.pl threshold from minwidth=2 to minwidth=1.

Krinkle added a parent task: T200108: Increase retention of ArcLamp SVGs to 2 years.

Krinkle added a subtask: T200109: Maintain and deploy Arc Lamp code from its own repository (outside Puppet).

Krinkle mentioned this in T200109: Maintain and deploy Arc Lamp code from its own repository (outside Puppet).Mar 15 2020, 11:42 PM

Krinkle closed subtask T200109: Maintain and deploy Arc Lamp code from its own repository (outside Puppet) as Resolved.Mar 16 2020, 9:34 PM

Krinkle added a parent task: T235455: Resolve arclamp disk exhaustion problem (Oct 2019).Jun 22 2020, 8:07 PM

Krinkle removed a parent task: T235455: Resolve arclamp disk exhaustion problem (Oct 2019).

Krinkle removed a parent task: T200108: Increase retention of ArcLamp SVGs to 2 years.

Krinkle mentioned this in T200108: Increase retention of ArcLamp SVGs to 2 years.Jun 22 2020, 8:10 PM

• dpifke reopened subtask T200109: Maintain and deploy Arc Lamp code from its own repository (outside Puppet) as Open.Jun 23 2020, 9:51 PM

• dpifke closed subtask T200109: Maintain and deploy Arc Lamp code from its own repository (outside Puppet) as Resolved.Jul 1 2020, 10:11 PM

• dpifke moved this task from Doing (old) to To-do: Goals, prioritized next 4 Quarters on the Performance-Team board.Jul 6 2020, 8:26 PM

• Gilles moved this task from To-do: Goals, prioritized next 4 Quarters to Abstract Wikipedia matrixing on the Performance-Team board.Mar 18 2021, 1:49 PM

• Gilles moved this task from Abstract Wikipedia matrixing to Doing: Goals on the Performance-Team board.May 20 2021, 2:07 PM

• dpifke moved this task from Doing: Goals to To-do: Goals, prioritized next 4 Quarters on the Performance-Team board.Oct 18 2021, 6:45 PM

Removing task assignee due to inactivity, as this open task has been assigned for more than two years. See the email sent to the task assignee on February 06th 2022 (and T295729).

Please assign this task to yourself again if you still realistically [plan to] work on this task - it would be welcome.

If this task has been resolved in the meantime, or should not be worked on ("declined"), please update its task status via "Add Action… 🡒 Change Status".

Also see https://www.mediawiki.org/wiki/Bug_management/Assignee_cleanup for tips how to best manage your individual work in Phabricator.

Krinkle closed subtask T244776: Swift container for performance flame graphs (ArcLamp) as Resolved.Oct 4 2022, 7:35 PM

Given revised scope of T200108, we won't be needing Swift and thus won't be stateless. It seems a lot simpler to maintain in this way than to manage a complex set of local bufferingn and file syncing back and forth with Swift just to fit it into a statless k8s pipeline.

Deploy ArcLamp process as stateless/scalable service (Kubernetes)Closed, DeclinedPublicActions