Page MenuHomePhabricator

Expand RAM on arclamp hosts and move them to baremetal
Closed, ResolvedPublic

Description

  • Location:
    • Eqiad webperf1004 VM -> arclamp1001,
    • Codfw webperf2004 VM -> arclamp2001
  • Memory: 16GB -> 32GB

The number of entry point and flamegraph variants (excimer-wall, excimer-cpu, k8s, etc) has increased, and the amount of daily samples has increased as well now that MW REST API and Parsoid PHP exist and essentially start to take over traffic from what was or would-have-been RESTBase.

It seems we were already hugging the limits before, but we're now over the limit more often than not during each run, which causes some flamegraphs to not get generated until a lucky attempt later on.

Details at T315056: arclamp_generate_svgs OOMs.

We'll try to find upstream solutions that make it more memory efficient overall, but for now, given this isn't run-away or uncontrolled growth, a doubling of memory shouhld easily gain us a few years.

Prior history:

[..]
18:28 < mutante> i will claim it's resolved and if it turns out you need more than 16 then let's reopen it and that discussion from the start
18:29 < mutante> because then they would become special cases among the other VMs

Event Timeline

Instead let's move these to a baremetal host instead? We're hitting some limits of what makes sense with Ganeti for these, one other issue is high rate of memory changes, which currently allows to migrate these off to a different host with hacks like temporarily stopping arclamp processes. We always buy a few spare hosts for unforeseen machine needs, so I guess the hardware is available.

That would also be a fine opportunity to move away from the confusing naming scheme, given that webperf1003 and 1004 are totally different services, something like xenon* or arclamp* instead.

+1 to not using the same names for the different webperf roles, thought the same before, should match more the puppet role

And yea, like the history says the discussion was to start from scratch once we get over the 16GB RAM limit. Hardware sounds the right way indeed.

Looking at CPU and disk usage (currently 150ish since some data is now on Swift) and the desired RAM, servers with "config A" would do just fine.

Dzahn mentioned this in Unknown Object (Task).Sep 1 2022, 8:26 PM
Dzahn mentioned this in Unknown Object (Task).Sep 1 2022, 8:30 PM
Dzahn added subtasks: Unknown Object (Task), Unknown Object (Task).

@MoritzMuehlenhoff @Krinkle I made procurement subtasks. There are 2 because the template says it needs to be limited to a specific DC. Please take a look if you have any comments for example on the sections about disk space and RAID. I am just saying "as long as it has 32GB RAM and matches the VM (or more obviously) to keep it as flexible as possible for dcops.

@Dzahn LGTM. Thanks for filing those. I would suggest for the hostnames to go with arclamp#001.

jbond triaged this task as Medium priority.Sep 6 2022, 2:41 PM

Retitling the task and dropping vm-requests

MoritzMuehlenhoff renamed this task from Resize webperf1004/2004 VM for arc-lamp to Expand RAM on arclamp hosts and move them to baremetal.Sep 6 2022, 2:43 PM

Is there a more specific tag we can use for this instead of SRE? perhaps serviceops?

Jclark-ctr closed subtask Unknown Object (Task) as Resolved.Nov 4 2022, 2:23 PM

Is there a more specific tag we can use for this instead of SRE? perhaps serviceops?

Yeah, this one is a bit in a grey area regarding which SRE team is the best one to deal with it, but serviceops is the most adjacent one given MediaWiki. We 'll take over this one.

@akosiaris Thank you! So what this is is:

  1. hardware has been procured (I reviewed/approved in T316906 (eqiad) and T316907 (codfw). done, i think. Though the dcops-codfw task looks open ->
  1. dcops racks hardware T319433 (eqiad) and T319428 (codfw). pending ->
  1. T319434 (eqiad) and T319429 (codfw) are supposed to be assigned to someone by dcops and that would be basically be this ticket
Papaul closed subtask Unknown Object (Task) as Resolved.Nov 14 2022, 11:49 PM

Is there a more specific tag we can use for this instead of SRE? perhaps serviceops?

Yeah, this one is a bit in a grey area regarding which SRE team is the best one to deal with it, but serviceops is the most adjacent one given MediaWiki. We 'll take over this one.

Per 28f86674054b7 observability has taken over arclamp. It aligns more closely with their area of expertise and focus than serviceops. I thank them for this.

fgiunchedi claimed this task.
fgiunchedi subscribed.

All done! arclamp now lives on baremetal hosts with plenty of memory to spare

Change 887771 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] arclamp: Remove rsync::quickdatacopy

https://gerrit.wikimedia.org/r/887771

Change 887771 merged by Muehlenhoff:

[operations/puppet@production] arclamp: Remove rsync::quickdatacopy

https://gerrit.wikimedia.org/r/887771