Page MenuHomePhabricator

More RAM needed for webperf1002 and webperf2002
Closed, ResolvedPublic

Description

As documented in T259167, the input data for the ArcLamp pipeline has been steadily increasing over time, and the memory required to process the full-day files now exceeds the available RAM on webperf1002.eqiad and webperf2002.codfw. (ArcLamp is a sampling profiler, which produces the flamegraphs seen at https://performance.wikimedia.org/arclamp from PHP stack traces taken at regular intervals in production; it is an essential tool for identifying performance regressions.)

In I247b774d and Ic697ad6b, I reduced RAM usage by ensuring only one graph is being produced at a time, however we continue to see OOMs.

These hosts are Ganeti instances with (currently) 8GB of RAM. We would like to increase the RAM for these instances to at least 16GB, or 32GB if possible.

Event Timeline

dpifke created this task.Aug 11 2020, 7:11 PM
Restricted Application added a project: Operations. · View Herald TranscriptAug 11 2020, 7:11 PM
Restricted Application added subscribers: Gilles, Aklapper. · View Herald Transcript
ThesenatorO5-2 triaged this task as Medium priority.Aug 12 2020, 12:39 AM
Krinkle raised the priority of this task from Medium to High.Aug 12 2020, 12:50 AM
Krinkle added a subscriber: Krinkle.

@ThesenatorO5-2 Please explain why you triaged the priority of this task.

ThesenatorO5-2 raised the priority of this task from High to Unbreak Now!.Aug 12 2020, 12:54 AM
This comment was removed by ThesenatorO5-2.

I did not found the two mentioned file(s)

Krinkle lowered the priority of this task from Unbreak Now! to High.Aug 12 2020, 1:02 AM

Please stop changing the priority of random tasks. I will ask for your account to be blocked if you continue.

dpifke moved this task from Inbox to Radar on the Performance-Team board.Aug 13 2020, 11:47 PM
dpifke edited projects, added Performance-Team (Radar); removed Performance-Team.

Mentioned in SAL (#wikimedia-operations) [2020-08-19T18:15:42Z] <mutante> rebooting webperf2002 VM on ganeti level (outside OS) to upgrade rom 8 to 16GB RAM (T260192)

Mentioned in SAL (#wikimedia-operations) [2020-08-19T18:25:10Z] <mutante> rebooting webperf1002 VM on ganeti level (outside OS) to upgrade rom 8 to 16GB RAM (T260192)

Dzahn closed this task as Resolved.Aug 19 2020, 6:34 PM
Dzahn claimed this task.
Dzahn added a subscriber: Dzahn.
17:54 < mutante> hi all, I could use some input on determining where we want to draw the line between ganeti VM and hardware. So .. the existing ganeti VMs usually have RAM between 1G and 8G, only 2 special cases with 16GB, puppetdb and deneb. none have more than 16.  Now we have a request to upgrade the webperf VMs from 8G to 16G or 32G if possible. I  looked at "gnt-node list" to try to determine how much is left 
17:54 < mutante> that can be allocated and the MFree column tells me it it is anything from 12 up to 62 depending on the node. So I am not sure how to calculate it but either way it seems  to mean we can't do 2 x 32GB.  So is this the line where we say webperf* should move to metal ?
17:55 < mutante> ..or does it just mean the ganeti cluster needs more resources

17:57 < mutante> dpifke: just a cc: for you ^
18:00 < dpifke> If it wasn't clear from the request, only webperfX002 (one VM in each cluster) needs more RAM.  webperfX001 can stay with 8GB.
18:02 < dpifke> Given the X001 hosts have different function than X002 hosts, perhaps we use this as an excuse to rename them.
18:02 < mutante> ah yes, one per cluster
18:02 < mutante> dpifke: yea, actually.. let's do the renaming
18:02 < dpifke> I *think* arclamp is the only thing running on X002, let me double check.
18:04 < mutante> i guess i can do "8 GB more" and first try it in codfw.  so the docs say it can be done without downtime
...
18:16 < mutante> i had to reboot on ganeti level, not OS level. but that is very quick
18:17 < mutante> [webperf2002:~] $ cat /proc/meminfo 
18:17 < mutante> MemTotal:       16436708 kB
18:17 < mutante> dpifke: there you go, so the downtime was very short
18:17 < dpifke> Looks good on this end, thanks!
......
18:28 < mutante> i will claim it's resolved and if it turns out you need more than 16 then let's reopen it and that discussion from the start
18:29 < mutante> because then they would become special cases among the other VMs
18:30 < dpifke> OK.  The long-term plan is to have this run as a stateless service of some sort rather than in a VM.  So hopefully 16GB will hold us over until then.
18:31 < mutante> alright. sounds good