More RAM needed for webperf1002 and webperf2002
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• dpifke
	Aug 11 2020, 7:11 PM

Description

As documented in T259167, the input data for the ArcLamp pipeline has been steadily increasing over time, and the memory required to process the full-day files now exceeds the available RAM on webperf1002.eqiad and webperf2002.codfw. (ArcLamp is a sampling profiler, which produces the flamegraphs seen at https://performance.wikimedia.org/arclamp from PHP stack traces taken at regular intervals in production; it is an essential tool for identifying performance regressions.)

In I247b774d and Ic697ad6b, I reduced RAM usage by ensuring only one graph is being produced at a time, however we continue to see OOMs.

These hosts are Ganeti instances with (currently) 8GB of RAM. We would like to increase the RAM for these instances to at least 16GB, or 32GB if possible.

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Duplicate		None	T259167 Truncated ArcLamp output files
		Resolved		Dzahn	T260192 More RAM needed for webperf1002 and webperf2002

Event Timeline

• dpifke created this task.Aug 11 2020, 7:11 PM

Restricted Application added a project: SRE. · View Herald TranscriptAug 11 2020, 7:11 PM

Restricted Application added subscribers: • Gilles, Aklapper. · View Herald Transcript

• dpifke added a parent task: T259167: Truncated ArcLamp output files.Aug 11 2020, 7:13 PM

• ThesenatorO5-2 triaged this task as Medium priority.Aug 12 2020, 12:39 AM

• ThesenatorO5-2 subscribed.

@ThesenatorO5-2 Please explain why you triaged the priority of this task.

• ThesenatorO5-2 raised the priority of this task from High to Unbreak Now!.Aug 12 2020, 12:54 AM

This comment was removed by • ThesenatorO5-2.

I did not found the two mentioned file(s)

Please stop changing the priority of random tasks. I will ask for your account to be blocked if you continue.

@ThesenatorO5-2: Please read https://www.mediawiki.org/wiki/Bug_management/Phabricator_etiquette, https://www.mediawiki.org/wiki/How_to_report_a_bug, and https://www.mediawiki.org/wiki/Phabricator/Project_management#Setting_task_priorities and do not set priority or assign random people. Thank you.

• dpifke moved this task from Inbox, needs triage to Radar on the Performance-Team board.Aug 13 2020, 11:47 PM

• dpifke edited projects, added Performance-Team (Radar); removed Performance-Team.

Mentioned in SAL (#wikimedia-operations) [2020-08-19T18:15:42Z] <mutante> rebooting webperf2002 VM on ganeti level (outside OS) to upgrade rom 8 to 16GB RAM (T260192)

Mentioned in SAL (#wikimedia-operations) [2020-08-19T18:25:10Z] <mutante> rebooting webperf1002 VM on ganeti level (outside OS) to upgrade rom 8 to 16GB RAM (T260192)

17:54 < mutante> hi all, I could use some input on determining where we want to draw the line between ganeti VM and hardware. So .. the existing ganeti VMs usually have RAM between 1G and 8G, only 2 special cases with 16GB, puppetdb and deneb. none have more than 16.  Now we have a request to upgrade the webperf VMs from 8G to 16G or 32G if possible. I  looked at "gnt-node list" to try to determine how much is left 
17:54 < mutante> that can be allocated and the MFree column tells me it it is anything from 12 up to 62 depending on the node. So I am not sure how to calculate it but either way it seems  to mean we can't do 2 x 32GB.  So is this the line where we say webperf* should move to metal ?
17:55 < mutante> ..or does it just mean the ganeti cluster needs more resources

17:57 < mutante> dpifke: just a cc: for you ^
18:00 < dpifke> If it wasn't clear from the request, only webperfX002 (one VM in each cluster) needs more RAM.  webperfX001 can stay with 8GB.
18:02 < dpifke> Given the X001 hosts have different function than X002 hosts, perhaps we use this as an excuse to rename them.
18:02 < mutante> ah yes, one per cluster
18:02 < mutante> dpifke: yea, actually.. let's do the renaming
18:02 < dpifke> I *think* arclamp is the only thing running on X002, let me double check.
18:04 < mutante> i guess i can do "8 GB more" and first try it in codfw.  so the docs say it can be done without downtime
...
18:16 < mutante> i had to reboot on ganeti level, not OS level. but that is very quick
18:17 < mutante> [webperf2002:~] $ cat /proc/meminfo 
18:17 < mutante> MemTotal:       16436708 kB
18:17 < mutante> dpifke: there you go, so the downtime was very short
18:17 < dpifke> Looks good on this end, thanks!
......
18:28 < mutante> i will claim it's resolved and if it turns out you need more than 16 then let's reopen it and that discussion from the start
18:29 < mutante> because then they would become special cases among the other VMs
18:30 < dpifke> OK.  The long-term plan is to have this run as a stateless service of some sort rather than in a VM.  So hopefully 16GB will hold us over until then.
18:31 < mutante> alright. sounds good

Krinkle mentioned this in T315056: arclamp_generate_svgs OOMs.Aug 12 2022, 8:29 AM

Krinkle mentioned this in T316223: Expand RAM on arclamp hosts and move them to baremetal.Aug 25 2022, 1:43 PM

More RAM needed for webperf1002 and webperf2002Closed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

More RAM needed for webperf1002 and webperf2002
Closed, ResolvedPublic
Actions

Related Objects
Search...