Page MenuHomePhabricator

Investigate memory increase for Prometheus hosts in codfw/eqiad
Closed, ResolvedPublic

Description

As per parent task, Prometheus k8s in codfw/eqiad can get OOM killed from time to time.
As one of the short term mitigations I think we should try to scale the hw up vertically, namely by increasing memory.
The current R440 hosts have 4x32GB RAM each, @wiki_willy would you mind helping in looking if we have memory available on site to be installed? Ideally >= 64GB per host (total 4 hosts, 2 eqiad and 2 codfw).
And if not immediately available, could we order the memory?

thank you!

Event Timeline

@Papaul / @Jhancock.wm and @Jclark-ctr / @VRiley-WMF - can you see if you have any spare memory onsite for Filippo? I think it's for prometheus100[5,6] and prometheus200[5,6]. (cc @RobH in case we have to order them)

https://netbox.wikimedia.org/search/?q=prometheus&obj_type=

Thanks,
Willy

at codfw we have
two 32GB 2Rx4 PC4 2666V
and
one 32GB 2Rx4 PC4 2400V

we also have eight 16GB 2Rv4 PC3L but I am not sure if they are compatible

Awesome, thanks @Jhancock.wm. Here's the codfw upgrade ticket for you to coordinate with @fgiunchedi on the downtime - T354685. Thanks, Willy

At eqiad we have

plenty of 32GB 2Rx4 PC4 2666V (4+)

that are available.

Thanks @VRiley-WMF. I have T354684 assigned over to you, so you can work with @fgiunchedi on coordinating downtime for the upgrades. Thanks, Willy

fgiunchedi claimed this task.

Resolving this since codfw is done and eqiad is tracked in T354684