Page MenuHomePhabricator

Increase memory available for an-launcher1001
Closed, ResolvedPublic

Description

The VM 'an-launcher1001` is very important for Analytics, since it schedules most of the regular jobs that run on the Hadoop cluster/infra daily. In theory the memory usage should be low, in practice we have some processes that eat more memory than usual ending up in OOM.

The VM currently uses 8g of RAM, I'd like to bump it up to 12g if possible.

Event Timeline

elukey triaged this task as High priority.Jun 1 2020, 6:22 AM
elukey created this task.

This is the current status for eqiad:

elukey@ganeti1003:~$ sudo gnt-node list
Node                   DTotal  DFree MTotal MNode MFree Pinst Sinst
ganeti1001.eqiad.wmnet 707.4G   9.2G  62.9G 36.1G 26.4G    11    11
ganeti1002.eqiad.wmnet 707.4G  69.2G  62.9G 40.1G 22.2G    10    11
ganeti1003.eqiad.wmnet 707.4G 244.6G  62.9G 39.7G 20.9G    11    11
ganeti1004.eqiad.wmnet 707.4G 285.1G  62.9G 56.1G  6.4G    10     8
ganeti1005.eqiad.wmnet   2.1T 225.8G  62.8G 17.2G 45.5G     6    18
ganeti1006.eqiad.wmnet   2.1T 886.8G  62.8G 53.5G  9.0G    12     5
ganeti1007.eqiad.wmnet   2.1T   1.1T  62.8G 45.5G 15.2G    12    12
ganeti1008.eqiad.wmnet   2.1T 914.5G  62.8G 48.3G 10.8G    13     6

The vm seems in row C, on ganeti1001:

Nodes:
  - primary: ganeti1001.eqiad.wmnet
    group: row_C 
  - secondaries: ganeti1002.eqiad.wmnet (group row_C)

So in theory Mfree is enough, with the assumption that I need to check primary/secondary nodes for enough resources only. @MoritzMuehlenhoff @akosiaris any thought?

This is the current status for eqiad:

elukey@ganeti1003:~$ sudo gnt-node list
Node                   DTotal  DFree MTotal MNode MFree Pinst Sinst
ganeti1001.eqiad.wmnet 707.4G   9.2G  62.9G 36.1G 26.4G    11    11
ganeti1002.eqiad.wmnet 707.4G  69.2G  62.9G 40.1G 22.2G    10    11
ganeti1003.eqiad.wmnet 707.4G 244.6G  62.9G 39.7G 20.9G    11    11
ganeti1004.eqiad.wmnet 707.4G 285.1G  62.9G 56.1G  6.4G    10     8
ganeti1005.eqiad.wmnet   2.1T 225.8G  62.8G 17.2G 45.5G     6    18
ganeti1006.eqiad.wmnet   2.1T 886.8G  62.8G 53.5G  9.0G    12     5
ganeti1007.eqiad.wmnet   2.1T   1.1T  62.8G 45.5G 15.2G    12    12
ganeti1008.eqiad.wmnet   2.1T 914.5G  62.8G 48.3G 10.8G    13     6

The vm seems in row C, on ganeti1001:

Nodes:
  - primary: ganeti1001.eqiad.wmnet
    group: row_C 
  - secondaries: ganeti1002.eqiad.wmnet (group row_C)

So in theory Mfree is enough, with the assumption that I need to check primary/secondary nodes for enough resources only. @MoritzMuehlenhoff @akosiaris any thought?

That's more or less correct. There is a chance the automatic allocator might complain in some future rebalancing, but there's enough memory in all hardware nodes to not be afraid in this case.

elukey@ganeti1003:~$ sudo gnt-instance modify -B memory=12g an-launcher1001.eqiad.wmnet
Modified instance an-launcher1001.eqiad.wmnet
 - be/memory -> 12288
Please don't forget that most parameters take effect only at the next (re)start of the instance initiated by ganeti; restarting from within the instance will not be enough.

Mentioned in SAL (#wikimedia-operations) [2020-06-01T14:53:57Z] <elukey> ganeti: increase memory available for an-launcher1001 from 8g to 12g - T254125

There are still some important jobs running so I cannot reboot the instance, will do it hopefully tomorrow :)

Mentioned in SAL (#wikimedia-operations) [2020-06-03T17:08:33Z] <elukey> ganeti: gnd-instance reboot an-launcher1001 to get new memory settings - T254125

Milimetric subscribed.

Can we re-enable reportupdater on the machine now?

Can we re-enable reportupdater on the machine now?

Already done a couple of days ago :)