Page MenuHomePhabricator

CloudVPS: cloudvirtan1002 puppet failures due to memory allocation issues?
Closed, ResolvedPublic

Description

I saw this on IRC:

12:29 <+icinga-wm> PROBLEM - puppet last run on cloudvirtan1002 is CRITICAL: CRITICAL: Puppet has 6 failures. Last run 6 minutes ago with 6 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Service[rsyslog],Exec[x509-bundle labvirt-star.eqiad.wmnet-chained],Exec[x509-bundle labvirt-star.eqiad.wmnet-chain]

And checked:

aborrero@cloudvirtan1002:~ $ sudo puppet agent -t -v
Info: Using configured environment 'production'
Info: Retrieving pluginfacts
Info: Retrieving plugin
Info: Loading facts
Info: Caching catalog for cloudvirtan1002.eqiad.wmnet
Notice: /Stage[main]/Base::Environment/Tidy[/var/tmp/core]: Tidying 0 files
Info: Applying configuration version '1550749226'
Error: /Stage[main]/Main/Node[__node_regexp__cloudvirtan1001-5.eqiad.wmnet]/Interface::Add_ip6_mapped[main]/Exec[eth0_v6_token]: Could not evaluate: Cannot allocate memory - fork(2)
Error: /Stage[main]/Rsyslog/Service[rsyslog]: Could not evaluate: Cannot allocate memory - fork(2)
Error: /Stage[main]/Openstack::Nova::Compute::Service/Sslcert::Certificate[labvirt-star.eqiad.wmnet]/Sslcert::Chainedcert[labvirt-star.eqiad.wmnet]/Exec[x509-bundle labvirt-star.eqiad.wmnet-chained]: Could not evaluate: Cannot allocate memory - fork(2)
Error: /Stage[main]/Openstack::Nova::Compute::Service/Sslcert::Certificate[labvirt-star.eqiad.wmnet]/Sslcert::Chainedcert[labvirt-star.eqiad.wmnet]/Exec[x509-bundle labvirt-star.eqiad.wmnet-chain]: Could not evaluate: Cannot allocate memory - fork(2)
[...]
Error: /Stage[main]/Nrpe/Base::Service_unit[nagios-nrpe-server]/Service[nagios-nrpe-server]: Could not evaluate: Cannot allocate memory - fork(2)
Error: /Stage[main]/Main/Node[__node_regexp__cloudvirtan1001-5.eqiad.wmnet]/Interface::Add_ip6_mapped[main]/Interface::Ip[main]/Exec[ip addr add 2620:0:861:118:10:64:20:45/64 dev eth0]: Could not evaluate: Cannot allocate memory - fork(2)
Error: /Stage[main]/Admin/Admin::Groupmembers[absent]/Exec[absent_ensure_members]: Could not evaluate: Cannot allocate memory - fork(2)
Error: /Stage[main]/Admin/Admin::Groupmembers[ops]/Exec[ops_ensure_members]: Could not evaluate: Cannot allocate memory - fork(2)
Error: /Stage[main]/Admin/Admin::Groupmembers[wikidev]/Exec[wikidev_ensure_members]: Could not evaluate: Cannot allocate memory - fork(2)
Error: /Stage[main]/Admin/Admin::Groupmembers[ops-adm-group]/Exec[adm_ensure_members]: Could not evaluate: Cannot allocate memory - fork(2)
Error: /Stage[main]/Admin/Admin::Groupmembers[wmcs-roots]/Exec[wmcs-roots_ensure_members]: Could not evaluate: Cannot allocate memory - fork(2)
[...]
Notice: Applied catalog in 13.42 seconds

However, the server has almost 1GB RAM available yet:

aborrero@cloudvirtan1002:~$ free -m
             total       used       free     shared    buffers     cached
Mem:        128847     128085        762       1825         17       1974
-/+ buffers/cache:     126092       2754
Swap:            0          0          0

I didn't see anything relevant in dmesg or syslog.

If they are running out of memory, we may consider using KSM https://en.wikipedia.org/wiki/Kernel_same-page_merging (after security considerations are checked).

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 21 2019, 11:50 AM

This host is still not actively used as far as I know, we got to a dead end after deploying HDFS/Presto on the above virts.. Is there a way to figure out what is using all that memory?

GTirloni added a comment.EditedFeb 21 2019, 12:26 PM

This server has 128GB of RAM. There are two VM's currently running on it:

  • ca-worker-2 - 122GB
  • canary-an1002-01 - 2GB (VM used by monitoring)

That leaves roughly 4GB for system processes, it's clearly not enough. We need to leave at least 10% of RAM available at all times for maintenance tasks.

My suggestion is to change the analytics-hadoop-worker flavor from 122GB to 112GB. Would that be acceptable for the Hadoop workloads?

Shouldn't be a big deal, the puppet code related to workers can figure out by itself how to self adjust settings based on the memory available. Going to ask to @Ottomata since he is following this project more closely, thanks!

Ya we can reduce the VM RAM size for sure. However, most likely all of this hardware is going to be moved out of Cloud VPS and back into prod networks (as soon as we have time to work on that ahhhhhh!). Perhaps we should just delete these VMs now. @JAllemandou do you have any more plans to experiment with data in the Cloud Presto cluster near term?

Ya we can reduce the VM RAM size for sure. However, most likely all of this hardware is going to be moved out of Cloud VPS and back into prod networks (as soon as we have time to work on that ahhhhhh!). Perhaps we should just delete these VMs now. @JAllemandou do you have any more plans to experiment with data in the Cloud Presto cluster near term?

Is there some task with the rationale/reasons?

https://phabricator.wikimedia.org/T207321#4882266 and below.

tl;dr Running reliable 'production' service in Cloud VPS today is very difficult. We'd have to build our own tooling for monitoring, alerting, run our own puppetmasters (for puppetdb & exported resources), etc.

No near-term plan for presto for me. Tests worked fine, we can move them :) Thanks!

Milimetric triaged this task as High priority.Feb 21 2019, 5:40 PM
Milimetric assigned this task to Ottomata.
Milimetric added a project: Analytics-Kanban.
Milimetric moved this task from Incoming to Operational Excellence on the Analytics board.

Ok, deleting all instances for now...

Nuria closed this task as Resolved.Feb 25 2019, 10:39 PM