Page MenuHomePhabricator

mysterious oom issues on VMs
Open, Needs TriagePublic

Description

Earlier today I restarted two VMs because puppet was reporting OOM: tools-sgeweblight-10-24, and tools-sgeweblight-10-30. Both seem healthy after the reboot.

Now I'm seeing the same issue on another VM: tools-sgeweblight-10-16

root@tools-sgeweblight-10-16:~# run-puppet-agent 
[FATAL] failed to allocate memory
Traceback (most recent call last):
	27: from /usr/bin/puppet:4:in `<main>'
	26: from /usr/lib/ruby/2.5.0/rubygems/core_ext/kernel_require.rb:59:in `require'
	25: from /usr/lib/ruby/2.5.0/rubygems/core_ext/kernel_require.rb:59:in `require'
	24: from /usr/lib/ruby/vendor_ruby/puppet/util/command_line.rb:12:in `<top (required)>'
	23: from /usr/lib/ruby/2.5.0/rubygems/core_ext/kernel_require.rb:59:in `require'
	22: from /usr/lib/ruby/2.5.0/rubygems/core_ext/kernel_require.rb:59:in `require'
	21: from /usr/lib/ruby/vendor_ruby/puppet.rb:302:in `<top (required)>'
	20: from /usr/lib/ruby/2.5.0/rubygems/core_ext/kernel_require.rb:59:in `require'
	19: from /usr/lib/ruby/2.5.0/rubygems/core_ext/kernel_require.rb:59:in `require'
	18: from /usr/lib/ruby/vendor_ruby/puppet/parser.rb:6:in `<top (required)>'
	17: from /usr/lib/ruby/2.5.0/rubygems/core_ext/kernel_require.rb:59:in `require'
	16: from /usr/lib/ruby/2.5.0/rubygems/core_ext/kernel_require.rb:59:in `require'
	15: from /usr/lib/ruby/vendor_ruby/puppet/parser/compiler.rb:8:in `<top (required)>'
	14: from /usr/lib/ruby/2.5.0/rubygems/core_ext/kernel_require.rb:59:in `require'
	13: from /usr/lib/ruby/2.5.0/rubygems/core_ext/kernel_require.rb:59:in `require'
	12: from /usr/lib/ruby/vendor_ruby/puppet/pops.rb:1:in `<top (required)>'
	11: from /usr/lib/ruby/vendor_ruby/puppet/pops.rb:12:in `<module:Puppet>'
	10: from /usr/lib/ruby/vendor_ruby/puppet/pops.rb:47:in `<module:Pops>'
	 9: from /usr/lib/ruby/2.5.0/rubygems/core_ext/kernel_require.rb:59:in `require'
	 8: from /usr/lib/ruby/2.5.0/rubygems/core_ext/kernel_require.rb:59:in `require'
	 7: from /usr/lib/ruby/vendor_ruby/puppet/pops/lookup.rb:96:in `<top (required)>'
	 6: from /usr/lib/ruby/vendor_ruby/puppet/pops/lookup.rb:96:in `require_relative'
	 5: from /usr/lib/ruby/vendor_ruby/puppet/pops/lookup/lookup_adapter.rb:518:in `<top (required)>'
	 4: from /usr/lib/ruby/vendor_ruby/puppet/pops/lookup/lookup_adapter.rb:518:in `require_relative'
	 3: from /usr/lib/ruby/vendor_ruby/puppet/pops/lookup/global_data_provider.rb:2:in `<top (required)>'
	 2: from /usr/lib/ruby/vendor_ruby/puppet/pops/lookup/global_data_provider.rb:2:in `require_relative'
	 1: from /usr/lib/ruby/vendor_ruby/puppet/pops/lookup/configured_data_provider.rb:1:in `<top (required)>'
/usr/lib/ruby/vendor_ruby/puppet/pops/lookup/configured_data_provider.rb:1:in `require_relative': failed to allocate memory (NoMemoryError)

It doesn't /look/ OOM though!

root@tools-sgeweblight-10-16:/var/log# free -h
              total        used        free      shared  buff/cache   available
Mem:          7.8Gi       2.5Gi       198Mi       261Mi       5.1Gi       4.8Gi
Swap:         8.0Gi       1.0Mi       8.0Gi

So, could this be the hypervisor running OOM? This VM runs on cloudvirt40:

root@cloudvirt1040:~# free -h
               total        used        free      shared  buff/cache   available
Mem:           503Gi       260Gi       235Gi       7.0Mi       6.8Gi       239Gi
Swap:          975Mi          0B       975Mi

So... what's the problem?

I'm rebooting this VM to resolve the problem, but this ticket is here for documentation next time the same thing happens.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Mentioned in SAL (#wikimedia-cloud) [2023-05-31T02:38:24Z] <andrewbogott> rebooted tools-sgeweblight-10-16, T337806

This is happening again on tools-sgeweblight-10-14.tools.eqiad1.wikimedia.cloud

Mentioned in SAL (#wikimedia-cloud) [2023-06-09T19:38:48Z] <andrewbogott> rebooting tools-sgeweblight-10-28 for T337806