Page MenuHomePhabricator

Tool pbbot hitting Java resource limits (OOM errors)
Closed, ResolvedPublic

Description

Since 2019-04-10, my Java apps are reporting out of memory errors:

Exception in thread "main" java.lang.OutOfMemoryError: unable to create native thread: possibly out of memory or process/resource limits reached

or (not so often):

Exception in thread "main" java.lang.OutOfMemoryError: Metaspace.

Such errors happen after the application has been successfully started. These are sometimes preceded by a [6.290s][warning][os,thread] Failed to start thread - pthread_create failed (EAGAIN) for attributes: stacksize: 1024k, guardsize: 0k, detached. line. Logs are available at /data/project/pbbot/output/*.log.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 11 2019, 11:08 PM

@PeterBowman What -mem X setting are you using for your job? Have you tried increasing that limit?

https://wikitech.wikimedia.org/wiki/Help:Toolforge/Grid#Allocating_additional_memory

bd808 renamed this task from Native thread limit hit by Java applications to Tool pbbot hitting Java resource limits (OOM errors).Apr 13 2019, 3:47 PM
bd808 added a project: Tools.
PeterBowman added a comment.EditedApr 13 2019, 4:21 PM

@bd808 Thanks, increasing -mem value from 5g to 8g made my jobs respond again, I'll try to adjust it to something in between. I was hoping to ultimately avoid this - unable to create native thread seemed unrelated to the amount of memory (IMO, in spite of the OOM error class) and 5g was already high, compare with the default vmem size as specified in that link. I never had to increase this limit in over 3 years and something changed just a few days ago, so I wanted to leave my feedback here. Please feel free to close this task if it's OK to raise -mem that high.

Edit: currently using -mem 6g.

something changed just a few days ago

We have had several Java users notice that the version of Java we have on Stretch job grid needs more -mem allocated than the Trusty job grid did. I can not think of any software related change in Toolforge in the past days/weeks that would have made this worse. Depending on what your bot is doing it may be actually just processing more data in a run related to some change in the rate of content in the wikis you are interacting with too. That can be tricky to determine for some bots.

bd808 closed this task as Resolved.Apr 18 2019, 4:38 AM
bd808 assigned this task to PeterBowman.