Page MenuHomePhabricator

Exim panics when spamd reaches maxchildren
Open, MediumPublic

Description

This happened in the last two/three days, where exim sends a lot of messages to spamd for testing, spamd reaches maxchildren and connections to spamd time out

2017-05-24 20:07:10 XXX spam acl condition: warning - spamd connection to 127.0.0.1, port 783 failed: Connection timed out
2017-05-24 20:07:10 XXX spam acl condition: all spamd servers failed
May 24 20:07:06 mx1001 spamd[10173]: prefork: child states: BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
May 24 20:07:06 mx1001 spamd[10173]: prefork: server reached --max-children setting, consider raising it

The system recovers by itself after the influx of messages (in this case from civi1001) has been processed, though we should be able to protect from this, maybe limiting concurrent deliveries?

Event Timeline

I know I am lazy so I still haven't decyphered the configs how you handle spamd, but a few notes in the dark:

  • you can use defer_ok to let messages through in case of spamd failure
spam = everybody/defer_ok
  • creating spamd instances are usually pretty cheap if you're using fast common bayes (like redis) and fast common whitelist (like postgres). I have usually four spamd containers around, 2 for everyday load (40%-40%) and the rest is for emergencies (10%-10%), which only get used when the main ones saturate. (I gave them plenty of mamory and let 50+ connection per instance.)
  • I observed no real difference between prefork or dynamic fork configs (unless your fork is expensive), so it's pointless to fiddle with it
  • exim max parallel deliveries strongly correlate to expected spamd parallel scans. if you let exim handle lots of connections you need spamd which can handle it as well.