So yesterday we got a report that Special:Translations is not working on meta. I looked few ganglia stats but could not figure out anything specific.
Today with help of Max we have a theory:
Many concurrent updates slow solr down to crawling (1 every 2 seconds taking average of almost 20 s to finish). Also search requests take longer than 10 seconds and time out.
The JobQueue seems to actually work against us here. The lack of accessible statistics to examine JobQueue behavior is problematic.
Max proposed that we store the list of changes and use cronjob to push them in batches with a single thread.
No other solutions have been proposed.
RT-5085 should be related to this.