Page MenuHomePhabricator

tilerator failed on maps eqiad servers (maps100[1234])
Closed, DuplicatePublic

Description

Icinga alerted at mostly the same time (2018-08-12 12:52) that all tilerator instances on eqiad cluster had stopped responding. A manual restart of tilerator at 12:57 fixed the issue.

The logs are short on actual information (see extract below). It is probably worth a timeboxed investigation, if only to add some logging so that we might understand what's happening next time. But given that the restart seems to have be done by service runner after some workers stopped sending heartbeats, I'm not even sure we'll know what to instrument to get more information.

[2018-08-14T12:20:04.200Z] ERROR: tilerator/4 on maps1003: worker stopped sending heartbeats, killing. (message="worker stopped sending heartbeats, killing.", worker_pid=134, levelPath=error/service-runner/master)
[2018-08-14T12:20:04.215Z] ERROR: tilerator/4 on maps1003: worker stopped sending heartbeats, killing. (message="worker stopped sending heartbeats, killing.", worker_pid=158, levelPath=error/service-runner/master)
[2018-08-14T12:20:07.967Z] ERROR: tilerator/4 on maps1003: worker stopped sending heartbeats, killing. (message="worker stopped sending heartbeats, killing.", worker_pid=98, levelPath=error/service-runner/master)
[2018-08-14T12:21:04.281Z] ERROR: tilerator/4 on maps1003: worker died, restarting (message="worker died, restarting", worker_pid=158, exit_code=null, levelPath=error/service-runner/master)
[2018-08-14T12:21:04.324Z] ERROR: tilerator/4 on maps1003: worker died, restarting (message="worker died, restarting", worker_pid=134, exit_code=null, levelPath=error/service-runner/master)

Event Timeline

Increased memory pressure due to Cassandra suddenly having to serve many tiles due to some (all) tiles being invalidated in Varnish cache?

Hmm, but if I have my chronology correct, this happened before some (all?) tiles were banned en masse (T201772#4497154).