Maniphest T201919

tilerator failed on maps eqiad servers (maps100[1234])
Closed, DuplicatePublic
Actions

Assigned To

None

Authored By

	Gehel
	Aug 14 2018, 1:07 PM

Description

Icinga alerted at mostly the same time (2018-08-12 12:52) that all tilerator instances on eqiad cluster had stopped responding. A manual restart of tilerator at 12:57 fixed the issue.

The logs are short on actual information (see extract below). It is probably worth a timeboxed investigation, if only to add some logging so that we might understand what's happening next time. But given that the restart seems to have be done by service runner after some workers stopped sending heartbeats, I'm not even sure we'll know what to instrument to get more information.

[2018-08-14T12:20:04.200Z] ERROR: tilerator/4 on maps1003: worker stopped sending heartbeats, killing. (message="worker stopped sending heartbeats, killing.", worker_pid=134, levelPath=error/service-runner/master)
[2018-08-14T12:20:04.215Z] ERROR: tilerator/4 on maps1003: worker stopped sending heartbeats, killing. (message="worker stopped sending heartbeats, killing.", worker_pid=158, levelPath=error/service-runner/master)
[2018-08-14T12:20:07.967Z] ERROR: tilerator/4 on maps1003: worker stopped sending heartbeats, killing. (message="worker stopped sending heartbeats, killing.", worker_pid=98, levelPath=error/service-runner/master)
[2018-08-14T12:21:04.281Z] ERROR: tilerator/4 on maps1003: worker died, restarting (message="worker died, restarting", worker_pid=158, exit_code=null, levelPath=error/service-runner/master)
[2018-08-14T12:21:04.324Z] ERROR: tilerator/4 on maps1003: worker died, restarting (message="worker died, restarting", worker_pid=134, exit_code=null, levelPath=error/service-runner/master)

Related Objects

Mentioned Here: T201772: maps.wikimedia.org is showing old vandalized version of OSM

Event Timeline

Gehel created this task.Aug 14 2018, 1:07 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 14 2018, 1:07 PM

Increased memory pressure due to Cassandra suddenly having to serve many tiles due to some (all) tiles being invalidated in Varnish cache?

Hmm, but if I have my chronology correct, this happened before some (all?) tiles were banned en masse (T201772#4497154).

• Jhernandez closed this task as a duplicate of T204047: investigate tilerator crash on maps eqiad.Sep 11 2018, 3:11 PM

tilerator failed on maps eqiad servers (maps100[1234])Closed, DuplicatePublicActions

Description

Related Objects

Event Timeline

tilerator failed on maps eqiad servers (maps100[1234])
Closed, DuplicatePublic
Actions