Page MenuHomePhabricator

Tilerator crashed on maps200[1-3].codfw.wmnet
Closed, ResolvedPublic

Description

Tilerator crashed on maps200[1-3[.codfw.wmnet with the following error:

{"name":"tilerator","hostname":"maps2001","pid":4,"level":50,"message":"worker stopped sending heartbeats, killing.","worker_pid":290,"levelPath":"error/service-runner/master","msg":"worker stopped sending heartbeats, killing.","time":"2019-04-02T00:17:15.131Z","v":0}
{"name":"tilerator","hostname":"maps2001","pid":4,"level":50,"message":"worker stopped sending heartbeats, killing.","worker_pid":296,"levelPath":"error/service-runner/master","msg":"worker stopped sending heartbeats, killing.","time":"2019-04-02T00:17:15.132Z","v":0}
{"name":"tilerator","hostname":"maps2001","pid":4,"level":50,"message":"worker stopped sending heartbeats, killing.","worker_pid":302,"levelPath":"error/service-runner/master","msg":"worker stopped sending heartbeats, killing.","time":"2019-04-02T00:17:15.132Z","v":0}
{"name":"tilerator","hostname":"maps2001","pid":4,"level":50,"message":"worker stopped sending heartbeats, killing.","worker_pid":326,"levelPath":"error/service-runner/master","msg":"worker stopped sending heartbeats, killing.","time":"2019-04-02T00:17:22.640Z","v":0}

These errors started around 2019-04-02T00:17:22.

Event Timeline

@Mathew.onipe this is solved and will be fixed when the stretch migration finishes. It's a known issue with the populate_admin script.

MSantos triaged this task as High priority.Apr 2 2019, 2:29 PM
MSantos moved this task from All map-related tasks to Tilerator on the Maps board.
MSantos edited projects, added Maps (Tilerator); removed Maps.

@MSantos ack

However, this same error occurred again with maps2003.

{"name":"tilerator","hostname":"maps2003","pid":4,"level":50,"message":"worker died, restarting","worker_pid":134,"exit_code":null,"levelPath":"error/service-runner/master","msg":"worker died, restarting","time":"2019-04-09T00:32:08.178Z","v":0}
{"name":"tilerator","hostname":"maps2003","pid":4,"level":50,"message":"worker died, restarting","worker_pid":146,"exit_code":null,"levelPath":"error/service-runner/master","msg":"worker died, restarting","time":"2019-04-09T00:32:11.936Z","v":0}
{"name":"tilerator","hostname":"maps2003","pid":4,"level":50,"message":"worker stopped sending heartbeats, killing.","worker_pid":278,"levelPath":"error/service-runner/master","msg":"worker stopped sending heartbeats, killing.","time":"2019-04-09T00:32:19.449Z","v":0}
{"name":"tilerator","hostname":"maps2003","pid":4,"level":50,"message":"worker stopped sending heartbeats, killing.","worker_pid":290,"levelPath":"error/service-runner/master","msg":"worker stopped sending heartbeats, killing.","time":"2019-04-09T00:32:23.202Z","v":0}

Mentioned in SAL (#wikimedia-operations) [2019-04-09T02:47:26Z] <onimisionipe> restarting tilerator on maps2003 - T219849

This took another turn today. Restating tilerator did not work.
syslog showed a permission error:

onimisionipe@maps2003:/srv/log/tilerator$ tail syslog.log 
Apr  9 03:13:01 maps2003 tilerator[1525]: ** Note: you can use --noprofile to disable default.profile **
Apr  9 03:13:01 maps2003 tilerator[1525]: #033]0;firejail /usr/bin/nodejs src/server.js -c /etc/tilerator/config.yaml #007Error while reading config file: Error: EACCES: permission denied, open '/etc/tilerator/config.yaml'
Apr  9 03:13:01 maps2003 tilerator[1525]: ** Note: you can use --noprofile to disable default.profile **
Apr  9 03:13:01 maps2003 tilerator[1525]: Parent pid 1525, child pid 1526
Apr  9 03:13:01 maps2003 tilerator[1525]: Parent is shutting down, bye...
Apr  9 03:13:04 maps2003 tilerator[1622]: Reading profile /etc/firejail/default.profile
Apr  9 03:13:04 maps2003 tilerator[1622]: Reading profile /etc/firejail/disable-common.inc
Apr  9 03:13:04 maps2003 tilerator[1622]: Reading profile /etc/firejail/disable-programs.inc
Apr  9 03:13:04 maps2003 tilerator[1622]: Reading profile /etc/firejail/disable-passwdmgr.inc
Apr  9 03:13:04 maps2003 tilerator[1622]: ** Note: you can use --noprofile to disable default.profile **

I'm going to depool maps2003 until this is fixed

Gehel claimed this task.

Stretch migration is completed. This should be fixed, we'll reopen if this happens again.