Page MenuHomePhabricator

OSM replication stuck on tilerator notification
Closed, ResolvedPublic

Description

As can be seen on grafana, OSM replication for the maps cluster is stuck fairly often. I did kill it and restart it with some success. ps indicates that the replication is stuck on tilerator notification (/usr/bin/nodejs /srv/deployment/tilerator/deploy/src/scripts/tileshell.js --config /etc/tileratorui/config.yaml -j.fromZoom 10 -j.beforeZoom 16 -j.generatorId gen -j.storageId v3 -j.deleteEmpty -j.expdirpath /srv/osm_expire/ -j.expmask expire\.list\.* -j.statefile /var/run/tileratorui/expire.state). Logs at /var/log/som2pgsql* don't show much, except an error in osm2pgsql, which seems mostly unrelated (Osm2pgsql failed due to ERROR: XML parsing error at line 1, column 0: no element found the script exits before tilerator notification in this case) and a line that might be related to T145534 (Cleanup failed for /tmp/tilerator-sorted-1428ZEuOZwSvvUOM: Error: ENOENT: no such file or directory, unlink '/tmp/tilerator-sorted-1428ZEuOZwSvvUOM').

Event Timeline

Gehel triaged this task as High priority.Jun 19 2017, 9:33 AM
Gehel added subscribers: debt, Pnorman.

The full process stack leading to the blocked tileshell.js:

root       796  0.0  0.0  27504  2780 ?        Ss   Apr10   0:48 /usr/sbin/cron -f
root     27549  0.0  0.0  42240  2632 ?        S    Jun16   0:00  \_ /usr/sbin/CRON -f
osmupda+ 27553  0.0  0.0   4336   768 ?        Ss   Jun16   0:00      \_ /bin/sh -c /usr/local/bin/replicate-osm >> /var/log/osmosis//osm2pgsql.log 2>&1
osmupda+ 27556  0.0  0.0  13236  3064 ?        S    Jun16   0:00          \_ /bin/bash /usr/local/bin/replicate-osm
root     19817  0.0  0.0  40540  3408 ?        S    Jun16   0:00              \_ sudo -u tileratorui /usr/local/bin/notify-tilerator
tilerat+ 19818  0.0  0.0  13236  2828 ?        S    Jun16   0:00                  \_ /bin/bash /usr/local/bin/notify-tilerator
tilerat+ 19819  0.0  0.0   5800   688 ?        S    Jun16   0:00                      \_ /usr/bin/flock -xn /srv/osmosis/replicate-osm.lck /usr/bin/nodejs /srv/deployment/tilerator/deploy/src/scripts/tileshell.js --config /etc/tileratorui/config.yaml -j.fromZoom 10 -j.beforeZoom 16 -j.generatorId gen -j.storageId v3 -j.deleteEmpty -j.expdirpath /srv/osm_expire/ -j.expmask expire\.list\.* -j.statefile /var/run/tileratorui/expire.state
tilerat+ 19820  0.0  0.0 1560252 74148 ?       Sl   Jun16   2:18                          \_ /usr/bin/nodejs /srv/deployment/tilerator/deploy/src/scripts/tileshell.js --config /etc/tileratorui/config.yaml -j.fromZoom 10 -j.beforeZoom 16 -j.generatorId gen -j.storageId v3 -j.deleteEmpty -j.expdirpath /srv/osm_expire/ -j.expmask expire\.list\.* -j.statefile /var/run/tileratorui/expire.state

Hi @Pnorman - can you take a look at this? Thanks!

maps1001 has been running since four days ago. I had a look at /srv/osm_expire and there's three files with these linecounts

$ wc -l *
  9396832 expire.list.201706120127
  9887311 expire.list.201706151440
  5066504 expire.list.201706160127
 24350647 total

Looking with top, I see CPU time in kartotherian and cassandra, but not tilerator.

A few thoughts

  • I'm surprised to see multiple expire lists
  • the lists seem big

This could be normal.

What could tileshell.js be doing? https://phabricator.wikimedia.org/P5600 is the list of open files. This shows the log file is the 16 one, which has

info Loading variables from the config file
info Loading sources from /srv/deployment/tilerator/deploy-cache/revs/001811e1a3eb21cf9246c5425f27f001b91efd27/src/sources.prod2.yaml
trace Updating layer landuse
trace Updating layer waterway
trace Updating layer water
trace Updating layer aeroway
trace Updating layer building
trace Updating layer road
trace Updating layer admin
trace Updating layer country_label
trace Updating layer place_label
trace Updating layer poi_label
trace Updating layer road_label
info Parsing /srv/osm_expire/expire.list.201706160127=>/tmp/tilerator-expire.list.201706160127-198202yrsPmlJSGHG
info Sorting: sort -u -n -o /tmp/tilerator-sorted-19820XoeaOTP4bWpz /tmp/tilerator-expire.list.201706160127-198202yrsPmlJSGHG
info Adding jobs from /tmp/tilerator-sorted-19820XoeaOTP4bWpz
info Finished job creation... no more unemployment... at least for a bit
warn Cleanup failed for /tmp/tilerator-sorted-19820XoeaOTP4bWpz: Error: ENOENT: no such file or directory, unlink '/tmp/tilerator-sorted-19820XoeaOTP4bWpz'
done

Posting what I have for now since I have a meeting.

Going into tileratorui I see 221 failed jobs, but they're all from about 6000 hours ago. 0 queued, active, complete or delayed jobs.

We discussed this today and need more debugging information. Because no measurable CPU or disk is being used by tilerator, there's a good chance something is left open. Right now the only server is the production one serving tiles, and both the other datacenter production and both test clusters are working okay. This means we can't debug right now.

One potential tool for seeing why node is running is https://github.com/mafintosh/why-is-node-running, but we'd have to change the call from nodejs tileshell.js to why-is-node-running and wait for it to hang. Another method is to attach the nodejs debugger to the running process (i.e. [node debug -p pid](https://stackoverflow.com/questions/30181274/how-to-attach-a-running-node-js-process-and-debug-without-a-gui)) and see what's happening.

Pnorman changed the task status from Open to Stalled.Jun 23 2017, 5:27 AM

Changing to stalled, since we're waiting for it to reoccur on a different server.

Moving to stalled - we're waiting on it to happen again.

Looking at logs / graphs, the problem has not appeared again. We should probably close this and reopen when we see the issue again.

debt claimed this task.