As can be seen on grafana, OSM replication for the maps cluster is stuck fairly often. I did kill it and restart it with some success. ps indicates that the replication is stuck on tilerator notification (/usr/bin/nodejs /srv/deployment/tilerator/deploy/src/scripts/tileshell.js --config /etc/tileratorui/config.yaml -j.fromZoom 10 -j.beforeZoom 16 -j.generatorId gen -j.storageId v3 -j.deleteEmpty -j.expdirpath /srv/osm_expire/ -j.expmask expire\.list\.* -j.statefile /var/run/tileratorui/expire.state). Logs at /var/log/som2pgsql* don't show much, except an error in osm2pgsql, which seems mostly unrelated (Osm2pgsql failed due to ERROR: XML parsing error at line 1, column 0: no element found the script exits before tilerator notification in this case) and a line that might be related to T145534 (Cleanup failed for /tmp/tilerator-sorted-1428ZEuOZwSvvUOM: Error: ENOENT: no such file or directory, unlink '/tmp/tilerator-sorted-1428ZEuOZwSvvUOM').
Description
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | Gehel | T167549 Create Icinga alert when OSM replication lags on maps | |||
Resolved | debt | T168241 OSM replication stuck on tilerator notification |
Event Timeline
The full process stack leading to the blocked tileshell.js:
root 796 0.0 0.0 27504 2780 ? Ss Apr10 0:48 /usr/sbin/cron -f root 27549 0.0 0.0 42240 2632 ? S Jun16 0:00 \_ /usr/sbin/CRON -f osmupda+ 27553 0.0 0.0 4336 768 ? Ss Jun16 0:00 \_ /bin/sh -c /usr/local/bin/replicate-osm >> /var/log/osmosis//osm2pgsql.log 2>&1 osmupda+ 27556 0.0 0.0 13236 3064 ? S Jun16 0:00 \_ /bin/bash /usr/local/bin/replicate-osm root 19817 0.0 0.0 40540 3408 ? S Jun16 0:00 \_ sudo -u tileratorui /usr/local/bin/notify-tilerator tilerat+ 19818 0.0 0.0 13236 2828 ? S Jun16 0:00 \_ /bin/bash /usr/local/bin/notify-tilerator tilerat+ 19819 0.0 0.0 5800 688 ? S Jun16 0:00 \_ /usr/bin/flock -xn /srv/osmosis/replicate-osm.lck /usr/bin/nodejs /srv/deployment/tilerator/deploy/src/scripts/tileshell.js --config /etc/tileratorui/config.yaml -j.fromZoom 10 -j.beforeZoom 16 -j.generatorId gen -j.storageId v3 -j.deleteEmpty -j.expdirpath /srv/osm_expire/ -j.expmask expire\.list\.* -j.statefile /var/run/tileratorui/expire.state tilerat+ 19820 0.0 0.0 1560252 74148 ? Sl Jun16 2:18 \_ /usr/bin/nodejs /srv/deployment/tilerator/deploy/src/scripts/tileshell.js --config /etc/tileratorui/config.yaml -j.fromZoom 10 -j.beforeZoom 16 -j.generatorId gen -j.storageId v3 -j.deleteEmpty -j.expdirpath /srv/osm_expire/ -j.expmask expire\.list\.* -j.statefile /var/run/tileratorui/expire.state
maps1001 has been running since four days ago. I had a look at /srv/osm_expire and there's three files with these linecounts
$ wc -l * 9396832 expire.list.201706120127 9887311 expire.list.201706151440 5066504 expire.list.201706160127 24350647 total
Looking with top, I see CPU time in kartotherian and cassandra, but not tilerator.
A few thoughts
- I'm surprised to see multiple expire lists
- the lists seem big
This could be normal.
What could tileshell.js be doing? https://phabricator.wikimedia.org/P5600 is the list of open files. This shows the log file is the 16 one, which has
info Loading variables from the config file info Loading sources from /srv/deployment/tilerator/deploy-cache/revs/001811e1a3eb21cf9246c5425f27f001b91efd27/src/sources.prod2.yaml trace Updating layer landuse trace Updating layer waterway trace Updating layer water trace Updating layer aeroway trace Updating layer building trace Updating layer road trace Updating layer admin trace Updating layer country_label trace Updating layer place_label trace Updating layer poi_label trace Updating layer road_label info Parsing /srv/osm_expire/expire.list.201706160127=>/tmp/tilerator-expire.list.201706160127-198202yrsPmlJSGHG info Sorting: sort -u -n -o /tmp/tilerator-sorted-19820XoeaOTP4bWpz /tmp/tilerator-expire.list.201706160127-198202yrsPmlJSGHG info Adding jobs from /tmp/tilerator-sorted-19820XoeaOTP4bWpz info Finished job creation... no more unemployment... at least for a bit warn Cleanup failed for /tmp/tilerator-sorted-19820XoeaOTP4bWpz: Error: ENOENT: no such file or directory, unlink '/tmp/tilerator-sorted-19820XoeaOTP4bWpz' done
Posting what I have for now since I have a meeting.
Going into tileratorui I see 221 failed jobs, but they're all from about 6000 hours ago. 0 queued, active, complete or delayed jobs.
We discussed this today and need more debugging information. Because no measurable CPU or disk is being used by tilerator, there's a good chance something is left open. Right now the only server is the production one serving tiles, and both the other datacenter production and both test clusters are working okay. This means we can't debug right now.
One potential tool for seeing why node is running is https://github.com/mafintosh/why-is-node-running, but we'd have to change the call from nodejs tileshell.js to why-is-node-running and wait for it to hang. Another method is to attach the nodejs debugger to the running process (i.e. [node debug -p pid](https://stackoverflow.com/questions/30181274/how-to-attach-a-running-node-js-process-and-debug-without-a-gui)) and see what's happening.
Looking at logs / graphs, the problem has not appeared again. We should probably close this and reopen when we see the issue again.