maps - tilerator notification seems stuck on sorting files
Closed, ResolvedPublic

Description

Since a few days, OSM replication on maps1001 seems stuck in tilerator notification. The postgresql import seems to terminate without error, but notification seems to never terminate.

root       686  0.0  0.0  27504  2840 ?        Ss   Jun10   0:15 /usr/sbin/cron -f
root     19785  0.0  0.0  42240  2648 ?        S    Sep03   0:00  \_ /usr/sbin/CRON -f
osmupda+ 19787  0.0  0.0   4336   724 ?        Ss   Sep03   0:00  |   \_ /bin/sh -c /usr/local/bin/replicate-osm > /tmp/osm2pgsql.log 2>&1
osmupda+ 19788  0.0  0.0  13236  2872 ?        S    Sep03   0:00  |       \_ /bin/bash /usr/local/bin/replicate-osm
root     24287  0.0  0.0  40540  3548 ?        S    Sep03   0:00  |           \_ sudo -u tileratorui /usr/local/bin/notify-tilerator
tilerat+ 24288  0.0  0.0  13236  2744 ?        S    Sep03   0:00  |               \_ /bin/bash /usr/local/bin/notify-tilerator
tilerat+ 24289  0.0  0.0 1345164 69288 ?       Sl   Sep03   5:34  |                   \_ /usr/bin/nodejs /srv/deployment/tilerator/deploy/src/scripts/tileshell.js --config /etc/tileratorui/config.yaml -j.fromZoom
tilerat+ 24324  0.0  0.0   4336   796 ?        S    Sep03   0:00  |                       \_ /bin/sh -c sort -u -n -o /tmp/tilerator-sorted-24289fQMMVWrPWQyP 
tilerat+ 24327  0.0  0.0  24564  1844 ?        S    Sep03   0:00  |                           \_ sort -u -n -o /tmp/tilerator-sorted-24289fQMMVWrPWQyP
root     13487  0.0  0.0  42240  2648 ?        S    Sep04   0:00  \_ /usr/sbin/CRON -f
osmupda+ 13491  0.0  0.0   4336   756 ?        Ss   Sep04   0:00  |   \_ /bin/sh -c /usr/local/bin/replicate-osm > /tmp/osm2pgsql.log 2>&1
osmupda+ 13493  0.0  0.0  13236  2876 ?        S    Sep04   0:00  |       \_ /bin/bash /usr/local/bin/replicate-osm
root     17509  0.0  0.0  40540  3520 ?        S    Sep04   0:00  |           \_ sudo -u tileratorui /usr/local/bin/notify-tilerator
tilerat+ 17510  0.0  0.0  13236  2744 ?        S    Sep04   0:00  |               \_ /bin/bash /usr/local/bin/notify-tilerator
tilerat+ 17511  0.0  0.0 1343572 67748 ?       Sl   Sep04   5:04  |                   \_ /usr/bin/nodejs /srv/deployment/tilerator/deploy/src/scripts/tileshell.js --config /etc/tileratorui/config.yaml -j.fromZoom
tilerat+ 17537  0.0  0.0   4336   756 ?        S    Sep04   0:00  |                       \_ /bin/sh -c sort -u -n -o /tmp/tilerator-sorted-175110STWXIdKWVkP 
tilerat+ 17538  0.0  0.0  24564  1968 ?        S    Sep04   0:00  |                           \_ sort -u -n -o /tmp/tilerator-sorted-175110STWXIdKWVkP
root      3890  0.0  0.0  42240  2648 ?        S    Sep05   0:00  \_ /usr/sbin/CRON -f
osmupda+  3892  0.0  0.0   4336   716 ?        Ss   Sep05   0:00  |   \_ /bin/sh -c /usr/local/bin/replicate-osm > /tmp/osm2pgsql.log 2>&1
osmupda+  3893  0.0  0.0  13236  3008 ?        S    Sep05   0:00  |       \_ /bin/bash /usr/local/bin/replicate-osm
root     10556  0.0  0.0  40540  3448 ?        S    Sep05   0:00  |           \_ sudo -u tileratorui /usr/local/bin/notify-tilerator
tilerat+ 10557  0.0  0.0  13236  2680 ?        S    Sep05   0:00  |               \_ /bin/bash /usr/local/bin/notify-tilerator
tilerat+ 10558  0.0  0.0 1346156 71348 ?       Sl   Sep05   4:45  |                   \_ /usr/bin/nodejs /srv/deployment/tilerator/deploy/src/scripts/tileshell.js --config /etc/tileratorui/config.yaml -j.fromZoom
tilerat+ 10578  0.0  0.0   4336   804 ?        S    Sep05   0:00  |                       \_ /bin/sh -c sort -u -n -o /tmp/tilerator-sorted-10558H8qTLdguqfif 
tilerat+ 10579  0.0  0.0  24564  1912 ?        S    Sep05   0:00  |                           \_ sort -u -n -o /tmp/tilerator-sorted-10558H8qTLdguqfif
root     29476  0.0  0.0  42240  2648 ?        S    Sep06   0:00  \_ /usr/sbin/CRON -f
osmupda+ 29478  0.0  0.0   4336   708 ?        Ss   Sep06   0:00  |   \_ /bin/sh -c /usr/local/bin/replicate-osm > /tmp/osm2pgsql.log 2>&1
osmupda+ 29480  0.0  0.0  13236  3044 ?        S    Sep06   0:00  |       \_ /bin/bash /usr/local/bin/replicate-osm
root      3793  0.0  0.0  40540  3536 ?        S    Sep06   0:00  |           \_ sudo -u tileratorui /usr/local/bin/notify-tilerator
tilerat+  3794  0.0  0.0  13236  2672 ?        S    Sep06   0:00  |               \_ /bin/bash /usr/local/bin/notify-tilerator
tilerat+  3795  0.0  0.0 1343836 67784 ?       Sl   Sep06   4:15  |                   \_ /usr/bin/nodejs /srv/deployment/tilerator/deploy/src/scripts/tileshell.js --config /etc/tileratorui/config.yaml -j.fromZoom
tilerat+  3815  0.0  0.0   4336   768 ?        S    Sep06   0:00  |                       \_ /bin/sh -c sort -u -n -o /tmp/tilerator-sorted-37950EzzltWnMUP5 
tilerat+  3816  0.0  0.0  24564  1920 ?        S    Sep06   0:00  |                           \_ sort -u -n -o /tmp/tilerator-sorted-37950EzzltWnMUP5
root     21928  0.0  0.0  42240  2648 ?        S    Sep07   0:00  \_ /usr/sbin/CRON -f
osmupda+ 21930  0.0  0.0   4336   712 ?        Ss   Sep07   0:00  |   \_ /bin/sh -c /usr/local/bin/replicate-osm > /tmp/osm2pgsql.log 2>&1
osmupda+ 21933  0.0  0.0  13236  2944 ?        S    Sep07   0:00  |       \_ /bin/bash /usr/local/bin/replicate-osm
root     28863  0.0  0.0  40540  3540 ?        S    Sep07   0:00  |           \_ sudo -u tileratorui /usr/local/bin/notify-tilerator
tilerat+ 28864  0.0  0.0  13236  2612 ?        S    Sep07   0:00  |               \_ /bin/bash /usr/local/bin/notify-tilerator
tilerat+ 28865  0.0  0.0 1348876 72212 ?       Sl   Sep07   3:46  |                   \_ /usr/bin/nodejs /srv/deployment/tilerator/deploy/src/scripts/tileshell.js --config /etc/tileratorui/config.yaml -j.fromZoom
tilerat+ 28892  0.0  0.0   4336   752 ?        S    Sep07   0:00  |                       \_ /bin/sh -c sort -u -n -o /tmp/tilerator-sorted-28865sRJZOAKQ8DYj 
tilerat+ 28893  0.0  0.0  24564  1900 ?        S    Sep07   0:00  |                           \_ sort -u -n -o /tmp/tilerator-sorted-28865sRJZOAKQ8DYj
root     15698  0.0  0.0  42240  2648 ?        S    Sep08   0:00  \_ /usr/sbin/CRON -f
osmupda+ 15701  0.0  0.0   4336   716 ?        Ss   Sep08   0:00  |   \_ /bin/sh -c /usr/local/bin/replicate-osm > /tmp/osm2pgsql.log 2>&1
osmupda+ 15703  0.0  0.0  13236  2876 ?        S    Sep08   0:00  |       \_ /bin/bash /usr/local/bin/replicate-osm
root     22568  0.0  0.0  40540  3440 ?        S    Sep08   0:00  |           \_ sudo -u tileratorui /usr/local/bin/notify-tilerator
tilerat+ 22569  0.0  0.0  13236  2680 ?        S    Sep08   0:00  |               \_ /bin/bash /usr/local/bin/notify-tilerator
tilerat+ 22570  0.0  0.0 1343852 68416 ?       Sl   Sep08   3:11  |                   \_ /usr/bin/nodejs /srv/deployment/tilerator/deploy/src/scripts/tileshell.js --config /etc/tileratorui/config.yaml -j.fromZoom
tilerat+ 22591  0.0  0.0   4336   728 ?        S    Sep08   0:00  |                       \_ /bin/sh -c sort -u -n -o /tmp/tilerator-sorted-22570x5ZMDg4Lo8E0 
tilerat+ 22592  0.0  0.0  24564  1952 ?        S    Sep08   0:00  |                           \_ sort -u -n -o /tmp/tilerator-sorted-22570x5ZMDg4Lo8E0
root      7470  0.0  0.0  42240  2648 ?        S    Sep09   0:00  \_ /usr/sbin/CRON -f
osmupda+  7472  0.0  0.0   4336   752 ?        Ss   Sep09   0:00  |   \_ /bin/sh -c /usr/local/bin/replicate-osm > /tmp/osm2pgsql.log 2>&1
osmupda+  7474  0.0  0.0  13236  3004 ?        S    Sep09   0:00  |       \_ /bin/bash /usr/local/bin/replicate-osm
root     14281  0.0  0.0  40540  3476 ?        S    Sep09   0:00  |           \_ sudo -u tileratorui /usr/local/bin/notify-tilerator
tilerat+ 14282  0.0  0.0  13236  2608 ?        S    Sep09   0:00  |               \_ /bin/bash /usr/local/bin/notify-tilerator
tilerat+ 14283  0.0  0.0 1341224 77700 ?       Sl   Sep09   2:40  |                   \_ /usr/bin/nodejs /srv/deployment/tilerator/deploy/src/scripts/tileshell.js --config /etc/tileratorui/config.yaml -j.fromZoom
tilerat+ 14316  0.0  0.0   4336   764 ?        S    Sep09   0:00  |                       \_ /bin/sh -c sort -u -n -o /tmp/tilerator-sorted-14283p93QHDpHKK0s 
tilerat+ 14317  0.0  0.0  24564  2016 ?        S    Sep09   0:00  |                           \_ sort -u -n -o /tmp/tilerator-sorted-14283p93QHDpHKK0s
root       876  0.0  0.0  42240  2648 ?        S    Sep10   0:00  \_ /usr/sbin/CRON -f
osmupda+   880  0.0  0.0   4336   812 ?        Ss   Sep10   0:00  |   \_ /bin/sh -c /usr/local/bin/replicate-osm > /tmp/osm2pgsql.log 2>&1
osmupda+   882  0.0  0.0  13236  2880 ?        S    Sep10   0:00  |       \_ /bin/bash /usr/local/bin/replicate-osm
root      7391  0.0  0.0  40540  3460 ?        S    Sep10   0:00  |           \_ sudo -u tileratorui /usr/local/bin/notify-tilerator
tilerat+  7392  0.0  0.0  13236  2680 ?        S    Sep10   0:00  |               \_ /bin/bash /usr/local/bin/notify-tilerator
tilerat+  7393  0.0  0.0 1346744 83128 ?       Sl   Sep10   2:08  |                   \_ /usr/bin/nodejs /srv/deployment/tilerator/deploy/src/scripts/tileshell.js --config /etc/tileratorui/config.yaml -j.fromZoom
tilerat+  7413  0.0  0.0   4336   800 ?        S    Sep10   0:00  |                       \_ /bin/sh -c sort -u -n -o /tmp/tilerator-sorted-7393crjv8Q2YCzcA 
tilerat+  7414  0.0  0.0  24564  1944 ?        S    Sep10   0:00  |                           \_ sort -u -n -o /tmp/tilerator-sorted-7393crjv8Q2YCzcA
root     27497  0.0  0.0  42240  2648 ?        S    Sep11   0:00  \_ /usr/sbin/CRON -f
osmupda+ 27499  0.0  0.0   4336   796 ?        Ss   Sep11   0:00  |   \_ /bin/sh -c /usr/local/bin/replicate-osm > /tmp/osm2pgsql.log 2>&1
osmupda+ 27501  0.0  0.0  13236  2876 ?        S    Sep11   0:00  |       \_ /bin/bash /usr/local/bin/replicate-osm
root     31652  0.0  0.0  40540  3452 ?        S    Sep11   0:00  |           \_ sudo -u tileratorui /usr/local/bin/notify-tilerator
tilerat+ 31653  0.0  0.0  13236  2740 ?        S    Sep11   0:00  |               \_ /bin/bash /usr/local/bin/notify-tilerator
tilerat+ 31654  0.0  0.0 1340604 76924 ?       Sl   Sep11   1:38  |                   \_ /usr/bin/nodejs /srv/deployment/tilerator/deploy/src/scripts/tileshell.js --config /etc/tileratorui/config.yaml -j.fromZoom
tilerat+ 31678  0.0  0.0   4336   752 ?        S    Sep11   0:00  |                       \_ /bin/sh -c sort -u -n -o /tmp/tilerator-sorted-31654msADbvT5C0LR 
tilerat+ 31679  0.0  0.0  24564  1988 ?        S    Sep11   0:00  |                           \_ sort -u -n -o /tmp/tilerator-sorted-31654msADbvT5C0LR
root     21459  0.0  0.0  42240  2648 ?        S    Sep12   0:00  \_ /usr/sbin/CRON -f
osmupda+ 21461  0.0  0.0   4336   720 ?        Ss   Sep12   0:00  |   \_ /bin/sh -c /usr/local/bin/replicate-osm > /tmp/osm2pgsql.log 2>&1
osmupda+ 21464  0.0  0.0  13236  2876 ?        S    Sep12   0:00  |       \_ /bin/bash /usr/local/bin/replicate-osm
root     25947  0.0  0.0  40540  3456 ?        S    Sep12   0:00  |           \_ sudo -u tileratorui /usr/local/bin/notify-tilerator
tilerat+ 25948  0.0  0.0  13236  2616 ?        S    Sep12   0:00  |               \_ /bin/bash /usr/local/bin/notify-tilerator
tilerat+ 25949  0.0  0.0 1346720 88216 ?       Sl   Sep12   1:10  |                   \_ /usr/bin/nodejs /srv/deployment/tilerator/deploy/src/scripts/tileshell.js --config /etc/tileratorui/config.yaml -j.fromZoom
tilerat+ 26005  0.0  0.0   4336   804 ?        S    Sep12   0:00  |                       \_ /bin/sh -c sort -u -n -o /tmp/tilerator-sorted-25949CfZ2l1GTbzks 
tilerat+ 26006  0.0  0.0  24564  2016 ?        S    Sep12   0:00  |                           \_ sort -u -n -o /tmp/tilerator-sorted-25949CfZ2l1GTbzks
root     14253  0.0  0.0  42240  2648 ?        S    01:27   0:00  \_ /usr/sbin/CRON -f
osmupda+ 14256  0.0  0.0   4336   804 ?        Ss   01:27   0:00      \_ /bin/sh -c /usr/local/bin/replicate-osm > /tmp/osm2pgsql.log 2>&1
osmupda+ 14259  0.0  0.0  13236  2940 ?        S    01:27   0:00          \_ /bin/bash /usr/local/bin/replicate-osm
root     21037  0.0  0.0  40540  3384 ?        S    02:06   0:00              \_ sudo -u tileratorui /usr/local/bin/notify-tilerator
tilerat+ 21038  0.0  0.0  13236  2684 ?        S    02:06   0:00                  \_ /bin/bash /usr/local/bin/notify-tilerator
tilerat+ 21039  0.0  0.0 1343592 84464 ?       Sl   02:06   0:39                      \_ /usr/bin/nodejs /srv/deployment/tilerator/deploy/src/scripts/tileshell.js --config /etc/tileratorui/config.yaml -j.fromZoom
tilerat+ 21059  0.0  0.0   4336   756 ?        S    02:06   0:00                          \_ /bin/sh -c sort -u -n -o /tmp/tilerator-sorted-21039IvY6iKUJEVdx 
tilerat+ 21060  0.0  0.0  24564  2048 ?        S    02:06   0:00                              \_ sort -u -n -o /tmp/tilerator-sorted-21039IvY6iKUJEVdx
Gehel created this task.Sep 13 2016, 4:51 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 13 2016, 4:51 PM
Restricted Application added a project: Discovery. · View Herald TranscriptSep 13 2016, 4:52 PM
Gehel added a comment.Sep 13 2016, 5:29 PM

An error in the notify-tilerator script meant that the -j.expdirpath parameter was not passed to tileshell.js. This might explain the issue.

Even with the corrected script, tileshell is not terminating. The command is:

/usr/bin/nodejs /srv/deployment/tilerator/deploy/src/scripts/tileshell.js --config /etc/tileratorui/config.yaml -j.fromZoom 10 -j.beforeZoom 16 -j.generatorId gen -j.storageId v3 -j.deleteEmpty -j.expdirpath /srv/osm_expire/ -j.expmask expire\.list\.* -j.statefile /var/run/tileratorui/expire.state

which seems to be correct. I'm wondering why tileshell needs to exec sort. Wouldn't it be easier to do sorting in node? (probably a stupid question, but here it is).

tileshell seems to create temporary files in /tmp. All those files are empty and it looks like cleanup is not happening.

@Yurik / @MaxSem could you have a look into this?

Yurik added a comment.Sep 14 2016, 5:37 PM

@Gehel the sorting is needed because we need to prevent duplicates, and to minimize the size of the job. Tilerator script converts the zoom/x/y tile coordinates into a single number (combined index), outputs it one number per line from all the source files, sorts with removing duplicates, and creates processing jobs with the resulting file.

Yurik moved this task from Backlog to Prioritized on the Maps-Sprint board.Sep 14 2016, 8:31 PM
Yurik triaged this task as High priority.
Yurik claimed this task.
MaxSem moved this task from All map-related tasks to Tilerator on the Maps board.Sep 23 2016, 7:12 AM

Mentioned in SAL (#wikimedia-operations) [2016-09-23T18:41:08Z] <gehel> killing stuck tilerator notification processes on maps1001 - T145534

Yurik closed this task as Invalid.Oct 12 2016, 6:20 PM

magically self-resolved

Gehel reopened this task as Open.Oct 13 2016, 12:31 PM

Same issue just happened again on maps-test2001. It seems that the expire file being processed is empty.

@Yurik could you check if notification handles the case where notification file exists but is empty correctly?

Yurik moved this task from Tilerator to Tilerator on the Maps board.Dec 6 2016, 3:04 PM
Yurik edited projects, added Maps (Tilerator); removed Maps.

Mentioned in SAL (#wikimedia-operations) [2016-12-14T08:48:35Z] <gehel> kill stuck notification script on maps-test2001 - T145534

MaxSem claimed this task.Jan 24 2017, 11:03 PM

@Gehel, any info on when this last happened? Looking at the stacktraces, there might be some info in /tmp/osm2pgsql.log but it's not present neither on maps2001 nor 1001 from which I can only conclude that there was nothing written there since the last reboots 89 days ago.

Gehel added a comment.Jan 25 2017, 9:49 AM

@MaxSem This seems to happened more often on maps-test2001, but has not happened for some time. My guess (but this is just a guess) is that the error is related to empty notifications, in the case where the update script runs, but no new data is available since last update. I can't find anything in the logs at the moment, but as far as I remember, there wasn't much in the logs last time I saw the issue, just a sort process stuck and not completing.

Updates are hourly, right? With that I'd be very surprised to see empty diffs

Gehel added a comment.Jan 25 2017, 2:03 PM

I'm probably completely wrong in my assumptions. I did see empty notification files, but that might well be a bug in the chain on our side...

MaxSem moved this task from Prioritized to In progress on the Maps-Sprint board.Jan 25 2017, 7:33 PM

@Gehel Can you catch up with @MaxSem about this on Monday? He's blocked on getting some more information on this. The two of you syncing up on IRC would be great. :-)

Gehel added a comment.Jan 28 2017, 8:17 AM

@Deskana will do! The proposal from @Pnorman was to increase replication frequency to help reproduce the issue, which seems to make sense. I'll implement that on Monday.

Change 335022 had a related patch set uploaded (by Gehel):
maps - increase replication frequency to each minute on maps-test cluster

https://gerrit.wikimedia.org/r/335022

Change 335022 merged by Gehel:
maps - increase replication frequency to each minute on maps-test cluster

https://gerrit.wikimedia.org/r/335022

Gehel added a comment.Jan 30 2017, 2:07 PM

@MaxSem frequency of OSM synchronisation is now per minute on maps-test cluster. I'll keep an eye on it to see if the bug is reproduced or not.

debt awarded a token.Jan 30 2017, 2:23 PM
Gehel added a comment.Jan 31 2017, 8:19 PM

Problem is reproduced with increased sync frequency.

Extract of ps axfu on maps-test2001:

root       679  0.0  0.0  27504  2832 ?        Ss    2016   0:12 /usr/sbin/cron -f
root     21264  0.0  0.0  42240  2828 ?        S    Jan30   0:00  \_ /usr/sbin/CRON -f
osmupda+ 21266  0.0  0.0   4336   760 ?        Ss   Jan30   0:00      \_ /bin/sh -c /usr/local/bin/replicate-osm >> /var/log/osmosis//osm2pgsql.log 2>&1
osmupda+ 21268  0.0  0.0  13236  3028 ?        S    Jan30   0:00          \_ /bin/bash /usr/local/bin/replicate-osm
root     21349  0.0  0.0  40540  3460 ?        S    Jan30   0:00              \_ sudo -u tileratorui /usr/local/bin/notify-tilerator
tilerat+ 21350  0.0  0.0  13236  2760 ?        S    Jan30   0:00                  \_ /bin/bash /usr/local/bin/notify-tilerator
tilerat+ 21351  0.0  0.0   5800   632 ?        S    Jan30   0:00                      \_ /usr/bin/flock -xn /srv/osmosis/replicate-osm.lck /usr/bin/nodejs /srv/deployment/tilerator/deploy/src/scripts/tileshell.js
tilerat+ 21352  0.0  0.1 1351120 92740 ?       Sl   Jan30   0:33                          \_ /usr/bin/nodejs /srv/deployment/tilerator/deploy/src/scripts/tileshell.js --config /etc/tileratorui/config.yaml -j.from
tilerat+ 21408  0.0  0.0   4336   768 ?        S    Jan30   0:00                              \_ /bin/sh -c sort -u -n -o /tmp/tilerator-sorted-21352Rlmkw4M6hIOT 
tilerat+ 21409  0.0  0.0  24564  2044 ?        S    Jan30   0:00                                  \_ sort -u -n -o /tmp/tilerator-sorted-21352Rlmkw4M6hIOT
saper added a subscriber: saper.Jan 31 2017, 9:10 PM

If node is hanging on I/O, external process execution etc. this might indicate problems with asynchronous I/O and exhaustion of the thread pool. A workaround could be to increase node thread pool size with "env UV_THREADPOOL_SIZE=128 node" when starting node (it's an env variable).

Can you get "info thread" and "bt all" with gdb from the stuck node process? Are any binary node modules used?

Gehel added a comment.Jan 31 2017, 9:11 PM

gdb info thread:

(gdb) info threads
  Id   Target Id         Frame 
  10   Thread 0x7f347462f700 (LWP 21353) "nodejs" 0x00007f3473bbc050 in sem_wait () from /lib/x86_64-linux-gnu/libpthread.so.0
  9    Thread 0x7f34704f4700 (LWP 21354) "V8 WorkerThread" 0x00007f3473bbc050 in sem_wait () from /lib/x86_64-linux-gnu/libpthread.so.0
  8    Thread 0x7f346fcf3700 (LWP 21355) "V8 WorkerThread" 0x00007f3473bbc050 in sem_wait () from /lib/x86_64-linux-gnu/libpthread.so.0
  7    Thread 0x7f346f4f2700 (LWP 21356) "V8 WorkerThread" 0x00007f3473bbc050 in sem_wait () from /lib/x86_64-linux-gnu/libpthread.so.0
  6    Thread 0x7f346ecf1700 (LWP 21357) "V8 WorkerThread" 0x00007f3473bbc050 in sem_wait () from /lib/x86_64-linux-gnu/libpthread.so.0
  5    Thread 0x7f345ea99700 (LWP 21376) "nodejs" 0x00007f3473bba08f in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
  4    Thread 0x7f345e298700 (LWP 21377) "nodejs" 0x00007f3473bba08f in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
  3    Thread 0x7f345da97700 (LWP 21378) "nodejs" 0x00007f3473bba08f in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
  2    Thread 0x7f345d296700 (LWP 21379) "nodejs" 0x00007f3473bba08f in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
* 1    Thread 0x7f347461e740 (LWP 21352) "nodejs" 0x00007f34705d95b9 in syscall () from /lib/x86_64-linux-gnu/libc.so.6
Gehel added a comment.EditedJan 31 2017, 9:26 PM

(gdb) thread apply all bt

Thread 10 (Thread 0x7f347462f700 (LWP 21353)):
#0  0x00007f3473bbc050 in sem_wait ()
   from /lib/x86_64-linux-gnu/libpthread.so.0
#1  0x00007f3473fe8822 in uv_sem_wait ()
   from /usr/lib/x86_64-linux-gnu/libuv.so.1
#2  0x0000000000b162c2 in node::DebugSignalThreadMain (unused=<optimized out>)
    at ../src/node.cc:3612
#3  0x00007f3473bb60a4 in start_thread ()
   from /lib/x86_64-linux-gnu/libpthread.so.0
#4  0x00007f34705dd62d in clone () from /lib/x86_64-linux-gnu/libc.so.6

Thread 9 (Thread 0x7f34704f4700 (LWP 21354)):
#0  0x00007f3473bbc050 in sem_wait ()
   from /lib/x86_64-linux-gnu/libpthread.so.0
#1  0x0000000000b921f8 in v8::base::Semaphore::Wait (this=this@entry=0x200ef48)
    at ../deps/v8/src/base/platform/semaphore.cc:100
#2  0x0000000000b8a659 in v8::platform::TaskQueue::GetNext (this=0x200ef20)
    at ../deps/v8/src/libplatform/task-queue.cc:44
#3  0x0000000000b8a9bc in v8::platform::WorkerThread::Run (this=0x200e010)
    at ../deps/v8/src/libplatform/worker-thread.cc:25
#4  0x0000000000b93220 in NotifyStartedAndRun (this=0x200e010)
    at ../deps/v8/src/base/platform/platform.h:458
#5  v8::base::ThreadEntry (arg=0x200e010)
    at ../deps/v8/src/base/platform/platform-posix.cc:600
#6  0x00007f3473bb60a4 in start_thread ()
   from /lib/x86_64-linux-gnu/libpthread.so.0
#7  0x00007f34705dd62d in clone () from /lib/x86_64-linux-gnu/libc.so.6

Thread 8 (Thread 0x7f346fcf3700 (LWP 21355)):
#0  0x00007f3473bbc050 in sem_wait ()
   from /lib/x86_64-linux-gnu/libpthread.so.0
#1  0x0000000000b921f8 in v8::base::Semaphore::Wait (this=this@entry=0x200ef48)
    at ../deps/v8/src/base/platform/semaphore.cc:100
#2  0x0000000000b8a659 in v8::platform::TaskQueue::GetNext (this=0x200ef20)
    at ../deps/v8/src/libplatform/task-queue.cc:44
#3  0x0000000000b8a9bc in v8::platform::WorkerThread::Run (this=0x200ea10)
    at ../deps/v8/src/libplatform/worker-thread.cc:25
#4  0x0000000000b93220 in NotifyStartedAndRun (this=0x200ea10)
    at ../deps/v8/src/base/platform/platform.h:458
#5  v8::base::ThreadEntry (arg=0x200ea10)
    at ../deps/v8/src/base/platform/platform-posix.cc:600
#6  0x00007f3473bb60a4 in start_thread ()
   from /lib/x86_64-linux-gnu/libpthread.so.0
#7  0x00007f34705dd62d in clone () from /lib/x86_64-linux-gnu/libc.so.6

Thread 7 (Thread 0x7f346f4f2700 (LWP 21356)):
#0  0x00007f3473bbc050 in sem_wait ()
   from /lib/x86_64-linux-gnu/libpthread.so.0
#1  0x0000000000b921f8 in v8::base::Semaphore::Wait (this=this@entry=0x200ef48)
    at ../deps/v8/src/base/platform/semaphore.cc:100
#2  0x0000000000b8a659 in v8::platform::TaskQueue::GetNext (this=0x200ef20)
    at ../deps/v8/src/libplatform/task-queue.cc:44
#3  0x0000000000b8a9bc in v8::platform::WorkerThread::Run (this=0x200ea90)
    at ../deps/v8/src/libplatform/worker-thread.cc:25
#4  0x0000000000b93220 in NotifyStartedAndRun (this=0x200ea90)
    at ../deps/v8/src/base/platform/platform.h:458
#5  v8::base::ThreadEntry (arg=0x200ea90)
    at ../deps/v8/src/base/platform/platform-posix.cc:600
#6  0x00007f3473bb60a4 in start_thread ()
   from /lib/x86_64-linux-gnu/libpthread.so.0
#7  0x00007f34705dd62d in clone () from /lib/x86_64-linux-gnu/libc.so.6

Thread 6 (Thread 0x7f346ecf1700 (LWP 21357)):
#0  0x00007f3473bbc050 in sem_wait ()
   from /lib/x86_64-linux-gnu/libpthread.so.0
#1  0x0000000000b921f8 in v8::base::Semaphore::Wait (this=this@entry=0x200ef48)
    at ../deps/v8/src/base/platform/semaphore.cc:100
#2  0x0000000000b8a659 in v8::platform::TaskQueue::GetNext (this=0x200ef20)
    at ../deps/v8/src/libplatform/task-queue.cc:44
#3  0x0000000000b8a9bc in v8::platform::WorkerThread::Run (this=0x200f600)
    at ../deps/v8/src/libplatform/worker-thread.cc:25
#4  0x0000000000b93220 in NotifyStartedAndRun (this=0x200f600)
    at ../deps/v8/src/base/platform/platform.h:458
#5  v8::base::ThreadEntry (arg=0x200f600)
    at ../deps/v8/src/base/platform/platform-posix.cc:600
#6  0x00007f3473bb60a4 in start_thread ()
   from /lib/x86_64-linux-gnu/libpthread.so.0
#7  0x00007f34705dd62d in clone () from /lib/x86_64-linux-gnu/libc.so.6

Thread 5 (Thread 0x7f345ea99700 (LWP 21376)):
#0  0x00007f3473bba08f in pthread_cond_wait@@GLIBC_2.3.2 ()
   from /lib/x86_64-linux-gnu/libpthread.so.0
#1  0x00007f3473fe89a9 in uv_cond_wait ()
   from /usr/lib/x86_64-linux-gnu/libuv.so.1
#2  0x00007f3473fdab5c in ?? () from /usr/lib/x86_64-linux-gnu/libuv.so.1
#3  0x00007f3473fe84b7 in ?? () from /usr/lib/x86_64-linux-gnu/libuv.so.1
#4  0x00007f3473bb60a4 in start_thread ()
   from /lib/x86_64-linux-gnu/libpthread.so.0
#5  0x00007f34705dd62d in clone () from /lib/x86_64-linux-gnu/libc.so.6

Thread 4 (Thread 0x7f345e298700 (LWP 21377)):
#0  0x00007f3473bba08f in pthread_cond_wait@@GLIBC_2.3.2 ()
   from /lib/x86_64-linux-gnu/libpthread.so.0
#1  0x00007f3473fe89a9 in uv_cond_wait ()
   from /usr/lib/x86_64-linux-gnu/libuv.so.1
#2  0x00007f3473fdab5c in ?? () from /usr/lib/x86_64-linux-gnu/libuv.so.1
#3  0x00007f3473fe84b7 in ?? () from /usr/lib/x86_64-linux-gnu/libuv.so.1
#4  0x00007f3473bb60a4 in start_thread ()
   from /lib/x86_64-linux-gnu/libpthread.so.0
#5  0x00007f34705dd62d in clone () from /lib/x86_64-linux-gnu/libc.so.6

Thread 3 (Thread 0x7f345da97700 (LWP 21378)):
#0  0x00007f3473bba08f in pthread_cond_wait@@GLIBC_2.3.2 ()
   from /lib/x86_64-linux-gnu/libpthread.so.0
#1  0x00007f3473fe89a9 in uv_cond_wait ()
   from /usr/lib/x86_64-linux-gnu/libuv.so.1
#2  0x00007f3473fdab5c in ?? () from /usr/lib/x86_64-linux-gnu/libuv.so.1
#3  0x00007f3473fe84b7 in ?? () from /usr/lib/x86_64-linux-gnu/libuv.so.1
#4  0x00007f3473bb60a4 in start_thread ()
   from /lib/x86_64-linux-gnu/libpthread.so.0
#5  0x00007f34705dd62d in clone () from /lib/x86_64-linux-gnu/libc.so.6

Thread 2 (Thread 0x7f345d296700 (LWP 21379)):
#0  0x00007f3473bba08f in pthread_cond_wait@@GLIBC_2.3.2 ()
   from /lib/x86_64-linux-gnu/libpthread.so.0
#1  0x00007f3473fe89a9 in uv_cond_wait ()
   from /usr/lib/x86_64-linux-gnu/libuv.so.1
#2  0x00007f3473fdab5c in ?? () from /usr/lib/x86_64-linux-gnu/libuv.so.1
#3  0x00007f3473fe84b7 in ?? () from /usr/lib/x86_64-linux-gnu/libuv.so.1
#4  0x00007f3473bb60a4 in start_thread ()
   from /lib/x86_64-linux-gnu/libpthread.so.0
#5  0x00007f34705dd62d in clone () from /lib/x86_64-linux-gnu/libc.so.6

Thread 1 (Thread 0x7f347461e740 (LWP 21352)):
#0  0x00007f34705d95b9 in syscall () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00007f3473fed30a in ?? () from /usr/lib/x86_64-linux-gnu/libuv.so.1
#2  0x00007f3473feb7d5 in ?? () from /usr/lib/x86_64-linux-gnu/libuv.so.1
#3  0x00007f3473fdd7f8 in uv_run () from /usr/lib/x86_64-linux-gnu/libuv.so.1
#4  0x0000000000b23400 in StartNodeInstance (arg=<synthetic pointer>)
    at ../src/node.cc:4223
#5  node::Start (argc=19, argv=<optimized out>) at ../src/node.cc:4299
#6  0x00007f3470516b45 in __libc_start_main ()
   from /lib/x86_64-linux-gnu/libc.so.6
#7  0x000000000064b5c3 in _start ()
saper added a comment.Jan 31 2017, 9:38 PM

Looks like node is waiting for the external process running and the threadpool is not used at the moment.

With @saper's amazing help, we've narrowed it to tilerator building an empty file list for sort, causing it to hang forever waiting for stdin.

So, as it turns out, sort is sitting there waiting for data from stdin which never come.

srcFiles is probably empty https://github.com/kartotherian/tilerator-jobprocessor/blob/master/lib/fileParser.js#L116

var command = 'sort -u -n -o ' + escapeShellArg(dstFile) + ' ' + srcFiles;

So the problem is probably somewhere around https://github.com/kartotherian/tilerator-jobprocessor/blob/master/lib/fileParser.js#L260-L271

MaxSem moved this task from In progress to Needs review on the Maps-Sprint board.Feb 3 2017, 2:24 AM

Some review was done by @Yurik and @MaxSem addressed his comments. @MaxSem will poke again for further review.

Fixed in code and needs to be deployed.

This was deployed! @Gehel will double-check on the maps-test cluster to make sure that the deployment definitely fixed this.

Gehel added a comment.Mar 6 2017, 9:08 AM

It looks like the issue is still there (see log below). Tilerator notification seems to be stuck since March 3 19:13 UTC, which is later than the latest tilerator deployment (March 2).

Looking at maps-test2001:/srv/deployment/tilerator/deploy/node_modules/tilerator-jobprocessor/lib/fileParser.js, I cannot see the commit fixing this issue. So the fix is probably ok, but we have a deployment issue.

@MaxSem can we have a look together when you arrive?

Note: I am going to wait for some investigation before restarting tilerator on the rest of the cluster.

root     13237  0.0  0.0  42240  2828 ?        S    Mar03   0:00  \_ /usr/sbin/CRON -f
osmupda+ 13239  0.0  0.0   4336   724 ?        Ss   Mar03   0:00      \_ /bin/sh -c /usr/local/bin/replicate-osm >> /var/log/osmosis//osm2pgsql.log 2>&1
osmupda+ 13240  0.0  0.0  13236  2960 ?        S    Mar03   0:00          \_ /bin/bash /usr/local/bin/replicate-osm
root     13766  0.0  0.0  40540  3476 ?        S    Mar03   0:00              \_ sudo -u tileratorui /usr/local/bin/notify-tilerator
tilerat+ 13767  0.0  0.0  13236  2752 ?        S    Mar03   0:00                  \_ /bin/bash /usr/local/bin/notify-tilerator
tilerat+ 13768  0.0  0.0   5800   696 ?        S    Mar03   0:00                      \_ /usr/bin/flock -xn /srv/osmosis/replicate-osm.lck /usr/bin/nodejs /srv/deployment/tilerator/deploy/src/scripts/tileshell.js --config /etc/tileratorui/config.yaml -j.fromZoom 10 -j.beforeZoom 16 -j.generatorId gen -j.storageId v5 -j.deleteEmpty -j.expdirpath /srv/osm_expire/ -j.expmask expire\.list\.* -j.statefile /var/run/tileratorui/expire.state
tilerat+ 13769  0.0  0.1 1348912 74488 ?       Sl   Mar03   1:42                          \_ /usr/bin/nodejs /srv/deployment/tilerator/deploy/src/scripts/tileshell.js --config /etc/tileratorui/config.yaml -j.fromZoom 10 -j.beforeZoom 16 -j.generatorId gen -j.storageId v5 -j.deleteEmpty -j.expdirpath /srv/osm_expire/ -j.expmask expire\.list\.* -j.statefile /var/run/tileratorui/expire.state
tilerat+ 13842  0.0  0.0   4336   756 ?        S    Mar03   0:00                              \_ /bin/sh -c sort -u -n -o /tmp/tilerator-sorted-13769a2mHyB5Qrh1f 
tilerat+ 13843  0.0  0.0  24564  1920 ?        S    Mar03   0:00                                  \_ sort -u -n -o /tmp/tilerator-sorted-13769a2mHyB5Qrh1f

Mentioned in SAL (#wikimedia-operations) [2017-03-06T09:09:12Z] <gehel> killing stuck tilerator notification on maps-test2001 - T145534

Okay, I figured out what went wrong, however trying to deploy now is a minefield due to shrinkwrapping, and we can only ditch it after Node 6 switch - I propose we do that very thing tomorrow morning on test hosts.

The fix is now live everywhere for realz.

Gehel moved this task from Stalled/Waiting to Done on the Maps-Sprint board.Mar 22 2017, 1:45 PM

Error has not been seen again, let's consider this solved and re-open if needed.

Deskana closed this task as Resolved.Mar 22 2017, 1:48 PM

Resolving per the above.