Page MenuHomePhabricator

tileshell does not exit
Closed, ResolvedPublic

Description

After fixing T175123, we find that tileshell does not exit after being notified of updates. Some investigation have been done in T175123#3652557 and T175123#3653195. gdb does not give us much info: P6068.

tileshell is run as:

/usr/bin/nodejs /srv/deployment/tilerator/deploy/src/scripts/tileshell.js \
    --config /etc/tileratorui/config.yaml \
    -j.fromZoom 10 \
    -j.beforeZoom 16 \
    -j.generatorId gen \
    -j.storageId v3 \
    -j.deleteEmpty \
    -j.expdirpath /srv/osm_expire/ \
    -j.expmask 'expire\.list\.*' \
    -j.statefile /var/run/tileratorui/expire.state

We expect tileshell to read the list of expired tiles in /srv/osm_expire/, enqueue jobs in redis to regenerate those tiles, update /var/run/tileratorui/expire.state and exit. While it seems to do its actual job, it does not exit... which blocks the OSM replication.

Event Timeline

@zeljkofilipin hopefully, the description above is a bit more clean. As I don't really know how tileshell works and don't know anything about nodejs, the current analysis is fairly succinct... sorry for that.

@Gehel unfortunately, I am also not familiar with it. This is probably something to mention at the next scrum of scrums, if you are stuck.

Yeah, the problem is that pretty much no one is familiar with that :)

@zeljkofilipin @Gehel is this something that can be re-written into something that we have people that can support this process?

We probably have enough people who know nodejs, just not in our team :) We could probably rewrite it in a number of languages, but the issue here is that no knows the code, and the people on the maps team don't have much experience with node...

@debt I have no clue what is going on here :) this might be a good question for one of the mailing lists (wikitech-l, engineering) or scrum of scrums.

debt added subscribers: Eevans, mobrovac, kaldari.

Thanks, @zeljkofilipin :)

Hi @mobrovac and @Eevans - would you be able to spare some time and help us with this node.js issue? :)

While it seems to do its actual job, it does not exit... which blocks the OSM replication.

This is a very confusing statement. If it doesn't exist, how can the script actually do something? What is the actual error and how/when/where does it manifest itself?

This is a very confusing statement. If it doesn't exist, how can the script actually do something?

We think it is successfully pushing its data to redis and then failing to exit. A different process reads the data from redis.

What is the actual error

The script stays running for days, consuming no cpu, io, or doing anything else.

and how/when/where does it manifest itself?

Randomly. We haven't found a way of reproducing it 100% of the time, but it happens on production and test. Test has it happen more often, but it's called every minute instead of every day.

I took at look at the code, but there are no obvious causes of hanging. One thing that I would suggest is to try to isolate the changes that started causing this: try to identify any recent commits or node modules dependencies changes that were made before the problem started manifesting itself.

We've already had similar issues with this script before (T145534, T168241), and we still don't have a scenario where we can reproduce it in 100% of cases. This does not help finding the change that introduced the issue...

This does not help finding the change that introduced the issue...

Ya, I don't think we're going to be able to identify it that way.

I took at look at the code, but there are no obvious causes of hanging

Are there diagnostics we could run or add to see what it's doing?

Since the behaviour is not deterministic, I suspect this might be connected to the type of job(s) it is processing. Adding some debug logging throughout the code execution path might give us more clues/hints as to the exact point it hangs and for which jobs that happens.

While poking at this, I discovered 2 hanging replicate-osm processes whose tileshell was killed:

maxsem@maps-test2001:~$ ps aux|grep osm2pgsql
osmupda+  9617  0.0  0.0   6384   728 ?        Ss   Oct03   0:00 /bin/sh -c /usr/local/bin/replicate-osm >> /var/log/osmosis//osm2pgsql.log 2>&1
osmupda+  9629  0.0  0.5 6376424 344436 ?      Sl   Oct03   0:27 /usr/bin/osm2pgsql --hstore --slim --input-reader xml --cache 5368 --number-processes 4 --expire-tiles 15 --expire-output /srv/osm_expire/expire.list.201710030754 --username osmupdater --database gis --append --flat-nodes /srv/osmosis/nodes.bin -
maxsem   20932  0.0  0.0  14780  2124 pts/0    S+   18:46   0:00 grep osm2pgsql
osmupda+ 21598  0.0  0.0   6384   724 ?        Ss   Oct10   0:00 /bin/sh -c /usr/local/bin/replicate-osm >> /var/log/osmosis//osm2pgsql.log 2>&1
osmupda+ 21608  1.4  0.6 6154116 447624 ?      Sl   Oct10  30:03 /usr/bin/osm2pgsql --hstore --slim --input-reader xml --cache 5368 --number-processes 4 --expire-tiles 15 --expire-output /srv/osm_expire/expire.list.201710100827 --username osmupdater --database gis --append --flat-nodes /srv/osmosis/nodes.bin -
osmupda+ 32169  0.0  0.0   6384   820 ?        Ss   Oct03   0:00 /bin/sh -c /usr/local/bin/replicate-osm >> /var/log/osmosis//osm2pgsql.log 2>&1
osmupda+ 32180  0.0  0.6 6420308 442528 ?      Sl   Oct03   8:24 /usr/bin/osm2pgsql --hstore --slim --input-reader xml --cache 5368 --number-processes 4 --expire-tiles 15 --expire-output /srv/osm_expire/expire.list.201710030722 --username osmupdater --database gis --append --flat-nodes /srv/osmosis/nodes.bin -

After these were nuked, replication resumed. Waiting for another tileshell hangup to investigate further.

Tileshell is stuck again since Oct 12 00:18. I'm not killing it to let @MaxSem investigate (on maps-test2001).

That particular hangup is 100% reproduceable, as long as replication state file doesn't change. It just hangs forever after outputting "done". Simple pull request: https://github.com/kartotherian/tilerator/pull/13

Change 384522 had a related patch set uploaded (by Gehel; owner: Gehel):
[maps/tilerator/deploy@master] Update tilerator to 8ab07d4

https://gerrit.wikimedia.org/r/384522

Change 384522 merged by Gehel:
[maps/tilerator/deploy@master] Update tilerator to 8ab07d4

https://gerrit.wikimedia.org/r/384522

Mentioned in SAL (#wikimedia-operations) [2017-10-16T15:50:45Z] <gehel@tin> Started deploy [tilerator/deploy@f3b26f3]: testing tileshell fix - T177389

Mentioned in SAL (#wikimedia-operations) [2017-10-16T15:51:05Z] <gehel@tin> Finished deploy [tilerator/deploy@f3b26f3]: testing tileshell fix - T177389 (duration: 00m 20s)

Mentioned in SAL (#wikimedia-operations) [2017-10-16T15:51:56Z] <gehel> killing stuck tileshell process on maps-test2001 - T177389

I have not found any instance of stuck tileshell yet, so this looks good and ready to be pushed to all servers.

Woohoo, thanks @Gehel and @MaxSem -- lets get this into production!

Mentioned in SAL (#wikimedia-operations) [2017-10-17T16:32:25Z] <gehel@tin> Started deploy [tilerator/deploy@f3b26f3]: fix tileshell not exiting - T177389

Mentioned in SAL (#wikimedia-operations) [2017-10-17T16:34:07Z] <gehel@tin> Finished deploy [tilerator/deploy@f3b26f3]: fix tileshell not exiting - T177389 (duration: 01m 42s)

Mentioned in SAL (#wikimedia-operations) [2017-10-17T16:35:06Z] <gehel> restart tilerator / tileratorui on all maps cluster after deployment of latest fix - T177389

Deployed, services restarted, we can close this until we see an issue again!

I checked the logs (/var/log/osmosis/osm2pgsql.log-20171018) on maps[12]001 and tileshell seems to have been executed without issue. Tile generation and system load has spiked around 1:30am UTC, which matches what is expected.

All looks good!