Page MenuHomePhabricator

tileshell does not honor redis configuration in /etc/tilerator/config.yaml
Closed, ResolvedPublic

Description

While tilerator itself honors redis configuration, tileshell uses defautl value (localhost, on default redis port). This is an issue as authentication is configured in the tilerator config file. The fix is probably simple once we know where to apply it (that's the hard part).

@MaxSem might be able to give us some help on this.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Contrary to what I thought at first, this bug also affects the update notification on the production maps cluster. This also asks the question of why it took so much time to discover this error (the answer as always is insufficient monitoring).

Hi @MaxSem - would you be able to take a look at this?

I might have time to look at this next week.

@MaxSem did you find time to have a look?

MaxSem moved this task from To-do to In progress on the Maps-Sprint board.

The above pull request has been merged, now tilerator needs a deploy. Looks like I'm still the only person at WMF who has ever deployed the maps services - I'd rather have someone else try it now that there's me in addition to the docs I wrote than wait for me to get hit by a bus.

Hey @MaxSem, thanks for helping out! @Gehel will be happy to help deploy the map service with you on Monday. :)

Actually, I'm off on Monday, but I can deploy on Tuesday.

I'm trying to build a new tilerator version, following the instructions. It looks like the main tilerator repo is now on github, so it looks like we need to sync our gerrit copy to use it as a submodule of the deployment repo. @MaxSem is that correct?

Change 379231 had a related patch set uploaded (by Gehel; owner: Gehel):
[maps/tilerator/deploy@master] update npm-shrinkwrap for latest tilerator version

https://gerrit.wikimedia.org/r/379231

I've updated tilerator/deploy according to my understanding. I also pushed the new npm dependencies, which is not in the doc (https://wikitech.wikimedia.org/wiki/Maps#Building_Kartotherian_and_Tilerator) but which seems to be the right thing to do, looking at the history of the project.

@MaxSem could you check if that looks correct to you?

It looks like we have commits on our kartotherian repo which are not upstream. We either want to merge upstream into our repo, or rebase our changes on top of upstream and push them upstream. I'd prefer the rebase and push upstream, so we can ensure we have exactly the same version, which is much simpler to follow.

@MaxSem: any input?

Change 379231 abandoned by Gehel:
update npm-shrinkwrap for latest tilerator version (4ec0a81)

Reason:
shrinkwrap is not needed anymore, I was following the wrong instructions, let's start again from scratch

https://gerrit.wikimedia.org/r/379231

Change 380771 had a related patch set uploaded (by Gehel; owner: Gehel):
[maps/tilerator/deploy@master] Update tilerator to 883ee1f

https://gerrit.wikimedia.org/r/380771

I have a build of tilerator ready to be deployed on maps-test (https://gerrit.wikimedia.org/r/380771) and I think it is mostly correct. It includes quite a lot of dependencies update, so some testing will be needed. @MaxSem I'd really like your review on it, just to ensure I did not do something entirely stupid.

I'm trying to do the same with kartoherian, but we have diverged from upstream, with 2 commits. I'm tempted to:

  • cherry-pick 29ae877376ff4a04c2cfae8179dd0bd1e1fa70bd (Add caching headers)
  • drop 86354b89eda66e37e37f29769e9de3585d6f3b05 (Switch to node 6), which seems to be superseeded by a bunch of versio upgrade upstream
  • reset our gerrit repo with upstream: we don't really have any reason to diverge and merging with upstream will just make things harder to track in the long term. I think it is better to do an history rewrite now.

@MaxSem, @Pnorman: your opinion is welcomed!

Change 380771 merged by Gehel:
[maps/tilerator/deploy@master] Update tilerator to 883ee1f

https://gerrit.wikimedia.org/r/380771

Mentioned in SAL (#wikimedia-operations) [2017-10-02T08:47:14Z] <gehel> deploying latest tilerator on maps-test only, fix for T175123

Looking at grafana, we can confirm that tile generation has started again. Metrics are not collected by cluster (another thing that should be fixed). Checking the /jobs endpoint on tileratorui show that the jobs are from the maps-test cluster as expected.

We are getting errors in the osm2pgsql logs on maps-test2001:/var/log/osmosis/osm2pgsql.log:

Reading in file: -
Using XML parser.
Maximum node in persistent node cache: 5142734847
node cache: stored: 0(-nan%), storage efficiency: -nan% (dense blocks: 0, sparse nodes: 0), hit rate: -nan%
Osm2pgsql failed due to ERROR: XML parsing error at line 1, column 0: no element found
Error while replicating OSM data

tileshell also looks like it is stuck, but I can't find any log...

Odds are that the file being passed in is empty. What's above it in the log? Didn't we see this once before?

Yep, looks like T168241: OSM replication stuck on tilerator notification

/usr/bin/nodejs /srv/deployment/tilerator/deploy/src/scripts/tileshell.js --config /etc/tileratorui/config.yaml -j.fromZoom 10 -j.beforeZoom 16 -j.generatorId gen -j.storageId v3 -j.deleteEmpty -j.expdirpath /srv/osm_expire/ -j.expmask expire\.list\.* -j.statefile /var/run/tileratorui/expire.state started at 9:30, about 12h ago.

This gives us a *lot* more information because we've caught it right after it happened.

  • The 201710020930 expire list in question is no bigger than normal, although the previous one is large
  1433 /srv/osm_expire/expire.list.201710020847
384162 /srv/osm_expire/expire.list.201710020929
  2774 /srv/osm_expire/expire.list.201710020930
  • io is idle
  • the process list is
root       647  0.0  0.0  29552  2484 ?        Ss   Jul04   1:54 /usr/sbin/cron -f
root     21682  0.0  0.0  44288  2824 ?        S    09:30   0:00  \_ /usr/sbin/CRON -f
osmupda+ 21686  0.0  0.0   6384   768 ?        Ss   09:30   0:00      \_ /bin/sh -c /usr/local/bin/replicate-osm >> /var/log/osmosis//osm2pgsql.log 2>&1
osmupda+ 21689  0.0  0.0  15284  2960 ?        S    09:30   0:00          \_ /bin/bash /usr/local/bin/replicate-osm
root     21798  0.0  0.0  42588  3544 ?        S    09:30   0:00              \_ sudo -u tileratorui /usr/local/bin/notify-tilerator
tilerat+ 21799  0.0  0.0  15284  2688 ?        S    09:30   0:00                  \_ /bin/bash /usr/local/bin/notify-tilerator
tilerat+ 21800  0.0  0.0   7848   656 ?        S    09:30   0:00                      \_ /usr/bin/flock -xn /srv/osmosis/replicate-osm.lck /usr/bin/nodejs /srv/deployment/tilerator/deploy/src/scripts/tileshell.js --config /etc/tileratorui/config.yaml -j.fromZoom 10 -j.beforeZoom 16 -j.generatorId gen -j.storageId v3 -j.deleteEmpty -j.expdirpath /srv/osm_expire/ -j.expmask expire\.list\.* -j.statefile /var/run/tileratorui/expire.state
tilerat+ 21801  0.2  0.1 1578668 119724 ?      Sl   09:30   1:35                          \_ /usr/bin/nodejs /srv/deployment/tilerator/deploy/src/scripts/tileshell.js --config /etc/tileratorui/config.yaml -j.fromZoom 10 -j.beforeZoom 16 -j.generatorId gen -j.storageId v3 -j.deleteEmpty -j.expdirpath /srv/osm_expire/ -j.expmask expire\.list\.* -j.statefile /var/run/tileratorui/expire.state

The interesting part of lsof output is

COMMAND   PID        USER   FD   TYPE    DEVICE SIZE/OFF      NODE NAME
nodejs  21801 tileratorui  cwd    DIR     254,0     4096    654804 /var/spool/cron
nodejs  21801 tileratorui  rtd    DIR     254,0     4096         2 /
nodejs  21801 tileratorui  txt    REG     254,0 15565432   1316208 /usr/bin/nodejs
[ various globally installed libraries ]
nodejs  21801 tileratorui  mem    REG     254,1  2613056  14943883 /srv/deployment/tilerator/deploy-cache/revs/b02bd9af937ad53a865192ae2fccf95ba1c8d455/node_modules/mapnik/lib/binding/node-v48-linux-x64/mapnik.node
[ more libraries ] 
nodejs  21801 tileratorui    0r  FIFO      0,10      0t0 164026841 pipe
nodejs  21801 tileratorui    1w   REG     254,0  5120704    657573 /var/log/osmosis/osm2pgsql.log
nodejs  21801 tileratorui    2w   REG     254,0  5120704    657573 /var/log/osmosis/osm2pgsql.log
nodejs  21801 tileratorui    3r   REG     254,1        0  69206018 /srv/osmosis/replicate-osm.lck
nodejs  21801 tileratorui    4r  FIFO      0,10      0t0 164025807 pipe
nodejs  21801 tileratorui    5w  FIFO      0,10      0t0 164025807 pipe
nodejs  21801 tileratorui    6u  0000      0,11        0      9537 anon_inode
nodejs  21801 tileratorui    7r  FIFO      0,10      0t0 164025808 pipe
nodejs  21801 tileratorui    8w  FIFO      0,10      0t0 164025808 pipe
nodejs  21801 tileratorui    9u  0000      0,11        0      9537 anon_inode
nodejs  21801 tileratorui   10r   CHR       1,9      0t0      1033 /dev/urandom
nodejs  21801 tileratorui   11r   CHR       1,3      0t0      1028 /dev/null
nodejs  21801 tileratorui   12u  IPv4 164022199      0t0       TCP maps-test2001.codfw.wmnet:55534->maps-test2001.codfw.wmnet:9042 (ESTABLISHED)
nodejs  21801 tileratorui   13u  IPv4 164025823      0t0       TCP maps-test2001.codfw.wmnet:37152->maps-test2002.codfw.wmnet:9042 (ESTABLISHED)
nodejs  21801 tileratorui   14u  IPv4 164027947      0t0       TCP maps-test2001.codfw.wmnet:48146->maps-test2003.codfw.wmnet:9042 (ESTABLISHED)
nodejs  21801 tileratorui   15u  IPv4 164026977      0t0       TCP maps-test2001.codfw.wmnet:57420->maps-test2001.codfw.wmnet:6379 (ESTABLISHED)
nodejs  21801 tileratorui   16u  IPv4 164026978      0t0       TCP maps-test2001.codfw.wmnet:57422->maps-test2001.codfw.wmnet:6379 (ESTABLISHED)
nodejs  21801 tileratorui   18u  IPv4 164026022      0t0       TCP maps-test2001.codfw.wmnet:57540->maps-test2001.codfw.wmnet:6379 (ESTABLISHED)
nodejs  21801 tileratorui   44w   REG     254,0        0   1569904 /tmp/tilerator-expire.list.201709010943-21801dOEnMykd2d1Q (deleted)
nodejs  21801 tileratorui   46w   REG     254,0        0   1569906 /tmp/tilerator-expire.list.201709010944-2180117sw2gqGEaj8 (deleted)
[about 10k similar lines]
nodejs  21801 tileratorui *534w   REG     254,0        0   1589750 /tmp/tilerator-expire.list.201710020847-21801FyLbuz7Z6c8K (deleted)
nodejs  21801 tileratorui *536w   REG     254,0        0   1589751 /tmp/tilerator-expire.list.201710020930-21801UJrollAkZi1f (deleted)

Last line repeats about 12.5k times.

Looking at the deleted files, they're all empty.

Looking at /srv/osm_expire/ I see there are 12.5k expire lists. Are we failing to clean these up?

I had a go at attaching to the running process and sent it SIGUSR1 but couldn't get node-inspector installed.

I was wondering if this might be some condition caused by too many lines in /srv/osm_expire/ There are 124m lines there. But I can't see why it would have gone from working to not with the addition of a few more. I'm still tempted to clean up /srv/osm_expire/ just to rule it out.

I think we need a nodejs developer for some guidance on what to do next.

Mentioned in SAL (#wikimedia-operations) [2017-10-03T07:20:43Z] <gehel> killing stuck tileshell and cleaning up /srv/osm_expire on maps-test2001 - T175123

A few notes after some investigation with @Pnorman:

  • the empty input on osm2pgsql which we see in the logs seems to be flock not being used correctly, it is noisy, but should not be a direct issue
  • tileshell being stuck is the real issue
  • I did see tileshell generating jobs in redis after its first restart
  • we are not sure if tileshell is getting stuck before sending jobs to redis, during, or after (and just not exiting cleanly)
  • we should try killing and restarting tileshell to see if the job generated are matching the expire files entirely
  • the current shell script wrapping the replication process are a mess and should be replaced by the one @Pnorman has been working on (this will not solve our actual issue, but will reduce the noise)

We've done most of what we can, at this point we need help from a nodejs developer.

Hi @zeljkofilipin - would you have some time to help out with this issue, or let us know who could?

@MaxSem you might have some idea on this as well... if you have some time...

@debt I am reading the ticket description and comments, but I did not understand what the actual problem is. :| Hence, I do not think I could help. :) Could you please update the ticket description to make it explicit what you need help with?

@zeljkofilipin yeah, it might make sense to break our current problem in a subtask, doing that right now...

This was resolved with the fix to T177389.