Page MenuHomePhabricator

reimage maps-test servers
Closed, ResolvedPublic

Description

Lot of experiments have been done on maps-test servers. Re-imaging them before testing migration to vector tiles would help make sure we are testing the right thing.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
debt triaged this task as Medium priority.Jun 27 2017, 7:15 PM

Mentioned in SAL (#wikimedia-operations) [2017-06-28T13:44:29Z] <gehel> start reimage of the maps-test cluster - T169011

Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts:

['maps-test2001.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201706281357_gehel_15053.log.

Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts:

['maps-test2001.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201706281515_gehel_10259.log.

Completed auto-reimage of hosts:

['maps-test2001.codfw.wmnet']

Of which those FAILED:

set(['maps-test2001.codfw.wmnet'])

Change 362155 had a related patch set uploaded (by Gehel; owner: Guillaume Lederrey):
[operations/puppet@production] scap3 - deployment of packge requires configuration to already exist

https://gerrit.wikimedia.org/r/362155

Reimaging maps-test2001 shows some issue with execution order of puppet, which has probably changed with the recent refactoring. Patch above should solve the issue, but it is touching the service module, which is used by quite a few services.

Change 362155 merged by Gehel:
[operations/puppet@production] scap3 - deployment of package requires configuration to already exist

https://gerrit.wikimedia.org/r/362155

Change 363144 had a related patch set uploaded (by Gehel; owner: Gehel):
[operations/puppet@production] scap3 - deployment of package requires configuration to already exist

https://gerrit.wikimedia.org/r/363144

Change 363144 merged by Gehel:
[operations/puppet@production] scap3 - deployment of package requires configuration to already exist

https://gerrit.wikimedia.org/r/363144

Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts:

['maps-test2001.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201707041214_gehel_23411.log.

Completed auto-reimage of hosts:

['maps-test2001.codfw.wmnet']

and were ALL successful.

Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts:

['maps-test2002.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201707041251_gehel_26059.log.

Completed auto-reimage of hosts:

['maps-test2002.codfw.wmnet']

and were ALL successful.

Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts:

['maps-test2003.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201707041332_gehel_1055.log.

Completed auto-reimage of hosts:

['maps-test2003.codfw.wmnet']

and were ALL successful.

Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts:

['maps-test2004.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201707041411_gehel_5933.log.

Completed auto-reimage of hosts:

['maps-test2004.codfw.wmnet']

and were ALL successful.

@Gehel I've ACK'ed the Icinga alert for kartotherian endpoints health on maps-test* hosts, given that they are in alarm since 4 days my guess is that they were downtimed but today seems that Icinga lost downtime again. Just FYI.

We got notifications on IRC again, so icinga might need another poke. Gehel is on vacation, so if someone else can do what's required, it'd be good.

@Pnorman when was that?
I'm asking because the Icinga issue is T164206 and since today around 12:10 UTC it should be fixed, see T164206#3435220 for the full explanation.
And the alarm I mentioned above was already re-acked after the fix and I can see the ack is still there.

06:43 PDT time, which is 01:43 UTC on Friday.

@Pnorman the issue on Icinga was solved around 12:10 UTC, but, as you can see in the detailed explanation in T164206#3435220, there was a loss of downtimes and acknowledgments, and the cleanup of this was done a bit later after the fix, from the logs I can see that at 13:43 UTC the maps-tests alarms where acknowledged again.

I was not in the discovery channel at that time so I cannot check, but if the bot is reporting the whole message, it should have the ACKNOWLEDGEMENT (CRITICAL) text in it.

The alarms itself fired earlier, between 11:20 and 12:03 UTC, according to Icinga logs.

Change 367415 had a related patch set uploaded (by Gehel; owner: Gehel):
[operations/puppet@production] maps - initial data import needs to be done with 900913 geometry

https://gerrit.wikimedia.org/r/367415

Change 367415 merged by Gehel:
[operations/puppet@production] maps - initial data import needs to be done with 900913 geometry

https://gerrit.wikimedia.org/r/367415

Change 367661 had a related patch set uploaded (by Gehel; owner: Gehel):
[operations/puppet@production] maps - ensure cleanup also deletes related objects

https://gerrit.wikimedia.org/r/367661

Change 367661 merged by Gehel:
[operations/puppet@production] maps - ensure cleanup also deletes related objects

https://gerrit.wikimedia.org/r/367661

osm2pgsql does not like the -E 900913 option:

sudo -E -u osmupdater osm2pgsql \
  --create --slim --flat-nodes nodes.bin -C 40000 --number-processes 8 \
  --hstore --host ${database_host} -U osmimporter -d gis -E 900913 \
  ${download_dir}/${filename} 2>&1 | tee ${log_file}
osm2pgsql version 0.92.0 (64 bit id space)

Osm2pgsql failed due to ERROR: creation of CRS failed: no options found in 'init' file
osm2pgsql failed to complete initial import

@Pnorman do you have an obvious answer as always?

I have the non-obvious answer.

osm2pgsql takes the input to -E and creates the projection +init=epsg:%1%, which causes proj.4 to look in proj's dir (/usr/share/proj) for the file espg and the entry 900913 in it, which it fails to find. Current proj includes 900913 in others.extra, so it would need +init=others.extra:900913. There are four ways to fix this

  1. Add 900913 to the epsg file with echo '<900913> +proj=merc +a=6378137 +b=6378137 +lat_ts=0.0 +lon_0=0.0 +x_0=0.0 +y_0=0 +k=1.0 +units=m +nadgrids=@null +wktext +no_defs <>' >> /usr/share/proj/epsg (or a properly maintainable equivalent)
  1. Patch osm2pgsql to special-case 900913 like it does 3857
  1. Extend osm2pgsql to take full init strings, or full proj.4 strings. There are some conceptual difficulties here, since PostGIS and proj don't always exactly match. These are currently ignored, but that won't be possible with those changes.
  1. Don't use 900913.

My preferred option is 4.

Change 367713 had a related patch set uploaded (by Gehel; owner: Gehel):
[operations/puppet@production] maps - switch to using the default 3857 projection

https://gerrit.wikimedia.org/r/367713

Change 367713 merged by Gehel:
[operations/puppet@production] maps - switch to using the default 3857 projection

https://gerrit.wikimedia.org/r/367713

Damn, osm2pgsql failed again (see log below). Or at least did not return with a 0 exit code (I see no documentation on exit codes in the man page, nor by asking Google nicely). osm2pgsql itself

Committing transaction for planet_osm_line
Committing transaction for planet_osm_polygon
Committing transaction for planet_osm_roads
Committing transaction for planet_osm_point
Committing transaction for planet_osm_line
Committing transaction for planet_osm_polygon
Committing transaction for planet_osm_roads
Committing transaction for planet_osm_point
Committing transaction for planet_osm_line
Committing transaction for planet_osm_polygon
Committing transaction for planet_osm_roads

Going over pending relations...
        0 relations are pending

Using 8 helper-processes
Finished processing 0 relations in 0 s

Committing transaction for planet_osm_point
WARNING:  there is no transaction in progress
Committing transaction for planet_osm_line
[...]
Committing transaction for planet_osm_line
WARNING:  there is no transaction in progress
Committing transaction for planet_osm_polygon
WARNING:  there is no transaction in progress
Committing transaction for planet_osm_roads
WARNING:  there is no transaction in progress
Committing transaction for planet_osm_point
WARNING:  there is no transaction in progress
Committing transaction for planet_osm_line
WARNING:  there is no transaction in progress
Committing transaction for planet_osm_polygon
WARNING:  there is no transaction in progress
Committing transaction for planet_osm_roads
WARNING:  there is no transaction in progress
Sorting data and creating indexes for planet_osm_line
Sorting data and creating indexes for planet_osm_point
Sorting data and creating indexes for planet_osm_roads
Sorting data and creating indexes for planet_osm_polygon
Copying planet_osm_roads to cluster by geometry finished
Creating geometry index on planet_osm_roads
Creating osm_id index on planet_osm_roads
Creating indexes on planet_osm_roads finished
All indexes on planet_osm_roads created in 2096s
Completed planet_osm_roads
Copying planet_osm_point to cluster by geometry finished
Creating geometry index on planet_osm_point
Copying planet_osm_line to cluster by geometry finished
Creating geometry index on planet_osm_line
Creating osm_id index on planet_osm_point
osm2pgsql failed to complete initial import

The last line is not from osm2pgsql.

What was the osm2pgsql exit code?

The osm2pgsql exit code is not stored by the script, which does

function initial_osm_import() {
  echo "starting initial OSM import"
  cd ${osmosis_dir}
  PGPASSWORD="$(< ${password_file})" flock -xn ${replicate_lock} \
    sudo -E -u osmupdater osm2pgsql \
      --create --slim --flat-nodes nodes.bin -C 40000 --number-processes 8 \
      --hstore --host ${database_host} -U osmimporter -d gis -E 3857 \
      ${download_dir}/${filename} 2>&1 | tee ${log_file}

  if [ ${PIPESTATUS[0]} -ne 0 ]; then
    echo "osm2pgsql failed to complete initial import"
    exit -1
  fi
  echo "initial OSM import completed"
}

Looking through the postgres log, I see

2017-07-26 12:47:01 GMT [21708]: [92-1] user=osmimporter,db=gis,app=[unknown],client=::1 LOG:  duration: 6953794.555 ms  statement: CREATE INDEX planet_osm_line_index ON planet_osm_line USING GIST (way)
2017-07-26 12:47:01 GMT [21708]: [93-1] user=osmimporter,db=gis,app=[unknown],client=::1 LOG:  could not send data to client: Broken pipe
2017-07-26 12:47:01 GMT [21708]: [94-1] user=osmimporter,db=gis,app=[unknown],client=::1 FATAL:  connection to client lost
2017-07-26 12:47:01 GMT [21708]: [95-1] user=osmimporter,db=gis,app=[unknown],client=::1 LOG:  disconnection: session time: 17:32:01.074 user=osmimporter database=gis host=::1 port=59788

This matches with Grafana's time for the end of stuff

image.png (169×613 px, 13 KB)

This shows osm2pgsql was gone when postgres tried to send it information about a statement that had completed, but is not when osm2pgsql finished. With a failure like this, osm2pgsql exits, then each postgres connection finds out when it tries to tell it about a statement that it completed. Because the statements running at this time are long-running index builds, there's a delay.

The first broken pipe error is at 2017-07-26 11:21:08 GMT, on the planet_osm_point_pkey index creation. Because this is a quick index, it pins down the exit time of osm2pgsql to within a few minutes.

@Gehel can you check the system logs for between 2017-07-26 11:12 and 2017-07-26 11:21:08

@Pnorman you were right, there is a log about OOMKiller:

Jul 26 11:13:47 maps-test2001 kernel: [1895245.194686] Out of memory: Kill process 21706 (osm2pgsql) score 643 or sacrifice child

The obvious thing I would tune to reduce osm2pgsql memory usage is -C but we have i set to 4000 which is 4Gb if I understand correctly. Looking at the trace, it looks like osm2pgsql was using ~10Gb at the time it was killed. Postgres is the other large consumer, followed by Cassandra. I'm not sure that reducing osm2pgsql cache will do enough good...

See P5814 for the full log.

The obvious thing I would tune to reduce osm2pgsql memory usage is -C but we have i set to 4000 which is 4Gb if I understand correctly

You're missing a zero, it's 40GB. That could come down to 35GB, but I wouldn't expect that to matter.

Generally the problem is overcommit related. For many scientific and large data purposes you want vm.overcommit_memory = 1, but sysctl shows this is already the case on the test servers.

The top rss'es are 40GB for osm2pgsql, 8.8GB for java, then 6.5-8GB for 30 postgres processes. Because some of that is shared, you can't just add up all of the postgres processes.

Normally on a 64GB machine I run with 35000 cache, .5GB (est) other osm2pgsql, estimate the base system stuff as 1GB, and have no Cassandra. This gives 28GB for postgres. Here we have 40.4GB for osm2pgsql (cache + other), 8.8GB for java, and can a bit over 1GB of base system stuff, giving 13.8GB for Postgres.

We can account for how this memory is used, and what's free for index creation. I run with 4GB shared_buffers, and the test cluster is set to have 7.5GB. This leaves me 24GB for index building, and the test cluster 6.3GB.

In this phase of the import, osm2pgsql is running CREATE TABLE .. AS SELECT ... FROM ... ORDER BY ...; CREATE INDEX ... ON ... (osm_id) ...; CREATE INDEX ... ON ... gist (way); ANALYZE ..., statements on four rendering tables, and building indexes on way and relation slim tables. The relevant memory settings are 192MB work_mem, 4GB maintenance_work_mem, and 1GB autovacuum_work_mem.

The maximum memory usage here would be 6*maintenance_work_mem, or 24GB. In practice, not all of the indexes are that big, and not all of the others hit peak memory usage at the same time. In particular, the points and roads tables are completely finished by the time the other rendering tables get into their index builds, giving a peak of probably 12GB, more than the estimated available 6GB.

Given these numbers, I can conclude that the OOM killer got called because there wasn't enough memory for everything. This conclusion is not shocking ;)

We should free up about 8-10GB. In order of ease + importance I would recommend

  1. Reducing shared_buffers to 4GB. This should be plenty for a 64GB system with multiple processes on it. (frees 3.5GB)
  2. Turning off Cassandra. Even if we decide we need to be able to import with Cassandra running. this will help performance, since any extra freed memory will be used for caching (frees 8.8GB)
  3. Reducing maintenance_work_mem to 3GB. This will slow down this phase of the import. ("frees" 3GB)
  4. Reducing osm2pgsql cache to 30000. This will slow the import, forcing it to read node positions from disk. Reducing it to 35000 won't free any memory, just reduce the difference between vmstat and resident set. (frees ~4GB)
  5. Add a swapfile. If osm2pgsql or postgres goes to swap, it would be unusably slow, but I think Cassandra could go to swap without any real problems.
  6. Add more RAM to any new servers we get. With DDR4, larger DIMMs, etc, 128GB is the new 64GB. OSM data sizes are increasing, but slower than moore's law and RAM sizes increase. (adds 64GB)

Change 369660 had a related patch set uploaded (by Gehel; owner: Gehel):
[operations/puppet@production] maps - tune postgresql for maps-test cluster

https://gerrit.wikimedia.org/r/369660

Postgresql configuration updated:

  • reduce shared buffers to 4GB (save 3.5GB)
  • reduce work mem to 3GB (save 1GB)

Documentation updated to suggest shutting down cassandra during data import (save ~8GB).

So overall we should have 12GB more memory available, let's hope this is enough.

To see the new config: https://puppet-compiler.wmflabs.org/compiler02/7268/

@Pnorman: waiting for your review to merge and try again...

Approved in gerrit. Your math here is slightly off - reducing the work mem to 3GB should save 2-4 GB because there's multiple index builds in parallel.

Change 369660 merged by Gehel:
[operations/puppet@production] maps - tune postgresql for maps-test cluster

https://gerrit.wikimedia.org/r/369660

Initial osm2pgsql import seems to complete, but now I have errors on import_waterlines (/usr/local/bin/import_waterlines). Error message in postgresql logs is quite cryptic to me (see below).

The geometry used seems correct to me (shp2pgsql ${SHP2PGSQL_OPERATION} -s 3857 -g way water-polygons-split-3857/water_polygons.shp water_polygons | sed "${TRANSFORMATION}" | psql gis)

2017-08-07 13:09:17 GMT [17598]: [25534-1] user=osmupdater,db=gis,app=psql,client=[local] STATEMENT:  INSERT INTO "water_polygons" ("fid",way) VALUES ('12762','0106000020110F000001000000010300000015000000DB0100000C2AE6B2BB3260C18A4F248132EA6141
[...]
17650671AE9F961410228193E083060C1B0662EC3E0F96141E5B10708003060C183084E15E1F96141');
2017-08-07 13:09:17 GMT [17598]: [25535-1] user=osmupdater,db=gis,app=psql,client=[local] ERROR:  current transaction is aborted, commands ignored until end of transaction block

ERROR: current transaction is aborted, commands ignored until end of transaction block

There should be an error earlier in the logs where a statement failed.

water_polygons was already created, I removed it and restarted the script from that point.

OSM import is complete, time to re-generate the tiles. @Pnorman should do this (with as much help from myself as needed) to ensure we are at least 2 to understand this process. Documentation is on wiki (starting at "Initial creation of cassandra keyspace"). The rough outline is:

  1. create cassandra key space
  2. export tiles list from an existing server
  3. generate those tiles on the maps-test cluster

@Pnorman ping me to schedule some time to do this.

I've created the key space and updated the docs slightly, but exporting the tile list is failing

$ node /srv/deployment/tilerator/deploy/src/scripts/tileshell.js --config /etc/tileratorui/config.yaml -j.generatorId v5 -j.zoom 14 --dumptiles /home/pnorman/all-tiles-14.txt --dumprawidx --dumpoverride
info Loading variables from the config file
info Loading sources from /srv/deployment/tilerator/deploy-cache/revs/001811e1a3eb21cf9246c5425f27f001b91efd27/src/sources.prod2.yaml
trace Updating layer landuse
trace Updating layer waterway
trace Updating layer water
trace Updating layer aeroway
trace Updating layer building
trace Updating layer road
trace Updating layer admin
trace Updating layer country_label
trace Updating layer place_label
trace Updating layer poi_label
trace Updating layer road_label
{ Err: Source ID v5 is not defined
    at setSources (/srv/deployment/tilerator/deploy-cache/revs/001811e1a3eb21cf9246c5425f27f001b91efd27/src/lib/common.js:90:19)
    at Object.paramsToJob (/srv/deployment/tilerator/deploy-cache/revs/001811e1a3eb21cf9246c5425f27f001b91efd27/src/lib/common.js:63:5)
    at /srv/deployment/tilerator/deploy-cache/revs/001811e1a3eb21cf9246c5425f27f001b91efd27/src/scripts/tileshell.js:141:26
    at tryCatcher (/srv/deployment/tilerator/deploy-cache/revs/001811e1a3eb21cf9246c5425f27f001b91efd27/node_modules/bluebird/js/release/util.js:16:23)
    at Promise._settlePromiseFromHandler (/srv/deployment/tilerator/deploy-cache/revs/001811e1a3eb21cf9246c5425f27f001b91efd27/node_modules/bluebird/js/release/promise.js:512:31)
    at Promise._settlePromise (/srv/deployment/tilerator/deploy-cache/revs/001811e1a3eb21cf9246c5425f27f001b91efd27/node_modules/bluebird/js/release/promise.js:569:18)
    at Promise._settlePromise0 (/srv/deployment/tilerator/deploy-cache/revs/001811e1a3eb21cf9246c5425f27f001b91efd27/node_modules/bluebird/js/release/promise.js:614:10)
    at Promise._settlePromises (/srv/deployment/tilerator/deploy-cache/revs/001811e1a3eb21cf9246c5425f27f001b91efd27/node_modules/bluebird/js/release/promise.js:693:18)
    at Async._drainQueue (/srv/deployment/tilerator/deploy-cache/revs/001811e1a3eb21cf9246c5425f27f001b91efd27/node_modules/bluebird/js/release/async.js:133:16)
    at Async._drainQueues (/srv/deployment/tilerator/deploy-cache/revs/001811e1a3eb21cf9246c5425f27f001b91efd27/node_modules/bluebird/js/release/async.js:143:10)
    at Immediate.Async.drainQueues (/srv/deployment/tilerator/deploy-cache/revs/001811e1a3eb21cf9246c5425f27f001b91efd27/node_modules/bluebird/js/release/async.js:17:14)
    at runCallback (timers.js:637:20)
    at tryOnImmediate (timers.js:610:5)
    at processImmediate [as _immediateCallback] (timers.js:582:5) name: 'Err', message: 'Source ID v5 is not defined' }
Err: Source ID v5 is not defined
    at setSources (/srv/deployment/tilerator/deploy-cache/revs/001811e1a3eb21cf9246c5425f27f001b91efd27/src/lib/common.js:90:19)
    at Object.paramsToJob (/srv/deployment/tilerator/deploy-cache/revs/001811e1a3eb21cf9246c5425f27f001b91efd27/src/lib/common.js:63:5)
    at /srv/deployment/tilerator/deploy-cache/revs/001811e1a3eb21cf9246c5425f27f001b91efd27/src/scripts/tileshell.js:141:26
    at tryCatcher (/srv/deployment/tilerator/deploy-cache/revs/001811e1a3eb21cf9246c5425f27f001b91efd27/node_modules/bluebird/js/release/util.js:16:23)
    at Promise._settlePromiseFromHandler (/srv/deployment/tilerator/deploy-cache/revs/001811e1a3eb21cf9246c5425f27f001b91efd27/node_modules/bluebird/js/release/promise.js:512:31)
    at Promise._settlePromise (/srv/deployment/tilerator/deploy-cache/revs/001811e1a3eb21cf9246c5425f27f001b91efd27/node_modules/bluebird/js/release/promise.js:569:18)
    at Promise._settlePromise0 (/srv/deployment/tilerator/deploy-cache/revs/001811e1a3eb21cf9246c5425f27f001b91efd27/node_modules/bluebird/js/release/promise.js:614:10)
    at Promise._settlePromises (/srv/deployment/tilerator/deploy-cache/revs/001811e1a3eb21cf9246c5425f27f001b91efd27/node_modules/bluebird/js/release/promise.js:693:18)
    at Async._drainQueue (/srv/deployment/tilerator/deploy-cache/revs/001811e1a3eb21cf9246c5425f27f001b91efd27/node_modules/bluebird/js/release/async.js:133:16)
    at Async._drainQueues (/srv/deployment/tilerator/deploy-cache/revs/001811e1a3eb21cf9246c5425f27f001b91efd27/node_modules/bluebird/js/release/async.js:143:10)
    at Immediate.Async.drainQueues (/srv/deployment/tilerator/deploy-cache/revs/001811e1a3eb21cf9246c5425f27f001b91efd27/node_modules/bluebird/js/release/async.js:17:14)
    at runCallback (timers.js:637:20)
    at tryOnImmediate (timers.js:610:5)
    at processImmediate [as _immediateCallback] (timers.js:582:5)

Looking at /srv/deployment/tilerator/deploy-cache/revs/001811e1a3eb21cf9246c5425f27f001b91efd27/src/sources.prod2.yaml I can see there is no reference to v5, just to v3.

We have a mismatch between prod and test in cassandra namespaces. Because test is currently empty, @Gehel is going to update it to be v3 like production

Configuration is already aligned, but the procedure to create Cassandra keyspace is not precise enough. I'll update the documentation.

keyspace v5 has been dropped, we can restart the tile generation procedure.

Updates from today's progress

  • Cassandra keyspace creation worked, and the tile list exported from prod.
  • Running tileshell to create tiles fails with
$ node /srv/deployment/tilerator/deploy/src/scripts/tileshell.js --config /etc/tileratorui/config.yaml -j.filepath /home/pnorman/all-tiles-14.txt  -j.fileZoomOverride 14 -j.fromZoom 10 -j.beforeZoom 16 -j.generatorId gen -j.storageId v3 -j.deleteEmpty
info Loading variables from the config file
info Loading sources from /srv/deployment/tilerator/deploy-cache/revs/001811e1a3eb21cf9246c5425f27f001b91efd27/src/sources.prod2.yaml
trace Updating layer landuse
trace Updating layer waterway
trace Updating layer water
trace Updating layer aeroway
trace Updating layer building
trace Updating layer road
trace Updating layer admin
trace Updating layer country_label
trace Updating layer place_label
trace Updating layer poi_label
trace Updating layer road_label
info Adding jobs from /home/pnorman/all-tiles-14.txt

events.js:160
      throw er; // Unhandled 'error' event
      ^
ReplyError: Ready check failed: NOAUTH Authentication required.
    at parseError (/srv/deployment/tilerator/deploy-cache/revs/001811e1a3eb21cf9246c5425f27f001b91efd27/node_modules/redis-parser/lib/parser.js:193:12)
    at parseType (/srv/deployment/tilerator/deploy-cache/revs/001811e1a3eb21cf9246c5425f27f001b91efd27/node_modules/redis-parser/lib/parser.js:303:14)

I checked that redis is running and that connecting to redis:// url in /etc/tileratorui/config.yaml works with redis-cli, and it is working and accepting the correct password.

Additional investigation steps:

  • Connecting to the "kue" console (http://<host>:6535/raw/jobs) which show the state of the redis queues does not raise any error and does not log anything strange. So it looks like the redis connection is working from tileratorui, but not from tileshell. This tend to indicate that we probably have a different code path in term of redis connection.
  • It looks like sources are not loaded due to some postgresql authentication error in genall (not investigated yet, probably entirely unrelated).

By adding print statements left and right, it looks like tileshell is not honoring the redis parameters configured in the tilerator config file, but is connecting to the default localhost:6379. Now we just need to find where this config is missing.

Another option would be to disable redis authentication for the initial import, since in standard operation this does not seem to be an issue.

We'll go ahead and disable the redis auth at this time, to get this done for now.

started generation on Tue Sep 5 21:17:36 UTC 2017

Started generation of all tiles with zoom < 13