investigate tilerator crash on maps eqiad
Open, HighPublic

Description

Sept 11, 12:22 UTC, icinga reported tilerator not responding on all maps eqiad servers. Problem was solved by restarting tilerator on all nodes, this was not sufficient. A few requests seemed to be stuck on postgresql (I did not copy the list, I should have). Restarting postgresql on maps1001 (master) finally solved the issue.

The postgresql logs show processes waiting for lock (see below for extract).

2018-09-11 12:29:03 GMT [17900]: [3-1] user=tilerator,db=gis,app=[unknown],client=::1 LOG:  process 17900 still waiting for AccessShareLock on relation 5134061 of database 16385 after 1000.061 ms at character 352
2018-09-11 12:29:03 GMT [17900]: [4-1] user=tilerator,db=gis,app=[unknown],client=::1 DETAIL:  Process holding the lock: 9576. Wait queue: 29765, 26796, 29953, 22070, 13992, 13998, 14015, 14085, 14423, 14424, 14451, 14518, 17900.
2018-09-11 12:29:03 GMT [17900]: [5-1] user=tilerator,db=gis,app=[unknown],client=::1 STATEMENT:  SELECT * FROM (
        SELECT
            osm_id, way,
            admin_level::SMALLINT,
            (tags->'maritime')::INT AS maritime,
            CASE
              WHEN
                tags->'disputed' = 'yes'
                OR tags->'dispute' = 'yes'
                OR (tags->'disputed_by') IS NOT NULL
                OR tags->'status' = 'partially_recognized_state'
              THEN 1
              ELSE 0
            END AS disputed
          FROM admin
          WHERE
            COALESCE(tags->'maritime', '') <> '1'
            AND (
              ( admin_level = '2' AND z(3.40282e+38) >= 2 )
              OR ( admin_level = '4' AND z(3.40282e+38) >= 3 )
            )
            AND COALESCE(tags->'left:country', '') <> 'Demarcation Zone'
            AND COALESCE(tags->'right:country', '') <> 'Demarcation Zone'
            AND way && ST_SetSRID('BOX3D(-3.402823466385289e+38 -3.402823466385289e+38,3.402823466385289e+38 3.402823466385289e+38)'::box3d, 900913)
        ) admin_ LIMIT 0
Gehel created this task.Sep 11 2018, 12:51 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 11 2018, 12:51 PM
Gehel triaged this task as High priority.Sep 11 2018, 12:57 PM
MSantos added a comment.EditedSep 11 2018, 7:58 PM

Here is what I learned so far:

  1. Tilerator stop responds every Tuesday at noon. The worker dies and tries to restart automatically.
  2. There is a cron task that runs the populate_admin() function at this frequency.
  3. On Sep 11, populate_admin.log shows:
FATAL:  terminating connection due to administrator command
CONTEXT:  PL/pgSQL function water_under(geometry) line 7 at RETURN
PL/pgSQL function populate_admin() line 15 at assignment
server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
connection to server was lost

To be continued.

EDIT: item 3 is unrelated and happened because PostgreSQL was forced to restarted

Gehel added a comment.Sep 12 2018, 8:44 AM

Oh, I completely forgot about populate_admin()! This might have been generating lock contention. Not exactly sure what we can improve on that side. If there isn't anything we can improve in populate_admin(), we might need to ensure that tilerator is more robust to contention (also not sure how to do that). It is an async process, so we don't really care if it lags a bit at some point, we just don't want it to crash and alert.

To be more specific, when the populate_admin() script runs, tilerator throws the following logs:

message: worker stopped sending heartbeats, killing.
message: worker died, restarting
message: worker died during startup, continue startup

It looks like it goes with these messages until populate_admin() finishes and then get back to normal.

I tried to reproduce it locally but I think that because my OSM dataset is small the issue doesn't happen.

Tilerator is receiving an AccessShareLock error from postgresql while populate_admin script is running.

2018-09-25 12:05:02 GMT [32508]: [3-1] user=tilerator,db=gis,app=[unknown],client=::1 LOG:  process 32508 still waiting for AccessShareLock on relation 5134061 of database 16385 after 1000.072 ms at character 422

The relation that postgresql locks is:

relation_name | schema_name
│---------------+-------------
│ admin         | public
│(1 row)

And will happen on every attempt to generate tile when trying to generate the admin layer.
The script takes less than 1 hour to run and I wrote a patch run it 1 hour before planet_sync (which notify tilerator and start generating tiles).

Change 462712 had a related patch set uploaded (by MSantos; owner: MSantos):
[operations/puppet@production] Chaging populate admin cron

https://gerrit.wikimedia.org/r/462712

Change 462712 merged by Gehel:
[operations/puppet@production] Chaging populate admin cron

https://gerrit.wikimedia.org/r/462712

The temporary solution seems not enough, probably because of T205735: Investigate jobs taking too long to complete in maps1001.eqiad, when the populate_admin cron starts there are tiles still being regenerate.

There is a PR that changes the populate_admin function to avoid resource locking.

Tilerator should be resilient to attempting to access a locked DB resource. I'd rather see us handle this there than to add a workaround in populate_admin(). Are you able to get a stack trace from the Tilerator crash?

Tilerator should be resilient to attempting to access a locked DB resource. I'd rather see us handle this there than to add a workaround in populate_admin(). Are you able to get a stack trace from the Tilerator crash?

Is not straightforward to have a stack trace because there is no "code failure". But, what I could track is that the service-runner jobs timeout (which is a puppet configuration) waiting for the locked resource. The main problem is that the library that gets stuck is mapnik (through tilelive-tmsource), and then service-runner kills the job (and that is the crash). The steps are:

  1. Tilerator creates a promise that instantiates tilelive and registers the sources from the configuration.
  2. Each job created needs to specify a source that will tell tilelive which module it needs to call, in this case, tilelive-tmsource.
  3. tilelive-tmsource calls mapnik with the layer query defined by osm-bright.tm2
  4. and then mapnik connects to Postgres to create vector-tiles.
  5. If the promise is unresponsive for a certain period of time, service-runner kills the job.

I couldn't find much reading about lock contention and mapnik that is why I followed the patterns of dealing with locks on the Database layer (I also think is better for performance).

If we're comfortable letting tile generation fail during the period populate_admin() runs (I'm not sure whether this is the case), we could solve this by adding a timeout to the promise for Postgres data that's reasonably long for a DB read but less than the timeout enforced by service-runner.

http://bluebirdjs.com/docs/api/timeout.html

IIRC the job processor retries a certain number of times before failing a tile generation request for good, so maybe we could play with that.

If we're comfortable letting tile generation fail during the period populate_admin() runs (I'm not sure whether this is the case), we could solve this by adding a timeout to the promise for Postgres data that's reasonably long for a DB read but less than the timeout enforced by service-runner.

http://bluebirdjs.com/docs/api/timeout.html

IIRC the job processor retries a certain number of times before failing a tile generation request for good, so maybe we could play with that.

That's a more robust way to deal with the problem and could solve for potential problems not related to populate_admin. I will take a look into that, thanks.

Mentioned in SAL (#wikimedia-operations) [2018-12-04T08:35:59Z] <gehel> restarting stuck tilerator on maps* - T204047