Page MenuHomePhabricator

Maps tile pregeneration is throwing errors
Closed, ResolvedPublic

Description

From logstash: https://logstash.wikimedia.org/goto/d9f419369ea7764474513ec0f07e7b5a

After a bit of debugging here are my current findings:

  • Maps sync is enabled on maps1009
  • Kartotherian on eqiad is receiving traffic
  • We have a known issue with geoshapes and connection pooling fixed here: https://gerrit.wikimedia.org/r/c/mediawiki/services/kartotherian/+/761880 (not deployed)
  • Geoshapes queries are causing transaction failures on maps master
  • The transaction rollbacks cause issues on maps replicas making pregeneration queries fail

Event Timeline

hnowlan subscribed.

maps1009 was repooled by accident due to a roll-restart of all maps services for security updates - the restart pooled maps1009 after updates completed without regard to its pooled status. Mitigation is to write a cookbook for handling these roll-restarts properly.

After taking a look at the maps1009 postgres logs I think the first reference of an error is on:

2022-02-01 11:31:09 GMT [9148]: [3-1] user=kartotherian,db=gis,app=[unknown],client=127.0.0.1 ERROR:  lwgeom_unaryunion_prec: GEOS Error: InterruptedException: Interrupted

After checking the rollback metrics for maps1009 i can verify that between 2022-02-01 up until 2022-02-25 we didnt have any rollbacks which correlates with the days we had kartotherian (thus geoshapes) depooled

Geoshapes queries are causing transaction failures of writes on masters ? or of the replication stream of master to slave ?

Have we considered splitting read/write requests with some sort of pooling and have these expensive geoshape queries, which apparently crash postgres at times, only connect with slaves ? Because geoshape doesn't do any writes if I'm not mistaken.

The reason we had kartotherian depooled on maps masters was exactly that: we didn't want expensive queries to interfere with OSM import/syncing. The problem is that we have a lot of moving parts and the priority is to finish with the tile pregeneration so the high level issue gets resolved, thats why we rely on the fact that maps1009 kartotherian was depooled.
In the future we are planning to:

  • Reduce the connection pool size for geoshapes (fix already merged on kartotherian codebase, needs deployment)
  • Disable anything non OSM import related on masters

Regading connecting from geoshapes master to a read replica, this is also an option but generally kartotherian server setup is very per-node oriented (in the past every node run one instance of each piece of software we use) so this would introduce a more complicated setup.

After syncing with the team here are the next steps:

  • Continue with tile prefeneration (z14-15 left)
  • Assume that OSM data integrity on PostGIS was not heavily affected (we had the same issue affecting our data for years so maybe a few weeks worth of transactions its not that big of a deal)
  • Ensure that kartotherian is depooled on masters (or even completely removed)
  • After tiles are generated
    • Deploy last version of kartotherian
    • Failover codfw tegola connections to eqiad

Then we can run a proper OSM planet import from scratch on codfw after all the learnings/fixes on eqiad

Change 764353 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/puppet@production] maps: disable kartotherian on maps masters

https://gerrit.wikimedia.org/r/764353

Change 764353 merged by Hnowlan:

[operations/puppet@production] maps: disable kartotherian on maps masters

https://gerrit.wikimedia.org/r/764353

Change 767066 had a related patch set uploaded (by Jgiannelos; author: Jgiannelos):

[maps/kartotherian/deploy@master] Fix pool size configuration

https://gerrit.wikimedia.org/r/767066

Change 767066 merged by jenkins-bot:

[maps/kartotherian/deploy@master] Fix pool size configuration

https://gerrit.wikimedia.org/r/767066

Jgiannelos claimed this task.

I am gonna resolve this one since:

  • Geoshapes is contained to 1 PG conn / node process
  • Maps masters don't receive traffic
  • z14-15 triggered