Page MenuHomePhabricator

Tasmania is covered with water at z10+
Closed, ResolvedPublic

Description

https://maps.wikimedia.org/#9/-43.2012/146.5082 -- try zooming in one more level

Event Timeline

My initial guess is something coastline related. I can't see any current problems with the coastline in Tasmania, but because the islands in the SE are rendering fine within the same tile as missing land on in the body of Tasmania it looks like a coastline problem.

If it's not a problem in the coastline shapefiles, then my next guess would be something related to how we're processing coastline updates. I don't know where that code is.

I don't see anything odd in the openstreetmapdata.com split land polygons so it's looking more like something specific to us

MaxSem raised the priority of this task from High to Unbreak Now!.Mar 6 2017, 10:16 AM
MaxSem added a project: Maps-Sprint.

Mentioned in SAL (#wikimedia-operations) [2017-03-06T20:34:45Z] <gehel> reimport waterlines data on maps1001.eqiad.wmnet - T159631

MaxSem lowered the priority of this task from Unbreak Now! to High.Mar 6 2017, 10:50 PM

Okay, so the problem is limited to Tasmania and was caused by user mistake. Someone just marked whole Tasmania with natural=bay. Paul fixed it with upstream https://www.openstreetmap.org/changeset/46635137 and I updated the prod DB with UPDATE planet_osm_polygon SET "natural" = NULL WHERE osm_id = -40976591.

Tiles in the affected area are being regenerated right now, HTTP caches will expire in 24 hours. I'm gonna keep this ticket open until updates are clearly visible in production.

MaxSem moved this task from In progress to Needs review on the Maps-Sprint board.
MaxSem added a project: Wikimedia-Incident.

Summary

The map in Tasmania was showing with water instead of land, with some minor islands near it still displaying. The issue was fixed in OpenStreetMap with http://www.openstreetmap.org/changeset/46635137, with a temporary fix manually applied to the database. Newly rendered tiles will be correct, but old tiles will persist in the Varnish caches for up to 24h.

Cause

A natural=bay tag had ended up on the Tasmania relation in OSM, matching the osm-bright.tm2source rules for water. This caused Tasmania to be rendered as water.

Fix

Changes were made to the OpenStreetMap data with http://www.openstreetmap.org/changeset/46635137 to fix the problem. Normally it would take at least a day to update, so the database was manually updated from PostgreSQL.

Post-mortem

The initial diagnosis was the problem was related to coastlines and water polygons, based on an island appearing as water while nearby islands rendered normally. Although this turned out to not be the cause, it is still what I would conclude as the most likely cause given the initial evidence.

Checking the latest coastline data from openstreetmapdata.com, the data appeared normal, and no recent suspect changes to OSM data were found. Additionally, no other OSM-based maps on other sites were showing problems. This lead to a suspicion of how were were processing coastline data.

Stepping through the cron jobs and scripts for coastline updates, a few problems were found:

  • The cron job was not running successfully, having a permissions error
  • There was no logging, making it impossible to see when it had run in the past
  • There was no reporting of errors
  • There was no monitoring of the difference in age between data on openstreetmapdata.com and what was loaded into the DB

This complicated investigation, and it was decided to not update the data until the problem had been fixed. A query and QGIS were used to independently verify that the coastline data loaded into the database was correct.

An additional query was used to check for water in the osm2pgsql tables, looking for natural=water or waterway=riverbank, on the suspicion that there was a multipolygon problem which had turned Tasmania into a water polygon.

Time was spent figuring out how to view tiles which were rendered directly from the database, bypassing the vector tile store in Cassandra. It took some time to be confident we were looking at freshly rendered data, still showing the problem.

Various attempts were made to view the vector data, either from the DB or Cassandra but this wasn't easy. A JSON-serialized version of the PBF vector tiles was available, but this is not a common format. PBF, GeoJSON, or TopoJSON tiles are more common.

Suspision turned to the overzoom feature of Kartotherian, which starts at z10. This feature has special handling for tiles which are "solid" and filled with a single square of land or water covering the tile.

During investigation of this, it was noticed that the water had an osm_id of -46635137 in the vector tiles, corresponding to relation 4097659. A quick inspection of the OSM data showed this had an inappropriate natural=bay tag. This matches the osm-bright.tm2source criteria of

"natural" IN ('water', 'bay')
OR waterway IS NOT NULL
OR landuse = 'reservoir'
OR landuse = 'pond'

for water. Most stylesheets use `"natural" = 'water' OR waterway = 'riverbank' OR landuse = 'reservoir'` which is why other styles hadn't observed a problem, no one in the OSM community had noticed the problem, and the query run earlier had not shown any water.

After identifying the problem it took a couple of minutes to fix it in OSM, but the WMF servers only update daily, and use the daily diffs. To fix it sooner the data was fixed in the database with an UPDATE statement after first testing on the test cluster.

After updating, the vector tiles needed regeneration. This took additional time to figure out the tilerator UI.

Unknowns

  • Why this was reported 3 weeks after the change was made to the MP and not sooner

Ideal diagnosis workflow

An ideal diagnosis workflow would be

  • The problem is observed
  • An x-ray view that the public can use is used to see what data there is in the tiles in an otherwise empty area of Tasmania, and the osm_id is used. Checking this ID reveals unusual tagging, which is fixed, or at least a suspected cause right away

Corrective actions

Documentation

  • How to view tiles with no caching needs to be documented better, or obvious through an administrative UI. It's not clear how to use the tilerator UI. Viewing without caching is essential in debugging most rendering problems as otherwise you get mislead by old tiles. T160013: Document how to view maps without any caching

Code

  • Hourly OSM diffs should be consumed, even if the data is only updated once a day. T137939: Increase frequency of OSM replication This
    • allows an update to be manually run if there is an urgent need, like in this case
    • Means the data will be a maximum of 1 hour old after running the update. Right now the update could run 23h after the last daily diff was created, resulting in 47h old data
    • Should not impact resource usage, because osmosis will merge the hourly diffs with --simc, and without --simc duplication in rerendering is still avoid.
  • Vector tile PBFs should be publicly accessible. This is something we already produce, but isn't visible to the public
  • Vector tile GeoJSONs should be publicly accessible. This format is (mostly) human readable.

New vector tile schema

Some of the lessons from this can be applied to the new vector tile work

  • The tradeoffs of merging polygons vs keeping osm_ids around should be looked at

I just want to say loud thanks to @MaxSem and @Pnorman for the thorough investigation, the fast identification and resolution of the problem (despite HTTP caches), and kudos to @Pnorman for writing such a detailed incident report afterwards, and for listing so many corrective actions. I'm proud of working with such talented, engaged people.

MaxSem renamed this task from Broken tiles at z10+ to Tasmania is covered with water at z10+.Mar 7 2017, 2:32 AM
  • There should be a debugging "x-ray" view that visualizes the vector tiles, similar to Kosmtik.

You're supposed to be able to do this by loading the style in Mapbox Studio Classic. It by default points at our vectors:

name: OSM Bright 2
source: "https://maps.wikimedia.org/osm-pbf/info.json"

However, it doesn't work, apparently because this TileJSON is missing field description that looks like that for gen datasource in Tilerator UI:

"vector_layers": [
    {
        "id": "landuse",
        "description": "",
        "fields": {
            "osm_id": "Number",
            "class": "String",
            "z_order": "Number",
            "way_area": "Number"
        }
    },
  . . .
]
  • Vector tile PBFs should be publicly accessible. This is something we already produce, but isn't visible to the public

We already have these, however as of now we haven't spent even a second on making sure these are ready for public consumption.

  • Vector tile GeoJSONs should be publicly accessible. This format is (mostly) human readable.

Our current JSON format (a dump of PBF content) is fine for these types of diagnostics. That's what I used to figure out which relation is breaking stuff this time.

@MaxSem that extra blob of json can be added to the sources.prod.yaml file - it supports metadata injection.

Change 341566 merged by Gehel:
[operations/puppet] osm - waterline import script fix and adding logging

https://gerrit.wikimedia.org/r/341566

Change 341770 merged by Gehel:
[operations/puppet] maps - fix logrotate template

https://gerrit.wikimedia.org/r/341770

Mentioned in SAL (#wikimedia-operations) [2017-03-08T09:51:30Z] <gehel> re-enabled waterline import on maps[12]001 - T159631

Mentioned in SAL (#wikimedia-operations) [2017-03-09T13:43:23Z] <gehel> invalidating Tasmania zoom level 10 tiles in varnish - T159631

We still bad tiles in cache. For example: this one. It has a "Age: 19788" header (~5h ago). All tiles should have already been rendering fine 5 hours ago, so this puzzles me a bit. Adding a cache busting parameter to that tile shows that the tile returned by kartotherian is good. All the checks I have done on kartotherian (or via tilerator ui) show that the data / rendering is fine on the nodes. My understanding is that we have a TTL cap at 1 day, but I might be misunderstanding what it means. The fact that we don't expose meaningful caching headers (T108435) does not help to understand issue.

The bad tiles lingering in cache are most probably due to the wrong Last-Modified header that is sent by Kartotherian. With this header, the conditional GET sent by varnish return a 304 (Not Modified) and the tiles are not updated. I added fixing the caching headers (T108435) to the stabilisation (T155601).