Page MenuHomePhabricator

[Bug] OSM DB degradation during sync as a result of missing features
Open, HighPublic

Description

Background

Since this task was created, we've been experiencing an OSM database degradation happening over time.

At first, we had some issues with osm2pgsql failing to complete a daily OSM sync due to corrupted input data but we couldn't correlate the problems due to the lack of information on the logs.

We now have imposm3 as the ETL engine behind OSM DB and the problem continues to happen and we still don't know the cause.

What we know:

  • it's unrelated to specific OSM edits, some missing polygons hasn't received an edit for months and still disappear
  • it's unrelated to the tool that imports the database
  • it's unrelated to initial imports and we have written some unit tests to verify that

How to mitigate that

While we don't have a solution for the root cause we have two maintenance procedures to restore the DB state:

These procedures helped to fix some issues raised by the community but it has become a burden to keep up with the degradation rate, that increased since we introduced Tegola

The current hypothesis to explore

There have been some PostgreSQL errors caused by connection slots starvation recently. The hypothesis is that, if we can correlate the OSM sync failure with the PG errors, we can find the root cause and act promptly to fix it.

Therefore we will be monitoring the database after every OSM change file is applied to it and look for the causes of the DB degradation. (upcoming patch)

What to do meanwhile?

For now, we will keep restoring the DB with the maintenance scripts, and to make it easier for us to keep up with the bugs, we will suggest a Phabricator template task to raise these types of issues with the needed information to apply the fix for specific geometries.

We have a scheduled maintenance script to re-import the planet in order to fix the current open tasks in the next week

Old task description kept for history context

Recently the community have spotted some weird behavior with geoshapes service. Some Wikidata items are not available through the service and some are, see example below:

It seemed that something was wrong with the OSM replication script or the initial import script failed at some point.

Initial import script and Steps to reproduce

The log doesn't show any errors or warnings and they are available at /home/mbsantos/osm-initial-import.log

OSM replication script

The logs doesn't show any errors or warnings and they are available at /var/log/osmosis

Further investigation and steps to reproduce the actual results

With no related track from the logs I started to investigate the data stored in the PostgreSQL DB. I started with the relation Lake Garda (8569) from the example above. The problem is not with the relation, because it is stored in the table planet_osm_rels:

gis=> SELECT id FROM planet_osm_rels WHERE id = 8569;
  id  
------
 8569
(1 row)

But performing a https://github.com/kartotherian/geoshapes/blob/master/geoshapes.js#L64 geoshapes similar query, it returns nothing because the OSM relation didn't become a polygon at osm_planet_polygon:

gis=> SELECT tags->'wikidata' AS id, osm_id
gis-> FROM planet_osm_polygon
gis-> WHERE tags ? 'wikidata' AND tags->'wikidata' IN ('Q6414');                                                             
 id | osm_id
----+--------
(0 rows)

Expected Results

Request existent Wikidata-OSM link and got the GeoJSON through geoshapes service. See the same SQL queries for the Hôtel de Blossac (311766):

gis=> SELECT id FROM planet_osm_rels WHERE id = 3117766;                                                                     
   id
---------
 3117766
(1 row)
gis=> SELECT tags->'wikidata' AS id, osm_id                                                                                  
FROM planet_osm_polygon
WHERE tags ? 'wikidata' AND tags->'wikidata' IN ('Q3145754');                                                                
    id    | osm_id  
----------+---------
 Q3145754 | -311766
(1 row)

Environments Observed

  • All master machines:
    • map1004.eqiad.wmnet
    • map2004.codfw.wmnet

Testing Environment for QA

  • Beta Cluster environment will be set to explore the issue

Additional notes

Some references report a similar issue but the outcome doesn't apply to our infrastructure
https://help.openstreetmap.org/questions/28563/missing-polygons-or-rendering-issue
https://help.openstreetmap.org/questions/68009/osm2pgsql-how-import-relations-into-polygon-table

Related Objects

StatusSubtypeAssignedTask
StalledNone
OpenNone
Resolvedhnowlan
OpenBUG REPORTNone
OpenNone
Resolvedhnowlan
OpenMSantos
ResolvedMSantos
ResolvedMSantos
ResolvedMSantos
ResolvedMSantos
ResolvedMSantos
ResolvedMSantos
ResolvedBUG REPORTMSantos
ResolvedMSantos
DuplicateBUG REPORTJgiannelos
ResolvedJgiannelos
ResolvedJgiannelos
ResolvedMSantos
ResolvedBUG REPORTMSantos

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

This is still (or is again?) happening with geolines, for some recently tagged relations. Reported on-wiki here and here.

Samples:

Wikidata itemOSM relationmaps.wikimedia.org/geoline URLNotes
Q792656910384840?getgeojson=1&ids=Q7926569Returns empty array. Relation created & tagged on 8 December 2019.
Q25036311696029?getgeojson=1&ids=Q1736261Returns empty array. Relation tagged on 20 December 2019.
Older ones still work, e.g. Q17362619240546?getgeojson=1&ids=Q1736261Works fine. Relation created/tagged on 21 January 2019.
And this one works: Q48182591628553?getgeojson=1&ids=Q4818259Works fine. OSM history times out, but probably tagged 19 December 2019 per this wikipedia diff
This one I tagged today: Q2503626114503?getgeojson=1&ids=Q2503626Returns empty array. Relation tagged on 25 December 2019, but I'm expecting it probably still won't work 2 days from now.
Evad37 renamed this task from [Bug] Some OSM relations didn't become polygons and are not been served through geoshapes service to [Bug] Some OSM relations didn't become polygons and are not been served through geoshapes service or geoline service.Dec 25 2019, 1:24 AM

This is still (or is again?) happening with geolines, for some recently tagged relations. Reported on-wiki here and here.

New objects in particular are missing as apparently OSM replication has been failing entirely since the beginning of last month. This is being worked on and replication lag is gradually reducing now (see T239728).

Lake Huron now seems to be present, but Lake Michigan goes white at various zoom levels. England and Wales National Parks have lost their green colour except for random individual tiles at very high zooms. People will rapidly lose confidence in the maps if major features such as lakes and islands are absent.

OSM replication was disabled around January 24 (T243609). So now all objects tagged after that are missing.

I'm glad to see this marked as high priority. I just waded around a bunch trying to figure out why I wasn't able to get the Maplink template working before finding this.

The geoshape service has begun returning the expected geometry for this particular relation.

Hi
I've found a link without data https://maps.wikimedia.org/geoline?getgeojson=1&ids=Q34600 despite https://www.openstreetmap.org/relation/357794 have been tagged more than 3 years ago https://www.openstreetmap.org/relation/357794 ¿I don't undertand?

This relation was missing the type=boundary and boundary=administrative tags from two months ago until about an hour ago. The service should return the correct geometry once it updates to the latest change to that feature.

Hi all, looks like there is a related issue for Q23103 (diag page) and Q21694759 (diag page). Both are named Shropshire but have different borders - the first is the ceremonial boundary while the last is the local government area.

I have done suggested checks such as ensuring the Q number and the tags type=boundary and boundary=administrative are on OSM, although the administrative tag is missing on other UK counties without issue such as Q23124 (diag page). I also saw a suggestion for the name tag centre missing on OSM may cause an issue, but the working example I've given also does not have one.

I've twiddled a couple of fields on OSM (and tidied up some Wikidata duplicate entries) hoping to force a reload to Wiki but could a Good Samaritan double check to ensure I haven't overlooked anything while waiting for the DB update? It would be appreciated as it's the only UK county that doesn't show. Thanks everyone.

Also reporting in that I've had to switch the Portland, Oregon and Columbus, Ohio enwiki articles' mapframe maps from shape to line to "fix" their maps, though then the shapes cannot be shaded as I would prefer. Will be eager to have a fix.

Change 752748 had a related patch set uploaded (by MSantos; author: MSantos):

[operations/puppet@production] maps: introduce imposm-geometry-import

https://gerrit.wikimedia.org/r/752748

MSantos renamed this task from [Bug] Some OSM relations didn't become polygons and are not been served through geoshapes service or geoline service to [Bug] OSM DB degradation during sync as a result of missing features.Jan 11 2022, 2:12 PM
MSantos updated the task description. (Show Details)

Of note is that what is missing are some pretty sizeable elements. Makes u wonder if size of body (even mtu or transaction time) is causing a problem. Is it a coincidence we lost the lakes twice ? In two different systems ? That’s suspicious.

Of note is that what is missing are some pretty sizeable elements. Makes u wonder if size of body (even mtu or transaction time) is causing a problem. Is it a coincidence we lost the lakes twice ? In two different systems ? That’s suspicious.

That's something to reason about, the only thing that hasn't changed between the 2 setups is the hardware and PostgreSQL configuration.

There are still some odd goings-on with white tiles in the middle of some of the Great Lakes, and also (and I have not seen this before) some strange green parts over what should be lake. eg en:Bruce Peninsula has an OSM Location map on which I am seeing both green and white oddities. If you click on the interactive link you can also see different white tiles and green 'shore extensions' at different zoom levels in maplink, Maybe this is just a transient bug, and thanks for those who have worked on this.

There are still some odd goings-on with white tiles in the middle of some of the Great Lakes, and also (and I have not seen this before) some strange green parts over what should be lake. eg en:Bruce Peninsula has an OSM Location map on which I am seeing both green and white oddities. If you click on the interactive link you can also see different white tiles and green 'shore extensions' at different zoom levels in maplink, Maybe this is just a transient bug, and thanks for those who have worked on this.

Screenshot 2022-03-16 at 16.57.06.png (818×1 px, 158 KB)

The white are just old cached tiles that will take a bit to expire and re-render. The green I'm not entirely sure. It might be the old and the new tile been adjacent causing the interpretation of the shapes for colouring to be off. That seems most likely. If so, it should fix itself if all the old tiles are finally expired.

Screenshot 2022-03-16 at 16.57.06.png (818×1 px, 158 KB)

The white are just old cached tiles that will take a bit to expire and re-render. The green I'm not entirely sure. It might be the old and the new tile been adjacent causing the interpretation of the shapes for colouring to be off. That seems most likely. If so, it should fix itself if all the old tiles are finally expired.

Hmm, the broken tiles are still there. I would have expected them to have cleared by now....
Its kinda strange, even if I get a completely different language or format tile, which is much less likely to be cached, I don't seem to get a correct tile... Maybe something more fishy is going on after all ?

All broken:
https://maps.wikimedia.org/osm-tegola/9/139/183.png?lang=en
https://maps.wikimedia.org/osm-tegola/9/139/183.png?lang=da
https://maps.wikimedia.org/osm/9/139/183.png?lang=da

From what I understand of the new infra, this can only be possible if the vector tiles in Cassandra have not yet refreshed for these tiles ?

After a non-exhaustive search around various water-bodies of the world, the only examples of the tile problem I came across were in the Great Lakes. At various zoom levels on Huron and Superior quite a few blocks show up, and each block appears to be consistently there at its 'problem' zoom levels. On Michegan I only found a few errant tiles, eg at Pentwater (But that doesn't mean there aren't more at zooms/locations I didn't check) and I saw no problems in Lakes Ontario or Erie.

Can someone write an overview about what is going on ?

Its really hard to keep track from a distance and I say that as someone who even slightly understands what's going on. OSM shape data is out of date, OSM tiles are out of date, missing data/features, and it seems that ppl are working on things with the goal of fixing it all, but it's really difficult to gauge what is going on, which servers are working (and how old the data is that they are working with) and which are not working, what the timelines are and what issue touches on which ticket etc.. Im sure someone is discussing these things in a standup meeting or something but... pls organise the tickets and connect them etc.

I know ppl are trying, but from an external perspective it's just like a wall of silence towards the community. Can someone please take charge in informing the community ? ping @Quiddity @Whatamidoing-WMF can you please push internally on this ?

@TheDJ Thanks for pushing this discussion forward and sorry for the silence on it, it wasn't intentional but rather related to multiple stuff happening on our end that put maps priority down the list.

This issue happened again due to a DC switchover that brought old and broken data in the codfw cluster. We're assessing the situation now and coordinating with SRE on the switchover to the most updated cluster and restoring the health and parity between the clusters as soon as possible.

As soon as I have a timeline I'll post it here. Again, sorry for the disturbance.

@TheDJ Thanks for pushing this discussion forward and sorry for the silence on it, it wasn't intentional but rather related to multiple stuff happening on our end that put maps priority down the list.

This issue happened again due to a DC switchover that brought old and broken data in the codfw cluster. We're assessing the situation now and coordinating with SRE on the switchover to the most updated cluster and restoring the health and parity between the clusters as soon as possible.

As soon as I have a timeline I'll post it here. Again, sorry for the disturbance.

I think it's just really difficult for people to even understand what "the issue" is and what are effects of the issue vs what are effects of the work attempting to fix the issues etc. so if we can get some clarification on that as well, it would be useful too.

.... a DC switchover that brought old and broken data in the codfw cluster

This is a different issue than the one outlined in the description of this task, right? There are several recent tasks on old and broken data, none of which mention the cause.

I'd also like to see that you (your team) would at the least create public tasks for any such issue. So that interested parties could find out that the issue is a known issue and it's on someone's plate.