Page MenuHomePhabricator

tiles.wmflabs.org OSM is outdated
Open, MediumPublicBUG REPORT

Description

I was doing some maintenance on tiles.wmflabs and I noticed that the osmdb for labs currently doesn't seem to be in sync ?

For instance this tile: https://tiles.wmflabs.org/osm/18/136090/86311.png is freshly generated, and it shows structures that no longer exist and that have been removed from OSM and those also no longer show in the wmflabs version of the tile: https://maps.wikimedia.org/osm-intl/18/136090/86311@2x.png

I know a lot of maps work has been happening, but I'm not entirely sure what is going on exactly and if this is part previous problems, or if this is due to the current work, or completely unexpected...

Event Timeline

ping @MSantos, as you are the only one I know who might have some insight into this.

For days I also get a lot of missing tiles (404) especially in higher zoom levels.

If it is no longer maintained is there an WMF alternative?
https://maps.wikimedia.org/osm-intl has a more conservative visual appearance, tile servers from OpenStrretMap should be avoided for policy...

For days I also get a lot of missing tiles (404) especially in higher zoom levels.

That is T285145. I deleted lots of stuff. The problem is that the current server can't generate tiles quick enough when you go to an area that it doesn't have the tiles for yet, without throwing 404s. Before it would just show you 3 year old tiles.

Majavah added a subscriber: Majavah.

I'm not sure how those are maintained, but as far as I'm aware this is not a Toolforge service.

Majavah renamed this task from Outdated tool forge maps to tiles.wmflabs.org OSM is outdated.Aug 7 2021, 8:26 AM
Majavah removed a subscriber: Majavah.

@Majavah I didn't know we kept separate tags for the DB replicas of Toolforge. Do you know what the project tag for that is ?

@Majavah I didn't know we kept separate tags for the DB replicas of Toolforge. Do you know what the project tag for that is ?

I am unfortunately not sure. This service seems to live in the maps Cloud VPS project, which is completely separate from Toolforge (tools Cloud VPS project).

No thats just a specific rendering. This is about the postgres DB with the database, used by both VPS and Toolforge projects. https://wikitech.wikimedia.org/wiki/Help:Toolforge/Database#Connecting_to_OSM_via_the_official_CLI_PostgreSQL

i mean. it's infra that no one wants to be responsible for, i get it, but we gotta put some sort of tag on it...

No thats just a specific rendering. This is about the postgres DB with the database, used by both VPS and Toolforge projects. https://wikitech.wikimedia.org/wiki/Help:Toolforge/Database#Connecting_to_OSM_via_the_official_CLI_PostgreSQL

taavi@tools-sgebastion-07:~ $ host osmdb.eqiad.wmnet                                  osmdb.eqiad.wmnet is an alias for osm.db.svc.eqiad.wmflabs.
osm.db.svc.eqiad.wmflabs has address 172.16.6.105
taavi@tools-sgebastion-07:~ $ host 172.16.6.105
105.6.16.172.in-addr.arpa domain name pointer clouddb1003.clouddb-services.eqiad1.wikimedia.cloud.

That server is in clouddb-services. That's technically cloud-services-team territory, maybe they have an idea what's going on here?

nskaggs triaged this task as Medium priority.Aug 27 2021, 1:27 PM
nskaggs moved this task from Inbox to Needs discussion on the cloud-services-team (Kanban) board.

The OSM database on clouddb1003.clouddb-services.eqiad1.wikimedia.cloud constantly runs out of available connections, which would likely stop automated things from happening to it. I was curious if it was actually in any kind of functional use since I'd been seeing issues due to the way people connect to it for ages (and just restarted the database here and there to clear it). The problem easily returns because any misconfigured tool that connects will keep this happening since the database has no auth for read access inside cloud (which should not be true of any database). If that is part of the system here, I can start by increasing the number of connections from the default. That would be a new requirement, which makes me wonder what is broken.

Ok, that's not what's up here. The database is working, but replication isn't because it doesn't have a state file. I've downloaded that and am manually trying a replication job. I'm going to need to disabled the cron, though.

Mentioned in SAL (#wikimedia-cloud) [2021-08-30T21:53:52Z] <bstorm> disable puppet and osm updater script on clouddb1003 T285668

Ok, so far, the script didn't explode like it clearly did in the logs, so that's progress.

Never mind. It just did.
Osm2pgsql failed due to ERROR: Connection to database failed: FATAL: remaining connection slots are reserved for non-replication superuser connections

That's what I've been seeing.

Mentioned in SAL (#wikimedia-cloud) [2021-08-30T22:07:58Z] <bstorm> restarting osmdb on clouddb1003 to try to capture enough connections T285668

The puppetization is not flexible in this area, so I'm trying brute force first and restarting the DB.

Nope, it instantly runs out of available connections.

That's not because of external connections to the db, that's for sure....

root@clouddb1003:~# netstat -npt | grep 5432 | awk '{print $5}' | grep -Eo '([0-9]{1,3}\.){3}[0-9]{1,3}' | cut -d: -f1 | sort | uniq -c | sort -nr | head

12 172.16.5.154
 1 172.16.6.106

Ok, it works if I set the number of processes lower :) I'll set something in the puppetization.

That said, I find the lack of actual work it did concerning. It's using an internal web proxy that may not be valid for inside Cloud VPS...

Before I go and try to commit this change to puppet, @TheDJ have I fixed what I needed to on OSMDB? I'll work up a patch now that the script can at least run without obvious errors. I suspect it needs more work to be doing the right thing.

I'm thinking it's the state file I downloaded. Digging in a bit.

I have a feeling that when this was moved to VMs and set up, at some point replication broke from OSM and now it needs to be brought up to date in a more heavy handed way. Possibly akin to T254014. It could just be a matter of picking an older state file, but at this point, this isn't stuff I've kicked much. This database is weird in that it has a couple of custom databases in it, so whatever we do, we shouldn't drop everything on the server...just the osm bits. That should not matter to the actual OSM database since they are separate.

@MSantos any ideas or assistance would be appreciated. I don't think WMCS has ever had much idea how to operate this thing since it wasn't well documented when I tried to save it from certain death.

I'll kick the puppet setup so that I can persist the change to the number of threads, since that clears up *one* error.

Change 715623 had a related patch set uploaded (by Bstorm; author: Bstorm):

[operations/puppet@production] cloud osmdb: set num_threads in the sync job

https://gerrit.wikimedia.org/r/715623

Change 715624 had a related patch set uploaded (by Bstorm; author: Bstorm):

[operations/puppet@production] cloud osmdb: don't use proxy for cloud

https://gerrit.wikimedia.org/r/715624

Yes likely we will need at the very least need to set a state file from before we got out of sync... Determining when that was exactly is probably going to be the hard part however... I think it was at least about a year or so....

I'm thinking if there are ways I can identify when.....
And thank you Brooke, for helping out.

Change 715623 merged by Bstorm:

[operations/puppet@production] cloud osmdb: set num_threads in the sync job

https://gerrit.wikimedia.org/r/715623

Change 715624 merged by Bstorm:

[operations/puppet@production] cloud osmdb: don't use proxy for cloud

https://gerrit.wikimedia.org/r/715624

@TheDJ Well, I created the instance on Feb. 22, 2019. If we presume it never was replicating in this setup (which seems a safe guess), I think that's a pretty good day to start from.

Mentioned in SAL (#wikimedia-cloud) [2021-08-31T20:19:14Z] <bstorm> attempting to resync OSMDB back to Feb 21st 2019 T285668

Looking better now: Processing: Node(240k 0.6k/s) Way(0k 0.00k/s) Relation(0 0.00/s)

It's gonna be a while. I've disabled the cron and puppet. It's running in a screen session.

That did an awful lot. The output is not easily capturable at this point because of how much it dumped to the screen.

The weirdest thing I see is Writing dirty tile list (4954K)node cache: stored: 2743934(100.00%), storage efficiency: 81.38% (dense blocks: 291, sparse nodes: 493857), hit rate: 3.52%

The state file is now claiming things are up to date. I'll re-enable the cron and puppet. @TheDJ do you have a way to check if things are working as expected now? There are old parts of this puppet code that clearly do not mesh with reality still, so it is worth checking.

Since there was a discussion of project tags up there in the back scroll. The correct tag for the OSMDB is Data-Services moved to the column "maps". Data services is also the correct tag for the wikireplicas themselves used in Toolforge, though most tickets are about usage not the servers themselves so it's probably ok that nobody really uses that tag except me and a few other people. It's not well-documented.

I wonder if the database servers for this could be moved into the maps project so people would have access to run these things who have access to administer the maps system. They are currently in clouddb-services where maps admins cannot access them, so you have to wait for me to fumble around with them :)

Was not able to confirm this yet, will have to look tomorrow.

I've not been able to get it to draw a tile that looks like a current tile.. I'm not sure why. I don't have the experience to deal with problems like these unfortunately. Maybe we need to ask wikitech-l ?

I've not been able to get it to draw a tile that looks like a current tile.. I'm not sure why. I don't have the experience to deal with problems like these unfortunately. Maybe we need to ask wikitech-l ?

That seems fair. The cloud database server is basically using the production WMF puppet code, but I have very little idea how it all works or about openstreetmaps in general. I'm happy I got something that was definitely broken to stop being broken, but that only buys so much :)

Also, I can grant access to the servers for any WMF maps people, or I could also start working on migrating this to the maps project instead where anyone with maps project access can get in and kick around the database themselves. I don't think WMCS provides much benefit in administering these databases separately since we don't have maps people and had no idea the database was so badly out of sync in the first place. I don't actually know who's doing maps work these days for the foundation, but I would very much like to get help from them.

Mentioned in SAL (#wikimedia-cloud) [2021-09-02T18:52:14Z] <bstorm> removed strange old duplicate cron for osmupdater T285668

@akosiaris I remember asking you about this setup in the past. Do you have any thoughts on what might be wrong here?

I also just found that we are actually missing the coastlines table and the land_polygons table from the original setup, it seems. I can try to recreate those by downloading the dumps mentioned in the puppet setup.

Change 716543 had a related patch set uploaded (by Bstorm; author: Bstorm):

[operations/puppet@production] cloud osmdb: update the filenames in case we re-import the shapefiles

https://gerrit.wikimedia.org/r/716543

@akosiaris I remember asking you about this setup in the past. Do you have any thoughts on what might be wrong here?

Hi! I am quite a bit rusty on this, plus tools have changed since I last messed with this. In my experience the sanest (albeit not fastest) way out of out of sync database was to a) backup the a couple of user databases, b) drop the database c) reimport it from scratch using osm2pgsql (assuming that is still the tool still being used). That would usually take a day or two. d) restore the extra user databases dumped in step a). For the duration of all of this puppet was best left disabled. From that point on, I have no real experience with the maps WMCS infrastructure. I (think!) tiles are generated by apache's mod_tile module, but by know I am grasping at straws.

I also just found that we are actually missing the coastlines table and the land_polygons table from the original setup, it seems. I can try to recreate those by downloading the dumps mentioned in the puppet setup.

That would be the way those are populated. If the files are around, puppet would try and create those tables.

I don't actually know who's doing maps work these days for the foundation, but I would very much like to get help from them.

Maps work for the foundation is happening by the Web and Infrastructure Product department's team. However, I don't think they ever touched the setup in WMCS, as far as I know they are now using different tools, and thus I would not dare to presume they will be able to help much.

In fact, I don't think the foundation ever had a member that had delved beyond the osm syncing step for the maps WMCS project. I don't even know of a frequent channel of communication between the foundation's present and past maps teams and the volunteers running the maps WMCS project.

I was wrong about coastlines and land_polygons, btw. The tables are there. I don't know if they need updating, but they are there. The sync process seems to be working now, but I don't really know how to check. I could do a resync from scratch (from what I can tell, all the tooling is exactly the same as it always was), though I am not sure that will help the situation or not.

Does anyone on the task know how to tell if the database is the problem (at this point now that I've theoretically maybe caught it up since it stopped syncing) or if some other bit is?

I was wrong about coastlines and land_polygons, btw. The tables are there. I don't know if they need updating, but they are there.

Coastlines change very slowly so they very rarely need updating. I am not sure about land polygons though. I would expect them to also need updating on the order of years, so definitely not often.

The sync process seems to be working now, but I don't really know how to check.

There's modules/osm/files/osm_sync_lag.sh that should spit out the lag of the server. If that doesn't error out or spit out some nonsensical value, then the DB should be properly synced. It quite possibly is already setup via puppet and prometheus is scraping it regularly.

Does anyone on the task know how to tell if the database is the problem (at this point now that I've theoretically maybe caught it up since it stopped syncing) or if some other bit is?

Maybe @Kolossos or @dschwen could help with that.

I am late to the party, but tiles.wmflabs.org is not part of the official Maps infrastructure. I'm not sure how its infrastructure works and I've limited time to help you setup currently, maybe next quarter, but @Bstorm if you need to sync and know what we have available in terms of maps OSM tooling, please ping me for a chat.

We now use imposm3 to import data and it would be interesting to see this project do more fancy stuff, like using OpenMapTiles.

I am late to the party, but tiles.wmflabs.org is not part of the official Maps infrastructure. I'm not sure how its infrastructure works and I've limited time to help you setup currently, maybe next quarter, but @Bstorm if you need to sync and know what we have available in terms of maps OSM tooling, please ping me for a chat.

We now use imposm3 to import data and it would be interesting to see this project do more fancy stuff, like using OpenMapTiles.

There's two pieces to all this here. We've got a maps project in cloud (which tiles.wmflabs.org is hosted on) that is run by volunteers and whoever knows how to make it work and a Postgres OSM database that is run by my team. That database consumes the same puppet stack as the production WMF ones as far as I know. Unfortunately, I don't know a thing about maps, but I have plenty of root so I'm trying to use that for good :)

My questions have partly been about how to make sure the Postgres database this connects to is in good working order for now so that I know I'm not the blocker. From there, I'm sure other people here who work on the maps project like @TheDJ could use any insight on what to kick to make things start working right. I usually think it is safe to assume that the problem is probably on my end of things since I haven't paid attention to the database much for a while, which is why I'm trying to sync up the Postgres database--just for the organizational context. I'll ping you soon!

I am late to the party, but tiles.wmflabs.org is not part of the official Maps infrastructure. I'm not sure how its infrastructure works and I've limited time to help you setup currently, maybe next quarter, but @Bstorm if you need to sync and know what we have available in terms of maps OSM tooling, please ping me for a chat.

We now use imposm3 to import data and it would be interesting to see this project do more fancy stuff, like using OpenMapTiles.

There's two pieces to all this here. We've got a maps project in cloud (which tiles.wmflabs.org is hosted on) that is run by volunteers and whoever knows how to make it work and a Postgres OSM database that is run by my team. That database consumes the same puppet stack as the production WMF ones as far as I know.

I don't think that's true anymore (albeit it's a recent development). While puppet abstracts it a bit so that the same defines and classes are used, maps in production now uses imposm3 while the one in WMCS still uses osm2pgsql in order to populate the OSM database.

I am late to the party, but tiles.wmflabs.org is not part of the official Maps infrastructure. I'm not sure how its infrastructure works and I've limited time to help you setup currently, maybe next quarter, but @Bstorm if you need to sync and know what we have available in terms of maps OSM tooling, please ping me for a chat.

We now use imposm3 to import data and it would be interesting to see this project do more fancy stuff, like using OpenMapTiles.

There's two pieces to all this here. We've got a maps project in cloud (which tiles.wmflabs.org is hosted on) that is run by volunteers and whoever knows how to make it work and a Postgres OSM database that is run by my team. That database consumes the same puppet stack as the production WMF ones as far as I know.

I don't think that's true anymore (albeit it's a recent development). While puppet abstracts it a bit so that the same defines and classes are used, maps in production now uses imposm3 while the one in WMCS still uses osm2pgsql in order to populate the OSM database.

@akosiaris is right, the puppet config is backwards compatible and to use imposm3 you need to enforce it, so it doesn't change tooling for tiles.wmflabs.org

If you need some reference about updating the OSM database using osm2pgsql, look into this task (T254014) for a checklist of procedures, some of them doesn't apply because are specific to the PG replicas in the production database. Nevermind, you already got it, maybe there's more relevant info at https://wikitech.wikimedia.org/wiki/Maps/OSM_Database

My biggest question is how to tell if replication is working right. I set it to do a full replication from the time it had stopped and now the state file thinks it is current. I can re-import the coastlines table and the polygon one, but how do you know when it is working :)

My biggest question is how to tell if replication is working right. I set it to do a full replication from the time it had stopped and now the state file thinks it is current.

That was always good enough in the past.

I can re-import the coastlines table and the polygon one, but how do you know when it is working :)

What was good enough in the past was having the tables there and knowing it was a relatively recent file one downloaded (not that it matters much, coastlines don't change that often, not to mention this was not syncing since circa 2019. Coastlines and land polygons are the least of the problems). The "app" would pick them up. The one thing that does come to mind is that the app needed to know which tiles had expired which was accomplished by allowing the tile servers to download a file osm2pgsql exported via rsync. I don't know whether it makes sense in this specific case, with so much time in between for the tile servers to use that file or just regenerate most tiles from scratch.

The one thing that does come to mind is that the app needed to know which tiles had expired which was accomplished by allowing the tile servers to download a file osm2pgsql exported via rsync. I don't know whether it makes sense in this specific case, with so much time in between for the tile servers to use that file or just regenerate most tiles from scratch.

Ahhh! Ok, I did see that the sync generates some kind of set of expired tiles. This also has an rsync server on it. I wonder if the maps project setup changed the server to rsync from back when labsdb1006/7 was shut down...or if the rsync server is still configured as expected? That suggests some places to look. Thanks!

OK, finally had some time to take a look again. I looked at planet_osm_nodes If we do a SELECT * FROM planet_osm_nodes ORDER BY id DESC LIMIT 1; we find the 'newest' nodes I'd guess. This returns node 9049471600

I've learned there is an easy way to find a node ( or way or relation) on the current OSM map: https://www.openstreetmap.org/node/9049471600
This shows "last edited" of Mon, 30 Aug 2021 23:59:34 +0000 (i guess that also means the sync isn't running ; )

So i've done some digging on ids, since we don't have timestamps in the tables..
This query gets lists of ids pretty close to that highest id we have in the db and calculates gaps in the ids. They show gaps of about a 100 ids max.

select id + 1 as gap_start,
    next_id -1 as gap_end,
    next_id - id as gap_size
from (
    select id, lat, lon, lead (id) over (order by id) as next_id
    FROM planet_osm_nodes WHERE id > 9048471600 ORDER by id limit 100000
) nr
where nr.id + 1 <> nr.next_id;

Now take it back further in time to say starting from id 9028100000 (last edited Mon, 23 Aug 2021 08:30:02 +0000)
i'm seeing lots of huge gaps of 10000 and even 200000 nodes at times... that seems very high for pretty recent data. This can also be compared to much older data (just decreasing the first digit of the id) and you can see that very old data also tends to show gaps more in the order of the hundreds than the hundred thousands as a pattern.

With some casual binary searching i'm seeing somewhere in the region of 6382000000 as a ballpark starting date of lots of very large gaps, which would be somewhere after 03 Apr 2019 21:20:07 +0000, indicating something went wrong with the sync, but pretty hard to figure out i guess what exactly, but its almost as if days got skipped perhaps ?

The point that i'm looking at myself to verify the state of things is one in my city that i know should have been added over the last couple of years and it is: https://www.openstreetmap.org/node/7005849394 Sun, 24 Nov 2019 10:38:15 +0000
Its just not in our tables.

I'm not sure what the best is. Maybe we just had a state that was too young/old and then we had cascading query failures on data when syncing ? Maybe we should go back a bit further ? Or maybe reimport from scratch ?

Ahhh! Ok, I did see that the sync generates some kind of set of expired tiles. This also has an rsync server on it. I wonder if the maps project setup changed the server to rsync from back when labsdb1006/7 was shut down...or if the rsync server is still configured as expected? That suggests some places to look. Thanks!

Pretty sure that was never done, because i didn't and I think i was the last person to touch the tiles server. Maybe @dschwen has it running on the wma server ?

Ok, so from what you just said, that sounds to me like the OSMDB needs to be rebuilt to make sure we don't have gaps after dumping the appropriate databases. Since it is on VMs. That also suggests it is a good time to consider building the service inside the maps project instead of in the special "admin only" space of clouddb-services. I don't know the implications of syncing up the design of this sync with the production one, but that might be worth considering as well.