Page MenuHomePhabricator

Improve tile storage for maps.wikimedia.org
Open, HighPublic

Description

Problem statement

Tegola-vector-tiles service stores rendered vector map tiles as pbf on thanos-swift swift storage. Currently, we store all tiles, and never delete them, thus our swift containers (currently tegola-swift-eqiad-v002 and tegola-swift-codfw-v002) constantly grow. Large containers on swift may cause problems in the swift infrastructure outside the scope of maps because of the way storage is allocated and the lack of system level sharding.

We need to work on a way to expire tiles from swift storage, based on when they were accessed last.

Actionables (Hackathon 2023)

As per T333318#8867628

  • Add a requestctl rule to block empty referrer requests (and/or maps.wikimedia.org)
  • Create new swift containers on both datacenters
  • Depool codfw, point condfw tegola to the new container
  • Enable kartotherian's request mirroring to start warming up codfw's new container
  • After a reasonable number of objects has been created, we can rinse and repeat with eqiad

General Ideas

Tile expiration

Cache sharding

  • Allow tegola to use multiple cache backends
    • Extend tegola to use multiple cache backends with sharding
    • Write a proxy to do the sharding between tegola and multiple swift containers

Event Timeline

During the hackthon, we came up with some findings regarding maps traffic:

image.png (1×1 px, 136 KB)

Which means that applications/websites are still using our maps infrastructure, by making requests with an empty referer.

Goal

Have as little as possible objects on swift, mainly objects that we use for our sites. Right now, our swift containers are "polluted" with objects generated from non-wikimedia related traffic.

Immediate solutions we can do

Create and warm up new swift containers with only objects that have been requested from our sites

  • Add a requestctl rule to block empty referrer requests (and/or maps.wikimedia.org)
  • Create new swift containers on both datacenters
  • Depool codfw, point condfw tegola to the new container
  • Enable kartotherian's request mirroring to start warming up codfw's new container
  • After a reasonable number of objects has been created, we can rinse and repeat with eqiad

From the current storage. The amount of tiles per level (low zoom levels are irrelevant):

root@maps2009:~# cat tiles-1684658556.txt | grep "^15/" | wc -l
26529207
root@maps2009:~# cat tiles-1684658556.txt | grep "^14/" | wc -l
12937282
root@maps2009:~# cat tiles-1684658556.txt | grep "^13/" | wc -l
6561060
root@maps2009:~# cat tiles-1684658556.txt | grep "^12/" | wc -l
2997025
root@maps2009:~# cat tiles-1684658556.txt | grep "^11/" | wc -l
1282211
root@maps2009:~# cat tiles-1684658556.txt | grep "^10/" | wc -l
694325
root@maps2009:~# cat tiles-1684658556.txt | grep "^9/" | wc -l
224453

Here is the WIP tegola patch with tests passing for allowing sharding storage based on tile key:
https://github.com/johngian/tegola/tree/caches-sharding

Immediate solutions we can do

Create and warm up new swift containers with only objects that have been requested from our sites

  • Add a requestctl rule to block empty referrer requests (and/or maps.wikimedia.org)
  • Create new swift containers on both datacenters

Logging requests towards maps.wikimedia.org with an empty referrer, we see that the traffic is not negligent:

image.png (596×2 px, 137 KB)

We see around 500-1000+ requests per second (this graph is using a 1/128 sample). We are moving forward to enabling the filter

Change 924112 had a related patch set uploaded (by Effie Mouzeli; author: Effie Mouzeli):

[operations/deployment-charts@master] tegola: Switch swift container to tegola-swift-codfw-v003

https://gerrit.wikimedia.org/r/924112

Change 924114 had a related patch set uploaded (by Effie Mouzeli; author: Effie Mouzeli):

[maps/kartotherian/deploy@master] Enable mirroring from eqiad to codfw

https://gerrit.wikimedia.org/r/924114

Change 924114 merged by jenkins-bot:

[maps/kartotherian/deploy@master] Enable mirroring from eqiad to codfw

https://gerrit.wikimedia.org/r/924114

Hey y'all, I was pointed to this task by @stjn. These changes unexpectedly broke my toolforge project at bullseye.toolforge.org - it was serving wikimedia maps but in the past week it started serving gray tiles and I was getting 403s in the browser console from the tile queries. I hacked around it by adding <meta name="referrer" content="strict-origin-when-cross-origin" /> to the source of my tool (h/t @AntiCompositeNumber for figuring that out), but this was an unexpected breakage and it would have been nice if it had been communicated more widely.

Hey y'all, I was pointed to this task by @stjn. These changes unexpectedly broke my toolforge project at bullseye.toolforge.org - it was serving wikimedia maps but in the past week it started serving gray tiles and I was getting 403s in the browser console from the tile queries. I hacked around it by adding <meta name="referrer" content="strict-origin-when-cross-origin" /> to the source of my tool (h/t @AntiCompositeNumber for figuring that out), but this was an unexpected breakage and it would have been nice if it had been communicated more widely.

More importantly... it seems that the webserver is setting referrer-policy: same-origin http header. Is that a default setting of nginx on toolserver ? Are other tools broken ?

Looking at https://k8s-status.toolforge.org/namespaces/tool-bullseye/pods/bullseye-6bfb678845-6qh58/, it looks like bullseye runs off the toolforge-python39-sssd-web container. My tools, which also run python39, don't set the header, so I don't think it's a Toolforge issue. My tools run Flask, but bullseye is Django.

I found https://docs.djangoproject.com/en/4.2/ref/settings/#std-setting-SECURE_REFERRER_POLICY which backs up that the default Referrer-Policy header set by Django is same-origin. The better way to set the Referrer-Policy header would be to change that setting.

is there a specific reason the tile viewer at https://maps.wikimedia.org/ was broken?

Misconfiguration. This specific one is fixed, we still have a few pending items though.

Hey y'all, I was pointed to this task by @stjn. These changes unexpectedly broke my toolforge project at bullseye.toolforge.org - it was serving wikimedia maps but in the past week it started serving gray tiles and I was getting 403s in the browser console from the tile queries. I hacked around it by adding <meta name="referrer" content="strict-origin-when-cross-origin" /> to the source of my tool (h/t @AntiCompositeNumber for figuring that out), but this was an unexpected breakage and it would have been nice if it had been communicated more widely.

FYI @GeneralNotability , it seems that you have this element in the body of the webpages now, but it should be inside the <head>. I'm kinda surprised that the browser is this tolerant with it.

This comment was removed by jijiki.

This is on serviceops and we are terribly sorry for that. We have been trying to come up with ways to prevent clients using maps for commercial purposes. We will keep trying, in the meantime, for every issue you notice, please continue using this thread.

Please post some notice about it in maps-l (at the very least). If I wasn’t lurking through Hackathon tasks, I might’ve never known what caused this, tbh.

This is on serviceops and we are terribly sorry for that. We have been trying to come up with ways to prevent clients using maps for commercial purposes. We will keep trying, in the meantime, for every issue you notice, please continue using this thread.

Please post some notice about it in maps-l (at the very least). If I wasn’t lurking through Hackathon tasks, I might’ve never known what caused this, tbh.

We were watching for new tasks, since this is a trial and error process.

is there a specific reason the tile viewer at https://maps.wikimedia.org/ was broken?

Misconfiguration. This specific one is fixed, we still have a few pending items though.

It seems examples in https://www.mediawiki.org/wiki/Wikimedia_Maps/API are still broken. E.g. https://maps.wikimedia.org/geoline?getgeojson=1&ids=Q2087925 returns an error. It worked for awhile after loading respective object in a Wikipedia article, but now it returns an error again. These API links have been useful here in Phabricator tasks so far.

TheDJ triaged this task as High priority.Jun 10 2023, 4:17 PM

Can we please unbreak maps.wikimedia.org so that it only applies this referrer check to the actual TILES and not to /geoline, /geoshape and /img/ ?

This is kinda annoying when trying to investigate issues with maps right now.