Page MenuHomePhabricator

maps.wikimedia.org is showing old vandalized version of OSM
Closed, ResolvedPublic

Description

See https://en.wikipedia.org/wiki/Special:Permalink/854500898#Help_-_Sneaky_Vandalism%21 for context

Compare:
https://maps.wikimedia.org/#12/40.7582/-73.9921 (Jewtropolis)

Versus:
https://www.openstreetmap.org/#map=12/40.7582/-73.9921 (New York)

Other vandalism seen all over the globe. OSM has fixed it but we're still showing it. Is there a way to clear the cache (assuming it's a cache issue)?

Event Timeline

MusikAnimal triaged this task as Unbreak Now! priority.Aug 11 2018, 9:13 PM

T137939: Increase frequency of OSM replication (and then displaying the changes promptly) probably needs to be a higher priority... otherwise this issue will just repeat itself next time OSM gets vandalised.

See also T159631#3078163 - specifically with regards to a bad edit on OSM, that got fixed on OSM, but required a manual database update because otherwise it would take at least a day to update here.

Logs indicate that the previous update ran without issue. As seen in grafana, tiles have been regenerated. Direct check on maps2001 indicates that the tiles are correct, we probably have a cache invalidation issue with varnish. A quick look at caching headers does not show anything wrong with them (but that was a quick look).

It looks like the cache is starting to invalidate. I don't have a precise timeline on when this issue happened, when we synced the problematic data and when we synced the corrected data, but tiles expiring now seems consistent with what I would expect, provided we synced the problematic data on Saturday Aug 11, and the corrected data on Sunday Aug 12 (both sync starting around 1:30 am and taking ~6 hours to generate most tiles).

For documenting purpose, below are the headers generated by kartotherian. Those look correct to me, setting a TTL of 24 hours, which is what I expect.

While T137939 is supposed to increase the frequency of refresh from OSM, there is no plan to reduce significantly the TTL on the varnish side. Reducing this TTL could have non trivial performance impact. We might be able to reduce it, but not without some research into the impact.

gehel@maps2001:~$ curl -sv localhost:6533/osm-intl/12/1203/1538.png > /dev/null
* Hostname was NOT found in DNS cache
*   Trying ::1...
* connect to ::1 port 6533 failed: Connection refused
*   Trying 127.0.0.1...
* Connected to localhost (127.0.0.1) port 6533 (#0)
> GET /osm-intl/12/1203/1538.png HTTP/1.1
> User-Agent: curl/7.38.0
> Host: localhost:6533
> Accept: */*
> 
< HTTP/1.1 200 OK
< access-control-allow-origin: *
< access-control-allow-headers: accept, x-requested-with, content-type
< access-control-expose-headers: etag
< x-xss-protection: 1; mode=block
< x-content-type-options: nosniff
< x-frame-options: SAMEORIGIN
< content-security-policy: default-src 'self'; object-src 'none'; media-src 'none'; style-src 'self'; script-src 'self'; frame-ancestors 'self'
< x-content-security-policy: default-src 'self'; object-src 'none'; media-src 'none'; style-src 'self'; script-src 'self'; frame-ancestors 'self'
< x-webkit-csp: default-src 'self'; object-src 'none'; media-src 'none'; style-src 'self'; script-src 'self'; frame-ancestors 'self'
< x-powered-by: kartotherian: 0.0.38 (c49f37c39515675d95d3dd7da09ca535ec0d448b)
< Cache-Control: public, max-age=86400, s-maxage=86400
< Content-Type: image/png
< ETag: "da5d7d24b729e78b042240f6d189e938"
< Last-Modified: Sun, 12 Aug 2018 05:25:54 GMT
< x-vector-backend-object: default
< Content-Length: 39425
< Date: Sun, 12 Aug 2018 15:32:31 GMT
< Connection: keep-alive
< 
{ [data not shown]
* Connection #0 to host localhost left intact

Mentioned in SAL (#wikimedia-operations) [2018-08-12T15:43:36Z] <gehel> full cache invalidation of maps tiles - T201772

Invalidating varnish cache (see P7451) seems to work. Browser cache might need refreshing, but not much we can do about that.

Full tile invalidation did generate high load on maps CPU, but only a fairly small increase in our 95%-ile response time.

Why do we cache 24 hours ? That seems like a lot for clients to cache. 1 hour would seem more than sufficient shouldn't it ? varnish could even use stale-while-revalidate to keep it's responsiveness.

I remember @Krinkle telling me once that the return of high cache times wasn't nearly as high as we presumed, and they managed to dial it back to a 5 minute expiration for the primary Javascript and CSS files, with almost the same backend resource usage as before.

Once people realize what an effective vandalism vector this is, I think we can expect to see a lot more of this.

Why do we cache 24 hours ? That seems like a lot for clients to cache. 1 hour would seem more than sufficient shouldn't it ? varnish could even use stale-while-revalidate to keep it's responsiveness.

Honestly, "historical reason". We could / should probably reduce it. I'm just saying that we should do some investigation before.

As far as I know, varnish already does stale-while-revalidate.

Gehel claimed this task.

Resolving as this specific issue is now OK.