Page MenuHomePhabricator

Questions about map tile cache performance
Closed, ResolvedPublic

Description

Requesting tiles seems a bit slow, considering that we now have 16 varnishes and almost no load.

https://maps.wikimedia.org/osm-intl/11/604/767.png

The wikipedia resources seem to have 3 varnish servers, but maps show 4, with the first one consistently a miss(0):

x-cache:cp2015 miss(0), cp1046 hit(1), cp3006 hit(1), cp3004 frontend hit(1)
x-cache:cp2015 miss(0), cp1046 hit(1), cp3006 hit(1), cp3004 frontend hit(2)
x-cache:cp2015 miss(0), cp1046 hit(1), cp3006 hit(1), cp3004 frontend hit(3)

Some of x-varnish headers (not sure what they are)

x-varnish:54309690, 47529231 52262450, 76489992 71663454, 5197 4054078
x-varnish:54309690, 47529231 52262450, 76489992 71663454, 1079612 4054078

Pings:

                                      Packets               Pings
Host                                Loss%   Snt   Last   Avg  Best  Wrst StDev
1. fe80::1                           0.0%   251    1.9   3.7   0.9 113.0  12.4
2. 2a02:2698:6c00::502               0.0%   250   17.9  11.6   2.2 129.2  15.5
3. 2a02:2698:6c00::1e0e              0.0%   250    4.3   6.3   1.6  87.0  11.4
4. GW-ERTelecom.retn.net             0.0%   250    2.8   6.4   1.7 134.9  14.8
5. ae13-110.RT.SL.SPB.RU.retn.net    0.0%   250   11.9  15.9  11.4 140.2  14.2
6. RT.TC2.AMS.NL.retn.net            3.2%   250  126.4 129.4  96.3 2965. 184.1
7. ae2.cr1-esams.wikimedia.org       3.2%   250  126.4 113.1  96.6 278.4  24.3
8. maps-lb.esams.wikimedia.org       1.6%   250  126.3 116.8  95.6 279.4  25.2

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

An active measurement on 800 requests give me a 99%-ile = 180 ms. Not amazing, but not incredibly slow either.

If I understand our metrics correctly (https://graphite.wikimedia.org/S/BV) we have much lower mean response time from codfw than other DC, which indicates that we don't have a good cache hit ratio on the local DC. I need to dig more to see if that's correct.

There are a number of misunderstandings in this ticket, so let me step through them a bit, and then we can get back to the basics and ask whatever fundamental questions we're trying to ask:

  1. "Requesting tiles seems a bit slow" - needs better quantification than that. I think @Gehel tried to quantify a bit with the 180ms, but doesn't indicate which DC he was hitting from where, and what the underlying latency is from the requesting client to the endpoint (and whether these 180ms hits are cache hits - I'd assume they are if he's fetching the same tile over and over).
  1. The interpretation of X-Cache is completely incorrect. That's not your fault, as interpreting is not intuitive, but the point is, there's no problem in those headers. They were all frontend cache hits (fastest possible scenario) in esams (the right-most server, cp3004). The remainder of the entries to the left come from the cached object itself, and indicate hit/miss/etc status when the object was cached into cp3004 originally. It's often the case that one or more of those entries is an initial cache miss (later requests to that middle-tier cache would be a hit, but now that cp3004 has an object to hit in its own cache, it's not going to talk to that server again, so it keeps showing the stats from the first request).
  1. Continuing on X-Cache, the 3 vs 4 servers thing you're observing between wikipedia and maps is an internal detail you really shouldn't worry about. On cache miss, we have to traverse multiple layers. Wikipedia's application layer happens to be active in eqiad, and maps' primary happens to be active in codfw, so the varnish routing paths are different. On local cache hits (either the rightmost is a hit, or the rightmost is a miss and the next one to the left is a hit, still in cp3xxx) this is irrelevant.
  1. Your ping data says you're in Russia, and you have 126ms latency from your client to esams. Are you observing results inconsistent with that?
  1. The graphite link from @Gehel is not latency, it's request counts. What that's showing is the relative number of requests coming into each datacenter for maps.
BBlack renamed this task from Verify maps caching to Questions about map tile cache performance.Apr 29 2016, 5:17 PM
BBlack triaged this task as Medium priority.

@BBlack, awesome explanation, thank you, this clarifies so much!

I am observing a very slow load time on the landline connection, while I see a much better performance over tethered cell network. What is more surprising is that the browser tile loading pattern looks very different:

Over mobile network: total 2.4second:

pasted_file (853×819 px, 157 KB)

Over landline (which seems to work ok otherwise): 20+ seconds. Note that the actual download does not start right away, but seem like they are sequential. The connection speedtest shows incredible 50Mbit down and 60! Mbit uplink:

pasted_file (859×837 px, 156 KB)

Mobile network pings:

                                    Packets               Pings
 Host                             Loss%   Snt   Last   Avg  Best  Wrst StDev
 8. ip-83-149-1-137.nwgsm.ru       0.0%    53   47.6  44.8  18.7 174.0  33.1
 9. 78.25.80.89                    0.0%    53   42.0  48.6  17.8 249.9  46.5
10. 78.25.80.88                    0.0%    53   58.1  51.5  17.8 183.4  35.4
11. 10.222.78.97                   0.0%    53   87.6  86.6  52.8 313.7  49.6
12. 83.169.204.78                  0.0%    53   51.8  51.8  28.7 237.6  32.5
13. 83.169.204.89                  0.0%    53   87.6  73.1  49.7 253.9  34.1
14. ae2.cr1-esams.wikimedia.org    0.0%    53   79.5  78.1  50.9 263.2  41.2
15. maps-lb.esams.wikimedia.org    0.0%    53   74.1  77.6  48.5 261.3  41.9

There's still a lot of missing detail here. What browser/os/version is this? How do I reproduce the same page load? What else is at the top of those waterfalls? I'm assuming it's the exact same client in both cases, just switching networks? Is SPDY being used in both cases?

In general, it looks like your mobile network is pretty healthy (there's a lot of latency deviation, but I'd kind of expect that on mobile). Your land-line connection averages 50% longer latency times than your mobile one to the same destination, which is kind of odd and the reverse of usual expectations. Also the blank (empty box) part of the waterfall on the landline graph is "Queuing", which google describes in https://developers.google.com/web/tools/chrome-devtools/profile/network-performance/understanding-resource-timing as:

If a request is queued it indicated that:
The request was postponed by the rendering engine because it's considered lower priority than critical resources (such as scripts/styles). This often happens with images.
The request was put on hold to wait for an unavailable TCP socket that's about to free up.
The request was put on hold because the browser only allows six TCP connections per origin on HTTP 1.
Time spent making disk cache entries (typically very quick.)

I still don't really see anything to indicate there's anything wrong with our cache infrastructure, just odd question-marks about your client and connectivity to esams. Perhaps your landline provider, in addition to having poor ping latency to us, also shapes HTTPS traffic differently than ICMP? Or some other device on your landline network is clogging the upstream side of the connnection with ACKs from larger downloads if the link is very asymmetric?

Also, note this line in your original landline ping results:

6. RT.TC2.AMS.NL.retn.net            3.2%   250  126.4 129.4  96.3 2965. 184.1

The worst-case ping at that hop was nearly 3 full seconds, and it's got 3.2% packet loss. That can't be a very healthy network. This makes me doubt that it seems to work ok otherwise if looking in as much detail as you are in this case.

Chrome 50.0.2661.86 (Official Build) (64-bit) on Ubuntu 16.04. I used https://maps.wikimedia.org/#9/50.7060/-100.3725 for both tests, on the same machine. The connection shows spdy/3.1 for both. My worry is only that I might have misconfigured headers or that Varnishes cause some strange behavior for some of our users. For comparison, google maps loads in 2.7s. Of course Google might have a few more datacenters or links :)

Top of the waterfall:

pasted_file (861×855 px, 154 KB)
Also, it seems that if I refresh after a long pause, I get a slightly different (although still similarly slow) graph:
pasted_file (896×874 px, 159 KB)
I wonder if this is caused by some Chrome/Ubuntu configuration (but since the timing is similar, it might be ok)

Refreshing after a long pause has to re-establish a connection. If you're comparing to google, then trace your pings to whatever edge IP Google gives you as well. It probably doesn't have a link with high latency spikes and loss. Also, does your connection to gmaps use IPv6 like it is for this case? Have you tried disabling IPv6 to see how that might affect things?

Note, the bad link is not Wikimedia's. It's internal to retn.net and is the link where the traffic hops from Russia to Amsterdam in their network. retn.net is who your ISP ER-Telecom Holding Saint-Petersburg Branch uses to jump out of Russia towards us.

I'm still failing to see any evidence there's a problem with the maps Varnish caches, for all of this looking and typing...

@BBlack, thanks for looking into this. Google's servers are also ipv6, and their ping response is around 10.5, which explains it. Regardless, this gives a very good insight into how our network is setup (varnish config, etc),, and I think we should close it as non-actionable at this point.

BBlack changed the task status from Invalid to Resolved.Apr 29 2016, 7:49 PM

Yes, at 10ms that probably means the gmaps endpoint you're hitting is inside of Russia, which is completely different...