Page MenuHomePhabricator

Investigate if pre-rendering images is having an impact on performance
Closed, ResolvedPublic

Description

  1. Investigate a way to look at the performance data for files uploaded since October 22
  2. compare those image load times with other low-traffic images that may not have been pre-rendered.
  1. https://gerrit.wikimedia.org/r/174933
  2. https://gerrit.wikimedia.org/r/174928

"Still needs tests, but can be reviewed already. I'm not happy about the promise being passed around that much, but I don't really see a way around it without major refactoring."

Event Timeline

MarkTraceur assigned this task to Gilles.
MarkTraceur raised the priority of this task from to Medium.
MarkTraceur updated the task description. (Show Details)
MarkTraceur added a project: Multimedia.
MarkTraceur moved this task to Next up on the Multimedia board.
MarkTraceur changed Security from none to None.
MarkTraceur subscribed.

Change 174933 had a related patch set uploaded (by Gilles):
Track the most recent upload time for performance events

https://gerrit.wikimedia.org/r/174933

Patch-For-Review

Change 174933 merged by jenkins-bot:
Track the most recent upload time for performance events

https://gerrit.wikimedia.org/r/174933

Tgr raised the priority of this task from Medium to High.Dec 11 2014, 12:47 AM

I'm going to backport this to get the results a couple of days earlier. It should be a good source of information about chaining as well (checking if it's worth doing).

Change 179918 had a related patch set uploaded (by Gilles):
Track the most recent upload time for performance events

https://gerrit.wikimedia.org/r/179918

Patch-For-Review

Change 179918 merged by jenkins-bot:
Track the most recent upload time for performance events

https://gerrit.wikimedia.org/r/179918

Change 179921 had a related patch set uploaded (by Gilles):
Backport Media Viewer performance tracking

https://gerrit.wikimedia.org/r/179921

Patch-For-Review

Change 179921 merged by jenkins-bot:
Backport Media Viewer performance tracking

https://gerrit.wikimedia.org/r/179921

Initial poking at the data suggests that prerendering actually worsened performance. I'll work on creating graphs around the period of when prerendering and chaining were introduced, in order to verify if the correlation is real, or if something else is responsible.

Change 180136 had a related patch set uploaded (by Gilles):
Query image performance by upload time

https://gerrit.wikimedia.org/r/180136

Patch-For-Review

Actually I've just thought of checking that data by dividing varnish hits (shouldn't be affected by prerendering) and varnish misses (should be where the prerendering kicks in).

It turns out that performance seems to worsen for varnish hits (the majority of requests) for images uploaded recently, which should be independent from prerendering, chaining, etc. As for varnish misses, the data seems insufficient right now that have reliable means, let alone percentiles, but it doesn't display the same trend as varnish hits.

@fgiunchedi is it possible that images recently added to Varnish perform worse than images that have been added to Varnish a while ago? I.e. if they've been added recently they would be in less frontends, and more likely to have to be pulled from a varnish backend.

If that's the case, I'll need to update the changeset to only look at varnish misses as a measure of whether or not prerendering is beneficial.

good question Gilles, it shouldn't happen AFAIK but it is possible, were those also direct fetches from swift? what data/logs are you looking at?

I'm looking at our own data gathered from sampled clients that checks if the image was a varnish hit or not. It's parsed from the HTTP headers, but we might not retain all the information. You can see the data we have on the MultimediaViewerNetworkPerformance_10596581 table of the "log" db on analytics-store.eqiad.wmnet The event_varnish* columns are what I use to check whether it's a hit or a miss. The event_uploadTimestamp is the column used to tell at what date the file was uploaded. To look only at images make sure that event_type is set to "image"

Here's the query that shows the performance statistics for image loads per upload month, only for varnish hits: P159

The sample sizes it currently yields are probably enough for means but not for percentiles. It's only been tracking this since yesterday, more data continuously feeds this, for all upload time periods.

Average performance experienced by users hitting the cp4xxx "varnish2" servers is slower: P160 is this expected?

Note that this assumes even distribution of file sizes and client bandwidth across the data. I thought it might be that the cp4xxx server hold bigger files, and it seems like it might be the case, although that doesn't explain everything: P161

It's possible that these servers serving more to some countries that have lower bandwidth, though. I don't know if there's some geographical load balancing for varnish servers.

For "varnish1" and "varnish3" servers, the results seem to be even, although the sampling is insufficient to draw conclusions on some of the "varnish3" servers. By varnish1, 2, 3 I mean the servers as they are listed in order in the header response for the requests. Presumably it represents the various layers of servers in the varnish stack.

yep clients are geo-located to the closest datacenter via dns, so different cp machines get very different clients in terms of bandwidth and latency for example. Disabling prerending and running the measurements again sounds easier to test this theory, I think that's scheduled to happen already?

Disabling prerending and running the measurements again sounds easier to test this theory, I think that's scheduled to happen already?

Yes, I'll take care of that when I come back from vacation in early january.

Which regions do the cp4xxx servers cover, out of curiosity?

Nevermind, I answered my own question by looking at the data: P164 It seems to be predominantly Asia, with Taiwan taking the lion's share. It's consistent with our geographical performance data: http://multimedia-metrics.wmflabs.org/dashboards/mmv#geographical_network_performance-graphs-tab (second map for images)

Now that I have more data I can confirm that only varnish hits are affected by the phenomenon where images uploaded recently (past 2 months) seem to perform worse than images that have been uploaded a longer time ago. So far I was looking at the file upload time, so I decided to look at varnish age of the cached thumbnail (i.e. how long the hit has been cached in varnish for) to check if the performance pattern also applied that way: P172 and it clearly does. So I'm filing a new Ops task for that: T84980

As for prerendering, to get back to the topic of this task, it seems like only 0.66% of varnish misses trigger a thumbnail generation! Unfortunately this is information we didn't have when we decided to venture into prerendering. Had we known that, it probably wouldn't have been such a priority.

The rest of the time, varnish misses are merely a swift pull, because that thumbnail has already been generated at some point in the past and stored in swift. That explains why the effect of prerendering wasn't noticeable on the global performance stats. Varnish hits make up most of the requests, so overall prerendering only has an effect on 0.13% of the image requests as a whole.

The percentage of varnish misses that trigger thumbnail generation should be higher for recently uploaded file, so the effect of prerendering might be more important for those. By turning prerendering off, we'll be able to put a figure on that. Which will help us decide how much of a priority we want to make rectangular-based prerendering.

Change 183885 had a related patch set uploaded (by Gilles):
Disable thumbnail prerendering in production

https://gerrit.wikimedia.org/r/183885

Patch-For-Review

Change 183885 merged by jenkins-bot:
Disable thumbnail prerendering in production

https://gerrit.wikimedia.org/r/183885

Change 180136 merged by jenkins-bot:
Query image performance by upload time

https://gerrit.wikimedia.org/r/180136

See the multimedia mailing list for continued discussion on this topic. The answer to the question this task asked is that pre-rendering only matters for 0.085% of images served by Media Viewer. I'll keep an eye on the performance by upload time in a few months as final confirmation that prerendering had no effect.