How about "synthetic" instead of webpagetest? Since we're about to introduce a new kind of synthetic test
It's possible that to do this right we'll need multiple machines. The bottleneck is the time it takes to complete the slowed down requests. Maybe we could try an instance cheaper than c4.large and see if we get the same stability? We haven't tried that, i.e. seeing what the cheapest kind of AWS instance that can support this level of stability.
Thu, Nov 16
Fair enough, I didn't have the case of privilege escalation in mind.
I don't see the big gap on synthetic testing anymore. In RUM, we don't have enough FF57 data yet, but what's already there suggests that FF57 is on par with Chrome62 for the median. Can't compare percentiles with so little data.
The little amount of gatekeeping there would be in thumbor is that thumbor would only access private containers if the request comes from the thumb.php endpoint (which is different than varnish's). No key required, the different URL handler is enough.
Hmmm the more I think about the implementation, the more the shared key would be make-believe security.
@fgiunchedi the shared secret key sounds like the simplest thing to do. Do you think we should have a separate thumbor swift user for private wikis? Or just grant it r/w access to those containers if it doesn't have it already?
On FF57 it seems like Firefox beats Chrome for StartRender on the Sweden article (10k DOM elements), is slightly slower for the Facebook article (12k DOM elements) and noticeably slower for the Barack Obama article (16k DOM elements).
Purged that file, the issue is still the same.
Wed, Nov 15
Is it possible to avoid libcurl adding that header automatically? Presumably if that header wasn't set in the update call, it wouldn't be touched by Swift.
Tue, Nov 14
Indeed, I'm seeing the same thing with a lot of images being stuck before being served, just for Firefox. Chrome is fine with mitmproxy. This would suggest the bad behavior comes from Firefox or its Selenium driver. Or maybe that's just how Firefox behaves when throttled this way?
I see that I'm getting a couple, but not the same ones:
Tested all the instructions and fixed a little thing, it's all good.
It's not in Docker form, you can go ahead and make that, it should be very simple. Make sure that you use the latest mitmproxy, what comes with Debian and Ubuntu is very outdated (no HTTP/2 support, etc.).
Running mitmproxy + chrome and firefox every 20 minutes on wmcs, added to https://grafana.wikimedia.org/dashboard/db/webpagereplay?refresh=15m&orgId=1
It works! First working replayed runs with Firefox:
Mon, Nov 13
Yeah I couldn't figure out the format to pass a map from the command line
Adding an option to browsertime to point to a firefox profile would help, I think. That'd probably be the most future-proof thing to do here.
They use the autoconf feature, which is an enterprise thing. I tried setting that up at the OS level by putting the files in the right places, but that doesn't seem to work. I can't see any examples of people using that with selenium.
Yeah this works, except for the root CA. I see that setting the selenium capability acceptInsecureCerts to true might help. The syntax for the option to set custom capabilities in browsertime is deliberately undocumented. What's the proper format to use it?
Mitmproxy actually gives slightly better stability than webpagereplay did on that WMCS machine (218 Speed index min-max range, vs 278 on a 24-hour period). Even just for Chrome it's a contender for webpagereplay we ought to test in the same docker AWS environment and on lawrencium.
That rollout is surprisingly slow
Fri, Nov 10
I've co-opted the "webpagereplay" machine on WMCS and started running browsertime + mitmproxy on it, still with the same settings as everything else, sending metrics to https://grafana.wikimedia.org/dashboard/db/webpagereplay
I can't get mitmproxy to work with firefox run by browsertime. I'm not sure who's the culprit, it's as if the firefox.preference options are ignored and firefox doesn't proxy the traffic at all. Is --firefox.preference well tested in browsertime?
AWS still wins in terms of stability, by a wide margin. @Peter did you see a stability difference when running the code directly vs inside a docker container?
Thu, Nov 9
I've added a desktop test for the Barack Obama article running on lawrencium every 15 minutes with the same settings as your docker image, sending metrics to https://grafana.wikimedia.org/dashboard/db/webpagereplay
The bare metal server with SSDs is now available, it's lawrencium.eqiad.wmnet
It does work, thank you.
lawrencium.eqiad.wmnet prompts me for a password, I imagine I don't have SSH access to it? The perf-team shell group would be the correct one to use here.
Actually, based on @Tbayer's remark in the other thread, adding a field to EventLogging would give us more correct data. With graphite as the storage mechanism by way of statsd, the aggregation of data would get in the way of generating correct histograms over a long period of time. I.e. we could be combining per-minute percentiles instead of actual percentiles over a longer period, which is an approximation.
It would be helpful for the affected user to provide response headers for a failing thumbnail, from the browser developer tools. The screenshot alone doesn't inform us enough of what kind of error was encountered.
It seems like wikitech isn't configured like any other production wiki and doesn't go through the usual thumbnailing environment. It doesn't use Thumbor. Its thumbnails are generated by Mediawiki.
Wed, Nov 8
Worth considering if we're really stuck with that problem. These requests aren't cached and with a new service like what you've described, I expect that the bottleneck (if any) will still be kafka. I get that piggy-backing on the shared memory buffer allowed us to avoid writing a new service, but I don't expect said service to be that different from the existing daemons, which are already services with uptime of their own.
Speaking of the quota, I think we're artificially limiting ourselves by using EventLogging, which relies on the URL only to transmit data. SendBeacon supports a payload in addition to the URL. From what I can gather Chrome is the only one that limits its size, at 64k. It seems like the 64k quota is shared by the different sendBeacon calls for a given pageload. That's still plenty of legroom compared to the current limit.
Tue, Nov 7
@RobH is digging up a spare server with an SSD we can test this on, to verify the SSD/spinning disk theory we think is giving AWS its edge at the moment.
The high granularity is really unfortunate (but I get that's a limitation of EventLogging). The results are very interesting as-is, but the answer to whether or not we can afford some extra early loading might be different with ms precision.
Assuming that the figures you have were measured at 60fps, you seem to have reached the maximum granularity possible (2 frames, i.e. 33,33ms), which is incredible. In terms of stability it can't get better than that, short of increasing the recording's fps. As a result, I don't think it's worth comparing alternatives once more for the sake of hoping better stability, unless we have other reasons to use them. Firefox support would definitely be a valid one.
It's true that there's no convenient way to tie an error returned by thumbor to its corresponding entry in logstash. I'll file a task about that.
Mon, Nov 6
If what @hoo needs is only a subset of what we have access to, why not create a new group for that?
Right, I think that the default orientation isn't the de facto standard seen elsewhere. Fixing that first should be a priority. Is there a task for it?
Yes, it was to avoid "random" network requests happening. Our test is over, you can resume taking osmium behind the barn and decommissioning it.
This kind of processing is better done async because it's quite time-consuming. Ideally you would issue a non-smart thumb on first request and serve the smart-cropped one once it's ready. But this something our thumbnail caching infrastructure doesn't currently support. All our thumbnails are rendered on the fly "as fast as possible". In that limited paradigm, something like smart cropping would slow down initial thumbnail requests considerably, meaning that clients might end up waiting seconds for a smart cropped thumbnail where the non-smart one would have taken milliseconds. I don't think this is a smart tradeoff.
Fri, Oct 27
Right, under the hood it's based on rsvg-convert (the command-line utility for librsvg). It uses the language defined by the LANG environment variable (the only method I know to make rsvg-convert handle multilingual SVGs):
Thu, Oct 26
No, they're spinning disks.
Wed, Oct 25
Didn't fix it...
osmium is now set up with webpagereplay like the WMCS machine and sending data to https://grafana-labs-admin.wikimedia.org/dashboard/db/webperf by way of production graphite
Tue, Oct 24
Quite possible, @fgiunchedi can check. He already noticed today that some objects on codfw hadn't been cleaned up when their eqiad counterparts had. It's possible that the cleanup job failed silently for some objects.
Stability got considerably worse on all metrics, except LastVisualChange which got better. But then LastVisualChange got a spike for exactly an hour (banner?). Anyway, it's kind of surprising that the throttling is making things less stable. Tomorrow I'll try to set up webpagereplay on osmium.
Yeah the shape smoothing is introducing significant artifacts on the phone example and I don't think it's a good idea for encyclopedic purposes.
The impossibility of sudoing inside the cron worked to my advantage, as it can't unset it there. But what I've described might be worth looking into (set the throttling with the standalone throttle command, run browsertime without connectivity settings, after the runs the tc throttle has been unset).
I don't know what's happening, but it seems like the global throttle doesn't stick. Is it possible that browsertime resets the connectivity unconditionally, even if the connectivity parameter wasn't used?
Meh, can't get it to work, I'll just set the throttling globally.
Ah, it's because it's running inside a cron. I'll try to enable sudo running inside a cron job.
[2017-10-24 07:50:02] Changing network interfaces needs sudo rights.
@Krinkle assigned to you for the performance mark work
upload.beta.wmflabs.org refuses SSL connections right now, I see that it's not on that list