I think that if we figured out the cause, we might be able to make things even more stable than our best current setup. But I agree that we've already spent a lot of resources on this issue and have a working setup with no practical issues since there's no PII, I don't want to keep investigating this beyond the current quarter.
Mon, Dec 11
Here's the catch: all the work being done is local to the machine. We record HTTP requests talking to the internet, but then we replay them entirely locally when the measurements are made. A lot of stuff runs (browser, ffmpeg, web servers), but you'd expect that a bare metal machine with bigger specs than c4.large would be more consistent than AWS, but no. I really have no idea what might be different. Linux kernel options? More recent generation of CPU? With so many processes involved it's really a black box, we don't know what makes it considerably more consistent across runs on AWS.
We ran this on spare bare metal servers Ops sourced for us (one without SSDs, one with an SSDs) and it still didn't give us test stability anywhere near AWS.
I don't expect any other work, no. This task really is only about adding SSL termination to Thumbor, so that Mediawiki can talk to it directly securely, since in the private wiki/thumb.php scenario we're not going through the Swift proxy.
@Peter running the following on a new VM just now:
Fri, Dec 8
All 3 ways of running Docker containers on Google Cloud are now set up and reporting to: https://grafana.wikimedia.org/dashboard/db/webpagereplay?refresh=15m&orgId=1&from=now-7d&to=now
I've added a Google Compute VM of similar cost than c4.large, running the tests on BarackObama with 7 runs, for desktop and mobile: https://grafana.wikimedia.org/dashboard/db/webpagereplay?refresh=15m&orgId=1&from=now-7d&to=now
Wed, Dec 6
@fgiunchedi I'd like to add private wiki support for Thumbor as a Q3 goal. Will you have the bandwidth to handle this task next quarter?
Right, in case that wasn't clear, this happens easily, slowing down the connection just makes it happen consistently.
Mon, Dec 4
Also, we could try Google's cloud, I believe we gave free credit there or something.
@Peter did you always run the AWS test inside Docker? That's the one thing we haven't tried on Cloud VPS and bare metal. If you did and it didn't change stability before/after Docker I don't think it's worth looking into, though.
Sure, I need to write a more concrete proof-of-concept before we can consider it
Fri, Dec 1
We can lower the threshold of the slow log at some point and you won't need to hit such extreme cases for them to show up. But yes, for now only 60-120 seconds will show up in there with some information about what happened.
Thu, Nov 30
OK, we have the explanation as to why the 2+ minute request didn't show up in the Varnish slow log. By default the VSL transaction timeout in Varnish is 120 seconds, and the varnishncsa command the slow log is currently based on doesn't allow overriding that value. Meaning that if a Varnish transaction takes more than 120 seconds, it won't be recorded in the slow log. In essence, the Varnish slow log only records requests taking longer than 60 but less than 120 seconds.
Wed, Nov 29
I think the bottleneck here is Swift being written to, the DB queries being read would be negligible DB load, IMHO.
It usually takes days, yes. Would leaving a message in the SAL be enough warning?
IMHO manual access is necessary in case it doesn't work as expected, etc. It's always convenient to be able to eval things as prod mediawiki and so on when working on this sort of that. And yes, the maintenance servers group is fine.
We only want this to run once. I've never felt the need to automate this sort of job, starting a screen and running the command is very simple.
I think it'd be nice to do the data collection before the deployment freeze and then that gives us time to analyze results and discuss next steps for the next quarter?
AWS just released 2 products on opposite sides of the spectrum that would be worth trying out for this:
With @Cparle working on the backend of Multimedia files, the necessity to run maintenance script is kind of inevitable for that sort of task. His work in that area increases the currently low human redundancy we have with people who know how to take care of these issues in production. He has shown to be capable of fully understanding the underlying code and stack and I trust him with performing data recovery in production.
Indeed, the namespace is necessary. I don't think we should be more permissive than browsers.
They might have rolled it out geographically, starting with some high latency locations.
Tue, Nov 28
Going back to the context of this task, the request that took more than a minute in the HAR should have been seen by the slow Varnish log and wasn't. This suggests that the problem was somewhere between Varnish and the user. Varnish is terminated by Nginx for SSL, request queueing by Nginx is also a possibility.
Slight correction: the fields start at 1, so Timestamp:Resp is correct for total time spent in Varnish. We could log more, though.
I've found the explanation here.
@ABorbaWMF anyone can upload files to Beta. You can just download the original of that file from Commons and upload it to Beta: https://commons.wikimedia.beta.wmflabs.org/wiki/Special:UploadWizard remember to copy/paste the licensing information found on Commons (license and list of authors).
Mon, Nov 27
I've looked for that particular request from the HAR file in the varnish logstash data you've linked to @ema and couldn't find it.
I think that's a non-issue, clearing the cache every time isn't "normal behavior" either. Saying that this particular feature needs to be on for things to be realistic is just cherry-picking.
It is indeed startling. Does this happen to you on all browsers? You mention that you can trigger this easily, it would be interesting if you can send us more HAR files, to see if there's any pattern to the requests that take a long time.
Is the task description up-to-date in regards to that bugfix? I.e. are regressions you initially saw on FF57 still true?
See my reply above
@Samat sorry I gave you the wrong address, the correct one is: email@example.com
Have you reported this upstream?
@Samat can you email the HAR file to firstname.lastname@example.org ?
It looks like the times are consistent now. The date is still localized to French in those contexts, but I can live with that.
Is it possible that in your tests, the rumSpeedIndex is computed before the banner has appeared?
All potentially multipage documents on all wikis were processed: PDFs, DJVUs, TIFFs. The migration was run one file type at a time, using the media_type, major_mime and minor_mime parameters in refreshFileHeaders for efficient underlying SQL queries.
Tue, Nov 21
Mon, Nov 20
I've tried adding one for when WPT upgraded to FF57 and I've found a bug/limitation already with the way we set up some of our dashboards: https://github.com/grafana/grafana/issues/9822
prepared-edit indicates that this is part of edit stashing. Which is preemptive processing of an edit a user is making before they actually hit save. If it fails, the worst that happens is that the edit processing has to be redone from scratch when the user hits "save".
Fri, Nov 17
How about "synthetic" instead of webpagetest? Since we're about to introduce a new kind of synthetic test
It's possible that to do this right we'll need multiple machines. The bottleneck is the time it takes to complete the slowed down requests. Maybe we could try an instance cheaper than c4.large and see if we get the same stability? We haven't tried that, i.e. seeing what the cheapest kind of AWS instance that can support this level of stability.
Thu, Nov 16
Fair enough, I didn't have the case of privilege escalation in mind.
I don't see the big gap on synthetic testing anymore. In RUM, we don't have enough FF57 data yet, but what's already there suggests that FF57 is on par with Chrome62 for the median. Can't compare percentiles with so little data.
The little amount of gatekeeping there would be in thumbor is that thumbor would only access private containers if the request comes from the thumb.php endpoint (which is different than varnish's). No key required, the different URL handler is enough.
Hmmm the more I think about the implementation, the more the shared key would be make-believe security.
@fgiunchedi the shared secret key sounds like the simplest thing to do. Do you think we should have a separate thumbor swift user for private wikis? Or just grant it r/w access to those containers if it doesn't have it already?
On FF57 it seems like Firefox beats Chrome for StartRender on the Sweden article (10k DOM elements), is slightly slower for the Facebook article (12k DOM elements) and noticeably slower for the Barack Obama article (16k DOM elements).
Purged that file, the issue is still the same.
Wed, Nov 15
Is it possible to avoid libcurl adding that header automatically? Presumably if that header wasn't set in the update call, it wouldn't be touched by Swift.
Tue, Nov 14
Indeed, I'm seeing the same thing with a lot of images being stuck before being served, just for Firefox. Chrome is fine with mitmproxy. This would suggest the bad behavior comes from Firefox or its Selenium driver. Or maybe that's just how Firefox behaves when throttled this way?
I see that I'm getting a couple, but not the same ones:
Tested all the instructions and fixed a little thing, it's all good.
It's not in Docker form, you can go ahead and make that, it should be very simple. Make sure that you use the latest mitmproxy, what comes with Debian and Ubuntu is very outdated (no HTTP/2 support, etc.).
Running mitmproxy + chrome and firefox every 20 minutes on wmcs, added to https://grafana.wikimedia.org/dashboard/db/webpagereplay?refresh=15m&orgId=1
It works! First working replayed runs with Firefox: