One should bear in mind that passing a request to the next server is only possible if nothing has been sent to a client yet. That is, if an error or timeout occurs in the middle of the transferring of a response, fixing this is impossible.
Ah, there is a default for that directive:
So, in theory our nginx config retries on the next upstream:
To verify my theory, I would have to be able to log requests to Thumbor when they come in. Since Thumbor is single-threaded I doubt that it's itself capable of such logging. When Thumbor gets to pick up the request, it's probably been waiting for some time in Tornado. I wonder if nginx can add a header to its request before passing it to the upstream? That would be a foolproof way to take into account any queueing delay experienced at the Thumbor level.
Looking at the nginx logs on thumbor1001, I notice that some of the timeouts are for files that don't exist. Example: https://upload.wikimedia.org/wikipedia/en/thumb/5/5a/Premier_League.svg/125px-Premier_League.svg.png
Things don't seem to have improved :(
Closing this as I think we're done with these, only the alert emailing remains, which is tracked on the Nagios task.
Looking at https://grafana.wikimedia.org/dashboard/db/webpagetest-alerts?refresh=5m&panelId=49&fullscreen&orgId=1&from=now-7d&to=now we can indeed see that the dashboard was alerting at the time the puppet change to forward alerts was merged and that it recovered exactly when the recovery message made it to IRC. So, it works! Left a comment on the changeset about the mailing list not working: https://gerrit.wikimedia.org/r/#/c/342431/
This is set up now. The IRC component works on #wikimedia-perf-bots:
And adding it to the config before the custom formatter with is fallback is there causes exceptions.
Yeah, that's added through config.
If you abandon a task long enough... ;)
Fri, Mar 24
Accidental UI popups are major point of frustration for some.
Thu, Mar 23
Actually, nginx was already at 90s, but I think we can double that and see the effect, because it might not be enough as-is to accomodate the worst case scenarios.
It seems like Mediawiki 200ing while Thumbor 504s has about halved:
Ran the script above for 10 minutes and compared it to a baseline of organic traffic.
We figured out what's filtered and what could be, I think that "going the extra mile", if it's possible, should be a separate task.
Chrome 57 was released earlier this month: https://developers.google.com/web/updates/2017/03/background_tabs
Tue, Mar 21
On the tech-mgmt meeting you mentioned this was underway, is there another phab task for it?
Wrote this dumb little thing: P5097 which seems to do the job.
Debian package ready at https://github.com/gi11es/thumbor-debian/tree/master/python-logstash @fgiunchedi please review and add to our production apt repos when you can.
Looking at the API response time graph, I would advise to separate out the p95 into a separate one. Putting the median and the 95 on the same scale means that you have to mouse over to get a sense of whether the median is moving, because it's so small it's hidden by the p95. Similarly, the median seems to have high variance. I understand that you're going to increase the sampling rate. I think you should also use a moving average, which makes for a more readable graph when there's quite a bit of variance in the measure. Also something that we do in places is graph the amount of samples a given data point is based on.
I don't think there's ever a case where it's useful for us to serve 4:4:4 jpegs. Even if Guetzli makes them "less expensive", at equal quality in my small test above it's still 9% bigger for absolutely no objective higher quality than a 4:2:0 quality equivalent. And that's at 95 quality in Guetzli on the reference image they provide, which is their advertised objective. So even for what it intends to do, it falls short compared to 4:2:0 images on objectively-measured visual differences (which includes color).
Mon, Mar 20
Good point, so you're talking about taking those styles statements that are in the head solely because of our no-JS support, and wouldn't need to be in the head if it wasn't for that, and bringing them inline inside the page?
We've talked about doing a later more expensive pass to optimize images further for other file formats (PNG, I believe), but it's a project of its own. And I think we have yet to stumble into gains big enough that would apply to a large enough chunk of our content that it would be something that gets picked up as a project.
Possibly an upstream bug, then, that the auth level is so high for that feature?
Also, from their github page:
I appear to be able to delete one I've just created. @Peter which way are you trying to delete the alert and failing to do so? Can you take screenshots?
I'm initially skeptical of these claims. MozJPEG had similar ones that turned out to be false when put to the test with our compression parameters.
I agree with @Krinkle's position. You took my recommendation that was about TemplateStyles out of context. We never said styles embedding should be a solution for everything. User-generated styles and extension styles are different. I expect extension styles to be able to modify anything on the page, including the chrome, while template styles should only affect the template contents. This is the main reason why the technical compromise for TemplateStyles can't be generalized to both situations. Their scope is fundamentally different. I can't speak for each extension listed because I don't know them all, but you're trying to solve something that isn't broken. Extensions should already only include with addModuleStyles things that *have to* be in the head to avoid FOUC, otherwise they should use addModule. There's no need to change that and the benefits of cross-page caching for using RL is high for extensions, that tend to apply (and need) their styles to whole namespaces or even all pages. Whereas TemplateStyles will create a much more scattered usage pattern of unknown distribution.
Fri, Mar 17
I can't find where that's defined in Puppet.
You can create alerts but not delete them?
Yeah, that sounds good to me
Thu, Mar 16
Aggregate counts isn't problematic, but the data we store is. If it's recorded, it can be compromised. I know we have retention policies, etc. but they're of no use if we someone gets access to our data. 60 days is a lot.
Added that snippet to ~/.vagrant.d/Vagrantfile and it worked, thanks!
Fresh VM without roles, then:
This seems to be back. Repro steps using the Varnish role still "work".