Mon, Aug 21
No qualms on the cache end of things!
Heh yeah I guess you're right. Still, I added it to the current page, and we seemed to have picked up some new translations over the weekend. I can pull the link back out of there on the next update if that makes more sense.
I see we have a few new translations up today, I'll incorporate them shortly! :)
You can see a view of cache_upload's over all 2xx (and everything else) here: https://grafana.wikimedia.org/dashboard/db/varnish-aggregate-client-status-codes?panelId=2&fullscreen&orgId=1&var-site=All&var-cache_type=upload&var-status_type=2&from=now-30d&to=now .
Sat, Aug 19
Both IE7 and IE8 for XP are what's being cut off in this transition, with IE8 being the newest IE that's even available for XP, AFAIK. However, we're not doing this with the express intent of deprecating older browser tech; it's just the natural fallout of raising our minimum security level for network connections. There's no further firm plans yet on Operations' end of things regarding deprecating specific browser versions, but we will most likely run through similar cipher/protocol deprecations in the future which may take out older browsers along the way like this one did.
Thu, Aug 17
After a couple of other minor nits, going to push the above as it stands. We can iterate further as necessary, at least it's an improvement on the original!
patch above is the same changes as a real changeset (it's just hard to review them that way, simpler manually on https://pinkunicorn.wikimedia.org/test-sec-warning ).
Thanks! Updated for all the above as best I can (I'm not 100% sure on the language-name text prefix for Arabic and Chinese, but took a good stab from http://mediaglyphs.org/mg/?p=langnames ), I guess someone that knows better can recommend a further fixup?
Update: noticed I had en-US firefox links in all of the translations. Updated them all now.
Testing updated HTML with some translations and a translate link (and other minor cleanups) at https://pinkunicorn.wikimedia.org/test-sec-warning . Will push something like this to the real one at https://en.wikipedia.org/test-sec-warning before upping percentage. Thoughts? Further tweaks? Mistakes? :)
Hopefully in the former case, they'll complain to their IT department and they'll fix it, and hopefully in the latter they'll blindly trust our Firefox links and find their way out of this mess from there :)
It's a very valid question :)
Update: Today is the start date for going to 5%. Before we pull that trigger sometime later (perhaps much later) today, I'm working on a few other things:
14:32 bblack@neodymium: conftool action : set/pooled=yes; selector: name=cp3036.*
Mon, Aug 14
I think there's still some work here to do, if nothing else to audit the situation as it stands. There's basically two things to sort out for all of the varnish-logging bits and pieces:
- Have we killed the hard dependency on Varnish being online? (Can we start the logger first and have it connect/reconnect as Varnish goes up and down?)
- Have we re-ordered the systemd level dependencies to ensure we're not losing log events? (Can we make Varnish services dependent on the loggers being ready to receive events?)
Re-evaluating alternatives here, hold on actual implementation for now.
Tue, Aug 8
Mon, Aug 7
It's just per-IP. So yes that sounds fine: if you're peaking at 80/s total, then lets put an upper sanity bound at 100/s misses for now. Any preferences for a deploy time to be sure whoever needs to be around to check for any fallout is around?
Thu, Aug 3
Even while FF 52 is still supported by Mozilla, it's unlikely that Mozilla's security efforts can actually prevent all the possible exploits that breach the underlying WinXP.
Works for me. If you can paste back the text form of whatever you want here, I can get the page updated.
The current message text (which needs massaging and updating anyways) is visible explicitly at: https://en.wikipedia.org/test-sec-warning
And for those wanting to follow the changes in 3DES percentage of requests as we go: https://grafana.wikimedia.org/dashboard/db/tls-ciphers?panelId=11&fullscreen&orgId=1
Cross-ticket updates: There's a separate sub-ticket for the Communications side of this change at T163251, and a timeline has been laid out there in T163251#3478043 . The TL;DR of the timeline is we'll ramp from 5% to ~29% blocked over the period of Aug 17 -> Oct 12, then 100% blocked on Oct 17, then protocol-disabled on Nov 17.
Tue, Aug 1
Excellent news! I'll try to squeeze in replacing one of the clusters ASAP, which will decom another 6x of the old cp to let us move further.
Mon, Jul 31
It seems reasonable to relax the regex in question a bit (to allow additional parameters).
It's also an interesting thought to consider progressively scaling the weight. For example, you could make the strategy configurable such that the first failure sets weight=configured_weight*0.5, the next weight=0, and the next deletes. However, the way that weighting is handled in sh for the public services is not ideal (excess churn due to lack of true chashing), so it's probably best to avoid staging through smaller shifts of weight until some future time when we've got a proper chashing ipvs scheduler.
Hmm I wrote that backwards above. The OCSP file-freshness checks look at age-of-mtime, not the timestamp within. In any case, we can still move them to crit=~3d and warn=~2d.
Ran it again and it's ok now.
+1. There are a number of tricky things here to get to these simple goals, though, and since the sysctls affect all services, we have to have the TCP cases in mind as well:
Thu, Jul 27
Yeah, that's not a bad idea. Perhaps we should morph this into a stretch-for-LVS ticket, and start with the always-almost-ready-to-use lvs1007-12? :)
@Johan Yeah I've been OoO and catching up slowly too. We also have Wikimania coming up on the horizon of course. Want to shoot for a start date the Thursday after Wikimania, Aug 17th, 3 weeks out from today?
Wed, Jul 26
So, things fell over again with a ton of puppetfail spam. As a stopgap, I've done the following:
So to recap a small part of IRC discussion today in the wake of issues with rebooting hydrogen, I think our short-term improvement plan looks like this:
Tue, Jul 25
It was decommed a long time ago, and then I revived it as a quasi-production testing machine for "temporary" use for a little while, and probably poorly documented that, and now "temporary" has stretch on a really really long time. cp1008 is the correct machine.
As @Aklapper said, there's a broad range of issues embedded in this. In the example given in the title, this is what actually happens and which actors are responsible for the redirects:
Mon, Jul 24
Add T171318 to the list too. There's doubtless a long tail of issues we'll never fully realize that would be helped by work here. Part of the reason this ticket's still idling so long is that it doesn't offer any simple path forward, just problems and problematic solutions. So let's step through things here:
Jul 21 2017
Jul 14 2017
Digging a little deeper, Shopify open-sources a lot of their infrastructure code. It seems likely that they already support the appropriate attributes at least in the lower levels of their stack (who knows in the user interface), as the specific options exist in their modified clone of Rails: https://github.com/Shopify/rails-mirror/blob/master/actionpack/lib/action_dispatch/middleware/ssl.rb#L20
To be clear then: this ticket is about our (WMF's) hosting of wikipedia.cz not having a valid SSL cert, and maybe touches on broader issues of ownership and delegation rules.
It seems like Shopify has been making some improvements on this front since we last checked.
@Johan - Do you have any kind of estimate on Community's inputs here and time needed before a start date? Can we set a tentative one and begin editing the various copy?
Jul 13 2017
We could perhaps go after this in the most-general sense. In our common VCL (across all clusters), if we get an error from a backend (xx) which has no body content, we can always turn that into a synth() response and probably add missing reason text in most cases as well (varnish does that by default when you re-set the status code). This way any app that sends errors with empty bodies gets converted to the standardized error templates that Varnish already has.
So, error code 429 is Too Many Requests, generally used by ratelimiters. In this case, it seems that thumbor (our internal service that renders thumbnails of images) issues a 429 because the SVG is failing to render (it might be invalid SVG in some sense, or at least making our SVG parsing tools explode). Making this more user-friendly is discussed in T169683
I think the main point here is we'd rather have a reproducible method for optimizing these images which works on our Linux and open-source based infrastructure. Having a third party optimize one of our many PNGs once manually is interesting, but this doesn't scale to the many other PNGs which may be spread around many other repos and sources, and more importantly the work will be lost the next time someone uploads new PNG content updates (e.g. visual re-designs or tweaks for new display types).
Jul 11 2017
So with these changes and cleanups in the past few weeks, we're basically down to two outstanding issues here from the original context:
Resolving this and moving the last remaining ticket up the tree as a direct child of the tracker. There's no point having a sub-category for one thing.
Ah I missed the part above where it stated that it expired in a week or two. In that case, there's little point for this particular certificate.
I think in this case we should revoke unless the expiry is already very close (it might be!). This is private key that is out of our control, and I honestly don't even understand all the machinations of the change of vendors involved here. It was one thing to trust them to represent one of our hostnames in a TLS public key when they were an active vendor with a contractual relationship, it's another to trust that that key is still secure in the aftermath of that relationship ending.
This hole was removed today in https://gerrit.wikimedia.org/r/#/c/364252 , so this is resolved assuming we don't revert (unlikely!). \o/
Jul 10 2017
benefactors - It wasn't originally part of the original task here, we've just been questioning whether it's also being removed at the same time, since it seems related.
+1 on using a similar rate to the APIs on text. I wonder what the peak (ab?)users' rates on upload.wikimedia.org look like as well, and whether one shared ratelimit for both might make sense.
Sorry, we've been wanting to make forward progress on this for several months now, but it keeps falling to the bottom of our priority queue. I'll pick up here from the related email thread as well and try to cover the basics:
All of these hostnames are still in DNS AFAICS: