IIRC, there's a bunch of crazy config supporting it on the same rackspace server that hosts wikitech-static. Some hacky stuff I threw together to proxy from watchmouse because they didn't support HTTPS. So, we should probably ssh over to there and kill that stuff, too, after the patches above are done.
I guess me!
Thanks! Since that's also the IP they use for policy.wikimedia.org, we can at least have some confidence in the basics of the TLS config, from our auditing on that other hostname.
I should mention a few other technical issues that have been crossing my mind as we try to wade through this:
Tue, Jul 17
Human status update, amidst the flurry of gerritbot notes:
So, trying to sum a bit of process on the above as I see it:
I don't think redirects are really an acceptable solution for the policy pages, as we'd still be sending users through a third party under different policies in order to read the policies for the site they're using... The links need to actually be changed on the production wikis (e.g. to foundation.wikimedia.org). This should happen in time for cached copies of the links to expire out (a week would be ideal), which implies the foundation.wikimedia.org site needs to be working and canonical for itself by the end of this week.
The key thing we're missing here on our end, by the actual transition date, is an IP address from Automattic to put in our DNS for this domain.
It's pretty unclear to me (perhaps I'm failing at reading!) exactly what needs to happen here at various levels of detail. What's clear is "move the foundation wiki from wikimediafoundation.org to foundation.wikimedia.org", but what's unclear even in the abstract is what date that should take effect (ASAP?) whether there's intended to be any transition period (redirects?).
Mon, Jul 16
Certs: yes, they should use Letsencrypt, which we'll authorize via CAA records in our DNS.
HSTS: yes, with a 1-year lifetime and preloading enabled. This and other HTTPS policy details are covered (at least lightly to basic minimums) here, if you want to point Automattic at it: https://wikitech.wikimedia.org/wiki/HTTPS#For_all_public-facing_HTTP[S]_sites_and_services_under_Wikimedia_control
Turning priority to "high" for this and the 5001 ticket ( T199675 ), as between the two of them they leave the upload@eqsin at its design limit of 4 reliable nodes.
Turning priority to "high" for this and the 5006 ticket, as between the two of them they leave the upload@eqsin at its design limit of 4 reliable nodes.
Wed, Jul 11
It's a complicated topic I think, on our end. There are ways to make it work today, but when I try to write down generic steps any internal service could take to talk to any other (esp MW or RB), it bogs down in complications that are probably less than ideal in various language/platform contexts.
Tue, Jul 10
Was the "temporary" JS redirect a 301 perhaps?
wgCacheEpoch is probably about the parser cache, which is separate from Traffic 's Varnish caching. Either one could be an issue here, or both.
+1, I don't think there's anything too sensitive in here, although I might have chosen different wording in places if this were a highly visible blog post or something :)
Mon, Jul 9
This raises some questions that are probably unrelated to the problem at hand, but might affect things indirectly:
- Why is an internal service (wdqs) querying a public endpoint? It should probably use private internal endpoints like appservers.svc or api.svc, but there may be arguments about desirability of [Varnish] caching. This is something we're grappling with in general in the longer-term (trying to understand and/or eliminate private internal service<->service traffic routing through the public edge unnecessarily).
- Why is it using webproxy to access it? It should be able to reach www.wikidata.org without any kind of proxy.
Thu, Jun 28
Going a bit beyond the explicit scope of this ticket, there are really a few different legacy-support risks we'd like to get rid of in our TLS configuration ASAP (but ASAP might be a year or more out in some cases!), so it helps to think about them all in concert. Roughly in expected order of removal based on expected feasibility:
Jun 14 2018
We've got some overlapping timelines on these long-form posts :)
Jun 12 2018
Jun 11 2018
Jun 6 2018
This plan looks good, thanks! We should be able to do the decoms for cable re-use you're recommending as well. We might need to leave lvs2007 for last after the others are brought online, to make it easier by first switching one of lvs200[1-3] 's work over to one of the new LVSes first.
Jun 5 2018
We faced this issue last time and went with "Wikipedia" as well, because:
Yes, but on a shorter timescale. The 3DES one, in retrospect, dragged on a bit longer than it had to, and in this case (a) We're starting at a lower percentage of real end-user UAs affected (b) there's a timing correlation with the PCI-DSS cutoff disabling most of these same devices for a lot of commercial sites as well, for slightly-different reasons. We're looking at ~6 weeks total process: 3 weeks ramping out the message percentage, 3 weeks of total blockage with the message, then disabling at the protocol level.
Jun 4 2018
As for the rest, especially with the one-offs using LetsEncrypt scripting today, we definitely don't have this kind of resiliency, or any kind of deployment lead time built in. We should at least fix the deployment lead-time clock skew problem in the new solution somehow (delay deployment by N days after issuance, unless it's a brand-new cert with only a self-signed/nothing preceding it for replacement).
Speaking for the big unified certs we get from commercial vendors: we generally do wait ~24h (usually longer?) , between the issue date of new major certs and their deployment, but it's more unspoken general best practices than a documented policy.
Well, a potential lesser goal that involves fewer moving parts would just be to loadbalance non-sessioned readonly requests (basically cache misses for GET/HEADs with no session token) across the sides, where we basically don't care if content is stale in the replication sense, and leave POST-like methods and all sessioned traffic master-only.
English grammar nits: it would be forward secret ciphers (meaning "ciphers which have the property of forward secrecy"). But these terms "forward secret" and "cipher" belong to fairly small-scope technical fields, so they probably don't translate easily (if at all, in some cases). With English being the normal language of technical matters on the Internet (for better or worse!), it may make sense to split this up into a simpler translatable message, followed by a more-accurate English-only explanation to be placed at the bottom that's intended for the more-technical (in some cases, what will get copied to some forum or tech support person to ask what it's about).
Correct, neither the domain itself (Registration, DNS) nor the hosting of any resources for it, belongs to the WMF. http://www.domain.hu/domain/domainsearch/ can give details to look up the owners.
May 31 2018
Last update was just missing console for atlas, as a non-blocker. I think this has probably been in a closeable state for a while?
May 25 2018
May 24 2018
May 23 2018
I don't have complete thoughts, but keep in mind in general it's complicated to go changing our actual host interface MTUs to anything larger than 1500 ("jumbo frames"), for a few reasons:
May 19 2018
Right, I forgot, that was discussed as an optimization (vs having ChronologyProtector just timeout -> stale invisibly in a case where we may have some operational issue causing persistent, large replication lag).
Right. Just to re-state for clarity, the sort of logic we should be implementing in VCL (in the cache layers) will look like this pseudo-code:
At this point, we're pushing off commit-free DC switching to post-ATS (sometime during the latter part of next FY, probably).
Also: naming bikshedding stuff: we should name/implement this as a generic ACME tool rather than LE-specific, and just make LE be the default issuing CA.
@Joe - So we're looking at doing something just for the LetsEncrypt (ACME) use-case over in T194962. The idea is this will manage puppetized issue/renewal/distribution of public/private keypairs from LetsEncrypt's ACME CA and support all the LE use-cases we have (e.g. same cert on N endpoint hosts, etc). I think it will work well and basically-solve all our issues for public-facing certs well enough. We can/should add revocation support as well (using the saved privkeys to revoke->reissue chosen certs via ACME, under admin command, if we think a privkey was compromised), but that might come later.
- We should assume by default we want all certificates to be dual-issued as ECDSA+RSA variants and served to clients in both forms (I think this basically requires doing the same thing twice over with different private key types).
- We should look at what attributes we can/should set in the request to get any optional goodies by default, like embedding SCTs for transparency.
Some after-thoughts on design issues and such (I haven't looked at any code!):
May 18 2018
Note added parent above, T194962. We're going to try to develop the generic solution for LE first, and then make the secredir service a client of the central LE service.
May 15 2018
Closing this as well, we're through the basic turn-up process. Trailing work on network engineering and Zero is tracked elsewhere.
Closing this (a bit late), as service has been online for a while now. Trailing remaining tasks re: Zero and/or further network engineering aren't really a blocker for saying the site is online :)
Works now, thanks!
Sorry, I missed the above scan link earlier. The "weak DH" issue isn't mentioned explicitly on our policy page, but is definitely an issue. I think maybe the policy page wording is defective, but basically if ssllabs.com isn't showing an A+, it's a fail. I'll amend it a bit to be more explicit about that.
May 14 2018
I think this ended up being an Analytics Q4 goal? It's not on our goals list, but we agree to alot some time to it in this Q to do our part (which is most of it, confusingly!), and have discussed it internally.
It's more likely that DotNetWikiBot just needs to be built against a newer .NET version, or needs .NET configuration tweaks, to support better encryption (or perhaps already is capable and some bots are behind on releases of it or some of its dependencies). I'm not familiar enough with DotNetWikiBot or .NET in general to know, though.
May 8 2018
Singapore itself, for non-sensical reasons related to the wild world of network peering, doesn't tend to be our best comparison point anyways, even though it's the first one we turned on. Probably our best bet would be to look at the aggregate data for all of the involved countries marked "Done" in T189252
May 5 2018
Yeah, lvs200x are all HPs as well, so they're "different" in many respects for better or worse, and getting replaced this quarter with something more-like lvs1016.
I don't think the firmware versions you see there in ethtool and/or dmesg are the whole story anyways. The ones visible from bios-level setup have like 6 different inter-related version numbers for different things, and even the "main" firmware version number was very different between the two cards. We should probably have @Cmjohnson flash them with latest (or at least, consistent) firmwares.
May 4 2018
The key to the ethool difference is this in the lspci stuff:
Capabilities: [a0] MSI-X: Enable+ Count=17 Masked-
Capabilities: [a0] MSI-X: Enable+ Count=32 Masked-
May 3 2018
[I still bet if you undo the already-done cable swap, and then switch the two cards' cables (leaving port1/2 ordering the same), this will all magically come out right]
I don't think it was a flip of the two ports on the same card that was needed, but instead switching all the cables between the two cards (order of cards, not order of ports in each card).
May 2 2018
@Jgreen yeah if you don't have any special purpose for them, then they're basically the same as the text-lb ones (we like having those in DNS so we don't have to remember IPs when doing certain kinds of manual debugging against specific sites, etc, but they're not functionally-used in any public way).
FWIW on my end, the following hostnames are definitely non-functional:
text-lb text-lb.codfw text-lb.eqiad text-lb.eqsin text-lb.esams text-lb.ulsfo
May 1 2018
Apr 26 2018
Note this will involve a planned ulsfo site outage, with its traffic falling back to codfw. If things go well the outage should be brief, the 5h estimate above is worst-case with complications. We should avoid taking other significant risks while this is ongoing (esp anything re: codfw-vs-eqiad redundancy, or risks to eqsin).
Apr 23 2018
+1, probably the convservative approach for now would be have puppet disable the systemd unit and remove the cron.daily file (on jessie as well? seems a waste if we expect it's not used).
Are we good for handoff to Traffic for OS-level install/config now on lvs1016?
Apr 19 2018
Crash recovery appears to have completed at about 23:10:33 and things came back online. We'll see if it remains stable. Leaving the downtimes in place to avoid more spamming of IRC.
[also, I've downtimed all the esams-specific prometheus-based alerts in icinga for 24h now (varnish child-counting checks and pybal bgp checks)]
It seems to be having problems coming up cleanly too, so more spam. First chunk of startup logs:
And that was followed by this, by the time it finally stopped itself ~5 minutes later:
Apr 19 22:55:47 bast3002 prometheus@ops: time="2018-04-19T22:55:47Z" level=info msg="Done checkpointing in-memory metrics and chunks in 3 Apr 19 22:55:47 bast3002 prometheus@ops: time="2018-04-19T22:55:47Z" level=info msg="Checkpointing fingerprint mappings..." source="persi Apr 19 22:55:48 bast3002 prometheus@ops: time="2018-04-19T22:55:48Z" level=info msg="Done checkpointing fingerprint mappings in 690.40272 Apr 19 22:55:49 bast3002 prometheus@ops: time="2018-04-19T22:55:49Z" level=error msg="Error unarchiving fingerprint 05d932c8af845af9 (met Apr 19 22:55:49 bast3002 prometheus@ops: time="2018-04-19T22:55:49Z" level=error msg="Error unarchiving fingerprint 92fa66fb4cf5b4ac (met Apr 19 22:55:49 bast3002 prometheus@ops: time="2018-04-19T22:55:49Z" level=error msg="Error unarchiving fingerprint b3bc5c659c5701f2 (met Apr 19 22:55:49 bast3002 prometheus@ops: time="2018-04-19T22:55:49Z" level=error msg="Error unarchiving fingerprint 67cc3bbc9996f80a (met Apr 19 22:55:49 bast3002 prometheus@ops: time="2018-04-19T22:55:49Z" level=error msg="Error unarchiving fingerprint 6c00f4c49f4d7f44 (met Apr 19 22:55:49 bast3002 prometheus@ops: time="2018-04-19T22:55:49Z" level=error msg="The storage is now inconsistent. Restart Prometheus Apr 19 22:55:49 bast3002 prometheus@ops: time="2018-04-19T22:55:49Z" level=error msg="Error unarchiving fingerprint 4b2269a7a2bc0903 (met Apr 19 22:55:49 bast3002 prometheus@ops: time="2018-04-19T22:55:49Z" level=error msg="Error unarchiving fingerprint 77b10b7d10ad866b (met Apr 19 22:55:49 bast3002 prometheus@ops: time="2018-04-19T22:55:49Z" level=error msg="Error unarchiving fingerprint 3f98cf3aa324de5b (met Apr 19 22:55:49 bast3002 prometheus@ops: time="2018-04-19T22:55:49Z" level=error msg="Error unarchiving fingerprint 31f74eca61e77fc7 (met Apr 19 22:55:49 bast3002 prometheus@ops: time="2018-04-19T22:55:49Z" level=info msg="Local storage stopped." source="storage.go:484" Apr 19 22:55:50 bast3002 systemd: Stopped prometheus server (instance ops). -- Subject: Unit firstname.lastname@example.org has finished shutting down -- Defined-By: systemd -- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel -- -- Unit email@example.com has finished shutting down.
It did keep spamming by the time I got done writing the above. Attempting to stop it now, but the basic daemon "stop" operation via systemctl is taking quite a long time (over 3 minutes and counting, even though parts of it have logged that they're gone):
My random $0.02 as a bytstander:
When I look at our LVS hosts (which are mixed jessie+stretch currently), the jessie ones show atop processes like:
root 26337 1 0 00:00 ? 00:00:04 /usr/bin/atop -a -w /var/log/atop/atop_20180419 600
Apr 13 2018
Yeah, ema and I discussed this after the meeting the other day. I'm not sure whether or how we can look into the history on the Zero side, but I don't see Zero at this point having a need or desire to restore that data through the previous infrastructure or add any new data, so we're planning to just stop pulling that empty data from them, and replace it with a private file that's puppet-managed/deployed by the SRE team and includes the newly-acquired OperaMini data.
Apr 10 2018
all green in icinga now and repooled, closing!