Page MenuHomePhabricator

status.wikimedia.org has no (valid) HTTPS
Closed, ResolvedPublic

Description

Currently, status.wikimedia.org has no HTTPS at all. I suspect this was the "workaround" for it having an incorrect certificate in the past.

Previous description: status.wikimedia.org is using an security certificate from *.io.watchmouse.com which give a warning in IE and Chrome.

Is it possible to install a wikimedia certificate on that domain? Thanks.

Details

Reference
bz32796

Related Objects

StatusSubtypeAssignedTask
ResolvedBBlack
ResolvedBBlack
ResolvedBBlack
ResolvedNone
DeclinedKrinkle
ResolvedJgreen
ResolvedChmarkine
ResolvedBBlack
ResolvedBBlack
ResolvedDzahn
Resolved ezachte
ResolvedBBlack
ResolvedBBlack
ResolvedBBlack
ResolvedBBlack
ResolvedBBlack
ResolvedBBlack
ResolvedBBlack
ResolvedDzahn
ResolvedBBlack
DuplicateNone
ResolvedBBlack

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 12:00 AM
bzimport added projects: HTTPS, acl*sre-team.
bzimport set Reference to bz32796.
bzimport added a subscriber: Unknown Object (MLST).

Because it's offsite.

It'll need it's own specific cert buying and assigning

I doubt this is feasible - it is hosted on Amazon AWS, so they'd have to fire up a separate watchmouse AWS LB instance just to serve Wikimedia status? ;-)

If we can't get it on a correct cert, we might want to redirect to its canonical domain instead so it at least loads properly. Main obvious downside is if our redirector or iframe wrapper goes down, you don't see it on our pretty domain anymore. ;)

Removing dependency to bug 27946 which is secure.wikimedia.org.

  • Bug 44760 has been marked as a duplicate of this bug. ***

So this certificate is served by Nimsoft, and we have no control over it. I'll paste the reasoning from RT:

it is just a CNAME for status.watchmouse.com

status.wikimedia.org is an alias for status.watchmouse.com.
status.watchmouse.com is an alias for dualstack.lb-1710199131.us-east-1.elb.amazonaws.com.

In the watchmouse UI, in "Public folders" setup you can change the CNAME but nothing about SSL or certificates.

And the failure is on their side already anyways, because status.watchmouse.com itself does not show the correct cert

status.watchmouse.com uses an invalid security certificate.

The certificate is only valid for *.io.watchmouse.com

Then folks ask if we can redirect, the answer is no. It is a status page for when the cluster is down, therefore redirecting via the cluster is non-ideal.

So this is a wontfix, because we cantfix.

Re-opening this for further consideration. A fair bit has changed since 2013, including a strong push for HTTPS/TLS/SSL support across both Wikimedia and the rest of the Internet.

https://status.wikimedia.org/ failing isn't really okay. We should figure out some way to make this work or we should kill the service entirely, in my opinion.

MZMcBride set Security to None.
MZMcBride added subscribers: Dzahn, BBlack.

Agreed, let's consider buying that. Adding "traffic" for opinions.

Over 2 years later, and we still have pages like status.watchmouse.com giving

This server could not prove that it is status.watchmouse.com; its security certificate is from status.io.watchmouse.com.

Do we really care of having status.wikimedia.org to be served over TLS? I am not sure it is worth it (and the price of a host cert), so I would rather disable HTTPS and just use http.

Do we really care of having status.wikimedia.org to be served over TLS?

yes

but that doesn't mean i have the solution how to fix it since it's on watchmouse's servers

Dzahn removed Dzahn as the assignee of this task.Oct 21 2015, 12:23 AM
BBlack renamed this task from status.wikimedia.org is using SSL cert from other domain to status.wikimedia.org has no (valid) HTTPS.Apr 14 2016, 1:30 PM
BBlack updated the task description. (Show Details)

Considering that watchmouse's own status pages, e.g. http://status.cloudmonitor.ca.com/ and http://stations.status.cloudmonitor.ca.com/ don't offer HTTPS at all (connection refused on 443), I doubt we'll get far with asking for it for our status site.

The current importance of this is that it's likely to be the very very last thing (there's one other pending, but it can be solved relatively easily) preventing us from doing a blanket STS-preload for all of wikimedia.org, which is a big deal.

Basic options that come to mind in a few minutes:

  1. Talk to watchmouse, see if there's any way they can HTTPS this with a legit cert, like we've done with other 3rd party vendors (where we purchase the cert and hand them the key securely). Seems unlikely, but worth a shot!
  2. Cancel/Replace this service? I don't know that we have an equivalent replacement anywhere, but this option was mentioned once before! If replacing it means spending a long time looking for a new replacement service and setting that up first, that sucks timeline-wise.
  3. Move it to another domain. We could move the CNAME to some other domain we own that we're not trying to STS-preload, like say status.wmftest.net or something. We could support the old name (transitionally, but not as the advertised name, because it might not work when our infra is down!) by having status.wikimedia.org map to one of the prod varnish clusters securely and generate a 301 -> the new hostname.

They might also have the option to simply use a hostname within their domains rather than bothering with another name of ours. e.g. configuring it as wikimedia.status.asm.ca.com and calling that the official name. After all this is supposed to be independent of our infrastructure. Ideally that would include our authdns, too.

The actual blocker for 2. was that Catchpoint was able to replace almost all features of Watchmouse, _except_ that it doesn't have that kind of status page. So maybe an option is also keep poking them about that, argueing that we pay them quite a bit already.

In the settings, we can see that http://status.wikimedia.org/ is also available at http://status.asm.ca.com/8777 . There don't appear to be any TLS-related settings :(

We could do a few things trivially, which have un-ideal tradeoffs, but might be acceptable:

  1. We could set up a revproxy for it internally, on perhaps a ganeti misc web node? Then it would TLS'd as part of the misc cluster, but it won't be available if any of several parts of our infra are in trouble, which is probably when we want it the most.
  2. We could set up a static page for it on the misc cluster (via varnish synthesis), which simply links to (or does a slow HTML refresh to?) the http://status.asm.ca.com/8777 . At least then people could bookmark the real thing independent of our infra, and it would rely on less of our infra (just authdns + LVS + cache_misc, and be DC-independent).
  3. We could host a revproxy as in (1) above, but externally like we do for https://wikitech-static.wikimedia.org/ (perhaps even on the same host?).

Option (3) sounds like the easiest way forward to me and an acceptable option. My only concern would be whether it could handle a surge of traffic (the kind of traffic it'd see when we're down at some point). I don't think we advertise the status page much, so I wouldn't expect it to. I don't think we have or could get any access statistics for it right now, but having it be frontend by infrastructure we control could allow us to, which is another plus. If we do this move, let's at least set up some logging and/or monitoring for it and check it during the next outage :)

Longer-term I think we should overhaul that whole status page. This is currently backed by Watchmouse which isn't very accurate (or pretty). We could either use some external status page service (statuspage.io etc.) or (my preference) build something ourselves using e.g. wikitech-static or some other externally-hosted infrastructure. Something like Cachet could be reused for this to save us from all the frontend trouble.

But all of that can wait; for the purposes of this task (HTTPS support), option (3) is a good compromise, IMHO.

Yeah I tend to agree too. I think if we're concerned at all about status.wm.o perf during outages, we could probably also tack on a secondary task to extend the apache config there to use mod_cache and cache the status with a 1 or 5 minute TTL and cut down on the revproxying load.

There's an upside, too, in that our revproxy will help anonymize clients of stats.wm.o against watchmouse/CA privacy invasion :)

Also note: while in there, should convert wikitech-static to cron'd letsencrypt (using our prod script!), and then use that for the status.wm.o cert as well.

Yes, we should. Unfortunately wikitech-static might be a pain since it does not use puppet (and for obvious reasons cannot reach the production puppetmaster). :/

it's ok, we can just copy down the acme-setup script as it exists today (well, and acme-tiny). for a 1-2 cert setup like this, it's not hard to use it puppet-free from cron I think.

Change 292482 had a related patch set uploaded (by BBlack):
status -> wikitech-static hosting T34796

https://gerrit.wikimedia.org/r/292482

Change 292482 merged by BBlack:
status -> wikitech-static hosting T34796

https://gerrit.wikimedia.org/r/292482

BBlack closed this task as Resolved.EditedJun 2 2016, 10:48 PM
BBlack claimed this task.

I've moved the status.wm.o DNS to wikitech-static, and set up an apache reverse proxy there with a LetsEncrypt cert that auto-renews. It seems to work now, after much experimenting and mucking around!

For the record, since we have no puppet, in case we have to muck with this again, the basic things I did were:

  1. Created a local acme user and group that can't log in
  2. Copied acme-setup, acme_tiny.py, and x509-bundle from our puppet repo to /usr/local/sbin/
  3. Commented out the self-verification portion of acme_tiny.py (this always seems to fail on challenge over redirect to self-signed for me).
  4. Installed the letsencrypt X3 and X4 intermediates in /usr/local/share/ca-certificates and ran update-ca-certificates.
  5. Enabled the following new apache2 modules: proxy, proxy_http, proxy_html
  6. Set up the following as the sites-available/enabled file for status.wikimedia.org.conf (note especially the crazy html translation hacks for re-mapping links URLs, especially the mongocache one (which is for ajax data loaded from a separate HTTP-only URL belonging to CA...):
# vim: filetype=apache

<VirtualHost *:80>
	ServerAdmin noc@wikimedia.org
        ServerName status.wikimedia.org

	SSLEngine off
	
	RewriteEngine on
	RewriteCond %{SERVER_PORT} !^443$
	RewriteRule ^/(.*)$ https://status.wikimedia.org/$1 [L,R=301]

	ErrorLog /var/log/apache2/error.log

	# Possible values include: debug, info, notice, warn, error, crit,
	# alert, emerg.
	LogLevel warn

	CustomLog /var/log/apache2/access.log combined
	ServerSignature Off

</VirtualHost>
<VirtualHost *:443>
	ServerAdmin noc@wikimedia.org 
	ServerName status.wikimedia.org

        SSLEngine on
        SSLCertificateFile /etc/acme/cert/status.chained.crt
        SSLCertificateKeyFile /etc/acme/key/status.key
  	SSLProtocol all -SSLv2 -SSLv3
	SSLCipherSuite -ALL:ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-AES128-SHA256:ECDHE-RSA-AES128-SHA256:ECDHE-ECDSA-AES256-SHA384:ECDHE-RSA-AES256-SHA384:ECDHE-ECDSA-AES128-SHA:ECDHE-RSA-AES128-SHA:ECDHE-ECDSA-AES256-SHA:ECDHE-RSA-AES256-SHA:ECDHE-ECDSA-DES-CBC3-SHA:ECDHE-RSA-DES-CBC3-SHA:AES128-GCM-SHA256:AES256-GCM-SHA384:AES128-SHA256:AES256-SHA256:AES128-SHA:AES256-SHA:DES-CBC3-SHA
	SSLHonorCipherOrder On
	Header always set Strict-Transport-Security "max-age=31536000"

	<Location />
		ProxyPass "http://status.asm.ca.com/8777/"
		ProxyPassReverse "http://status.asm.ca.com/8777/"
		RequestHeader unset Accept-Encoding
		Header always set Content-Security-Policy upgrade-insecure-requests
		ProxyHTMLEnable On
		ProxyHTMLExtended On
		ProxyHTMLLinks	a		href
		ProxyHTMLLinks	area		href
		ProxyHTMLLinks	link		href
		ProxyHTMLLinks	img		src longdesc usemap
		ProxyHTMLLinks	object		classid codebase data usemap
		ProxyHTMLLinks	q		cite
		ProxyHTMLLinks	blockquote	cite
		ProxyHTMLLinks	ins		cite
		ProxyHTMLLinks	del		cite
		ProxyHTMLLinks	form		action
		ProxyHTMLLinks	input		src usemap
		ProxyHTMLLinks	head		profile
		ProxyHTMLLinks	base		href
		ProxyHTMLLinks	script		src for
		ProxyHTMLEvents	onclick ondblclick onmousedown onmouseup onmouseover onmousemove onmouseout onkeypress onkeydown onkeyup onfocus onblur onload onunload onsubmit onreset onselect onchange
   		ProxyHTMLURLMap //status\.asm\.ca\.com/8777(/|$) //status.wikimedia.org/ [Ri]
		ProxyHTMLURLMap //mongocache.asm.ca.com/ //status.wikimedia.org/.mongocache/
   		ProxyHTMLURLMap http:// https:// [i]
		SetOutputFilter proxy-html
	</Location>
        <Location /.mongocache>
		ProxyPass "http://mongocache.asm.ca.com/"
		ProxyPassReverse "http://mongocache.asm.ca.com/"
        </Location>

	<Location /.well-known/acme-challenge>
		ProxyPass "!"
	</Location>

	Alias "/.well-known/acme-challenge" "/var/acme/challenge"
	<IfVersion >= 2.4>
    	<Directory "/var/acme/challenge">
       		Require all granted
    	</Directory>
	</IfVersion>

	ErrorLog /var/log/apache2/error.log

	# Possible values include: debug, info, notice, warn, error, crit,
	# alert, emerg.
	LogLevel debug

	CustomLog /var/log/apache2/access.log combined
	ServerSignature Off

</VirtualHost>
  1. Ran the initial acme-setup for self-signed:
/usr/local/sbin/acme-setup -i status -s status.wikimedia.org -m self -u acme
  1. Reloaded apache2
  2. Re-run to get a real cert:
/usr/local/sbin/acme-setup -i status -s status.wikimedia.org -u acme -m acme -w apache2
  1. Created a cronjob running exactly the above once a day at 17:17, which will auto-renew when necessary.

(note, above has been edited a few times to correct missing stuff, will keep doing that so this task serves as a good reference)