Page MenuHomePhabricator

Swift uses http in deployment-prep, https in production
Open, Needs TriagePublic

Description

This environment difference broke production today when rolling out T244776, despite the patch having been fully tested and working in beta.

I don't immediately know if this is just a configuration issue (T277680), or there's deeper issues leading to the divergence. (My money's on the latter.)

Event Timeline

In https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/hieradata/common/profile/swift.yaml, we have a mix of http and https, pointing to ms-fe.svc.%{::site}.wmnet:

mw_thumbor:
    access:       ''
    account_name: 'AUTH_mw'
    auth:         'http://ms-fe.svc.%{::site}.wmnet'
    user:         'mw:thumbor'
    stats_enabled: 'no'
# ...
performance_arclamp:
    access:       '.admin'
    account_name: 'AUTH_performance'
    auth:         'https://ms-fe.svc.%{::site}.wmnet'
    user:         'performance:arclamp'

We override that in https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/hieradata/cloud/eqiad1/deployment-prep/common.yaml:

profile::swift::accounts:
  performance_arclamp:
    access: .admin
    account_name: AUTH_performance
    auth: http://deployment-ms-fe03.deployment-prep.eqiad.wmflabs
    user: performance:arclamp

There's an additional override in the project Hiera config in Horizon:

thumbor::swift::swift_host: http://deployment-ms-fe03.deployment-prep.eqiad.wmflabs

I think the root issue is that we don't have Envoy in deployment-prep (T254917), thus there's no ms-fe.svc.labs.wmnet. We can (and should) be consistent with http vs. https when overriding the Swift parameters, but ideally we wouldn't have to override them at all.

We don't have Envoy and we don't have Cergen which (as I understand it) is used to create the certificates used to encrypt internal https traffic. I believe @jbond has been working on a replacement for cergen which could be used here, but I'm not sure if it's in a usable state yet or if it is suitable here.

@Majavah I have configured the deployment-prep project so that it should be able to work with the test pki service as such you should be able to follow the instructions on the pki client wiki page to add managed certificates. The system should be in a stable state to use however please keep in mind

  • currently there is no production equivalent of the pki service (expected in ~2 weeks)
  • this is a dev server so the security guarantees should be considered as such
  • If you where to make use of the service you would be the first one (other then me) so there could be teething issues

Disclamer out of the way i'd be happy to help either here in via irc

Now that we have cfssl I created a CNAME for a svc domain name and configured an envoy proxy in front of Swift. This means that it's now available from https://ms-fe.svc.deployment-prep.eqiad1.wikimedia.cloud. I haven't moved any clients yet, will do in the future if someone isn't faster than me.

Change 684010 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/puppet@production] beta: Use https for swift

https://gerrit.wikimedia.org/r/684010

Change 684012 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/mediawiki-config@master] beta: Use https for swift

https://gerrit.wikimedia.org/r/684012

Change 683837 had a related patch set uploaded (by Jbond; author: John Bond):

[operations/puppet@production] P::envoy: allow users to run tlsproxy without service proxy

https://gerrit.wikimedia.org/r/683837

Change 688312 had a related patch set uploaded (by Jbond; author: John Bond):

[operations/puppet@production] P:services_proxy::envoy: drop the ensure parameter

https://gerrit.wikimedia.org/r/688312

Change 688315 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[operations/puppet@production] profile::services_proxy::envoy: noop when fed an empty list of listeners

https://gerrit.wikimedia.org/r/688315

Change 688312 abandoned by Jbond:

[operations/puppet@production] P:services_proxy::envoy: drop the ensure parameter

Reason:

https://gerrit.wikimedia.org/r/c/operations/puppet/ /688315

https://gerrit.wikimedia.org/r/688312

Change 683837 abandoned by Jbond:

[operations/puppet@production] P::envoy: allow users to run tlsproxy without service proxy

Reason:

https://gerrit.wikimedia.org/r/c/operations/puppet/ /688315

https://gerrit.wikimedia.org/r/683837