Page MenuHomePhabricator

Upgrade cloucontrol1003/1004 to stretch/mitaka
Closed, ResolvedPublic

Description

The failover process for this will be a bit messy but it should be possible to do this without a lot of downtime.

Related Objects

StatusSubtypeAssignedTask
ResolvedAndrew
ResolvedAndrew
OpenNone
OpenNone
ResolvedMoritzMuehlenhoff
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
ResolvedPapaul
Resolved JHedden
Resolvedaborrero
Resolvedaborrero
ResolvedPapaul
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
ResolvedAndrew
Resolvedaborrero
Resolvedaborrero
ResolvedAndrew
Resolvedaborrero
Resolvedaborrero
ResolvedAndrew
Resolved Marostegui
Resolvedaborrero
ResolvedAndrew
DuplicateNone
ResolvedAndrew
ResolvedAndrew
Invalid JHedden

Event Timeline

My proposal is to use the opportunity to try some stuff using the same window:

  • let's move the API endpoints to the external proxy. Use the new cloud domain for them. Example FQDN: eqiad1.api.wmfcloud.org (I don't remember the actual domain name)
  • point this FQDN to the active server, which is cloudcontrol1003 currently
  • update every script to use this new endpoint. At least, the important ones.
  • upgrade cloudcontrol1004 to stretch. Probably reimage it.
  • try failovering from cloudcontrol1003 to cloudcontrol1004. New API endpoint change. This will be very disruptive even if is only the control plan what is affected. See what can be improved here.
  • upgrade cloudcontrol1003. Probably reimage it.
  • Cleanup
  • let's move the API endpoints to the external proxy.

This sounded strange to me at first, but then I looked at the current /etc/novaobserver.yaml config and realized that in the eqiad1-r region we are using http://cloudcontrol1003.wikimedia.org:5000/v3 as the auth URL. So as I understand it now this change would be primarily putting a service name in place (which seems like a really really good idea).

Example FQDN: eqiad1.api.wmfcloud.org (I don't remember the actual domain name)

wmcloud.org is the domain that I eventually expect to replace wmflabs.org.

There is a lot of potential for bikeshedding conversations about the FQDN to use for this service name. I am assuming that we will be rolling out the use of *.wmcloud.org to Cloud VPS projects via Designate (for example login.tools.wmflabs.org transitioning to login.tools.wmcloud.org) and domainproxy (for example wdqs-test.wmflabs.org transitioning to wdqs-test.wmcloud.org). This isn't on the surface a problem, but it does mean that whatever sub-domain we pick for holding public service names will also be blocking that sub-domain from use as a domainproxy name and as a project managed sub-domain.

We are currently using *.svc.eqiad.wmflabs for internal service names. I'm wondering if it would make things easier to remember in the long term if we took *.svc.wmcloud.org as the base domain for public service names? And then maybe use something like eqiad1.openstack.svc.wmcloud.org for the eqiad1-r region OpenStack API endpoints? I think this gives a lot more semantic meaning to the URLs with svc being a pretty clear signal that this is a service name that can float from host to host, and openstack namespacing things off from any other general class of services we may end up publishing.

  • update every script to use this new endpoint. At least, the important ones.

A grep for 'cloudcontrol1003.wikimedia.org:5000' in ops/puppet.git brings back a pretty short list of files, so we should probably be able to change all of them. Anything that we find while doing that which is not reading the value from a config file (or at least the profile::openstack::base::keystone_host hira variable) should probably be noted with a phab task to come back and clean things up later.

$ git grep -l 'cloudcontrol1003.wikimedia.org:5000'
hieradata/role/eqiad/wmcs/openstack/eqiad1/labweb.yaml
modules/graphite/files/archive-instances
modules/labstore/files/nfs-exportd.py
modules/openstack/files/mitaka/admin_scripts/wmcs-prod-example.sh
modules/profile/files/toolforge/clush/tools-clush-generator
modules/profile/manifests/wmcs/prometheus.pp
modules/role/files/toollabs/clush/tools-clush-generator
modules/sonofgridengine/files/grid_configurator/grid_configurator.py

We are currently using *.svc.eqiad.wmflabs for internal service names. I'm wondering if it would make things easier to remember in the long term if we took *.svc.wmcloud.org as the base domain for public service names? And then maybe use something like eqiad1.openstack.svc.wmcloud.org for the eqiad1-r region OpenStack API endpoints? I think this gives a lot more semantic meaning to the URLs with svc being a pretty clear signal that this is a service name that can float from host to host, and openstack namespacing things off from any other general class of services we may end up publishing.

I agree with all you commented. My only suggestion is to perhaps s/openstack/cloudvps/g, i.e, eqiad1.cloudvps.svc.wmcloud.org. But I don't have a strong opinion.

A related question that comes to mind is where is going to live the authoritative server for this subdomain (openstack|cloudvps.svc.wmcloud.org). For proper error handling (cloudvps down) this should probably be outside designate.

  • update every script to use this new endpoint. At least, the important ones.

A grep for 'cloudcontrol1003.wikimedia.org:5000' in ops/puppet.git brings back a pretty short list of files, so we should probably be able to change all of them. Anything that we find while doing that which is not reading the value from a config file (or at least the profile::openstack::base::keystone_host hira variable) should probably be noted with a phab task to come back and clean things up later.

It's more than that. Is not only the keystone FQDN/endpoint. I would consider changing all the public openstack endpoints and decoupling them from a concrete cloudcontrol server:

aborrero@cloudcontrol1004:~ 3s $ sudo wmcs-openstack endpoint list | grep public
| 0a1eb902933c4652ad41e6450fe436ee | eqiad1-r | neutron      | network      | True    | public    | http://cloudcontrol1003.wikimedia.org:9696                             |
| 4578c49346db479ab6d5b7961af8f60a | eqiad1-r | nova         | compute      | True    | public    | http://cloudcontrol1003.wikimedia.org:8774/v2.1                        |
| 465f331e03de4bfcbef1f8a7dbb4de15 | eqiad1-r | designate    | dns          | True    | public    | http://cloudservices1003.wikimedia.org:9001                            |
| be7a84a1af114f94bd1d6cc48b374413 | eqiad1-r | keystone     | identity     | True    | public    | http://cloudcontrol1003.wikimedia.org:5000/v3                          |
| ebf523fce03a4ff4b859277f0a3d2477 | eqiad1-r | proxy        | proxy        | True    | public    | http://proxy-eqiad1.wmflabs.org:5668/dynamicproxy-api/v1/$(tenant_id)s |
| f922afd9417448028ce02734e0420a0b | eqiad1-r | glance       | image        | True    | public    | http://cloudcontrol1003.wikimedia.org:9292                             |

I know this is a really big change. We should take some time and think on the consequences this has for how openstack itself works.

But right now the situation is we likely require raw database updates in case we want to switch the cloudcontrol servers (either controlled failover or disaster recovery).
It would be great to decouple this stuff and get closer to a working scenario in which we can failover the control plane by just flipping a hiera key (to change the proxy backend and starting the daemons).

It's more or less obvious we can follow this approach for openstack public endpoints. I'm not that sure for private/admin endpoints.

Progress on HA for openstack APIs is stalled pending discussion about naming. I'm going to decouple this task from that one for now.

My current plan for upgrading is

  1. re-image the secondaries (cloudcontrol1004, cloudservices1004, labnet1004) with stretch
  2. update keystone endpoint catalog to point to secondary servers
  3. apply a puppet patch pointing all openstack users to cloudcontrol1004
  4. re-image primaries
  5. revert the patch in step 3

Change 512492 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] cloudcontrol1004: move to Stretch

https://gerrit.wikimedia.org/r/512492

Change 512492 merged by Andrew Bogott:
[operations/puppet@production] cloudcontrol1004: move to Stretch

https://gerrit.wikimedia.org/r/512492

I have left 1004 in a somewhat strange state pending a fix for T224345, but the cluster is still working and most services are still active/active between 1003 and 1004.

Volans triaged this task as High priority.May 27 2019, 8:24 AM
Volans subscribed.

cloudcontrol1003 is flapping its systemd degraded alert since 2019-05-25 21:46. The unit that fails is:

● designate_floating_ip_ptr_records_updater.service               loaded failed failed    Designate Floating IP PTR records updater

cloudcontrol1003 is flapping its systemd degraded alert since 2019-05-25 21:46. The unit that fails is:

● designate_floating_ip_ptr_records_updater.service               loaded failed failed    Designate Floating IP PTR records updater

I downtimed this check in icinga for 1day.

The rabbitmq cluster that I set up last week doesn't work across the Stretch/Jessie divide. So for now, I'm making cloudcontrol1004 the only active rabbit provider, pending a rebuild of cloudcontrol1003.

In the meantime, rabbitmq will be stopped on cloudcontrol1003, and puppet disabled there.

Change 512954 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] Make cloudcontrol1004 the primary keystone host

https://gerrit.wikimedia.org/r/512954

Change 512954 merged by Andrew Bogott:
[operations/puppet@production] Make cloudcontrol1004 the primary keystone host

https://gerrit.wikimedia.org/r/512954

Icinga downtime for 8:00:00 set by aborrero@cumin1001 on 1 host(s) and their services with reason: rebuilding server

cloudcontrol1003.wikimedia.org

Change 513097 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] nfs-exportd: use cloudcontrol1004 endpoint for now

https://gerrit.wikimedia.org/r/513097

Change 513097 merged by Andrew Bogott:
[operations/puppet@production] nfs-exportd: use cloudcontrol1004 endpoint for now

https://gerrit.wikimedia.org/r/513097

Mentioned in SAL (#wikimedia-operations) [2019-05-29T12:33:36Z] <Zppix> [11:58:16] <arturo> T221770 icinga downtime cloudcontrol1003.wikimedia.org for upcoming rebuild as stretch

Change 513117 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] nova: more swapping of cloudcontrol1003/1004

https://gerrit.wikimedia.org/r/513117

Change 513117 merged by Andrew Bogott:
[operations/puppet@production] nova: more swapping of cloudcontrol1003/1004

https://gerrit.wikimedia.org/r/513117

Change 513134 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] cloudcontrol1003: install with Debian Stretch

https://gerrit.wikimedia.org/r/513134

Change 513134 merged by Andrew Bogott:
[operations/puppet@production] cloudcontrol1003: install with Debian Stretch

https://gerrit.wikimedia.org/r/513134

Mentioned in SAL (#wikimedia-operations) [2019-05-29T14:45:32Z] <andrewbogott> reimaging cloudcontrol1003 T221770

Andrew claimed this task.

Other than one incorrect icinga warning (addressed by https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/513276/) this is done and things seem to be working fine.