mitaka
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Andrew
	Apr 24 2019, 1:21 PM

Description

The failover process for this will be a bit messy but it should be possible to do this without a lot of downtime.

Details

Subject	Repo	Branch	Lines +/-
Make cloudcontrol1004 the primary keystone host	operations/puppet	production	+7 -7
nova: more swapping of cloudcontrol1003/1004	operations/puppet	production	+3 -3
cloudcontrol1003: install with Debian Stretch	operations/puppet	production	+0 -1
nfs-exportd: use cloudcontrol1004 endpoint for now	operations/puppet	production	+2 -2
cloudcontrol1004: move to Stretch	operations/puppet	production	+0 -2

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	Andrew	T237749 Upgrade wmcs OpenStack version to Ocata
Resolved	Andrew	T210715 cloudvps: PDNS 3.x vs 4.x
Open	None	T132225 Add SSHFP dns records to bastions
Open	None	T224708 Drop most of mwopenstackclients.DnsManager in favour of designateclient
Resolved	MoritzMuehlenhoff	T224549 Track remaining jessie systems in production
Resolved	aborrero	T212302 CloudVPS: upgrade: jessie -> stretch & mitaka -> newton
Resolved	aborrero	T214299 cloudvps: neutron: upgrade jessie -> stretch
Resolved	aborrero	T214167 labtestneutron2001: reimage to stretch & rename to cloudnet2001-dev
Resolved	Papaul	T214181 codfw: rename/relabel labtestneutron2001 to cloudnet2001-dev
Resolved	• JHedden	T214297 cloudvps: neutron: investigate keepalived warning
Resolved	aborrero	T214303 labtestneutron2002: reimage to stretch & rename to cloudnet2002-dev
Resolved	aborrero	T214322 cloudnet2002-dev: ACPI error
Resolved	Papaul	T214370 labtestneutron2002: refresh/rename to cloudnet2002-dev
Resolved	aborrero	T215407 cloudvps: cloudcontrol support for mitaka/stretch
Resolved	aborrero	T215605 cloudvps: missing packages in stretch for cloudcontrol servers
Resolved	aborrero	T223832 puppet vs. stretch vs. keystone
Resolved	aborrero	T216497 CloudVPS: workaround archival of jessie-backports repo
Resolved	Andrew	T221769 Upgrade cloudservices1003/1004 to stretch/mitaka
Resolved	aborrero	T224354 backport pdns-server version 3.x to Stretch
Resolved	aborrero	T224877 prometheus-pdns-exporter: add stretch support
Resolved	Andrew	T221770 Upgrade cloucontrol1003/1004 to stretch/mitaka
Resolved	aborrero	T224345 stretch build of prometheus-openstack-exporter incompatible with our mitaka apt repo
Resolved	aborrero	T224424 cloudservices1003: gateway timeout error
Resolved	Andrew	T233258 designate: switch from designate-pool-manager to designate-producer/designate-worker
Resolved	Marostegui	T233978 Drop 'designate_pool_manager' database from m5 and remove associated grants
Resolved	aborrero	T233665 Forward our neutron-l3-agent routing hacks to Openstack Newton
Resolved	Andrew	T234834 Various user visible errors in Cloud VPS projects following OpenStack upgrade on 2019-10-07
Duplicate	None	T234830 CloudVPS: m5-master databases for openstack may require re-encoding
Resolved	Andrew	T234876 nova-conductor running out of mysql connections
Resolved	Andrew	T234836 CloudVPS: update DNS record for eqiad1 egress (routing_source_ip) & ingress
Invalid	• JHedden	T235070 Resolve keystone schema errors

Event Timeline

Andrew created this task.Apr 24 2019, 1:21 PM

Restricted Application removed a project: Patch-For-Review. · View Herald TranscriptApr 24 2019, 1:21 PM

My proposal is to use the opportunity to try some stuff using the same window:

let's move the API endpoints to the external proxy. Use the new cloud domain for them. Example FQDN: eqiad1.api.wmfcloud.org (I don't remember the actual domain name)
point this FQDN to the active server, which is cloudcontrol1003 currently
update every script to use this new endpoint. At least, the important ones.
upgrade cloudcontrol1004 to stretch. Probably reimage it.
try failovering from cloudcontrol1003 to cloudcontrol1004. New API endpoint change. This will be very disruptive even if is only the control plan what is affected. See what can be improved here.
upgrade cloudcontrol1003. Probably reimage it.
Cleanup

In T221770#5134585, @aborrero wrote:

let's move the API endpoints to the external proxy.

This sounded strange to me at first, but then I looked at the current /etc/novaobserver.yaml config and realized that in the eqiad1-r region we are using http://cloudcontrol1003.wikimedia.org:5000/v3 as the auth URL. So as I understand it now this change would be primarily putting a service name in place (which seems like a really really good idea).

Example FQDN: eqiad1.api.wmfcloud.org (I don't remember the actual domain name)

wmcloud.org is the domain that I eventually expect to replace wmflabs.org.

There is a lot of potential for bikeshedding conversations about the FQDN to use for this service name. I am assuming that we will be rolling out the use of *.wmcloud.org to Cloud VPS projects via Designate (for example login.tools.wmflabs.org transitioning to login.tools.wmcloud.org) and domainproxy (for example wdqs-test.wmflabs.org transitioning to wdqs-test.wmcloud.org). This isn't on the surface a problem, but it does mean that whatever sub-domain we pick for holding public service names will also be blocking that sub-domain from use as a domainproxy name and as a project managed sub-domain.

We are currently using *.svc.eqiad.wmflabs for internal service names. I'm wondering if it would make things easier to remember in the long term if we took *.svc.wmcloud.org as the base domain for public service names? And then maybe use something like eqiad1.openstack.svc.wmcloud.org for the eqiad1-r region OpenStack API endpoints? I think this gives a lot more semantic meaning to the URLs with svc being a pretty clear signal that this is a service name that can float from host to host, and openstack namespacing things off from any other general class of services we may end up publishing.

update every script to use this new endpoint. At least, the important ones.

A grep for 'cloudcontrol1003.wikimedia.org:5000' in ops/puppet.git brings back a pretty short list of files, so we should probably be able to change all of them. Anything that we find while doing that which is not reading the value from a config file (or at least the profile::openstack::base::keystone_host hira variable) should probably be noted with a phab task to come back and clean things up later.

$ git grep -l 'cloudcontrol1003.wikimedia.org:5000'
hieradata/role/eqiad/wmcs/openstack/eqiad1/labweb.yaml
modules/graphite/files/archive-instances
modules/labstore/files/nfs-exportd.py
modules/openstack/files/mitaka/admin_scripts/wmcs-prod-example.sh
modules/profile/files/toolforge/clush/tools-clush-generator
modules/profile/manifests/wmcs/prometheus.pp
modules/role/files/toollabs/clush/tools-clush-generator
modules/sonofgridengine/files/grid_configurator/grid_configurator.py

In T221770#5134946, @bd808 wrote:

We are currently using *.svc.eqiad.wmflabs for internal service names. I'm wondering if it would make things easier to remember in the long term if we took *.svc.wmcloud.org as the base domain for public service names? And then maybe use something like eqiad1.openstack.svc.wmcloud.org for the eqiad1-r region OpenStack API endpoints? I think this gives a lot more semantic meaning to the URLs with svc being a pretty clear signal that this is a service name that can float from host to host, and openstack namespacing things off from any other general class of services we may end up publishing.

I agree with all you commented. My only suggestion is to perhaps s/openstack/cloudvps/g, i.e, eqiad1.cloudvps.svc.wmcloud.org. But I don't have a strong opinion.

A related question that comes to mind is where is going to live the authoritative server for this subdomain (openstack|cloudvps.svc.wmcloud.org). For proper error handling (cloudvps down) this should probably be outside designate.

update every script to use this new endpoint. At least, the important ones.

A grep for 'cloudcontrol1003.wikimedia.org:5000' in ops/puppet.git brings back a pretty short list of files, so we should probably be able to change all of them. Anything that we find while doing that which is not reading the value from a config file (or at least the profile::openstack::base::keystone_host hira variable) should probably be noted with a phab task to come back and clean things up later.

It's more than that. Is not only the keystone FQDN/endpoint. I would consider changing all the public openstack endpoints and decoupling them from a concrete cloudcontrol server:

aborrero@cloudcontrol1004:~ 3s $ sudo wmcs-openstack endpoint list | grep public
| 0a1eb902933c4652ad41e6450fe436ee | eqiad1-r | neutron      | network      | True    | public    | http://cloudcontrol1003.wikimedia.org:9696                             |
| 4578c49346db479ab6d5b7961af8f60a | eqiad1-r | nova         | compute      | True    | public    | http://cloudcontrol1003.wikimedia.org:8774/v2.1                        |
| 465f331e03de4bfcbef1f8a7dbb4de15 | eqiad1-r | designate    | dns          | True    | public    | http://cloudservices1003.wikimedia.org:9001                            |
| be7a84a1af114f94bd1d6cc48b374413 | eqiad1-r | keystone     | identity     | True    | public    | http://cloudcontrol1003.wikimedia.org:5000/v3                          |
| ebf523fce03a4ff4b859277f0a3d2477 | eqiad1-r | proxy        | proxy        | True    | public    | http://proxy-eqiad1.wmflabs.org:5668/dynamicproxy-api/v1/$(tenant_id)s |
| f922afd9417448028ce02734e0420a0b | eqiad1-r | glance       | image        | True    | public    | http://cloudcontrol1003.wikimedia.org:9292                             |

I know this is a really big change. We should take some time and think on the consequences this has for how openstack itself works.

But right now the situation is we likely require raw database updates in case we want to switch the cloudcontrol servers (either controlled failover or disaster recovery).
It would be great to decouple this stuff and get closer to a working scenario in which we can failover the control plane by just flipping a hiera key (to change the proxy backend and starting the daemons).

It's more or less obvious we can follow this approach for openstack public endpoints. I'm not that sure for private/admin endpoints.

Andrew added a subtask: T223905: HA for openstack services.May 20 2019, 2:10 PM

Progress on HA for openstack APIs is stalled pending discussion about naming. I'm going to decouple this task from that one for now.

My current plan for upgrading is

re-image the secondaries (cloudcontrol1004, cloudservices1004, labnet1004) with stretch
update keystone endpoint catalog to point to secondary servers
apply a puppet patch pointing all openstack users to cloudcontrol1004
re-image primaries
revert the patch in step 3

Andrew removed subtasks: T223905: HA for openstack services, T223902: cloudcontrol: decide on FQDN for service endpoints.May 24 2019, 4:33 PM

Change 512492 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] cloudcontrol1004: move to Stretch

https://gerrit.wikimedia.org/r/512492

Change 512492 merged by Andrew Bogott:
[operations/puppet@production] cloudcontrol1004: move to Stretch

https://gerrit.wikimedia.org/r/512492

Andrew added a subtask: T224345: stretch build of prometheus-openstack-exporter incompatible with our mitaka apt repo.May 25 2019, 6:43 PM

I have left 1004 in a somewhat strange state pending a fix for T224345, but the cluster is still working and most services are still active/active between 1003 and 1004.

cloudcontrol1003 is flapping its systemd degraded alert since 2019-05-25 21:46. The unit that fails is:

● designate_floating_ip_ptr_records_updater.service               loaded failed failed    Designate Floating IP PTR records updater

In T221770#5213978, @Volans wrote:
cloudcontrol1003 is flapping its systemd degraded alert since 2019-05-25 21:46. The unit that fails is:
● designate_floating_ip_ptr_records_updater.service               loaded failed failed    Designate Floating IP PTR records updater

I downtimed this check in icinga for 1day.

aborrero closed subtask T224345: stretch build of prometheus-openstack-exporter incompatible with our mitaka apt repo as Resolved.May 27 2019, 12:00 PM

The rabbitmq cluster that I set up last week doesn't work across the Stretch/Jessie divide. So for now, I'm making cloudcontrol1004 the only active rabbit provider, pending a rebuild of cloudcontrol1003.

In the meantime, rabbitmq will be stopped on cloudcontrol1003, and puppet disabled there.

Change 512954 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] Make cloudcontrol1004 the primary keystone host

https://gerrit.wikimedia.org/r/512954

MoritzMuehlenhoff added a parent task: T224549: Track remaining jessie systems in production.May 29 2019, 11:12 AM

MoritzMuehlenhoff mentioned this in T224549: Track remaining jessie systems in production.

Change 512954 merged by Andrew Bogott:
[operations/puppet@production] Make cloudcontrol1004 the primary keystone host

https://gerrit.wikimedia.org/r/512954

Icinga downtime for 8:00:00 set by aborrero@cumin1001 on 1 host(s) and their services with reason: rebuilding server

cloudcontrol1003.wikimedia.org

Change 513097 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] nfs-exportd: use cloudcontrol1004 endpoint for now

https://gerrit.wikimedia.org/r/513097

Change 513097 merged by Andrew Bogott:
[operations/puppet@production] nfs-exportd: use cloudcontrol1004 endpoint for now

https://gerrit.wikimedia.org/r/513097

Mentioned in SAL (#wikimedia-operations) [2019-05-29T12:33:36Z] <Zppix> [11:58:16] <arturo> T221770 icinga downtime cloudcontrol1003.wikimedia.org for upcoming rebuild as stretch

Change 513117 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] nova: more swapping of cloudcontrol1003/1004

https://gerrit.wikimedia.org/r/513117

Change 513117 merged by Andrew Bogott:
[operations/puppet@production] nova: more swapping of cloudcontrol1003/1004

https://gerrit.wikimedia.org/r/513117

Change 513134 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] cloudcontrol1003: install with Debian Stretch

https://gerrit.wikimedia.org/r/513134

Change 513134 merged by Andrew Bogott:
[operations/puppet@production] cloudcontrol1003: install with Debian Stretch

https://gerrit.wikimedia.org/r/513134

Mentioned in SAL (#wikimedia-operations) [2019-05-29T14:45:32Z] <andrewbogott> reimaging cloudcontrol1003 T221770

Other than one incorrect icinga warning (addressed by https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/513276/) this is done and things seem to be working fine.

aborrero closed subtask T224424: cloudservices1003: gateway timeout error as Resolved.Jun 4 2019, 11:28 AM

Krenair mentioned this in T218423: Add python 3 packages to openstack::clientpackages::common.Jun 22 2019, 8:01 PM

Upgrade cloucontrol1003/1004 to stretch/mitakaClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Upgrade cloucontrol1003/1004 to stretch/mitaka
Closed, ResolvedPublic
Actions

Related Objects
Search...