Page MenuHomePhabricator

cloudvps: main/eqiad1 keystone merge
Closed, ResolvedPublic

Description

This task is to track all the work related to the keystone merge in the main/eqiad1 openstack deployments.
An email has been sent to the cloud-announce mailing list, operations scheduled for next Mon 13th Aug at 14:00 UTC (the D day).

Before D day:

  • announcements
  • prepare network connectivity: ACL, firewalling, etc
  • prepare new endpoints
  • prepare DB grants and any other SQL commands for the D day
  • prepare puppet patch to introduce the new keystone server/daemon @ cloudcontrol1003.wikimedia.org (https://gerrit.wikimedia.org/r/451314)
  • re-check nova_controller hiera keys (https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/452345/)
  • other DB operations (T201674)
  • check/review the schedule/planning for concrete operations/steps on D day. Also, include some rollback instructions.

On D day:

The outcome of the operations is:

  • keystone daemon is running on cloudcontrol1003.wikimedia.org
  • keystone database is on m5
  • keystone has support for 2 regions: eqiad (the old main deployment region) and eqiad1-r (the new eqiad1 deployment region)
  • keystone database contains endpoints for both deployments
  • no local keystone database (mysql) exists in cloudcontrol1003.wikimedia.org or cloudcontrol1004.wikimedia.org

I will be using this etherpad to paste things and track ongoing work: https://etherpad.wikimedia.org/p/cloudvps-keystone-merge

Related docs:
https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/keystone_notes
https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Neutron_Notes/phases
https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Deployments

Event Timeline

aborrero triaged this task as Medium priority.
aborrero updated the task description. (Show Details)
aborrero updated the task description. (Show Details)
aborrero updated the task description. (Show Details)

Change 451314 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] cloudvps: merge main/eqiad1 keystone services

https://gerrit.wikimedia.org/r/451314

New endpoints are ready. On D day, we need to:

  • enable eqiad1 endpoints (they have been created, but disabled)
  • run SQL queries to update main endpoints

In m5-master:

MariaDB [keystone]> UPDATE endpoint set url='http://cloudcontrol1003.wikimedia.org:5000/v3' WHERE url='http://labcontrol1001.wikimedia.org:5000/v3';
MariaDB [keystone]> UPDATE endpoint set url='http://cloudcontrol1003.wikimedia.org:35357/v3' WHERE url='http://labcontrol1001.wikimedia.org:35357/v3';

in case of rollback:

MariaDB [keystone]> UPDATE endpoint set url='http://labcontrol1001.wikimedia.org:5000/v3' WHERE url='http://cloudcontrol1003.wikimedia.org:5000/v3';
MariaDB [keystone]> UPDATE endpoint set url='http://labcontrol1001.wikimedia.org:35357/v3' WHERE url='http://cloudcontrol1003.wikimedia.org:35357/v3';

Change 451850 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] cloudvps: keystone: allow IPv6 connections from foreign services

https://gerrit.wikimedia.org/r/451850

Change 451850 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] cloudvps: keystone: allow IPv6 connections from foreign services

https://gerrit.wikimedia.org/r/451850

Andrew updated the task description. (Show Details)

Change 452345 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] cloudvps: use keytone_host instead of nova_controller

https://gerrit.wikimedia.org/r/452345

Change 452367 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] Horizon: disable during keystone switchover

https://gerrit.wikimedia.org/r/452367

Change 452367 merged by Andrew Bogott:
[operations/puppet@production] Horizon: disable during keystone switchover

https://gerrit.wikimedia.org/r/452367

Mentioned in SAL (#wikimedia-operations) [2018-08-13T14:09:48Z] <andrewbogott> stopping nodepool, downtiming horizon for T201504

Mentioned in SAL (#wikimedia-operations) [2018-08-13T14:10:48Z] <arturo> T201504 disable keystone in main and eqiad1 deployments, all has been downtimed in icinga

Change 451314 merged by Andrew Bogott:
[operations/puppet@production] cloudvps: merge main/eqiad1 keystone services

https://gerrit.wikimedia.org/r/451314

Change 452345 merged by Andrew Bogott:
[operations/puppet@production] cloudvps: use keytone_host instead of nova_controller

https://gerrit.wikimedia.org/r/452345

Change 452381 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] keystone: pass the actual keystone host to openstack::util::envscripts

https://gerrit.wikimedia.org/r/452381

Change 452381 merged by Andrew Bogott:
[operations/puppet@production] keystone: pass the actual keystone host to openstack::util::envscripts

https://gerrit.wikimedia.org/r/452381

Change 452389 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] cloud vps: main: restore envscripts

https://gerrit.wikimedia.org/r/452389

Change 452393 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] Pass keystone_host to profile::openstack::main::nova::common

https://gerrit.wikimedia.org/r/452393

Change 452393 merged by Andrew Bogott:
[operations/puppet@production] Pass keystone_host to profile::openstack::main::nova::common

https://gerrit.wikimedia.org/r/452393

Change 452389 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] cloud vps: main: restore envscripts and adminscripts

https://gerrit.wikimedia.org/r/452389

Change 452427 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] labs-ip-alias-dump.py: make multi-region

https://gerrit.wikimedia.org/r/452427

Change 452427 merged by Andrew Bogott:
[operations/puppet@production] labs-ip-alias-dump.py: make multi-region

https://gerrit.wikimedia.org/r/452427

Change 452435 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] labs-ip-alias-dump.py.trusty: make multi-region

https://gerrit.wikimedia.org/r/452435

Change 452435 merged by Andrew Bogott:
[operations/puppet@production] labs-ip-alias-dump.py.trusty: make multi-region

https://gerrit.wikimedia.org/r/452435

Change 452448 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] nfs-exportd: fix our call to region.list()

https://gerrit.wikimedia.org/r/452448

Change 452448 merged by Bstorm:
[operations/puppet@production] nfs-exportd: fix our call to region.list()

https://gerrit.wikimedia.org/r/452448

Change 452457 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] Horizon: set DEFAULT_SERVICE_REGION to eqiad

https://gerrit.wikimedia.org/r/452457

Change 452457 merged by Andrew Bogott:
[operations/puppet@production] Horizon: set DEFAULT_SERVICE_REGION to eqiad

https://gerrit.wikimedia.org/r/452457

There were two categories of issues that cropped up during this change:

  1. Services still trying to hit labcontrol1001. Some of these were puppet mishaps (our hiera model is very tangly) and some of them were unpuppetized services (e.g. openstack-browser) that needed to be informed.
  1. Basically any custom code that enumerated instances immediately started enumerating in the new region (eqiad1-r) and coming up empty. In retrospect, this was basically a coin-toss -- we got (un) lucky and the default region didn't change when we made this merge in labtest so I wasn't expecting it.

Lots of additional patches (most of them linked above) have addressed all the problems we've found so far.

This change was a huge blocker for the neutron migration -- I'm thrilled to have it done.

Change 452668 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[labs/striker@master] striker/striker.ini: refresh keystone URL

https://gerrit.wikimedia.org/r/452668

Change 452668 merged by jenkins-bot:
[labs/striker@master] striker/striker.ini: refresh keystone URL

https://gerrit.wikimedia.org/r/452668

Change 456288 had a related patch set uploaded (by BryanDavis; owner: Bryan Davis):
[operations/puppet@production] striker: Point at cloudcontrol1003 for OpenStack APIs

https://gerrit.wikimedia.org/r/456288

Change 456288 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] striker: Point at cloudcontrol1003 for OpenStack APIs

https://gerrit.wikimedia.org/r/456288

Change 476232 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] openstack: eqiad1: activate keystone extra services

https://gerrit.wikimedia.org/r/476232