Page MenuHomePhabricator

cloudvps: main/eqiad1 keystone merge
Closed, ResolvedPublic

Description

This task is to track all the work related to the keystone merge in the main/eqiad1 openstack deployments.
An email has been sent to the cloud-announce mailing list, operations scheduled for next Mon 13th Aug at 14:00 UTC (the D day).

Before D day:

  • announcements
  • prepare network connectivity: ACL, firewalling, etc
  • prepare new endpoints
  • prepare DB grants and any other SQL commands for the D day
  • prepare puppet patch to introduce the new keystone server/daemon @ cloudcontrol1003.wikimedia.org (https://gerrit.wikimedia.org/r/451314)
  • re-check nova_controller hiera keys (https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/452345/)
  • other DB operations (T201674)
  • check/review the schedule/planning for concrete operations/steps on D day. Also, include some rollback instructions.

On D day:

The outcome of the operations is:

  • keystone daemon is running on cloudcontrol1003.wikimedia.org
  • keystone database is on m5
  • keystone has support for 2 regions: eqiad (the old main deployment region) and eqiad1-r (the new eqiad1 deployment region)
  • keystone database contains endpoints for both deployments
  • no local keystone database (mysql) exists in cloudcontrol1003.wikimedia.org or cloudcontrol1004.wikimedia.org

I will be using this etherpad to paste things and track ongoing work: https://etherpad.wikimedia.org/p/cloudvps-keystone-merge

Related docs:
https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/keystone_notes
https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Neutron_Notes/phases
https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Deployments

Event Timeline

aborrero triaged this task as Medium priority.Aug 8 2018, 11:30 AM
aborrero created this task.
aborrero updated the task description. (Show Details)
aborrero updated the task description. (Show Details)
aborrero updated the task description. (Show Details)

Change 451314 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] cloudvps: merge main/eqiad1 keystone services

https://gerrit.wikimedia.org/r/451314

New endpoints are ready. On D day, we need to:

  • enable eqiad1 endpoints (they have been created, but disabled)
  • run SQL queries to update main endpoints

In m5-master:

MariaDB [keystone]> UPDATE endpoint set url='http://cloudcontrol1003.wikimedia.org:5000/v3' WHERE url='http://labcontrol1001.wikimedia.org:5000/v3';
MariaDB [keystone]> UPDATE endpoint set url='http://cloudcontrol1003.wikimedia.org:35357/v3' WHERE url='http://labcontrol1001.wikimedia.org:35357/v3';

in case of rollback:

MariaDB [keystone]> UPDATE endpoint set url='http://labcontrol1001.wikimedia.org:5000/v3' WHERE url='http://cloudcontrol1003.wikimedia.org:5000/v3';
MariaDB [keystone]> UPDATE endpoint set url='http://labcontrol1001.wikimedia.org:35357/v3' WHERE url='http://cloudcontrol1003.wikimedia.org:35357/v3';

Change 451850 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] cloudvps: keystone: allow IPv6 connections from foreign services

https://gerrit.wikimedia.org/r/451850

Change 451850 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] cloudvps: keystone: allow IPv6 connections from foreign services

https://gerrit.wikimedia.org/r/451850

aborrero updated the task description. (Show Details)
Andrew updated the task description. (Show Details)

Change 452345 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] cloudvps: use keytone_host instead of nova_controller

https://gerrit.wikimedia.org/r/452345

aborrero updated the task description. (Show Details)
aborrero updated the task description. (Show Details)

Change 452367 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] Horizon: disable during keystone switchover

https://gerrit.wikimedia.org/r/452367

Change 452367 merged by Andrew Bogott:
[operations/puppet@production] Horizon: disable during keystone switchover

https://gerrit.wikimedia.org/r/452367

aborrero updated the task description. (Show Details)

Mentioned in SAL (#wikimedia-operations) [2018-08-13T14:09:48Z] <andrewbogott> stopping nodepool, downtiming horizon for T201504

Mentioned in SAL (#wikimedia-operations) [2018-08-13T14:10:48Z] <arturo> T201504 disable keystone in main and eqiad1 deployments, all has been downtimed in icinga

Change 451314 merged by Andrew Bogott:
[operations/puppet@production] cloudvps: merge main/eqiad1 keystone services

https://gerrit.wikimedia.org/r/451314

Change 452345 merged by Andrew Bogott:
[operations/puppet@production] cloudvps: use keytone_host instead of nova_controller

https://gerrit.wikimedia.org/r/452345

Change 452381 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] keystone: pass the actual keystone host to openstack::util::envscripts

https://gerrit.wikimedia.org/r/452381

Change 452381 merged by Andrew Bogott:
[operations/puppet@production] keystone: pass the actual keystone host to openstack::util::envscripts

https://gerrit.wikimedia.org/r/452381

Change 452389 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] cloud vps: main: restore envscripts

https://gerrit.wikimedia.org/r/452389

Change 452393 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] Pass keystone_host to profile::openstack::main::nova::common

https://gerrit.wikimedia.org/r/452393

Change 452393 merged by Andrew Bogott:
[operations/puppet@production] Pass keystone_host to profile::openstack::main::nova::common

https://gerrit.wikimedia.org/r/452393

Change 452389 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] cloud vps: main: restore envscripts and adminscripts

https://gerrit.wikimedia.org/r/452389

Change 452427 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] labs-ip-alias-dump.py: make multi-region

https://gerrit.wikimedia.org/r/452427

Change 452427 merged by Andrew Bogott:
[operations/puppet@production] labs-ip-alias-dump.py: make multi-region

https://gerrit.wikimedia.org/r/452427

Change 452435 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] labs-ip-alias-dump.py.trusty: make multi-region

https://gerrit.wikimedia.org/r/452435

Change 452435 merged by Andrew Bogott:
[operations/puppet@production] labs-ip-alias-dump.py.trusty: make multi-region

https://gerrit.wikimedia.org/r/452435

Change 452448 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] nfs-exportd: fix our call to region.list()

https://gerrit.wikimedia.org/r/452448

Change 452448 merged by Bstorm:
[operations/puppet@production] nfs-exportd: fix our call to region.list()

https://gerrit.wikimedia.org/r/452448

Change 452457 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] Horizon: set DEFAULT_SERVICE_REGION to eqiad

https://gerrit.wikimedia.org/r/452457

Change 452457 merged by Andrew Bogott:
[operations/puppet@production] Horizon: set DEFAULT_SERVICE_REGION to eqiad

https://gerrit.wikimedia.org/r/452457

There were two categories of issues that cropped up during this change:

  1. Services still trying to hit labcontrol1001. Some of these were puppet mishaps (our hiera model is very tangly) and some of them were unpuppetized services (e.g. openstack-browser) that needed to be informed.
  1. Basically any custom code that enumerated instances immediately started enumerating in the new region (eqiad1-r) and coming up empty. In retrospect, this was basically a coin-toss -- we got (un) lucky and the default region didn't change when we made this merge in labtest so I wasn't expecting it.

Lots of additional patches (most of them linked above) have addressed all the problems we've found so far.

This change was a huge blocker for the neutron migration -- I'm thrilled to have it done.

Change 452668 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[labs/striker@master] striker/striker.ini: refresh keystone URL

https://gerrit.wikimedia.org/r/452668

Change 452668 merged by jenkins-bot:
[labs/striker@master] striker/striker.ini: refresh keystone URL

https://gerrit.wikimedia.org/r/452668

Change 456288 had a related patch set uploaded (by BryanDavis; owner: Bryan Davis):
[operations/puppet@production] striker: Point at cloudcontrol1003 for OpenStack APIs

https://gerrit.wikimedia.org/r/456288

Change 456288 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] striker: Point at cloudcontrol1003 for OpenStack APIs

https://gerrit.wikimedia.org/r/456288

Change 476232 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] openstack: eqiad1: activate keystone extra services

https://gerrit.wikimedia.org/r/476232