Maniphest T202889

cloudvps: dedicated openstack database
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	aborrero
	Aug 27 2018, 12:34 PM

Description

From T188589 and T202549 it is clear that the database usage by openstack could be very heavy.

We tried adding some limits, constraings, throttles and other stuff and apparently we are good right now.
But in the long term, as our usage of openstack grows, we will eventually need a dedicated database.
Also, @Bstorm pointed out that some design concepts/components in openstack are assuming high dedicated database resources.

We observed in the past that sharing m5-master.eqiad.wmnet can lead to an openstack usage peak affecting other services.
So, if only for isolation purposes, we should consider having a dedicated openstack database.

Let's this ticket be tracking/debate point for this subject.

Details

Subject	Repo	Branch	Lines +/-
neutron: reduce number of api_wokers	operations/puppet	production	+3 -0
nova: Reduce compute workers for eqiad main	operations/puppet	production	+1 -1
Nova: reduce number of worker nodes	operations/puppet	production	+3 -3

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		Marostegui	T202889 cloudvps: dedicated openstack database
		Resolved	PRODUCTION ERROR	None	T210332 Exception of type Wikimedia\\Rdbms\\DBConnectionError after an API query

Event Timeline

aborrero created this task.Aug 27 2018, 12:34 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 27 2018, 12:34 PM

aborrero added subscribers: bd808, Andrew, jcrespo, Marostegui.Aug 27 2018, 12:36 PM

My thoughts on this:
I agree that long term, ideally, we shouldn't share this service with wikitech, as they can affect each other in case we have issues (most likely connections bursts as we have seen in the past).
In order to isolate both services, I think we have two options

Move openstack to a different instance on the same host.

Convert db1073 to be a multi-instance master (this requires some puppet work, ie: pt-heartbeat). We do have slaves with multi-instance, but we don't have any master running multi-instance
Openstack config files would need to allow to use a different port than 3306 as that would no longer be the port on which MySQL would be running for openstack.
m5 currently doesn't use HAProxy, but ideally it should, so we'd also need to check how to adapt haproxy for multinstance masters failovers.

Move openstack to a different database physical server

This provides the best isolation level as openstack would have their own set of servers.
This require HW purchases (a minimum of 4db servers: 2 eqiad + 2 codfw) and thus it would involve budgets owners.
The current standard spec for DBs would be totally overkill for this (512GB + 10x SSDs disks). New specs should be defined.

Notice T167973: Move database for wikitech (labswiki) to a main cluster section

In T202889#4541255, @jcrespo wrote:

Notice T167973: Move database for wikitech (labswiki) to a main cluster section

Good catch - forgot about that task

That would pretty much unblock this task and would leave m5 pretty much for openstack and Cloud things

Change 463789 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] Nova: reduce number of worker nodes

https://gerrit.wikimedia.org/r/463789

gerritbot added a project: Patch-For-Review.Oct 1 2018, 3:59 PM

Change 463789 merged by Andrew Bogott:
[operations/puppet@production] Nova: reduce number of worker nodes

https://gerrit.wikimedia.org/r/463789

aborrero mentioned this in T209480: labnet1001/labstore1004 combined alert on 2018-11-14.Nov 14 2018, 1:11 PM

Bump. I would like to see if we have any chances on moving forward with this.

See T202889#4541135

In T202889#4746340, @Marostegui wrote:

See T202889#4541135

What do you prefer? What would you recommend?

Also, @bd808 what do you think?

In T202889#4746346, @aborrero wrote:

In T202889#4746340, @Marostegui wrote:

See T202889#4541135

What do you prefer? What would you recommend?

If I had to choose, I would move this to a dedicated infrastructure managed by the Cloud Team so it can live on its own servers and managed by those who really know what's going on behind the scenes in terms of future growth, scaling, expansion and whatnot.
We could, of course, help with the specs, installation and migration.

In T202889#4746346, @aborrero wrote:

Also, @bd808 what do you think?

In T202889#4746655, @Marostegui wrote:

If I had to choose, I would move this to a dedicated infrastructure managed by the Cloud Team so it can live on its own servers and managed by those who really know what's going on behind the scenes in terms of future growth, scaling, expansion and whatnot.
We could, of course, help with the specs, installation and migration.

I suppose the main blocker for dedicated hardware is figuring out how big these hosts need to be and then finding the budget to purchase them. My naive understanding is that data storage for the database is not large at all. The main issue is concurrent connection count as far as I understand it.

I'm not super excited about adding yet another set of hosts to support OpenStack, but I understand the point of view of the DBA team that this is not really their core mission either. It too bad that there is a chicken & egg problem that prevents this from being virtualized.

I wonder if we could actually repurpose labpuppetmaster100[12] hardware for this after moving their services into instances (T171188)? Those boxes each have 24 physical cores, 32GB ram, and 1TB RAID1 SATA.

As noted in T202889#4541255, T167973: Move database for wikitech (labswiki) to a main cluster section is open as well. When we move the labswiki database off of m5 then it would be just a bit more than the OpenStack schemas left running there per https://wikitech.wikimedia.org/wiki/MariaDB/misc#m5. As far as I know the only blocker for moving labswiki's db to the main wiki db cluster is DBA time.

In T202889#4748811, @bd808 wrote:

. As far as I know the only blocker for moving labswiki's db to the main wiki db cluster is DBA time.

I think there was something else apart from DBA's time - I cannot recall exactly right now what it was, I think it was firewall's related, maybe @jcrespo can recall exactly what it was?

Also, keep in mind that m5 servers (and other misc ones) will need to be replaced as they are old and out of warranty anyways.

Note: we are having more issues with the current connection limit, we just got paged because openstack services are struggling with the DB limits.

For reference: https://grafana.wikimedia.org/dashboard/db/mysql?panelId=9&fullscreen&orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1073&var-port=9104&from=now%2Fy&to=now

Paladox subscribed.Nov 24 2018, 11:01 PM

Change 475603 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] nova: Reduce compute workers for eqiad main

https://gerrit.wikimedia.org/r/475603

Change 475603 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] nova: Reduce compute workers for eqiad main

https://gerrit.wikimedia.org/r/475603

It seems we won't be able to implement any robust solution to this in the short term (dedicated database),

I just checked this:

By seeing the graph data, Do you think we can raise DB connection limits? It seems we have 8GB free RAM and no loadavg/CPU/network contention at all.
Please, consider raising the limits in the short term while we work on other long term solutions.

Yeah, moving to the new region will eventually allow us to shut down an entire group of workers for nova, but the neutron workers only just got going. This hovers at close to 500 in general. It isn't leaking connections. It has the ability to burst connection numbers as it needs for around an hour or two before giving the connection back. The neutron pool has scaled dramatically as we have moved VMs into that environment. It would make sense to give another 100 connections to the limit since it floats between 490 and 500 now. On Monday, we will also review connection usage to see if anything could benefit from limitation on the configuration end (particularly neutron). Since it is already limited, it is likely scaling itself to required levels.

Change 475607 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] neutron: reduce number of api_wokers

https://gerrit.wikimedia.org/r/475607

Change 475607 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] neutron: reduce number of api_wokers

https://gerrit.wikimedia.org/r/475607

And on that note, we found another constraint we could put on the newly-busy neutron workers. It looks a lot nicer now with current limits.

jcrespo mentioned this in T202367: Productionize dbproxy101[2-7].eqiad.wmnet and dbproxy200[1-4].Nov 27 2018, 3:12 PM

Marostegui added a subtask: T210332: Exception of type Wikimedia\\Rdbms\\DBConnectionError after an API query.Nov 29 2018, 8:13 PM

Marostegui closed subtask T210332: Exception of type Wikimedia\\Rdbms\\DBConnectionError after an API query as Resolved.Nov 30 2018, 8:54 AM

My current understanding is:

db1073 is already only being used for WMCS uses
The recurring issue with connections is resolved

So I'm not sure there's anything to do here that isn't already documented in other tickets.

aborrero moved this task from Needs discussion to Graveyard on the cloud-services-team (Kanban) board.Dec 4 2018, 4:56 PM

db1073 currently has wikitech db (but there's another task to follow up on its migration already)

Closing this for now per T202889#4798131
If someone feels we need to revisit this, please re-open!
Thanks everyone

cloudvps: dedicated openstack databaseClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

cloudvps: dedicated openstack database
Closed, ResolvedPublic
Actions

Related Objects
Search...