labsdb chat july 18/19
ActivePublic
Actions

Authored by • chasemp on Jul 19 2016, 9:15 PM.

Tags

None

Referenced Files

	F4289501: labsdb chat july 18/19
	Jul 19 2016, 9:15 PM

Subscribers

None

	Labs MySQL infrastructure


	3 replica dbs, 2 toolsdbs

	Problems:

	1. Load

	- Cores usually at 100% https://grafana.wikimedia.org/dashboard/db/server-board?from=1468766304245&to=1468852464245&var-server=labsdb1003&var-network=eth0 + swapping

	- Crashy

	2. No HA solution
	3. RAID0 - disk goes, data and everything is gone (happened to labsdb1002)
	4. Current HA solution is to change what the DNS entries point to
	problematic because:
	1. Not transparent to users because of user dbs not replicated
	5. Few tools/users take up majority of resources
	6. TokuDB, used for labs, crashes frequently. Was used because it was able to compress things better, which we needed because of the large number of shards on single machines. Also causes lag / bogus results sometimes.
	7. Lag spikes on things like updating tables that don't have indexes / replicas getting 'stuck' and needing restarting. Sometimes corruption (but that is probably a mediawiki issue)
	8. Having lots of accounts with separate grants makes auditing difficult.
	9. Users can't run EXPLAIN queries to check the theoretical efficiency of their SQL
	10. Sanitizing needs to be both: more secure and more automatic

	Proposed solutions:
	1. Switch to InnoDB compressed, ditch TokuDB. Needs more disk but we have them now. Will make re-imports from prod easier too
	2. RAID 10 not RAID 0
	3. Possibly use HAProxy, but might need L7 proxying instead. How to handle user dbs on replicas (large pain point for HA)?
	4. Use mariaddb 10.1 "roles" to manage common permissions (<https://mariadb.com/kb/en/mariadb/roles-overview/>)

Event Timeline

• chasemp created this paste.Jul 19 2016, 9:15 PM

labsdb chat july 18/19ActivePublicActions

Event Timeline

labsdb chat july 18/19
ActivePublic
Actions