Page MenuHomePhabricator
Paste P3514

labsdb chat july 18/19
ActivePublic

Authored by chasemp on Jul 19 2016, 9:15 PM.
Labs MySQL infrastructure
3 replica dbs, 2 toolsdbs
Problems:
1. Load
- Cores usually at 100% https://grafana.wikimedia.org/dashboard/db/server-board?from=1468766304245&to=1468852464245&var-server=labsdb1003&var-network=eth0 + swapping
- Crashy
2. No HA solution
3. RAID0 - disk goes, data and everything is gone (happened to labsdb1002)
4. Current HA solution is to change what the DNS entries point to
problematic because:
1. Not transparent to users because of user dbs not replicated
5. Few tools/users take up majority of resources
6. TokuDB, used for labs, crashes frequently. Was used because it was able to compress things better, which we needed because of the large number of shards on single machines. Also causes lag / bogus results sometimes.
7. Lag spikes on things like updating tables that don't have indexes / replicas getting 'stuck' and needing restarting. Sometimes corruption (but that is probably a mediawiki issue)
8. Having lots of accounts with separate grants makes auditing difficult.
9. Users can't run EXPLAIN queries to check the theoretical efficiency of their SQL
10. Sanitizing needs to be both: more secure and more automatic
Proposed solutions:
1. Switch to InnoDB compressed, ditch TokuDB. Needs more disk but we have them now. Will make re-imports from prod easier too
2. RAID 10 not RAID 0
3. Possibly use HAProxy, but might need L7 proxying instead. How to handle user dbs on replicas (*large pain point* for HA)?
4. Use mariaddb 10.1 "roles" to manage common permissions (<https://mariadb.com/kb/en/mariadb/roles-overview/>)

Event Timeline

chasemp created this paste.Jul 19 2016, 9:15 PM