Page MenuHomePhabricator

Measure capacity and utilization of labsdb*** boxes
Closed, ResolvedPublic

Description

Utilization both in terms of query volume / performance as well as disk space, to see if we need to get more of them / put more quotas in place.

Event Timeline

yuvipanda raised the priority of this task from to Needs Triage.
yuvipanda updated the task description. (Show Details)
yuvipanda added projects: Cloud-Services, DBA.

This is basically capacity planning for these boxes, since we've been seeing more consistent issues with them now. @jcrespo how do you think we should handle these?

I can help and even own this task.

There is an already installed measurement plugin for MySQL (user_stats) https://www.percona.com/doc/percona-server/5.5/diagnostics/user_stats.html?id=percona-server:features:userstatv2 and we are investigating deploy an even more detailed one, performance_schema, at least on production: T99485 https://dev.mysql.com/doc/refman/5.6/en/performance-schema-summary-tables.html

So, measuring is easy or it will be very soon.

Whatever wants to be done with that however, it is difficult:

  • Will users accept quotas
  • Integrated support is limited only:
MAX_QUERIES_PER_HOUR
MAX_UPDATES_PER_HOUR
MAX_CONNECTIONS_PER_HOUR
MAX_USER_CONNECTIONS

With no limits per user on memory or disk space (unless you did something hacky).

Things that could be done easily:

  • Put a soft and hard limit on query execution with pt-kill / tendril. First one would send an email, second one will kill the query.

Things that could be done with some development:

  • Force users to put long queries in a queue, (maybe integrating then on quarry?)
  • Containers with separate instances for problematic users?
  • Add some extra HA/load balancing both for replicas and user dbs, specially for maintenance

Wonderful!

Quotas are probably not going to work, so that'll be the last resort. However, I think just identifying and informing users of intense queries will help a lot. We can also consider adding more hardware if we have the numbers to back it up.

Aside from the aggregated totals on Ganglia and Tendril (which lead to discover OOMs on labsdb1002, you can do queries right now on information_schema such as:

MariaDB LABS > SELECT * FROM information_schema.user_statistics ORDER BY CPU_TIME DESC  LIMIT 10;
+--------------+-------------------+------------------------+----------------+--------------------+--------------------+----------------+--------------+----------------------+--------------+-------------+--------------+---------------+--------------+-----------------+-----------------+----------------+---------------------+-----------------------+--------------------+------------------+---------------+---------------+
| USER         | TOTAL_CONNECTIONS | CONCURRENT_CONNECTIONS | CONNECTED_TIME | BUSY_TIME          | CPU_TIME           | BYTES_RECEIVED | BYTES_SENT   | BINLOG_BYTES_WRITTEN | ROWS_READ    | ROWS_SENT   | ROWS_DELETED | ROWS_INSERTED | ROWS_UPDATED | SELECT_COMMANDS | UPDATE_COMMANDS | OTHER_COMMANDS | COMMIT_TRANSACTIONS | ROLLBACK_TRANSACTIONS | DENIED_CONNECTIONS | LOST_CONNECTIONS | ACCESS_DENIED | EMPTY_QUERIES |
+--------------+-------------------+------------------------+----------------+--------------------+--------------------+----------------+--------------+----------------------+--------------+-------------+--------------+---------------+--------------+-----------------+-----------------+----------------+---------------------+-----------------------+--------------------+------------------+---------------+---------------+
| s51434       |             22299 |                      0 |        1436841 | 1425651.5465629105 | 1307179.4418262297 |      328749999 |   1978526262 |                    0 | 658074917494 |    14622385 |            0 |        482734 |       125129 |         1157406 |          621740 |          26154 |                3224 |                     1 |                  0 |               10 |             0 |        642030 |
| s51127       |            215514 |                      0 |        1166804 | 1111794.0408123396 |  553111.2763173468 |     1348920625 |   3691625118 |                    0 |  83183738937 |    55590402 |     84309277 |    1684114863 |    173507740 |         1268664 |         3729650 |         802580 |             1992655 |                    20 |                158 |                0 |             2 |        648303 |
| s51704       |              4088 |                      0 |         813233 |  813172.6721980363 |  485524.2997996303 |        1211538 |      1834764 |                    0 | 151808046468 |       11320 |            0 |             0 |            0 |            7349 |               0 |           8382 |                7334 |                    16 |                  0 |                0 |             0 |          1718 |
| s51211       |             89423 |                      0 |         907074 |  687022.6312436978 | 466182.34979867964 |    22224086347 |  44986048334 |                    0 |  79640608147 |   201392566 |            0 |             0 |            0 |       161248586 |               0 |              0 |           161247677 |                     0 |                  0 |              379 |             0 |     160999530 |
| s51419       |              1852 |                      0 |         516209 |  367862.7543620093 | 328065.45904408814 |      411482132 |    537592335 |                    0 |  87593445429 |     2375185 |            0 |             0 |            0 |         2587287 |               0 |              0 |             2587287 |                     0 |                  0 |                0 |             0 |        212102 |
| s51156       |            234018 |                      0 |         673312 |  551110.2557645071 | 294718.45623102057 |    22219436424 | 210738205345 |                    0 |  21453096513 |  3954524446 |            0 |             0 |            0 |        83952539 |               0 |              0 |            83963124 |                    24 |                  0 |                0 |             0 |         18168 |
| s52256       |              8276 |                      0 |         308766 |   300670.671030904 | 218032.89066452847 |     1743478345 |   2300664213 |                    0 |  24266778713 |     7281835 |            0 |             0 |            0 |         5080093 |               0 |        4094165 |             5081828 |                     2 |                  0 |                0 |             0 |       1811747 |
| s51591       |              2268 |                      0 |         283181 | 268265.10337598226 | 162389.26819081898 |      927830702 |    256057419 |                    0 |  33482778007 |     4963983 |            0 |             0 |            0 |          927775 |               0 |           2268 |              927768 |                     7 |                  0 |                0 |             0 |        452786 |
| p50588g50589 |              4417 |                      0 |         265882 |  265858.3057499842 | 153817.60534970183 |         850271 | 305532078003 |                    0 |  33612300607 | 15603822277 |            0 |             0 |            0 |            4417 |               0 |              0 |                4493 |                     1 |                  0 |                0 |             0 |             0 |
| s51187       |            639113 |                      0 |         496665 |  280481.8013343217 | 118965.32272312038 |      613457602 |   2033210162 |                    0 |   9331619766 |    35838127 |            0 |     305296980 |            0 |         1011110 |           66285 |         640488 |             1043585 |                     1 |                308 |                0 |           308 |         17851 |
+--------------+-------------------+------------------------+----------------+--------------------+--------------------+----------------+--------------+----------------------+--------------+-------------+--------------+---------------+--------------+-----------------+-----------------+----------------+---------------------+-----------------------+--------------------+------------------+---------------+---------------+
10 rows in set (0.00 sec)

ooooh, that's great!

I don't feel like I've the necessary qualifications to own this task. Do you think you can do that, @jcrespo? I guess it'll primarily be just figuring out if / when we need new hardware...

I lacked the specific goal. The original description was a bit vague. Now I can own this.

There are no need for specific metrics to know that current hardware for labsdb machines is lacking. There is no basic HA and from time to time we suffer OOMs. There are 2 options: limit users traffic or have better machines with more connection capacity (mainly, more memory and disk performance). We also need load balancing.

So if we can find a way to add another machine of similar specs, that'll help improve things as well, right?

As for load balancing - we currently are doing vague DNS based load balancing, but that falls apart for user dbs. Are you thinking of doing a different form of load balancing, or just doing what we do now but with HA for user databases?

Are you thinking of doing a different form of load balancing, or just doing what we do now but with HA for user databases?

Both. I haven't thought about a particular proposal, let me work on that.

@jcrespo can you spec out an ideal labsdb machine and we can see if we can find budget for it?

Low right now, I haven't even started evaluating this even on production, and labsdbs have been working acceptably well in the last months.

jcrespo raised the priority of this task from Low to High.Feb 25 2016, 7:48 PM
Nemo_bis lowered the priority of this task from High to Medium.Mar 14 2016, 10:24 AM

Closing despite not having done anything publicly because it has been done privately on procurement at T131363. I will be happy to share the results of specs ASAP, when approved, stripped of the non-private parts (but I want to do it with a proper announcement).