Page MenuHomePhabricator

Optimize all the things (=MySQL tables)
Closed, ResolvedPublic

Assigned To
Authored By
Ladsgroup
Nov 18 2025, 2:34 PM
Referenced Files
F70301478: grafik.png
Nov 20 2025, 12:58 PM
F70294722: grafik.png
Nov 20 2025, 12:56 AM

Description

For various reasons it'd be really nice to have all the tables in MySQL optimized from time to time:

  • Getting table stats in order to avoid optimizer picking the wrong query plan (stuff like T305427)
  • Reclaiming space when tables shrink for various reasons.
    • which also would reduce the time to clone a replica.
    • and save space on backups too
  • Reducing rand disk lookups due to having fragmented data.

There are two options (that are not mutually exclusive):

  • After every DC switchover, when the whole dc is depooled. run optimize table on every table with replication.
  • A script that wold go around and do the work in the background (reusing the auto_schema code).

I think doing it every quarter or half a year would be fine. More than that, it'll be too much work for too little gain.

Event Timeline

We could consider using analyze table which does not reclaim any space as it doesn't rebuild the table and it is less aggressive in terms of performance.
Unless we are struggling disk wise (which I don't think we are), we could go first for analyze and later if needed go for optimize (I assume it would be transparent for whatever method we implement to run either one or the other one).

Marostegui triaged this task as Medium priority.Nov 18 2025, 2:40 PM
Marostegui moved this task from Triage to Refine on the DBA board.

We could consider using analyze table which does not reclaim any space as it doesn't rebuild the table and it is less aggressive in terms of performance.
Unless we are struggling disk wise (which I don't think we are), we could go first for analyze and later if needed go for optimize (I assume it would be transparent for whatever method we implement to run either one or the other one).

I think the biggest problem we had with "optimize" (I don't know if this is also the case with "analyze" too) is that it won't be replicated so the thing with running on master of the depooled dc won't work. We explicitly have to run "ALTER TABLE" instead to get it to replicate. Maybe it's fixed now? I doubt it.

Yeah, that is the bug I created :) https://jira.mariadb.org/browse/MDEV-26618
My idea was to run this without replication, on a per host basis, to avoid massive lags (and also that way we don't need the DC to be depooled, which I believe is what is preferred after each switchover)

Fair. One last point in favor of optimize instead of analyze: I know we don't have disk constraints but it can help us with reducing time to clone a replica from another. Not a big deal though. We can start with analyze and once I have charmed you, we can switch to optimize.

Let's do that then

Mentioned in SAL (#wikimedia-operations) [2025-11-19T17:08:15Z] <ladsgroup@cumin1003> dbctl commit (dc=all): 'Testing all optimize (T410401)', diff saved to https://phabricator.wikimedia.org/P85394 and previous config saved to /var/cache/conftool/dbconfig/20251119-170814-ladsgroup.json

Ran a script on db1185 to clean up s5 and it reduced the size by 180GB(!)
https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&refresh=5m&var-server=db1185&var-datasource=000000026&var-cluster=mysql&viewPanel=panel-28&from=2025-11-19T14:58:53.459Z&to=2025-11-20T00:53:54.952Z&timezone=utc

grafik.png (924×1 px, 61 KB)

It's a bit misleading because I assume it'll grow faster now but let's see how it develops. I'm going to start running this on all of s6 now.

From what I'm seeing a bit clean up was 50GB being removed from templatelinks of cebwiki.

1import pymysql.cursors
2from auto_schema.config import Config
3from auto_schema.replication_discovery import HostReplicationDiscovery
4from auto_schema.replica_set import ReplicaSet
5from auto_schema.bash import run
6from auto_schema.host import Host
7import time
8from datetime import datetime
9import re
10import sys
11from argparse import Namespace
12from wmflib.interactive import ensure_shell_is_durable
13
14section = 's6'
15ticket = 'T410589'
16replicas = None
17tables_to_skip = ['revision', 'content', 'slots']
18
19
20args = Namespace(dc_masters=False, section=section, dc='eqiad')
21replica_set = ReplicaSet(replicas, section, args=args)
22replication_discovery = HostReplicationDiscovery()
23config = Config()
24try:
25 ensure_shell_is_durable()
26except WmflibError:
27 print('Rolling restarts must be done in screen/tmux/etc, exiting.')
28 sys.exit(1)
29for host in replica_set.replicas:
30 host.downtime('72')
31 if replica_set.detect_depool(host):
32 depooled = host.depool(ticket)
33 if not depooled:
34 continue
35 host.run_sql('stop slave;')
36
37 for db_name in replica_set.get_dbs():
38 connection = pymysql.connect(
39 host=host.fqn,
40 port=host.port,
41 database=db_name,
42 read_default_file="/root/.my.cnf",
43 cursorclass=pymysql.cursors.DictCursor,
44 )
45 connection.autocommit(True)
46 tables = []
47 with connection:
48 with connection.cursor() as cursor:
49 cursor.execute('show tables;')
50 result = cursor.fetchall()
51 for row in result:
52 tables.append(list(row.values())[0])
53 for table in tables:
54 if table in tables_to_skip:
55 continue
56 with connection.cursor() as cursor:
57 print(datetime.now(), ': Optimizing ' + table + ' on ' + db_name)
58 cursor.execute('optimize table ' + table + ';')
59
60 host.run_sql('start slave;')
61 if replica_set.detect_depool(host):
62 host.repool(ticket)

Yeah, that is the bug I created :) https://jira.mariadb.org/browse/MDEV-26618
My idea was to run this without replication, on a per host basis, to avoid massive lags (and also that way we don't need the DC to be depooled, which I believe is what is preferred after each switchover)

And right on time this bug has been fixed after 4 years since it was created! I guess that's a Christmas gift 🎁

Yeah, that is the bug I created :) https://jira.mariadb.org/browse/MDEV-26618
My idea was to run this without replication, on a per host basis, to avoid massive lags (and also that way we don't need the DC to be depooled, which I believe is what is preferred after each switchover)

And right on time this bug has been fixed after 4 years since it was created! I guess that's a Christmas gift 🎁

Merged and included in the next 10.11.16

Ladsgroup moved this task from In progress to Done on the DBA board.

I think this is done. The actual run for this year is happening in another ticket.