Page MenuHomePhabricator

Recheck if we need innodb_flush_neighbors=ON
Closed, ResolvedPublic

Description

Whilst talking to @Danielgblack about T311106 he pointed out that if we have our fleet with SSDs we might want to re-check if we really need innodb_flush_neighbors=1, which we kept for when we didn't have all of them with SSDs.

More info also at: https://mariadb.com/kb/en/innodb-system-variables/#innodb_flush_neighbors and https://jira.mariadb.org/browse/MDEV-17380

Event Timeline

Marostegui triaged this task as Medium priority.Jul 19 2022, 8:21 AM
Marostegui moved this task from Triage to In progress on the DBA board.

MySQL also agrees: https://dev.mysql.com/doc/refman/5.7/en/innodb-parameters.html#sysvar_innodb_flush_neighbors

When the table data is stored on a traditional HDD storage device, flushing such neighbor pages in one operation reduces I/O overhead (primarily for disk seek operations) compared to flushing individual pages at different times. For table data stored on SSD, seek time is not a significant factor and you can turn this setting off to spread out write operations. For related information, see Section 14.8.3.5, “Configuring Buffer Pool Flushing”.

Change 815308 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[operations/puppet@production] mariadb: Set innodb_flush_neighbors to 0 in most dbs

https://gerrit.wikimedia.org/r/815308

This came up on IRC, and I'm proposing to ask the HW what disks it has via the Redfish API.

Limitations:

  • DELL's only are supported in this current form, could make it work for HPs too I think
  • It might fail on older iDRAC versions
  • Tested only with random hosts
  1. Get a DANGEROUS shell with:

https://wikitech.wikimedia.org/wiki/Spicerack#Test_newly_released_Spicerack_features
but change the logging level to INFO

  1. setup this code in the above shell:
def check_host(hostname, results):
    fqdn = spicerack.netbox_server(hostname.split('.')).mgmt_fqdn
    redfish = spicerack.redfish(fqdn, 'root')
    try:
        storage = redfish.request('get', '/redfish/v1/Systems/System.Embedded.1/Storage').json()
        key = 'Drives'
    except Exception:
        storage = redfish.request('get', '/redfish/v1/Systems/System.Embedded.1/Storage/Controllers').json()
        key = 'Devices'
    for controller in storage['Members']:
        drives = redfish.request('get', controller['@odata.id']).json()[key]
        if key == 'Devices':
            if hostname not in results:
                results[hostname] = drives
            else:
                results[hostname].extend(drives)
            for drive in drives:
                logging.warning('Host %s has unknown media type for drive %s %s', hostname, drive['Manufacturer'], drive['Model'])
        else:
            for drive in drives:
                data = redfish.request('get', drive['@odata.id']).json()
                results[hostname] = data
                if data['MediaType'] != 'SSD':
                    logging.error('Host %s has MediaType %s for drive %s', hostname, data['MediaType'], drive['@odata.id'])
  1. run it for the hosts you need:
hosts = spicerack.remote().query('YOUR_QUERY_HERE').hosts
results = {}
for host in hosts:
    logging.info(host)
    check_host(host, results)
    
# here the 'results' variable will have all the data if you need to inspect any host more in depth without re-doing the calls

Extract of example outputs:

WARNING:root:Host db1115.eqiad.wmnet has unknown media type for drive SEAGATE ST4000NM0265 2DC
ERROR:root:Host es1025.eqiad.wmnet has MediaType HDD for drive /redfish/v1/Systems/System.Embedded.1/Storage/Drives/Disk.Bay.0:Enclosure.Internal.0-1:RAID.Slot.4-1

Thanks Daniel.
Interestingly that commit mentions INNODB_TABLESPACES_SCRUBBING which has a column called: ON_SSD
For a random host with SSDs it shows:

root@db1169.eqiad.wmnet[information_schema]> select * from INNODB_TABLESPACES_SCRUBBING limit 3;
+-------+------------------------+------------+----------------------+-----------------------+------------------------------+---------------------------+-------------------------------+--------+
| SPACE | NAME                   | COMPRESSED | LAST_SCRUB_COMPLETED | CURRENT_SCRUB_STARTED | CURRENT_SCRUB_ACTIVE_THREADS | CURRENT_SCRUB_PAGE_NUMBER | CURRENT_SCRUB_MAX_PAGE_NUMBER | ON_SSD |
+-------+------------------------+------------+----------------------+-----------------------+------------------------------+---------------------------+-------------------------------+--------+
|     0 | innodb_system          |          0 | NULL                 | NULL                  |                         NULL |                         0 |                             0 |      0 |
|   660 | enwiki/__wmf_checksums |          0 | NULL                 | NULL                  |                         NULL |                         0 |                             0 |      0 |
|   661 | enwiki/abuse_filter    |          0 | NULL                 | NULL                  |                         NULL |                         0 |                             0 |      0 |
+-------+------------------------+------------+----------------------+-----------------------+------------------------------+---------------------------+-------------------------------+--------+
3 rows in set (0.001 sec)

That columns shows 0 for those files, which are on SSDs. Looking up the documention, there's no way to find what 0 means on that column as the doc does refer to the column https://mariadb.com/docs/reference/mdb/information-schema/INNODB_TABLESPACES_SCRUBBING/ but it doesn't really say what it means (I would assume 0 means NO on SSDs?), but there's nothing on the detailed docs for that table: https://mariadb.com/kb/en/information-schema-innodb_tablespaces_scrubbing-table/ as it looks outdated.

Looking at the commit message:

On Linux, the detection will be based on the files
/sys/block/*/queue/rotational and /sys/block/*/dev.
Especially for USB storage, it is possible that
/sys/block/*/queue/rotational will wrongly report 1 instead of 0.

fil_node_t::on_ssd: Whether the InnoDB data file resides on
solid-state storage.

fil_system_t::ssd: Collection of Linux block devices that reside on
non-rotational storage.

fil_system_t::create(): Detect ssd on Linux based on the contents
of /sys/block/*/queue/rotational and /sys/block/*/dev.

fil_system_t::is_ssd(dev_t): Determine if a Linux block device is
non-rotational. Partitions will be identified with the containing
block device by assuming that the least significant 4 bits of the
minor number identify a partition, and that the "partition number"
of the entire device is 0.
root@db1169:~# cat /sys/block/*/queue/rotational
1
1
root@db1169:~# cat /sys/block/*/dev
254:0
8:0

This host has no spinning disks, they are all SSDs over a RAID10, so not sure why they're showing 1.

We are not setting innodb_flush_neighbors = 1 anywhere in our my.cnf, but yet it is reporting being set:

root@db1169.eqiad.wmnet[information_schema]> show global variables like '%flush_neigh%';
+------------------------+-------+
| Variable_name          | Value |
+------------------------+-------+
| innodb_flush_neighbors | 1     |
+------------------------+-------+
1 row in set (0.001 sec)

So it is either not working as it should or it we are doing something weird there.
I guess it doesn't really hurt to specifically set it to 0 on my.cnf with the patch @Ladsgroup is putting up.

1 is the default for innodb_flush_neighbours (just corrected docs). It doesn't really change to 0 on SSD detection as innodb ignores the setting if ssd was detected.

It looks like your raid controller (or kernel if soft raid), is passing the wrong rotational status.

So yes, and explicit patch to my.cnf sounds like the right approach.

https://github.com/MariaDB/server/commit/10dd290b4b8b8b235c8cf42e100f0a4415629e79 shows an algorithm of autodetection based on /sys/block/*/queue/rotational contents.

We've found that /sys/block/*/queue/rotational is not reliable because with some HW RAID controllers SSDs are reported as rotational. See for example T309027 for some additional context. That's why I was suggesting to check them querying directly the HW.

Yes, I think it's safe to enable it on hosts that are on SSD but before enabling it everywhere, I want to run a fleet-wide query to make sure we don't have any hosts on hdd left (unless Manuel is sure nothing left :D) to avoid perf regressions.

Working on it now

1 is the default for innodb_flush_neighbours (just corrected docs). It doesn't really change to 0 on SSD detection as innodb ignores the setting if ssd was detected.

It looks like your raid controller (or kernel if soft raid), is passing the wrong rotational status.

So yes, and explicit patch to my.cnf sounds like the right approach.

Thanks Daniel!

We can't enable this on ExternalStorage as they all are on HDD, I'm looking for the rest.

So far I turned it off on pc1012 and db1131 (s6) and now monitoring it.

@Ladsgroup how's this going? Can we resume this early next week?

Yeah, I'm sick :P

It's deployed to several hosts with no major changes (good or bad) so I'm hoping to deploy it next week everywhere (when I hopefully feel better) so I can monitor for any surprises.

Deployed on all of s6, I wait for a day and then deploy it everywhere

I just deployed it everywhere except misc, pc, x2. pc and x2 wouldn't need it, they don't write to binlog. I'll do m sections now.

Change 815308 merged by Ladsgroup:

[operations/puppet@production] mariadb: Set innodb_flush_neighbors to 0 in most dbs

https://gerrit.wikimedia.org/r/815308

Ladsgroup moved this task from In progress to Done on the DBA board.