Page MenuHomePhabricator

Investigate spikes on db1132 (mariadb 10.6 host)
Closed, ResolvedPublic

Description

I have noticed some periodic spikes on db1132 (bullseye+mariadb 10.6). Which aren't present on buster or bullseye with mariadb 10.4

These are some examples:

Captura de pantalla 2022-04-28 a las 11.38.00.png (251×921 px, 43 KB)

Captura de pantalla 2022-04-28 a las 11.39.04.png (523×927 px, 60 KB)

Captura de pantalla 2022-04-28 a las 11.38.55.png (279×935 px, 31 KB)

Captura de pantalla 2022-04-28 a las 11.38.49.png (269×926 px, 33 KB)

Captura de pantalla 2022-04-28 a las 11.39.11.png (233×957 px, 28 KB)

Captura de pantalla 2022-04-28 a las 11.38.26.png (254×935 px, 34 KB)

They do not seem to affect the host performance though.

Event Timeline

Marostegui triaged this task as Medium priority.Apr 28 2022, 9:40 AM
Marostegui moved this task from Triage to In progress on the DBA board.

At a first glance, there's a difference between the % of dirty pages 10.6 and 10.4 have by default. So I am going to change 10.6 to mimic 10.4

Mentioned in SAL (#wikimedia-operations) [2022-04-28T09:41:40Z] <marostegui> Change db1132 innodb_max_dirty_pages_pct from 90% to 75% T307082

Also present on pc2014 since the upgrade to 10.6

Captura de pantalla 2022-04-28 a las 11.46.27.png (794×1 px, 141 KB)

Mentioned in SAL (#wikimedia-operations) [2022-04-28T09:51:29Z] <marostegui> Change pc2014 innodb_max_dirty_pages_pct from 90% to 75% T307082

innodb_max_dirty_pages_pct change didn't prevent the spike after 2 hours, going to keep digging.

From what I have been able to see this is mostly related on how/when we flush pages to disk. As I mentioned before, this is not causing any performance degradation or anything on db1132 (serving on s1 with normal weight) but just some weird graphs. So far I think it might be related to this new flag innodb_lru_flush_size https://mariadb.com/kb/en/innodb-system-variables/#innodb_lru_flush_size
Still investigating

There's also https://mariadb.com/docs/reference/mdb/status-variables/Innodb_checkpoint_max_age/ which is a new flag re-introduced on 10.6 but not present on 10.4 which makes things even more confusing on some of the graphs - this behaviour is probably also happening on 10.4 but we simply could graph it.

I can confirm each spike matches the flush of the dirty pages:

Fri 29 Apr 2022 05:45:48 AM UTC
Innodb_buffer_pool_pages_dirty	447462
Fri 29 Apr 2022 05:45:48 AM UTC
Innodb_buffer_pool_pages_dirty	447777
Fri 29 Apr 2022 05:45:49 AM UTC
Innodb_buffer_pool_pages_dirty	448092
Fri 29 Apr 2022 05:45:49 AM UTC
Innodb_buffer_pool_pages_dirty	448330
Fri 29 Apr 2022 05:45:49 AM UTC
Innodb_buffer_pool_pages_dirty	448625
Fri 29 Apr 2022 05:45:50 AM UTC
Innodb_buffer_pool_pages_dirty	449007
Fri 29 Apr 2022 05:45:50 AM UTC
Innodb_buffer_pool_pages_dirty	449306
Fri 29 Apr 2022 05:45:50 AM UTC
Innodb_buffer_pool_pages_dirty	449571
Fri 29 Apr 2022 05:45:51 AM UTC
Innodb_buffer_pool_pages_dirty	449928
Fri 29 Apr 2022 05:45:51 AM UTC
Innodb_buffer_pool_pages_dirty	450226
Fri 29 Apr 2022 05:45:51 AM UTC
Innodb_buffer_pool_pages_dirty	450490
Fri 29 Apr 2022 05:45:51 AM UTC
Innodb_buffer_pool_pages_dirty	450807
Fri 29 Apr 2022 05:45:52 AM UTC
Innodb_buffer_pool_pages_dirty	451112
Fri 29 Apr 2022 05:45:52 AM UTC
Innodb_buffer_pool_pages_dirty	451371
Fri 29 Apr 2022 05:45:52 AM UTC
Innodb_buffer_pool_pages_dirty	451649
Fri 29 Apr 2022 05:45:53 AM UTC
Innodb_buffer_pool_pages_dirty	451874
Fri 29 Apr 2022 05:45:53 AM UTC
Innodb_buffer_pool_pages_dirty	452138
Fri 29 Apr 2022 05:45:53 AM UTC
Innodb_buffer_pool_pages_dirty	452506
Fri 29 Apr 2022 05:45:54 AM UTC
Innodb_buffer_pool_pages_dirty	452797
Fri 29 Apr 2022 05:45:54 AM UTC
Innodb_buffer_pool_pages_dirty	453051
Fri 29 Apr 2022 05:45:54 AM UTC
Innodb_buffer_pool_pages_dirty	453350
Fri 29 Apr 2022 05:45:54 AM UTC
Innodb_buffer_pool_pages_dirty	453656
Fri 29 Apr 2022 05:45:55 AM UTC
Innodb_buffer_pool_pages_dirty	453965
Fri 29 Apr 2022 05:45:55 AM UTC
Innodb_buffer_pool_pages_dirty	454245
Fri 29 Apr 2022 05:45:55 AM UTC
Innodb_buffer_pool_pages_dirty	454511
Fri 29 Apr 2022 05:45:56 AM UTC
Innodb_buffer_pool_pages_dirty	454900
Fri 29 Apr 2022 05:45:56 AM UTC
Innodb_buffer_pool_pages_dirty	455171
Fri 29 Apr 2022 05:45:56 AM UTC
Innodb_buffer_pool_pages_dirty	455490
Fri 29 Apr 2022 05:45:57 AM UTC
Innodb_buffer_pool_pages_dirty	455839
Fri 29 Apr 2022 05:45:57 AM UTC
Innodb_buffer_pool_pages_dirty	456162
Fri 29 Apr 2022 05:45:57 AM UTC
Innodb_buffer_pool_pages_dirty	456411
Fri 29 Apr 2022 05:45:57 AM UTC
Innodb_buffer_pool_pages_dirty	456738
Fri 29 Apr 2022 05:45:58 AM UTC
Innodb_buffer_pool_pages_dirty	457005
Fri 29 Apr 2022 05:45:58 AM UTC
Innodb_buffer_pool_pages_dirty	457293
Fri 29 Apr 2022 05:45:58 AM UTC
Innodb_buffer_pool_pages_dirty	457628
Fri 29 Apr 2022 05:45:59 AM UTC
Innodb_buffer_pool_pages_dirty	457908
Fri 29 Apr 2022 05:45:59 AM UTC
Innodb_buffer_pool_pages_dirty	458174
Fri 29 Apr 2022 05:45:59 AM UTC
Innodb_buffer_pool_pages_dirty	458533
Fri 29 Apr 2022 05:45:59 AM UTC
Innodb_buffer_pool_pages_dirty	458803
Fri 29 Apr 2022 05:46:00 AM UTC
Innodb_buffer_pool_pages_dirty	459127
Fri 29 Apr 2022 05:46:00 AM UTC
Innodb_buffer_pool_pages_dirty	459397
Fri 29 Apr 2022 05:46:00 AM UTC
Innodb_buffer_pool_pages_dirty	459695
Fri 29 Apr 2022 05:46:01 AM UTC
Innodb_buffer_pool_pages_dirty	459977
Fri 29 Apr 2022 05:46:01 AM UTC
Innodb_buffer_pool_pages_dirty	460296
Fri 29 Apr 2022 05:46:01 AM UTC
Innodb_buffer_pool_pages_dirty	460621
Fri 29 Apr 2022 05:46:02 AM UTC
Innodb_buffer_pool_pages_dirty	460858
Fri 29 Apr 2022 05:46:02 AM UTC
Innodb_buffer_pool_pages_dirty	461214
Fri 29 Apr 2022 05:46:02 AM UTC
Innodb_buffer_pool_pages_dirty	461446
Fri 29 Apr 2022 05:46:02 AM UTC
Innodb_buffer_pool_pages_dirty	461729
Fri 29 Apr 2022 05:46:03 AM UTC
Innodb_buffer_pool_pages_dirty	450837
Fri 29 Apr 2022 05:46:03 AM UTC
Innodb_buffer_pool_pages_dirty	448800
Fri 29 Apr 2022 05:46:03 AM UTC
Innodb_buffer_pool_pages_dirty	448403
Fri 29 Apr 2022 05:46:04 AM UTC
Innodb_buffer_pool_pages_dirty	447829
Fri 29 Apr 2022 05:46:04 AM UTC
Innodb_buffer_pool_pages_dirty	415128
Fri 29 Apr 2022 05:46:04 AM UTC
Innodb_buffer_pool_pages_dirty	413864
Fri 29 Apr 2022 05:46:05 AM UTC
Innodb_buffer_pool_pages_dirty	413175
Fri 29 Apr 2022 05:46:05 AM UTC
Innodb_buffer_pool_pages_dirty	412523
Fri 29 Apr 2022 05:46:05 AM UTC
Innodb_buffer_pool_pages_dirty	382330
Fri 29 Apr 2022 05:46:05 AM UTC
Innodb_buffer_pool_pages_dirty	383903
Fri 29 Apr 2022 05:46:06 AM UTC
Innodb_buffer_pool_pages_dirty	385345
Fri 29 Apr 2022 05:46:06 AM UTC
Innodb_buffer_pool_pages_dirty	385902
Fri 29 Apr 2022 05:46:06 AM UTC
Innodb_buffer_pool_pages_dirty	385191
Fri 29 Apr 2022 05:46:07 AM UTC
Innodb_buffer_pool_pages_dirty	384407
Fri 29 Apr 2022 05:46:07 AM UTC
Innodb_buffer_pool_pages_dirty	383459
Fri 29 Apr 2022 05:46:07 AM UTC
Innodb_buffer_pool_pages_dirty	349641
Fri 29 Apr 2022 05:46:08 AM UTC
Innodb_buffer_pool_pages_dirty	348902
Fri 29 Apr 2022 05:46:08 AM UTC
Innodb_buffer_pool_pages_dirty	348207
Fri 29 Apr 2022 05:46:08 AM UTC
Innodb_buffer_pool_pages_dirty	347570
Fri 29 Apr 2022 05:46:08 AM UTC
Innodb_buffer_pool_pages_dirty	312613
Fri 29 Apr 2022 05:46:09 AM UTC
Innodb_buffer_pool_pages_dirty	311983
Fri 29 Apr 2022 05:46:09 AM UTC
Innodb_buffer_pool_pages_dirty	311004
Fri 29 Apr 2022 05:46:09 AM UTC
Innodb_buffer_pool_pages_dirty	310174
Fri 29 Apr 2022 05:46:10 AM UTC
Innodb_buffer_pool_pages_dirty	276432
Fri 29 Apr 2022 05:46:10 AM UTC
Innodb_buffer_pool_pages_dirty	275693
Fri 29 Apr 2022 05:46:10 AM UTC
Innodb_buffer_pool_pages_dirty	274733
Fri 29 Apr 2022 05:46:11 AM UTC
Innodb_buffer_pool_pages_dirty	273413
Fri 29 Apr 2022 05:46:11 AM UTC
Innodb_buffer_pool_pages_dirty	238148
Fri 29 Apr 2022 05:46:11 AM UTC
Innodb_buffer_pool_pages_dirty	236917
Fri 29 Apr 2022 05:46:11 AM UTC
Innodb_buffer_pool_pages_dirty	236126
Fri 29 Apr 2022 05:46:12 AM UTC
Innodb_buffer_pool_pages_dirty	234991
Fri 29 Apr 2022 05:46:12 AM UTC
Innodb_buffer_pool_pages_dirty	198547
Fri 29 Apr 2022 05:46:12 AM UTC
Innodb_buffer_pool_pages_dirty	197258
Fri 29 Apr 2022 05:46:13 AM UTC
Innodb_buffer_pool_pages_dirty	196086
Fri 29 Apr 2022 05:46:13 AM UTC
Innodb_buffer_pool_pages_dirty	194813
Fri 29 Apr 2022 05:46:13 AM UTC
Innodb_buffer_pool_pages_dirty	160424
Fri 29 Apr 2022 05:46:14 AM UTC
Innodb_buffer_pool_pages_dirty	159380
Fri 29 Apr 2022 05:46:14 AM UTC
Innodb_buffer_pool_pages_dirty	158317
Fri 29 Apr 2022 05:46:14 AM UTC
Innodb_buffer_pool_pages_dirty	157133
Fri 29 Apr 2022 05:46:14 AM UTC
Innodb_buffer_pool_pages_dirty	121455
Fri 29 Apr 2022 05:46:15 AM UTC
Innodb_buffer_pool_pages_dirty	120382
Fri 29 Apr 2022 05:46:15 AM UTC
Innodb_buffer_pool_pages_dirty	119376
Fri 29 Apr 2022 05:46:15 AM UTC
Innodb_buffer_pool_pages_dirty	118359
Fri 29 Apr 2022 05:46:16 AM UTC
Innodb_buffer_pool_pages_dirty	86659
Fri 29 Apr 2022 05:46:16 AM UTC
Innodb_buffer_pool_pages_dirty	81122
Fri 29 Apr 2022 05:46:16 AM UTC
Innodb_buffer_pool_pages_dirty	80142
Fri 29 Apr 2022 05:46:17 AM UTC
Innodb_buffer_pool_pages_dirty	79379
Fri 29 Apr 2022 05:46:17 AM UTC
Innodb_buffer_pool_pages_dirty	78310
Fri 29 Apr 2022 05:46:17 AM UTC
Innodb_buffer_pool_pages_dirty	43063
Fri 29 Apr 2022 05:46:17 AM UTC
Innodb_buffer_pool_pages_dirty	29740
Fri 29 Apr 2022 05:46:18 AM UTC
Innodb_buffer_pool_pages_dirty	30244
Fri 29 Apr 2022 05:46:18 AM UTC
Innodb_buffer_pool_pages_dirty	29654

Captura de pantalla 2022-04-29 a las 7.48.31.png (796×1 px, 92 KB)

Captura de pantalla 2022-04-29 a las 7.49.20.png (800×1 px, 86 KB)

Captura de pantalla 2022-04-29 a las 7.50.06.png (800×1 px, 77 KB)

Captura de pantalla 2022-04-29 a las 7.49.58.png (808×1 px, 71 KB)

Making an aggressive change to innodb_max_dirty_pages_pct and moving it from 75 to 1 a few hours ago did make a big impact on the graphs making the spikes a lot smaller on pc2014 as we are flushing more often and less pages per flush.
Again there's no performance issue having 75, so I want to investigate this on db1132 (as pc2014 doesn't receive reads) next week and see what's the impact of flushing more but less data.

Captura de pantalla 2022-04-29 a las 11.49.24.png (492×915 px, 80 KB)

Captura de pantalla 2022-04-29 a las 11.49.45.png (257×913 px, 62 KB)

Captura de pantalla 2022-04-29 a las 11.49.34.png (265×923 px, 59 KB)

For the record and to see the effect on graphs during the long weekend:

root@db1132.eqiad.wmnet[(none)]> set global innodb_max_dirty_pages_pct = 50;
Query OK, 0 rows affected (0.001 sec)

For the record and to see the effect on graphs during the long weekend:

root@db1132.eqiad.wmnet[(none)]> set global innodb_max_dirty_pages_pct = 50;
Query OK, 0 rows affected (0.001 sec)

Set this back to 10% and will see if there's performance impact on db1132 which is now serving traffic in s1.

root@db1132.eqiad.wmnet[(none)]> set global innodb_max_dirty_pages_pct = 1;
Query OK, 0 rows affected (0.000 sec)

root@db1132.eqiad.wmnet[(none)]>

So, it is not making any difference, perfomance-wise how many % of dirty pages we allow for flushing. db1132 is able to serve as fast as the other 10.4 hosts and nothing is being reported as slow and/or with more latency than any other s1 host serving on main and API traffic. Probably going with the default we used to have is good enough. Going to leave this open until I place the host in s8 and see how that goes T307546

One of the reasons innodb_checkpoint_age is now showing those spikes and we didn't see them in 10.4 is cause in 10.4 this flag was removed and re-added again on 10.6 (https://mariadb.com/docs/reference/mdb/status-variables/Innodb_checkpoint_age/) so it is entirely possible that we had the same behaviour in 10.4 but we simply didn't know about it. Again, this is causing no performance issues at all so far.

We can clearly see this on db1172 which was migrated to 10.6 a few hours ago:

Captura de pantalla 2022-05-09 a las 11.11.26.png (259×918 px, 27 KB)

I have captured db2146 variables when running 10.4 and right after the migration to 10.6 to see what could be causing the flushes.
This is the diff of relevant variables which I will explore individually:

--- db2146_10.4	2022-05-11 05:14:09.024590676 +0000
+++ db2146_10.6	2022-05-11 05:16:43.440802082 +0000



-innodb_adaptive_max_sleep_delay	150000
-innodb_background_scrub_data_check_interval	3600
-innodb_background_scrub_data_compressed	OFF
-innodb_background_scrub_data_interval	604800
-innodb_background_scrub_data_uncompressed	OFF
-innodb_buffer_pool_instances	8
-innodb_checksums	ON
-innodb_commit_concurrency	0
-innodb_concurrency_tickets	5000
+innodb_deadlock_report	full
-innodb_file_format
-innodb_force_load_corrupted	OFF
-innodb_idle_flush_pct	100
-innodb_large_prefix
-innodb_lock_schedule_algorithm	fcfs
-innodb_locks_unsafe_for_binlog	OFF
-innodb_log_checksums	ON
-innodb_log_compressed_pages	ON
-innodb_log_files_in_group	2
-innodb_log_optimize_ddl	OFF
-innodb_lru_scan_depth	1024
-innodb_max_dirty_pages_pct	75.000000
+innodb_lru_flush_size	32
+innodb_lru_scan_depth	1536
+innodb_max_dirty_pages_pct	90.000000
-innodb_page_cleaners	4
-innodb_replication_delay	0
+innodb_read_only_compressed	OFF
-innodb_rollback_segments	128
-innodb_scrub_log	OFF
-innodb_scrub_log_speed	256
-innodb_stats_sample_pages	16
-innodb_stats_transient_sample_pages	16
+innodb_stats_transient_sample_pages	8
-innodb_sync_array_size	1
-innodb_thread_concurrency	0
-innodb_thread_sleep_delay	10000
-innodb_undo_logs	128
-innodb_version	10.4.22
+innodb_version	10.6.7
-max_long_data_size	33554432
-max_recursive_iterations	4294967295
+max_recursive_iterations	1000
-multi_range_count	256
-thread_concurrency	10
+thread_pool_dedicated_listener	OFF
+thread_pool_exact_stats	OFF
-version	10.4.22-MariaDB-log
+version	10.6.7-MariaDB-log

Mentioned in SAL (#wikimedia-operations) [2022-05-11T06:40:20Z] <marostegui> db2146 set global innodb_max_dirty_pages_pct = 75; T307082

root@db1172.eqiad.wmnet[(none)]> set global innodb_stats_transient_sample_pages = 16;
Query OK, 0 rows affected (0.000 sec)

root@db1172.eqiad.wmnet[(none)]> set global innodb_lru_scan_depth = 1024;
Query OK, 0 rows affected (0.001 sec)

I am going to consider this fixed. Looks like we don't have any performance benefit or regression with the changes and how the graph looks like. I think this might be a red herring explained by T307082#7913090