I have noticed some periodic spikes on db1132 (bullseye+mariadb 10.6). Which aren't present on buster or bullseye with mariadb 10.4
These are some examples:
They do not seem to affect the host performance though.
Marostegui | |
Apr 28 2022, 9:39 AM |
F35119220: Captura de pantalla 2022-05-09 a las 11.11.26.png | |
May 9 2022, 9:11 AM |
F35073103: Captura de pantalla 2022-04-29 a las 11.49.34.png | |
Apr 29 2022, 9:50 AM |
F35073104: Captura de pantalla 2022-04-29 a las 11.49.45.png | |
Apr 29 2022, 9:50 AM |
F35073105: Captura de pantalla 2022-04-29 a las 11.49.24.png | |
Apr 29 2022, 9:50 AM |
F35072911: Captura de pantalla 2022-04-29 a las 7.49.58.png | |
Apr 29 2022, 5:51 AM |
F35072905: Captura de pantalla 2022-04-29 a las 7.48.31.png | |
Apr 29 2022, 5:51 AM |
F35072908: Captura de pantalla 2022-04-29 a las 7.49.20.png | |
Apr 29 2022, 5:51 AM |
F35072912: Captura de pantalla 2022-04-29 a las 7.50.06.png | |
Apr 29 2022, 5:51 AM |
I have noticed some periodic spikes on db1132 (bullseye+mariadb 10.6). Which aren't present on buster or bullseye with mariadb 10.4
These are some examples:
They do not seem to affect the host performance though.
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | Marostegui | T301879 Test MariaDB 10.6 on Bullseye | |||
Resolved | Marostegui | T307082 Investigate spikes on db1132 (mariadb 10.6 host) |
At a first glance, there's a difference between the % of dirty pages 10.6 and 10.4 have by default. So I am going to change 10.6 to mimic 10.4
Mentioned in SAL (#wikimedia-operations) [2022-04-28T09:41:40Z] <marostegui> Change db1132 innodb_max_dirty_pages_pct from 90% to 75% T307082
Mentioned in SAL (#wikimedia-operations) [2022-04-28T09:51:29Z] <marostegui> Change pc2014 innodb_max_dirty_pages_pct from 90% to 75% T307082
innodb_max_dirty_pages_pct change didn't prevent the spike after 2 hours, going to keep digging.
From what I have been able to see this is mostly related on how/when we flush pages to disk. As I mentioned before, this is not causing any performance degradation or anything on db1132 (serving on s1 with normal weight) but just some weird graphs. So far I think it might be related to this new flag innodb_lru_flush_size https://mariadb.com/kb/en/innodb-system-variables/#innodb_lru_flush_size
Still investigating
There's also https://mariadb.com/docs/reference/mdb/status-variables/Innodb_checkpoint_max_age/ which is a new flag re-introduced on 10.6 but not present on 10.4 which makes things even more confusing on some of the graphs - this behaviour is probably also happening on 10.4 but we simply could graph it.
I can confirm each spike matches the flush of the dirty pages:
Fri 29 Apr 2022 05:45:48 AM UTC Innodb_buffer_pool_pages_dirty 447462 Fri 29 Apr 2022 05:45:48 AM UTC Innodb_buffer_pool_pages_dirty 447777 Fri 29 Apr 2022 05:45:49 AM UTC Innodb_buffer_pool_pages_dirty 448092 Fri 29 Apr 2022 05:45:49 AM UTC Innodb_buffer_pool_pages_dirty 448330 Fri 29 Apr 2022 05:45:49 AM UTC Innodb_buffer_pool_pages_dirty 448625 Fri 29 Apr 2022 05:45:50 AM UTC Innodb_buffer_pool_pages_dirty 449007 Fri 29 Apr 2022 05:45:50 AM UTC Innodb_buffer_pool_pages_dirty 449306 Fri 29 Apr 2022 05:45:50 AM UTC Innodb_buffer_pool_pages_dirty 449571 Fri 29 Apr 2022 05:45:51 AM UTC Innodb_buffer_pool_pages_dirty 449928 Fri 29 Apr 2022 05:45:51 AM UTC Innodb_buffer_pool_pages_dirty 450226 Fri 29 Apr 2022 05:45:51 AM UTC Innodb_buffer_pool_pages_dirty 450490 Fri 29 Apr 2022 05:45:51 AM UTC Innodb_buffer_pool_pages_dirty 450807 Fri 29 Apr 2022 05:45:52 AM UTC Innodb_buffer_pool_pages_dirty 451112 Fri 29 Apr 2022 05:45:52 AM UTC Innodb_buffer_pool_pages_dirty 451371 Fri 29 Apr 2022 05:45:52 AM UTC Innodb_buffer_pool_pages_dirty 451649 Fri 29 Apr 2022 05:45:53 AM UTC Innodb_buffer_pool_pages_dirty 451874 Fri 29 Apr 2022 05:45:53 AM UTC Innodb_buffer_pool_pages_dirty 452138 Fri 29 Apr 2022 05:45:53 AM UTC Innodb_buffer_pool_pages_dirty 452506 Fri 29 Apr 2022 05:45:54 AM UTC Innodb_buffer_pool_pages_dirty 452797 Fri 29 Apr 2022 05:45:54 AM UTC Innodb_buffer_pool_pages_dirty 453051 Fri 29 Apr 2022 05:45:54 AM UTC Innodb_buffer_pool_pages_dirty 453350 Fri 29 Apr 2022 05:45:54 AM UTC Innodb_buffer_pool_pages_dirty 453656 Fri 29 Apr 2022 05:45:55 AM UTC Innodb_buffer_pool_pages_dirty 453965 Fri 29 Apr 2022 05:45:55 AM UTC Innodb_buffer_pool_pages_dirty 454245 Fri 29 Apr 2022 05:45:55 AM UTC Innodb_buffer_pool_pages_dirty 454511 Fri 29 Apr 2022 05:45:56 AM UTC Innodb_buffer_pool_pages_dirty 454900 Fri 29 Apr 2022 05:45:56 AM UTC Innodb_buffer_pool_pages_dirty 455171 Fri 29 Apr 2022 05:45:56 AM UTC Innodb_buffer_pool_pages_dirty 455490 Fri 29 Apr 2022 05:45:57 AM UTC Innodb_buffer_pool_pages_dirty 455839 Fri 29 Apr 2022 05:45:57 AM UTC Innodb_buffer_pool_pages_dirty 456162 Fri 29 Apr 2022 05:45:57 AM UTC Innodb_buffer_pool_pages_dirty 456411 Fri 29 Apr 2022 05:45:57 AM UTC Innodb_buffer_pool_pages_dirty 456738 Fri 29 Apr 2022 05:45:58 AM UTC Innodb_buffer_pool_pages_dirty 457005 Fri 29 Apr 2022 05:45:58 AM UTC Innodb_buffer_pool_pages_dirty 457293 Fri 29 Apr 2022 05:45:58 AM UTC Innodb_buffer_pool_pages_dirty 457628 Fri 29 Apr 2022 05:45:59 AM UTC Innodb_buffer_pool_pages_dirty 457908 Fri 29 Apr 2022 05:45:59 AM UTC Innodb_buffer_pool_pages_dirty 458174 Fri 29 Apr 2022 05:45:59 AM UTC Innodb_buffer_pool_pages_dirty 458533 Fri 29 Apr 2022 05:45:59 AM UTC Innodb_buffer_pool_pages_dirty 458803 Fri 29 Apr 2022 05:46:00 AM UTC Innodb_buffer_pool_pages_dirty 459127 Fri 29 Apr 2022 05:46:00 AM UTC Innodb_buffer_pool_pages_dirty 459397 Fri 29 Apr 2022 05:46:00 AM UTC Innodb_buffer_pool_pages_dirty 459695 Fri 29 Apr 2022 05:46:01 AM UTC Innodb_buffer_pool_pages_dirty 459977 Fri 29 Apr 2022 05:46:01 AM UTC Innodb_buffer_pool_pages_dirty 460296 Fri 29 Apr 2022 05:46:01 AM UTC Innodb_buffer_pool_pages_dirty 460621 Fri 29 Apr 2022 05:46:02 AM UTC Innodb_buffer_pool_pages_dirty 460858 Fri 29 Apr 2022 05:46:02 AM UTC Innodb_buffer_pool_pages_dirty 461214 Fri 29 Apr 2022 05:46:02 AM UTC Innodb_buffer_pool_pages_dirty 461446 Fri 29 Apr 2022 05:46:02 AM UTC Innodb_buffer_pool_pages_dirty 461729 Fri 29 Apr 2022 05:46:03 AM UTC Innodb_buffer_pool_pages_dirty 450837 Fri 29 Apr 2022 05:46:03 AM UTC Innodb_buffer_pool_pages_dirty 448800 Fri 29 Apr 2022 05:46:03 AM UTC Innodb_buffer_pool_pages_dirty 448403 Fri 29 Apr 2022 05:46:04 AM UTC Innodb_buffer_pool_pages_dirty 447829 Fri 29 Apr 2022 05:46:04 AM UTC Innodb_buffer_pool_pages_dirty 415128 Fri 29 Apr 2022 05:46:04 AM UTC Innodb_buffer_pool_pages_dirty 413864 Fri 29 Apr 2022 05:46:05 AM UTC Innodb_buffer_pool_pages_dirty 413175 Fri 29 Apr 2022 05:46:05 AM UTC Innodb_buffer_pool_pages_dirty 412523 Fri 29 Apr 2022 05:46:05 AM UTC Innodb_buffer_pool_pages_dirty 382330 Fri 29 Apr 2022 05:46:05 AM UTC Innodb_buffer_pool_pages_dirty 383903 Fri 29 Apr 2022 05:46:06 AM UTC Innodb_buffer_pool_pages_dirty 385345 Fri 29 Apr 2022 05:46:06 AM UTC Innodb_buffer_pool_pages_dirty 385902 Fri 29 Apr 2022 05:46:06 AM UTC Innodb_buffer_pool_pages_dirty 385191 Fri 29 Apr 2022 05:46:07 AM UTC Innodb_buffer_pool_pages_dirty 384407 Fri 29 Apr 2022 05:46:07 AM UTC Innodb_buffer_pool_pages_dirty 383459 Fri 29 Apr 2022 05:46:07 AM UTC Innodb_buffer_pool_pages_dirty 349641 Fri 29 Apr 2022 05:46:08 AM UTC Innodb_buffer_pool_pages_dirty 348902 Fri 29 Apr 2022 05:46:08 AM UTC Innodb_buffer_pool_pages_dirty 348207 Fri 29 Apr 2022 05:46:08 AM UTC Innodb_buffer_pool_pages_dirty 347570 Fri 29 Apr 2022 05:46:08 AM UTC Innodb_buffer_pool_pages_dirty 312613 Fri 29 Apr 2022 05:46:09 AM UTC Innodb_buffer_pool_pages_dirty 311983 Fri 29 Apr 2022 05:46:09 AM UTC Innodb_buffer_pool_pages_dirty 311004 Fri 29 Apr 2022 05:46:09 AM UTC Innodb_buffer_pool_pages_dirty 310174 Fri 29 Apr 2022 05:46:10 AM UTC Innodb_buffer_pool_pages_dirty 276432 Fri 29 Apr 2022 05:46:10 AM UTC Innodb_buffer_pool_pages_dirty 275693 Fri 29 Apr 2022 05:46:10 AM UTC Innodb_buffer_pool_pages_dirty 274733 Fri 29 Apr 2022 05:46:11 AM UTC Innodb_buffer_pool_pages_dirty 273413 Fri 29 Apr 2022 05:46:11 AM UTC Innodb_buffer_pool_pages_dirty 238148 Fri 29 Apr 2022 05:46:11 AM UTC Innodb_buffer_pool_pages_dirty 236917 Fri 29 Apr 2022 05:46:11 AM UTC Innodb_buffer_pool_pages_dirty 236126 Fri 29 Apr 2022 05:46:12 AM UTC Innodb_buffer_pool_pages_dirty 234991 Fri 29 Apr 2022 05:46:12 AM UTC Innodb_buffer_pool_pages_dirty 198547 Fri 29 Apr 2022 05:46:12 AM UTC Innodb_buffer_pool_pages_dirty 197258 Fri 29 Apr 2022 05:46:13 AM UTC Innodb_buffer_pool_pages_dirty 196086 Fri 29 Apr 2022 05:46:13 AM UTC Innodb_buffer_pool_pages_dirty 194813 Fri 29 Apr 2022 05:46:13 AM UTC Innodb_buffer_pool_pages_dirty 160424 Fri 29 Apr 2022 05:46:14 AM UTC Innodb_buffer_pool_pages_dirty 159380 Fri 29 Apr 2022 05:46:14 AM UTC Innodb_buffer_pool_pages_dirty 158317 Fri 29 Apr 2022 05:46:14 AM UTC Innodb_buffer_pool_pages_dirty 157133 Fri 29 Apr 2022 05:46:14 AM UTC Innodb_buffer_pool_pages_dirty 121455 Fri 29 Apr 2022 05:46:15 AM UTC Innodb_buffer_pool_pages_dirty 120382 Fri 29 Apr 2022 05:46:15 AM UTC Innodb_buffer_pool_pages_dirty 119376 Fri 29 Apr 2022 05:46:15 AM UTC Innodb_buffer_pool_pages_dirty 118359 Fri 29 Apr 2022 05:46:16 AM UTC Innodb_buffer_pool_pages_dirty 86659 Fri 29 Apr 2022 05:46:16 AM UTC Innodb_buffer_pool_pages_dirty 81122 Fri 29 Apr 2022 05:46:16 AM UTC Innodb_buffer_pool_pages_dirty 80142 Fri 29 Apr 2022 05:46:17 AM UTC Innodb_buffer_pool_pages_dirty 79379 Fri 29 Apr 2022 05:46:17 AM UTC Innodb_buffer_pool_pages_dirty 78310 Fri 29 Apr 2022 05:46:17 AM UTC Innodb_buffer_pool_pages_dirty 43063 Fri 29 Apr 2022 05:46:17 AM UTC Innodb_buffer_pool_pages_dirty 29740 Fri 29 Apr 2022 05:46:18 AM UTC Innodb_buffer_pool_pages_dirty 30244 Fri 29 Apr 2022 05:46:18 AM UTC Innodb_buffer_pool_pages_dirty 29654
Making an aggressive change to innodb_max_dirty_pages_pct and moving it from 75 to 1 a few hours ago did make a big impact on the graphs making the spikes a lot smaller on pc2014 as we are flushing more often and less pages per flush.
Again there's no performance issue having 75, so I want to investigate this on db1132 (as pc2014 doesn't receive reads) next week and see what's the impact of flushing more but less data.
For the record and to see the effect on graphs during the long weekend:
root@db1132.eqiad.wmnet[(none)]> set global innodb_max_dirty_pages_pct = 50; Query OK, 0 rows affected (0.001 sec)
Set this back to 10% and will see if there's performance impact on db1132 which is now serving traffic in s1.
root@db1132.eqiad.wmnet[(none)]> set global innodb_max_dirty_pages_pct = 1; Query OK, 0 rows affected (0.000 sec) root@db1132.eqiad.wmnet[(none)]>
So, it is not making any difference, perfomance-wise how many % of dirty pages we allow for flushing. db1132 is able to serve as fast as the other 10.4 hosts and nothing is being reported as slow and/or with more latency than any other s1 host serving on main and API traffic. Probably going with the default we used to have is good enough. Going to leave this open until I place the host in s8 and see how that goes T307546
One of the reasons innodb_checkpoint_age is now showing those spikes and we didn't see them in 10.4 is cause in 10.4 this flag was removed and re-added again on 10.6 (https://mariadb.com/docs/reference/mdb/status-variables/Innodb_checkpoint_age/) so it is entirely possible that we had the same behaviour in 10.4 but we simply didn't know about it. Again, this is causing no performance issues at all so far.
We can clearly see this on db1172 which was migrated to 10.6 a few hours ago:
I have captured db2146 variables when running 10.4 and right after the migration to 10.6 to see what could be causing the flushes.
This is the diff of relevant variables which I will explore individually:
--- db2146_10.4 2022-05-11 05:14:09.024590676 +0000 +++ db2146_10.6 2022-05-11 05:16:43.440802082 +0000 -innodb_adaptive_max_sleep_delay 150000 -innodb_background_scrub_data_check_interval 3600 -innodb_background_scrub_data_compressed OFF -innodb_background_scrub_data_interval 604800 -innodb_background_scrub_data_uncompressed OFF -innodb_buffer_pool_instances 8 -innodb_checksums ON -innodb_commit_concurrency 0 -innodb_concurrency_tickets 5000 +innodb_deadlock_report full -innodb_file_format -innodb_force_load_corrupted OFF -innodb_idle_flush_pct 100 -innodb_large_prefix -innodb_lock_schedule_algorithm fcfs -innodb_locks_unsafe_for_binlog OFF -innodb_log_checksums ON -innodb_log_compressed_pages ON -innodb_log_files_in_group 2 -innodb_log_optimize_ddl OFF -innodb_lru_scan_depth 1024 -innodb_max_dirty_pages_pct 75.000000 +innodb_lru_flush_size 32 +innodb_lru_scan_depth 1536 +innodb_max_dirty_pages_pct 90.000000 -innodb_page_cleaners 4 -innodb_replication_delay 0 +innodb_read_only_compressed OFF -innodb_rollback_segments 128 -innodb_scrub_log OFF -innodb_scrub_log_speed 256 -innodb_stats_sample_pages 16 -innodb_stats_transient_sample_pages 16 +innodb_stats_transient_sample_pages 8 -innodb_sync_array_size 1 -innodb_thread_concurrency 0 -innodb_thread_sleep_delay 10000 -innodb_undo_logs 128 -innodb_version 10.4.22 +innodb_version 10.6.7 -max_long_data_size 33554432 -max_recursive_iterations 4294967295 +max_recursive_iterations 1000 -multi_range_count 256 -thread_concurrency 10 +thread_pool_dedicated_listener OFF +thread_pool_exact_stats OFF -version 10.4.22-MariaDB-log +version 10.6.7-MariaDB-log
Mentioned in SAL (#wikimedia-operations) [2022-05-11T06:40:20Z] <marostegui> db2146 set global innodb_max_dirty_pages_pct = 75; T307082
root@db1172.eqiad.wmnet[(none)]> set global innodb_stats_transient_sample_pages = 16; Query OK, 0 rows affected (0.000 sec) root@db1172.eqiad.wmnet[(none)]> set global innodb_lru_scan_depth = 1024; Query OK, 0 rows affected (0.001 sec)
I am going to consider this fixed. Looks like we don't have any performance benefit or regression with the changes and how the graph looks like. I think this might be a red herring explained by T307082#7913090