Page MenuHomePhabricator

Increase timeout for mariadb replication check
Closed, DeclinedPublic

Description

dbstore1001 start running the backups on Wednesday around midnight (UTC) and it creats lots of IO on the server. During this time the server usually complains about:

04:46 < icinga-wm> RECOVERY - MariaDB Slave SQL: m2 on dbstore1001 is OK: OK slave_sql_state not a slave                                                                                                                                                            
04:46 < icinga-wm> RECOVERY - MariaDB Slave SQL: s1 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes                                                                                                                                                 
04:46 < icinga-wm> RECOVERY - MariaDB Slave IO: s4 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes                                                                                                                                                    
04:46 < icinga-wm> RECOVERY - MariaDB Slave IO: s3 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes                                                                                                                                                    04:46 < icinga-wm> RECOVERY - MariaDB Slave IO: m2 on dbstore1001 is OK: OK slave_io_state not a slave                                                                                                                                                              
04:47 < icinga-wm> RECOVERY - MariaDB Slave IO: s6 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes                                                                                                                                                    
04:47 < icinga-wm> RECOVERY - MariaDB Slave IO: s1 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes                                                                                                                                                    
04:47 < icinga-wm> RECOVERY - MariaDB Slave IO: s7 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes                                                                                                                                                    
04:47 < icinga-wm> RECOVERY - MariaDB Slave SQL: x1 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional)                                                                                                                         
04:47 < icinga-wm> RECOVERY - MariaDB Slave SQL: s3 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional)                                                                                                                         
04:48 < icinga-wm> RECOVERY - MariaDB Slave IO: m3 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes                                                                                                                                                    
04:48 < icinga-wm> RECOVERY - MariaDB Slave IO: s5 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes                                                                                                                                                    
04:48 < icinga-wm> RECOVERY - MariaDB Slave SQL: s2 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional)                                                                                                                          
04:59 < icinga-wm> PROBLEM - MariaDB Slave SQL: s5 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.                                                                                                                                          
04:59 < icinga-wm> PROBLEM - MariaDB Slave SQL: s4 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.                                                                                                                                          
04:59 < icinga-wm> PROBLEM - MariaDB Slave SQL: s7 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.                                                                                                                                         
04:59 < icinga-wm> PROBLEM - MariaDB Slave SQL: m3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.                                                                                                                                          
04:59 < icinga-wm> PROBLEM - MariaDB Slave SQL: s6 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.                                                                                                                                          
04:59 < icinga-wm> PROBLEM - MariaDB Slave IO: s2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.                                                                                                                                           
05:00 < icinga-wm> PROBLEM - MariaDB Slave IO: x1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.                                                                                                                                           
05:00 < icinga-wm> PROBLEM - MariaDB Slave IO: s4 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.                                                                                                                                           
05:00 < icinga-wm> PROBLEM - MariaDB Slave SQL: m2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.                                                                                                                                          
05:00 < icinga-wm> PROBLEM - MariaDB Slave IO: m2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.                                                                                                                                           
05:00 < icinga-wm> PROBLEM - MariaDB Slave SQL: s1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.                                                                                                                                          
05:00 < icinga-wm> PROBLEM - MariaDB Slave IO: s3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.                                                                                                                                           
05:00 < icinga-wm> PROBLEM - MariaDB Slave IO: m3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.                                                                                                                                           
05:00 < icinga-wm> PROBLEM - MariaDB Slave IO: s5 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.                                                                                                                                          
05:00 < icinga-wm> PROBLEM - MariaDB Slave SQL: s2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.                                                                                                                                          
05:00 < icinga-wm> RECOVERY - MariaDB Slave SQL: s5 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional)                                                                                                                          
05:00 < icinga-wm> RECOVERY - MariaDB Slave SQL: s4 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional)                                                                                                                          
05:00 < icinga-wm> RECOVERY - MariaDB Slave SQL: s7 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional)                                                                                                                          
05:00 < icinga-wm> RECOVERY - MariaDB Slave IO: s2 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes                                                                                                                                                     
05:00 < icinga-wm> RECOVERY - MariaDB Slave SQL: s6 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional)                                                                                                                          
05:00 < icinga-wm> RECOVERY - MariaDB Slave SQL: m3 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional)                                                                                                                          
05:00 < icinga-wm> RECOVERY - MariaDB Slave IO: x1 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes                                                                                                                                                     
05:00 < icinga-wm> RECOVERY - MariaDB Slave SQL: m2 on dbstore1001 is OK: OK slave_sql_state not a slave                                                                                                                                                             
05:00 < icinga-wm> RECOVERY - MariaDB Slave SQL: s1 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes                                                                                                                                                  
05:00 < icinga-wm> RECOVERY - MariaDB Slave IO: m2 on dbstore1001 is OK: OK slave_io_state not a slave                                                                                                                                                               
05:00 < icinga-wm> RECOVERY - MariaDB Slave IO: s3 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes                                                                                                                                                     
05:00 < icinga-wm> RECOVERY - MariaDB Slave IO: s4 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes                                                                                                                                                     
05:01 < icinga-wm> RECOVERY - MariaDB Slave IO: m3 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes                                                                                                                                                     
05:01 < icinga-wm> RECOVERY - MariaDB Slave IO: s5 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes

We pushed https://gerrit.wikimedia.org/r/#/c/345372/ to reduce the number of parallel processes that get executed, but in the end backups were taking too long to finish. It helped to prevent alarms like this, but, again, the price to pay was backups taking more than 24h to finish so it was reverted

We might want to increase the timeout of the checks to prevent this noise

Event Timeline

With the full reimplementation of the backups/dbstore hosts, let's decline this.