Under high load, there is replication check pile-ups on coredbs, specially enwiki API servers
Closed, ResolvedPublic

Description

Actionables:

  • convert heartbeat tables to innodb to see if that helps with concurrency
  • increase the number of concurrent threads on the pool of connections
  • Reimage the API enwiki servers with the latest version of MariaDB and disable query-rewrite

All of the above actions have the potential of making the problem worse, not better.

https://grafana.wikimedia.org/dashboard/db/mysql?panelId=37&fullscreen&var-dc=eqiad%20prometheus%2Fops&var-server=db1065&from=1478800185345&to=1478806077117

Mediawiki logging complains about: "db1065 not replicating" https://logstash.wikimedia.org/goto/9bed049145ba2757e546bea6bc3967b1

jcrespo created this task.Nov 10 2016, 7:27 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 10 2016, 7:27 PM
jcrespo edited the task description. (Show Details)
jcrespo claimed this task.Feb 14 2017, 11:20 AM
jcrespo moved this task from Triage to In progress on the DBA board.

Mentioned in SAL (#wikimedia-operations) [2017-02-14T11:28:35Z] <jynus> performing schema change on all mariadb servers T150474

cat *.hosts | sort | uniq | while read host port; do echo $host $port; mysql --skip-ssl -A -h $host -P $port heart
beat -e "SHOW CREATE TABLE heartbeat\G" | grep ENGINE=MyISAM; done
db1001.eqiad.wmnet 3306
db1009.eqiad.wmnet 3306
db1015.eqiad.wmnet 3306
db1016.eqiad.wmnet 3306
db1018.eqiad.wmnet 3306
db1020.eqiad.wmnet 3306
db1021.eqiad.wmnet 3306
db1022.eqiad.wmnet 3306
db1023.eqiad.wmnet 3306
db1024.eqiad.wmnet 3306
db1026.eqiad.wmnet 3306
db1028.eqiad.wmnet 3306
db1029.eqiad.wmnet 3306
db1030.eqiad.wmnet 3306
db1031.eqiad.wmnet 3306
db1033.eqiad.wmnet 3306
db1034.eqiad.wmnet 3306
db1035.eqiad.wmnet 3306
db1036.eqiad.wmnet 3306
db1037.eqiad.wmnet 3306
db1038.eqiad.wmnet 3306
db1039.eqiad.wmnet 3306
db1040.eqiad.wmnet 3306
db1041.eqiad.wmnet 3306
db1043.eqiad.wmnet 3306
db1044.eqiad.wmnet 3306
db1045.eqiad.wmnet 3306
db1046.eqiad.wmnet 3306
db1047.eqiad.wmnet 3306
db1047.eqiad.wmnet 3306
db1048.eqiad.wmnet 3306
db1049.eqiad.wmnet 3306
db1050.eqiad.wmnet 3306
db1051.eqiad.wmnet 3306
) ENGINE=MyISAM DEFAULT CHARSET=binary
db1052.eqiad.wmnet 3306
) ENGINE=MyISAM DEFAULT CHARSET=binary
db1053.eqiad.wmnet 3306
db1054.eqiad.wmnet 3306
db1055.eqiad.wmnet 3306
) ENGINE=MyISAM DEFAULT CHARSET=binary
db1056.eqiad.wmnet 3306
) ENGINE=MyISAM DEFAULT CHARSET=binary
db1057.eqiad.wmnet 3306
) ENGINE=MyISAM DEFAULT CHARSET=binary
db1059.eqiad.wmnet 3306
db1060.eqiad.wmnet 3306
db1061.eqiad.wmnet 3306
db1062.eqiad.wmnet 3306
db1063.eqiad.wmnet 3306
db1064.eqiad.wmnet 3306
db1065.eqiad.wmnet 3306
db1066.eqiad.wmnet 3306
) ENGINE=MyISAM DEFAULT CHARSET=binary
db1067.eqiad.wmnet 3306
db1068.eqiad.wmnet 3306
db1069.eqiad.wmnet 3311
db1069.eqiad.wmnet 3312
db1069.eqiad.wmnet 3313
db1069.eqiad.wmnet 3314
db1069.eqiad.wmnet 3315
db1069.eqiad.wmnet 3316
db1069.eqiad.wmnet 3317
db1070.eqiad.wmnet 3306
) ENGINE=MyISAM DEFAULT CHARSET=binary
db1071.eqiad.wmnet 3306
) ENGINE=MyISAM DEFAULT CHARSET=binary
db1072.eqiad.wmnet 3306
) ENGINE=MyISAM DEFAULT CHARSET=binary
db1073.eqiad.wmnet 3306
) ENGINE=MyISAM DEFAULT CHARSET=binary
db1074.eqiad.wmnet 3306
) ENGINE=MyISAM DEFAULT CHARSET=binary
db1075.eqiad.wmnet 3306
) ENGINE=MyISAM DEFAULT CHARSET=binary
db1076.eqiad.wmnet 3306
) ENGINE=MyISAM DEFAULT CHARSET=binary
db1077.eqiad.wmnet 3306
) ENGINE=MyISAM DEFAULT CHARSET=binary
db1078.eqiad.wmnet 3306
) ENGINE=MyISAM DEFAULT CHARSET=binary
db1079.eqiad.wmnet 3306
db1080.eqiad.wmnet 3306
) ENGINE=MyISAM DEFAULT CHARSET=binary
db1081.eqiad.wmnet 3306
) ENGINE=MyISAM DEFAULT CHARSET=binary
db1082.eqiad.wmnet 3306
) ENGINE=MyISAM DEFAULT CHARSET=binary
db1083.eqiad.wmnet 3306
) ENGINE=MyISAM DEFAULT CHARSET=binary
db1084.eqiad.wmnet 3306
) ENGINE=MyISAM DEFAULT CHARSET=binary
db1085.eqiad.wmnet 3306
db1086.eqiad.wmnet 3306
db1087.eqiad.wmnet 3306
) ENGINE=MyISAM DEFAULT CHARSET=binary
db1088.eqiad.wmnet 3306
db1089.eqiad.wmnet 3306
) ENGINE=MyISAM DEFAULT CHARSET=binary
db1090.eqiad.wmnet 3306
) ENGINE=MyISAM DEFAULT CHARSET=binary
db1091.eqiad.wmnet 3306
) ENGINE=MyISAM DEFAULT CHARSET=binary
db1092.eqiad.wmnet 3306
) ENGINE=MyISAM DEFAULT CHARSET=binary
db1093.eqiad.wmnet 3306
db1094.eqiad.wmnet 3306
db1095.eqiad.wmnet 3306
ERROR 1045 (28000): Access denied for user 'root'@'10.64.32.20' (using password: YES)
db2010.codfw.wmnet 3306
db2011.codfw.wmnet 3306
db2012.codfw.wmnet 3306
db2016.codfw.wmnet 3306
db2017.codfw.wmnet 3306
db2018.codfw.wmnet 3306
db2019.codfw.wmnet 3306
db2023.codfw.wmnet 3306
db2028.codfw.wmnet 3306
db2029.codfw.wmnet 3306
db2030.codfw.wmnet 3306
db2033.codfw.wmnet 3306
db2034.codfw.wmnet 3306
db2035.codfw.wmnet 3306
db2036.codfw.wmnet 3306
db2037.codfw.wmnet 3306
db2038.codfw.wmnet 3306
db2039.codfw.wmnet 3306
db2040.codfw.wmnet 3306
db2041.codfw.wmnet 3306
db2042.codfw.wmnet 3306
db2043.codfw.wmnet 3306
db2044.codfw.wmnet 3306
db2045.codfw.wmnet 3306
db2046.codfw.wmnet 3306
db2047.codfw.wmnet 3306
db2048.codfw.wmnet 3306
db2049.codfw.wmnet 3306
db2050.codfw.wmnet 3306
db2051.codfw.wmnet 3306
db2052.codfw.wmnet 3306
db2053.codfw.wmnet 3306
db2054.codfw.wmnet 3306
db2055.codfw.wmnet 3306
db2056.codfw.wmnet 3306
db2057.codfw.wmnet 3306
db2058.codfw.wmnet 3306
db2059.codfw.wmnet 3306
db2060.codfw.wmnet 3306
db2061.codfw.wmnet 3306
db2062.codfw.wmnet 3306
db2063.codfw.wmnet 3306
db2064.codfw.wmnet 3306
db2065.codfw.wmnet 3306
db2066.codfw.wmnet 3306
db2067.codfw.wmnet 3306
db2068.codfw.wmnet 3306
db2069.codfw.wmnet 3306
db2070.codfw.wmnet 3306
dbstore1001.eqiad.wmnet 3306
dbstore1002.eqiad.wmnet 3306
dbstore2001.codfw.wmnet 3306
dbstore2002.codfw.wmnet 3306
labsdb1001.eqiad.wmnet 3306
labsdb1003.eqiad.wmnet 3306
labsdb1009.eqiad.wmnet 3306
ERROR 1045 (28000): Access denied for user 'root'@'10.64.32.20' (using password: YES)
labsdb1010.eqiad.wmnet 3306
ERROR 1045 (28000): Access denied for user 'root'@'10.64.32.20' (using password: YES)
labsdb1011.eqiad.wmnet 3306
ERROR 1045 (28000): Access denied for user 'root'@'10.64.32.20' (using password: YES)
) ENGINE=MyISAM DEFAULT CHARSET=binary
db1090.eqiad.wmnet 3306
) ENGINE=MyISAM DEFAULT CHARSET=binary
db1091.eqiad.wmnet 3306
) ENGINE=MyISAM DEFAULT CHARSET=binary
db1092.eqiad.wmnet 3306
) ENGINE=MyISAM DEFAULT CHARSET=binary
db1093.eqiad.wmnet 3306
db1094.eqiad.wmnet 3306
db1095.eqiad.wmnet 3306
ERROR 1045 (28000): Access denied for user 'root'@'10.64.32.20' (using password: YES)
db2010.codfw.wmnet 3306
db2011.codfw.wmnet 3306
db2012.codfw.wmnet 3306
db2016.codfw.wmnet 3306
db2017.codfw.wmnet 3306
db2018.codfw.wmnet 3306
db2019.codfw.wmnet 3306
db2023.codfw.wmnet 3306
db2028.codfw.wmnet 3306
db2029.codfw.wmnet 3306
db2030.codfw.wmnet 3306
db2033.codfw.wmnet 3306
db2034.codfw.wmnet 3306
db2035.codfw.wmnet 3306
db2036.codfw.wmnet 3306
db2037.codfw.wmnet 3306
db2038.codfw.wmnet 3306
db2039.codfw.wmnet 3306
db2040.codfw.wmnet 3306
db2041.codfw.wmnet 3306
db2042.codfw.wmnet 3306
db2043.codfw.wmnet 3306
db2044.codfw.wmnet 3306
db2045.codfw.wmnet 3306
db2046.codfw.wmnet 3306
db2047.codfw.wmnet 3306

which means pending:

db1051.eqiad.wmnet 3306
db1052.eqiad.wmnet 3306
db1055.eqiad.wmnet 3306
db1056.eqiad.wmnet 3306
db1057.eqiad.wmnet 3306
db1066.eqiad.wmnet 3306
db1070.eqiad.wmnet 3306
db1071.eqiad.wmnet 3306
db1072.eqiad.wmnet 3306
db1073.eqiad.wmnet 3306
db1074.eqiad.wmnet 3306
db1075.eqiad.wmnet 3306
db1076.eqiad.wmnet 3306
db1077.eqiad.wmnet 3306
db1078.eqiad.wmnet 3306
db1080.eqiad.wmnet 3306
db1081.eqiad.wmnet 3306
db1082.eqiad.wmnet 3306
db1083.eqiad.wmnet 3306
db1084.eqiad.wmnet 3306
db1087.eqiad.wmnet 3306
db1089.eqiad.wmnet 3306
db1090.eqiad.wmnet 3306
db1091.eqiad.wmnet 3306
db1092.eqiad.wmnet 3306
db1095.eqiad.wmnet 3306
labsdb1009.eqiad.wmnet 3306
labsdb1010.eqiad.wmnet 3306
labsdb1011.eqiad.wmnet 3306
jcrespo added a comment.EditedFeb 14 2017, 12:33 PM

Done the labs ones, db1052, db1095, db1057 and db1075. The rest will need depooling.

Change 337573 had a related patch set uploaded (by Jcrespo):
Depool db1051,66,80,74,77,56,81,70,82 for maintenance

https://gerrit.wikimedia.org/r/337573

Change 337573 merged by jenkins-bot:
Depool db1051,66,80,74,77,56,81,70,82 for maintenance

https://gerrit.wikimedia.org/r/337573

Change 337579 had a related patch set uploaded (by Jcrespo):
Depool db1055,72,83,56,84,76,78,87,71 for maintenance

https://gerrit.wikimedia.org/r/337579

Change 337579 merged by jenkins-bot:
Depool db1055,72,83,56,84,76,78,87,71 for maintenance

https://gerrit.wikimedia.org/r/337579

Change 337586 had a related patch set uploaded (by Jcrespo):
Depool db1073,89,90,91,92 for maintenance

https://gerrit.wikimedia.org/r/337586

Change 337586 merged by jenkins-bot:
Depool db1073,89,90,91,92 for maintenance

https://gerrit.wikimedia.org/r/337586

jcrespo closed this task as "Resolved".Feb 14 2017, 2:50 PM

This is now done- there was indeed contention here- if the conversion to innodb is enough to fix it, or if there is more contentions causes is something that we will see with time. There may be extra corrections needed, like tuning the pool of connections or changing the model when new connections fail.

Change 337840 had a related patch set uploaded (by Jcrespo):
Increse the concurrent threads of large mariadb servers

https://gerrit.wikimedia.org/r/337840

Change 337840 merged by Jcrespo:
Increase the concurrent threads of large mariadb servers

https://gerrit.wikimedia.org/r/337840