Page MenuHomePhabricator

Backfill data into db1065 and db1066
Closed, ResolvedPublic

Description

When the query to rebuild the recentchanges data (from RebuildRecentchanges::rebuildRecentChangesTablePass1) arrived on db1065 and db1066 (API servers) due to a missing INDEX on rev_timestamp the execution plan was a full table scan and would have taken the replication stuck for a very long time. It was reading ~1100 row/s.

So we have decided to kill it and skip it in the replication.

The reason behind this decision:

  • Not user facing (at least for now)
  • Replica was stuck and estimation of completion was very long
  • The 2 servers are in eqiad (passive DC today) and dedicated to API
  • The inserted data on other servers were just 2682 rows, we can backfill them tomorrow manually or with pt-table-checksum
  • This is temporary data that in any case will go away in ~1-3 months and seems that will not be visible anyway due to application layer filter limitation to 5000 changes given that this is enwiki

The killed query was:

INSERT /* RebuildRecentchanges::rebuildRecentChangesTablePass1  */
INTO `recentchanges` (rc_timestamp,rc_user,rc_user_text,rc_namespace,rc_title,rc_comment,rc_minor,rc_bot,rc_new,rc_cur_id,rc_this_oldid,rc_last_oldid,rc_type,rc_source,rc_deleted)
SELECT rev_timestamp,rev_user,rev_user_text,page_namespace,page_title,rev_comment,rev_minor_edit,0,page_is_new,page_id,rev_id,0, (CASE WHEN page_is_new != 0 THEN 1 ELSE 0 END) , (CASE WHEN page_is_new != 0 THEN 'mw.new' ELSE 'mw.edit' END) ,rev_deleted
FROM `page`,`revision`
WHERE (rev_timestamp > '20160419144741') AND (rev_timestamp < '20160419151018') AND (rev_page=page_id)
ORDER BY rev_timestamp DESC

Execution plan

+------+-------------+----------+------+------------------------------------------------+-------------+---------+---------------------+----------+----------+---------------------------------+
| id   | select_type | table    | type | possible_keys                                  | key         | key_len | ref                 | rows     | filtered | Extra                           |
+------+-------------+----------+------+------------------------------------------------+-------------+---------+---------------------+----------+----------+---------------------------------+
|    1 | SIMPLE      | page     | ALL  | PRIMARY                                        | NULL        | NULL    | NULL                | 41618662 |   100.00 | Using temporary; Using filesort |
|    1 | SIMPLE      | revision | ref  | rev_page_id,page_timestamp,page_user_timestamp | rev_page_id | 4       | enwiki.page.page_id |        6 |   100.00 | Using where                     |
+------+-------------+----------+------+------------------------------------------------+-------------+---------+---------------------+----------+----------+---------------------------------+

Event Timeline

And this is why insert...selects are evil.

For context:

160419 22:42:19 [ERROR] Slave SQL: Error 'Connection was killed' on query. Default database: 'enwiki'. Query: 'INSERT /* RebuildRecentchanges::rebuildRecentChangesTablePass1  */  INTO `recentchanges` (rc_timestamp,rc_user,rc_user_text,rc_namespace,rc_title,rc_comment,rc_minor,rc_bot,rc_new,rc_cur_id,rc_this_oldid,rc_last_oldid,rc_type,rc_source,rc_deleted) SELECT  rev_timestamp,rev_user,rev_user_text,page_namespace,page_title,rev_comment,rev_minor_edit,0,page_is_new,page_id,rev_id,0, (CASE WHEN page_is_new != 0 THEN 1 ELSE 0 END) , (CASE WHEN page_is_new != 0 THEN 'mw.new' ELSE 'mw.edit' END) ,rev_deleted FROM `page`,`revision`   WHERE (rev_timestamp > '20160419144741') AND (rev_timestamp < '20160419151018') AND (rev_page=page_id)  ORDER BY rev_timestamp DESC', Internal MariaDB error code: 1927
160419 22:44:15 [ERROR] Slave SQL: Error 'Connection was killed' on query. Default database: 'enwiki'. Query: 'INSERT /* RebuildRecentchanges::rebuildRecentChangesTablePass1  */  INTO `recentchanges` (rc_timestamp,rc_user,rc_user_text,rc_namespace,rc_title,rc_comment,rc_minor,rc_bot,rc_new,rc_cur_id,rc_this_oldid,rc_last_oldid,rc_type,rc_source,rc_deleted) SELECT  rev_timestamp,rev_user,rev_user_text,page_namespace,page_title,rev_comment,rev_minor_edit,0,page_is_new,page_id,rev_id,0, (CASE WHEN page_is_new != 0 THEN 1 ELSE 0 END) , (CASE WHEN page_is_new != 0 THEN 'mw.new' ELSE 'mw.edit' END) ,rev_deleted FROM `page`,`revision`   WHERE (rev_timestamp > '20160419144741') AND (rev_timestamp < '20160419151018') AND (rev_page=page_id)  ORDER BY rev_timestamp DESC', Internal MariaDB error code: 1927
160419 22:44:30 [Note] 'SQL_SLAVE_SKIP_COUNTER=1' executed at relay_log_file='./db1065-relay-bin.008253', relay_log_pos='195098609', master_log_name='db1052-bin.003636', master_log_pos='195098321' and new position at relay_log_file='./db1065-relay-bin.008253', relay_log_pos='195099446', master_log_name='db1052-bin.003636', master_log_pos='195099158'
root@neodymium:~$ mysql -h db1065.eqiad.wmnet enwiki -e "SELECT count(*) FROM recentchanges WHERE (rc_timestamp > '20160419144741') AND (rc_timestamp < '20160419151018')"
+----------+
| count(*) |
+----------+
|      240 |
+----------+

root@neodymium:~$ mysql -h db2016.codfw.wmnet enwiki -e "SELECT count(*) FROM recentchanges WHERE (rc_timestamp > '20160419144741') AND (rc_timestamp < '20160419151018')"
+----------+
| count(*) |
+----------+
|     2914 |
+----------+

root@neodymium:~$ mysql -h db2016.codfw.wmnet enwiki -e "SELECT count(*) FROM recentchanges WHERE (rc_timestamp <= '20160419144741') AND (rc_timestamp > '20160419130000')"
+----------+
| count(*) |
+----------+
|    14603 |
+----------+
root@neodymium:~$ mysql -h db1065.eqiad.wmnet enwiki -e "SELECT count(*) FROM recentchanges WHERE (rc_timestamp <= '20160419144741') AND (rc_timestamp > '20160419130000')"
+----------+
| count(*) |
+----------+
|    14603 |
+----------+
root@neodymium:~$ mysql -h db2016.codfw.wmnet enwiki -e "SELECT count(*) FROM recentchanges WHERE (rc_timestamp < '201604200000000') AND (rc_timestamp >= '20160419151018')"
+----------+
| count(*) |
+----------+
|   153811 |
+----------+
root@neodymium:~$ mysql -h db1065.eqiad.wmnet enwiki -e "SELECT count(*) FROM recentchanges WHERE (rc_timestamp < '201604200000000') AND (rc_timestamp >= '20160419151018')"
+----------+
| count(*) |
+----------+
|   153811 |
+----------+

I would have created the index (it would have been relatively fast for innodb > 5.5), then started replication again. But the decision was good. The problem is as usual STATEMENT and an unsafe statement, which if chained could have cause more issues (eg revision and archive tables for deleting and undeleting rows).

Got 2914 rows on neodymium:/home/jynus, ready to import mysql -h db1065 enwiki < T133122.sqlwhen I have a +1:

mysqldump -n -t --compact --skip-extended-insert --insert-ignore --single-transaction --default-character-set=binary -h db1052.eqiad.wmnet enwiki recentchanges --where "(rc_timestamp > '20160419144741') AND (rc_timestamp < '20160419151018')" > T133122.sql

@jcrespo how to avoid duplicates for the 240 already existing on db1065/db1066? Should we dump and delete them to check they are there afterwards?

Check the mysqldump parameters and/or the generated file :-)

Still waking up, missed the insert-ignore :)

I also had to run:

root@neodymium:~$ mysql -h db1065.eqiad.wmnet enwiki -e "DELETE FROM recentchanges WHERE rc_id IN (817302324, 817301655)"
root@neodymium:~$ mysql -h db1066.eqiad.wmnet enwiki -e "DELETE FROM recentchanges WHERE rc_id IN (817302324, 817301655)"

But we may have created wrong ids or suffered from bad charset, because:

mysql -BN -A -h db2016.codfw.wmnet enwiki -e "SELECT * FROM recentchanges WHERE (rc_timestamp > '20160419144741') AND (rc_timestamp < '20160419151018')" | md5sum
e4339aa026abdc60fe0b1d349d369c01  -
mysql -BN -A -h db1065.eqiad.wmnet enwiki -e "SELECT * FROM recentchanges WHERE (rc_timestamp > '20160419144741') AND (rc_timestamp < '20160419151018')" | md5sum
3c5ca2cae74c5eb921a8e63715049079  -
mysql -BN -A -h db1066.eqiad.wmnet enwiki -e "SELECT * FROM recentchanges WHERE (rc_timestamp > '20160419144741') AND (rc_timestamp < '20160419151018')" | md5sum
3c5ca2cae74c5eb921a8e63715049079  -
mysql -BN -A -h db1052.eqiad.wmnet enwiki -e "SELECT * FROM recentchanges WHERE (rc_timestamp > '20160419144741') AND (rc_timestamp < '20160419151018')" | md5sum
e4339aa026abdc60fe0b1d349d369c01  -
root@neodymium:~$ mysql -BN -A -h db1065.eqiad.wmnet enwiki -e "UPDATE recentchanges SET rc_patrolled=1 WHERE rc_id = 817303249"
root@neodymium:~$ mysql -BN -A -h db1066.eqiad.wmnet enwiki -e "UPDATE recentchanges SET rc_patrolled=1 WHERE rc_id = 817303249"
root@neodymium:~$ mysql -BN -A -h db2016.codfw.wmnet enwiki -e "SELECT * FROM recentchanges WHERE (rc_timestamp > '20160419144741') AND (rc_timestamp < '20160419151018')" | md5sum
e4339aa026abdc60fe0b1d349d369c01  -
root@neodymium:~$ mysql -BN -A -h db1065.eqiad.wmnet enwiki -e "SELECT * FROM recentchanges WHERE (rc_timestamp > '20160419144741') AND (rc_timestamp < '20160419151018')" | md5sum
e4339aa026abdc60fe0b1d349d369c01  -
root@neodymium:~$ mysql -BN -A -h db1066.eqiad.wmnet enwiki -e "SELECT * FROM recentchanges WHERE (rc_timestamp > '20160419144741') AND (rc_timestamp < '20160419151018')" | md5sum
e4339aa026abdc60fe0b1d349d369c01  -