⚓ T156373 During Phabricator upgrade on 2017-01-26, all m3 replica dbs crashed at the same time

jcrespo created this task.Jan 26 2017, 2:49 PM

Restricted Application added subscribers: TerraCodes, Aklapper. · View Herald TranscriptJan 26 2017, 2:49 PM

I will backup current dataset and recover the last backup into db1048.

Mentioned in SAL (#wikimedia-operations) [2017-01-26T14:52:42Z] <jynus> stopping mysql on db1048 T156373

From looking at db2012 (uses GTID). When starting slave it always tries to start at the same position, meaning it is not even able to go thru one:

170126 14:13:03 [Note] Slave I/O thread: connected to master 'repl@db1043.eqiad.wmnet:3306',replication starts at GTID position '0-171970592-200273010,171970592-171970592-8362351'
170126 14:23:45 [Note] Slave I/O thread: connected to master 'repl@db1043.eqiad.wmnet:3306',replication starts at GTID position '0-171970592-200273010,171970592-171970592-8362351'
170126 14:36:49 [Note] Slave I/O thread: connected to master 'repl@db1043.eqiad.wmnet:3306',replication starts at GTID position '0-171970592-200273010,171970592-171970592-8362351'
170126 14:44:44 [Note] Slave I/O thread: connected to master 'repl@db1043.eqiad.wmnet:3306',replication starts at GTID position '0-171970592-200273010,171970592-171970592-8362351'
170126 14:45:53 [Note] Slave I/O thread: connected to master 'repl@db1043.eqiad.wmnet:3306',replication starts at GTID position '0-171970592-200273010,171970592-171970592-8362351'

That is this:

REPLACE INTO `heartbeat`.`heartbeat` (ts, server_id, file, position, relay_master_log_file, exec_master_log_pos, shard, datacenter) VAL
UES ('2017-01-26T02:00:21.001210', '171970592', 'db1043-bin.001457', '753447334', 'db1048-bin.001435', '7214391', 'm3', 'eqiad')
/*!*/;
# at 753447703
#170126  2:00:21 server id 171970592  end_log_pos 753447777     Query   thread_id=109713383     exec_time=0     error_code=0
SET TIMESTAMP=1485396021/*!*/;
COMMIT
/*!*/;
# at 753447777
#170126  2:00:21 server id 171970592  end_log_pos 753447815     **GTID 171970592-171970592-8362351**


/*!100001 SET @@session.gtid_domain_id=171970592*//*!*/;
/*!100001 SET @@session.gtid_seq_no=8362351*//*!*/;
BEGIN
/*!*/;
# at 753447815
#170126  2:00:21 server id 171970592  end_log_pos 753447843     Intvar
SET INSERT_ID=161901/*!*/;
# at 753447843
#170126  2:00:21 server id 171970592  end_log_pos 753448238     Query   thread_id=317863984     exec_time=0     error_code=0
use `phabricator_repository`/*!*/;
SET TIMESTAMP=1485396021/*!*/;
SET @@session.sql_mode=4194304/*!*/;
/*!\C utf8mb4 *//*!*/;
SET @@session.character_set_client=45,@@session.collation_connection=45,@@session.collation_server=63/*!*/;
INSERT INTO `repository_pullevent` (`repositoryPHID`, `epoch`, `pullerPHID`, `remoteAddress`, `remoteProtocol`, `resultType`, `resultCode`, `properties`, `phid`) VALUES ('PHID-REPO-xx', '1485396021', NULL, 'xxx', 'ssh', 'pull', '0', 'null', 'xx')
/*!*/;
# at 753448238
#170126  2:00:21 server id 171970592  end_log_pos 753448265     Xid = 6661367891
COMMIT/*!*/;
# at 753448265
#170126  2:00:22 server id 171970592  end_log_pos 753448303   ***** GTID 0-171970592-200272994*****

So looks like it crashes when doing

INSERT INTO `repository_pullevent` (`repositoryPHID`, `epoch`, `pullerPHID`, `remoteAddress`, `remoteProtocol`, `resultType`, `resultCode`, `properties`, `phid`) VALUES ('PHID-REPO-xx', '1485396021', NULL, 'xxxx', 'ssh', 'pull', '0', 'null', 'PHID-PULE-xxx')

This is the original crash:

2017-01-26 02:00:22 7fa9cebf6700 InnoDB: FTS Optimize Removing table phabricator_search/#sql2-1145-522051
2017-01-26 02:01:55 7fa9cebf6700 InnoDB: FTS Optimize Removing table phabricator_search/search_documentfield

So it was removing a temporary table called: #sql2-1145-522051
Looks related to this ALTER that happened 5 minutes before the crash:

#170126  1:55:41 server id 171970592  end_log_pos 753455998     Query   thread_id=317863696     exec_time=297   error_code=0
SET TIMESTAMP=1485395741/*!*/;
SET @@session.sql_mode=4194304/*!*/;
/*!\C utf8mb4 *//*!*/;
SET @@session.character_set_client=45,@@session.collation_connection=45,@@session.collation_server=63/*!*/;

ALTER TABLE `phabricator_search`.`search_documentfield` ADD FULLTEXT `key_corpus` (corpus, stemmedCorpus)

/*!*/;
# at 753455998
#170126  2:00:39 server id 171970592  end_log_pos 753456036     GTID 0-171970592-200273011

However, looks like that alter never finished, as the table of db2012 doesn't have it

+------------+
| @@hostname |
+------------+
| db2012     |
+------------+
1 row in set (0.01 sec)

*************************** 1. row ***************************
       Table: search_documentfield
Create Table: CREATE TABLE `search_documentfield` (
  `phid` varbinary(64) NOT NULL,
  `phidType` varchar(4) COLLATE utf8mb4_bin NOT NULL,
  `field` varchar(4) COLLATE utf8mb4_bin NOT NULL,
  `auxPHID` varbinary(64) DEFAULT NULL,
  `corpus` longtext CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci,
  `stemmedCorpus` longtext CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci,
  KEY `phid` (`phid`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_bin
1 row in set (0.00 sec)

Whereas the master does:

+------------+
| @@hostname |
+------------+
| db1043     |
+------------+
1 row in set (0.00 sec)

*************************** 1. row ***************************
       Table: search_documentfield
Create Table: CREATE TABLE `search_documentfield` (
  `phid` varbinary(64) NOT NULL,
  `phidType` varchar(4) COLLATE utf8mb4_bin NOT NULL,
  `field` varchar(4) COLLATE utf8mb4_bin NOT NULL,
  `auxPHID` varbinary(64) DEFAULT NULL,
  `corpus` longtext CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci,
  `stemmedCorpus` longtext CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci,
  KEY `phid` (`phid`),
  FULLTEXT KEY `key_corpus` (`corpus`,`stemmedCorpus`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_bin
1 row in set (0.00 sec)

When running the ALTER manually it manually on db2012 the server crashes again.
After mysqldumping the table, dropping it, and re-import it I started the slave to let the alter go thru with the normal replication thread: server crashed, however there is no FTS error anymore.
Running the ALTER manually shows that it is still causing the server to crash.

The only significant difference on the server is the MariaDB version, db2012 runs 10.0.28 where the master runs 10.0.23, which also means a different InnoDB version.

Paladox subscribed.Jan 26 2017, 3:58 PM

Looking at the above ^^, upstream have created a new phabricator_search table. So we no longer need to have it created.

But i also experienced this problem with running anything on that table. I had to drop it and recreate it.

See https://phabricator.wikimedia.org/D509 (sql part is important)

This looks like a bug introduced when they started supporting innodb full text search.

(Not sure how related i am talking about, just going on the things you find)

In T156373#2972979, @Paladox wrote:

Looking at the above ^^, upstream have created a new phabricator_search table. So we no longer need to have it created.

But i also experienced this problem with running anything on that table. I had to drop it and recreate it.

See https://phabricator.wikimedia.org/D509

(Not sure how related i am talking about, just going on the things you find)

Thanks for the help. Dropping and recreating it didn't work for us. @Paladox which mariadb version were you testing on?
Also, the SQL patch you suggest says:

ALTER TABLE {$NAMESPACE}_search.search_documentfield DROP INDEX `corpus`;

But that index doesn't exist on production,

I have set up an completely new box with 10.0.28, imported the table, try to ran the ALTER and it crashed the server:

Jan 21 13:34:41 clone mysqld: 2017-01-21 13:34:41 7f90033fe700  InnoDB: Assertion failure in thread 140256506537728 in file row0merge.cc line 803
Jan 21 13:34:41 clone mysqld: InnoDB: Failing assertion: b == &block[0] + buf->total_size
Jan 21 13:34:41 clone mysqld: InnoDB: We intentionally generate a memory trap.
Jan 21 13:34:41 clone mysqld: InnoDB: Submit a detailed bug report to http://bugs.mysql.com.
Jan 21 13:34:41 clone mysqld: InnoDB: If you get repeated assertion failures or crashes, even
Jan 21 13:34:41 clone mysqld: InnoDB: immediately after the mysqld startup, there may be
Jan 21 13:34:41 clone mysqld: InnoDB: corruption in the InnoDB tablespace. Please refer to
Jan 21 13:34:41 clone mysqld: InnoDB: http://dev.mysql.com/doc/refman/5.6/en/forcing-innodb-recovery.html
Jan 21 13:34:41 clone mysqld: InnoDB: about forcing recovery.
Jan 21 13:34:41 clone mysqld: 170121 13:34:41 [ERROR] mysqld got signal 6 ;

After that I have set up a server with 10.0.23 (the same version the master is running at the moment) and tried the ALTER manually and it worked.

MariaDB [test]> ALTER TABLE search_documentfield ADD FULLTEXT `key_corpus` (corpus, stemmedCorpus);
Stage: 1 of 1 'altering table'   96.8% of stage done

Query OK, 0 rows affected, 1 warning (8 min 24.21 sec)
Records: 0  Duplicates: 0  Warnings: 1

MariaDB [test]>
MariaDB [test]> show warnings;
+---------+------+--------------------------------------------------+
| Level   | Code | Message                                          |
+---------+------+--------------------------------------------------+
| Warning |  124 | InnoDB rebuilding table to add column FTS_DOC_ID |
+---------+------+--------------------------------------------------+
1 row in set (0.00 sec)

So looks related to 10.0.28
next: I will try on 10.0.29

In T156373#2973018, @Marostegui wrote:
In T156373#2972979, @Paladox wrote:

Looking at the above ^^, upstream have created a new phabricator_search table. So we no longer need to have it created.

But i also experienced this problem with running anything on that table. I had to drop it and recreate it.

See https://phabricator.wikimedia.org/D509

(Not sure how related i am talking about, just going on the things you find)

Thanks for the help. Dropping and recreating it didn't work for us. @Paladox which mariadb version were you testing on?
Also, the SQL patch you suggest says:
ALTER TABLE {$NAMESPACE}_search.search_documentfield DROP INDEX `corpus`;
But that index doesn't exist on production,

I have set up an completely new box with 10.0.28, imported the table, try to ran the ALTER and it crashed the server:
Jan 21 13:34:41 clone mysqld: 2017-01-21 13:34:41 7f90033fe700  InnoDB: Assertion failure in thread 140256506537728 in file row0merge.cc line 803
Jan 21 13:34:41 clone mysqld: InnoDB: Failing assertion: b == &block[0] + buf->total_size
Jan 21 13:34:41 clone mysqld: InnoDB: We intentionally generate a memory trap.
Jan 21 13:34:41 clone mysqld: InnoDB: Submit a detailed bug report to http://bugs.mysql.com.
Jan 21 13:34:41 clone mysqld: InnoDB: If you get repeated assertion failures or crashes, even
Jan 21 13:34:41 clone mysqld: InnoDB: immediately after the mysqld startup, there may be
Jan 21 13:34:41 clone mysqld: InnoDB: corruption in the InnoDB tablespace. Please refer to
Jan 21 13:34:41 clone mysqld: InnoDB: http://dev.mysql.com/doc/refman/5.6/en/forcing-innodb-recovery.html
Jan 21 13:34:41 clone mysqld: InnoDB: about forcing recovery.
Jan 21 13:34:41 clone mysqld: 170121 13:34:41 [ERROR] mysqld got signal 6 ;
After that I have set up a server with 10.0.23 (the same version the master is running at the moment) and tried the ALTER manually and it worked.
MariaDB [test]> ALTER TABLE search_documentfield ADD FULLTEXT `key_corpus` (corpus, stemmedCorpus);
Stage: 1 of 1 'altering table'   96.8% of stage done

Query OK, 0 rows affected, 1 warning (8 min 24.21 sec)
Records: 0  Duplicates: 0  Warnings: 1

MariaDB [test]>
MariaDB [test]> show warnings;
+---------+------+--------------------------------------------------+
| Level   | Code | Message                                          |
+---------+------+--------------------------------------------------+
| Warning |  124 | InnoDB rebuilding table to add column FTS_DOC_ID |
+---------+------+--------------------------------------------------+
1 row in set (0.00 sec)
So looks related to 10.0.28
next: I will try on 10.0.29

Server version: 10.1.20-MariaDB-1~trusty mariadb.org binary distribution

I found this bug report https://jira.mariadb.org/browse/MDEV-11233 that looks very similar to the error above.

It says it is now fixed in mariadb 10.1.20 so we may want to update from 10.0 to 10.1.20.

Affects Version/s:
10.1.17, 10.1.18
Fix Version/s:
10.1.20, 10.2.3

:)

In T156373#2973052, @Paladox wrote:

I found this bug report https://jira.mariadb.org/browse/MDEV-11233 that looks very similar to the error above.

It says it is now fixed in mariadb 10.1.20 so we may want to update from 10.0 to 10.1.20.

Affects Version/s:
10.1.17, 10.1.18
Fix Version/s:
10.1.20, 10.2.3

:)

Thanks for that looks like it also affects 10.0.28 and I just tried 10.0.29 and it crashed it too.
I haven't gone thru all the ticket (as it is quite long, but I will later).

I am going to upgrade to 10.1.21 (last stable release) and see if it crashes there or it is fixed as they say there.

The alter table worked on 10.1.21.

So to recap:

10.0.23 -> works
10.0.28 -> crashes
10.0.29 -> crashes
10.1.21 -> works

We need to decide what to do with this. Downgrading the slaves probably not the best idea, but jumping to 10.1.21 might also not be the best thing here.
We also need to keep in mind that this involves dbstore servers - which, luckily, in eqiad they are running 10.0.22, otherwise they'd have crashed.
The ones in codfw, they are running 10.0.28 - (dbstore2002 - still doesn't replicate m3, so it didn't crash).

From the description, have the phab statistics crons been disabled?

In T156373#2973202, @Marostegui wrote:

The alter table worked on 10.1.21.

So to recap:

10.0.23 -> works
10.0.28 -> crashes
10.0.29 -> crashes
10.1.21 -> works

We need to decide what to do with this. Downgrading the slaves probably not the best idea, but jumping to 10.1.21 might also not be the best thing here.
We also need to keep in mind that this involves dbstore servers - which, luckily, in eqiad they are running 10.0.22, otherwise they'd have crashed.
The ones in codfw, they are running 10.0.28 - (dbstore2002 - still doesn't replicate m3, so it didn't crash).

Maybe we should find the patch that fixed it in 10.1.20, then back port it to 10.0

In T156373#2973370, @greg wrote:

From the description, have the phab statistics crons been disabled?

Not that I know of, would you be able to do it for us?

I don't have the rights, I believe @jcrespo did it (with verification from @mmodell) last time we did some maint on the phab dbs.

This was the change, I believe: rOPUPc3e686e142b6: Disable crons using the phabricator db slave due to maintenance

In T156373#2973390, @Paladox wrote:

Maybe we should find the patch that fixed it in 10.1.20, then back port it to 10.0

That is not easy (and probably not even doable). 10.0 and 10.1 are two major different releases which means there are things that do not work in 10.1 that works on 10.0 and viceversa.
Not to mention that this touches innodb internals and both versions ship a different version of innodb as well.
So unfortunately I would discard backporting anything from 10.1 to 10 as not only is probably difficult (or impossible) but would also be a pain to maintain for future releases of 10.0

@Marostegui hi, i was talking about doing it upstream.

Not sure this is helpful, but the phabricator_search.search_documentfield table is just a search index and can be rebuilt (albeit very slowly, and at the cost of search not working very well during the rebuild).

Thus, an alternate pathway to apply this schema change which doesn't require an ALTER TABLE is: drop the table; recreate it without any data; separately rebuild the search index. This might not be the best way forward, but is a possible approach if making the ALTER work is burdensome.

@epriestley we are already doing that in parallel. But if it happens to be the key size, as the bug says, it will not help (it is not the alter, but the structure itself).

We can go back to MyISAM, but that was a disaster scability wise.

We can backport the patch, but I have yet to see how clean it applies to 10.0.29. Maintaining it is a pain, so we can think about migrating to 10.1 only for m3. Maybe we can convice MariaDB to backport the patch themselves. We need to evaluate the options and discus them with the service owners, then take a decision.

@jcrespo and @Marostegui i found the patch https://github.com/MariaDB/server/commit/9199d727598d60e2e56cebaadb74f4fb042cbcd4

I got some merge conflicts when cherry picking to the 10.0 branch

~/server$ git status
On branch 10.0
Your branch is up-to-date with 'origin/10.0'.
You are currently cherry-picking commit 9199d727598.

(fix conflicts and run "git cherry-pick --continue")
(use "git cherry-pick --abort" to cancel the cherry-pick operation)

Changes to be committed:

new file: mysql-test/suite/innodb_fts/r/create.result
new file: mysql-test/suite/innodb_fts/t/create.opt
new file: mysql-test/suite/innodb_fts/t/create.test
modified: storage/innobase/fts/fts0fts.cc
modified: storage/innobase/fts/fts0opt.cc
modified: storage/innobase/handler/ha_innodb.cc
modified: storage/innobase/include/fts0fts.h
modified: storage/xtradb/fts/fts0fts.cc
modified: storage/xtradb/fts/fts0opt.cc
modified: storage/xtradb/handler/ha_innodb.cc
modified: storage/xtradb/include/fts0fts.h

Unmerged paths:

(use "git add <file>..." to mark resolution)

both modified: storage/innobase/row/row0ftsort.cc
both modified: storage/xtradb/row/row0ftsort.cc

@jcrespo Ah, sorry, I hadn't actually read the linked bug.

As another possible alternative: it's well within the realm of feasibility to filter all tokens with length >127 bytes from the search index (at the cost of being unable to search for extremely long tokens, of course). I can provide a migration + patch for this if it's an avenue you want to explore.

We wouldn't want to maintain this stuff in the Phabricator upstream, but they'd probably take about 20 minutes to write and test and only require something on the order of 10 lines of patch code in Phabricator until the underlying issue is fixed. Not sure how hard patching MariaDB vs Phabricator is in your environment, though.

Actually, it's very easy to fix merge conflicts.

remove this

-<<<<<<< HEAD
-           /* One variable length column, word with its lenght less than
-           fts_max_token_size, add one extra size and one extra byte.
-
-           Since the max length for FTS token now is larger than 255,
-           so we will need to signify length byte itself, so only 1 to 128
-           bytes can be used for 1 bytes, larger than that 2 bytes. */
-           if (t_str.f_len < 128) {
-                   /* Extra size is one byte. */
-                   cur_len += 2;
-           } else {
-                   /* Extra size is two bytes. */
-                   cur_len += 3;
-           }

from both files.

As usual, @epriestley, I thank you a lot for the support (thank you so much!), but this is almost 100% clearly not a phab issue, and we should solve it by patching MariaDB or at admin level -upgrade, etc- Phab is not to blame here. We are recovering our data right now for now and then decide what is the best way.

One way you can actually help us, however, is to use your developer position of a highly popular product to pressure MariaDB on that ticket to backport such a fix for 10.0, if I can abuse your time. 0:-) I may file a request for that at MariaDB's bug tracker, but it is getting too late here, will leave it for tomorrow probably.

Thank you again.

https://jira.mariadb.org/browse/MDEV-11918

@jcrespo and @Marostegui I've back ported the fix here https://github.com/MariaDB/server/pull/300 though i have no idea if they will just reject it.

looking at the 10.0 branch, the last change merged there was 3 days ago, so i presuming they will do another release.

In T156373#2973965, @Paladox wrote:

@jcrespo and @Marostegui I've back ported the fix here

Thanks, Paladox, that is helpful. Let me test it tomorrow by building a new package without waiting for their merge, will report back.

Ok, your welcome :)

Paladox moved this task from Backlog to Patch proposed upstream on the Upstream board.Jan 27 2017, 1:59 AM

Wow, so awesome to wake up and see so much progress has been done on this ticket over night :-)! Brilliant work @Paladox @epriestley @jcrespo!!
Let's see what MariaDB says about the PR from @Paladox and if they are planning to release more packages for 10.0...

In T156373#2973641, @Paladox wrote:

@Marostegui hi, i was talking about doing it upstream.

yes, sorry I misunderstood you :-)

In T156373#2973994, @jcrespo wrote:

In T156373#2973965, @Paladox wrote:

@jcrespo and @Marostegui I've back ported the fix here

Thanks, Paladox, that is helpful. Let me test it tomorrow by building a new package without waiting for their merge, will report back.

If it works it would be a life-saver as we wouldn't need to upgrade the dbstore boxes to 10.1 (2001/2002) or leave them on 10.0.23 (1001/1002). Regarding db1048 and db2012 we could upgrade them to 10.1 to at least have some slaves for phabricator.

In T156373#2975376, @Marostegui wrote:

Wow, so awesome to wake up and see so much progress has been done on this ticket over night :-)! Brilliant work @Paladox @epriestley @jcrespo!!
Let's see what MariaDB says about the PR from @Paladox and if they are planning to release more packages for 10.0...

In T156373#2973641, @Paladox wrote:

@Marostegui hi, i was talking about doing it upstream.

yes, sorry I misunderstood you :-)

In T156373#2973994, @jcrespo wrote:

In T156373#2973965, @Paladox wrote:

@jcrespo and @Marostegui I've back ported the fix here

Thanks, Paladox, that is helpful. Let me test it tomorrow by building a new package without waiting for their merge, will report back.

If it works it would be a life-saver as we wouldn't need to upgrade the dbstore boxes to 10.1 (2001/2002) or leave them on 10.0.23 (1001/1002). Regarding db1048 and db2012 we could upgrade them to 10.1 to at least have some slaves for phabricator.

It currently looks like upstream are planning another release since the 10.0 branch has changes from a few days ago.

MariaDB reopened the bug to fix it in 10.0 \o/

https://jira.mariadb.org/browse/MDEV-11233?focusedCommentId=91146&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-91146

@Marostegui I doint think they know about the bug @jcrespo filled.

In T156373#2975451, @Paladox wrote:

@Marostegui I doint think they know about the bug @jcrespo filled.

Just linked it there :)

Looks like they fixed it and will be shipped with the 10.0.30:

Resolution: Fixed
Affects Version/s:
10.0.28, 10.0.29
Fix Version/s:
10.0.30

@Marostegui this is the fix for 10.0 https://github.com/MariaDB/server/commit/732672c3044e60fb0d1dfdb466bd3c3d13ea2f8d

We will have to build it if we want to deploy the fix, since I doint know when there next 10.0 update is.

Paladox moved this task from Patch proposed upstream to Patch merged upstream on the Upstream board.Jan 27 2017, 8:40 AM

Yes, let's wait for Jaime to build the package with that to see how it works. We could test it on db2012 if needed.

Ok :)

• mmodell awarded a token.Jan 27 2017, 3:51 PM

Mentioned in SAL (#wikimedia-operations) [2017-01-27T16:01:14Z] <jynus> submitted wmf-mariadb10_10.0.29-2 for T156373 fix

+1 :)

MariaDB's regresion test does work on the new package:
https://github.com/MariaDB/server/blob/732672c3044e60fb0d1dfdb466bd3c3d13ea2f8d/mysql-test/suite/innodb_fts/t/create.test

m3 main replica is now catching up (may take several hours), we will do a last test on production to verify everything is ok and upgrade all slaves to it if it works.
https://grafana.wikimedia.org/dashboard/db/mysql?var-dc=eqiad%20prometheus%2Fops&var-server=db1048

jcrespo updated the task description. (Show Details)Jan 27 2017, 5:19 PM

:)

Awesome, this is really impressive work everyone. Thanks for helping @Paladox!

Your welcome :)

And of course mad props and thanks to @jcrespo, @Marostegui, and @epriestley

+ Wikilove

In T156373#2976990, @jcrespo wrote:

MariaDB's regresion test does work on the new package:
https://github.com/MariaDB/server/blob/732672c3044e60fb0d1dfdb466bd3c3d13ea2f8d/mysql-test/suite/innodb_fts/t/create.test

m3 main replica is now catching up (may take several hours), we will do a last test on production to verify everything is ok and upgrade all slaves to it if it works.
https://grafana.wikimedia.org/dashboard/db/mysql?var-dc=eqiad%20prometheus%2Fops&var-server=db1048

Excellent news! Great job @jcrespo and @Paladox!

The production test seemed ok- 10.0.29-2 not crashing on db1048 (not sure if it took more than 5 minutes)- I will now let it catch up.

I have saved sqldata-precrash datadir on /srv in case we want to test it more (coords db1043-bin.001457:753455353).

In T156373#2977727, @jcrespo wrote:

The production test seemed ok- 10.0.29-2 not crashing on db1048 (not sure if it took more than 5 minutes)- I will now let it catch up.

I have saved sqldata-precrash datadir on /srv in case we want to test it more (coords db1043-bin.001457:753455353).

Very good news! The crash used to take less than 5 minutes in db2012 as per my tests yesterday, so it is looking good.
Moreover, the index is already present so I believe we have overcome the bug :-)

root@MISC m3[phabricator_search]> show create table search_documentfield;
+----------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| Table                | Create Table
+----------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| search_documentfield | CREATE TABLE `search_documentfield` (
  `phid` varbinary(64) NOT NULL,
  `phidType` varchar(4) COLLATE utf8mb4_bin NOT NULL,
  `field` varchar(4) COLLATE utf8mb4_bin NOT NULL,
  `auxPHID` varbinary(64) DEFAULT NULL,
  `corpus` longtext CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci,
  `stemmedCorpus` longtext CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci,
  KEY `phid` (`phid`),
  FULLTEXT KEY `key_corpus` (`corpus`,`stemmedCorpus`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_bin

Well done!!!

🥂 🍾

:)

I have upgraded db2012 to 10.0.29-2 (actually done a full upgrade) and the ALTER has gone thru as well. The server is now trying to catch up

root@MISC m3[(none)]> show create table phabricator_search.search_documentfield\G
*************************** 1. row ***************************
       Table: search_documentfield
Create Table: CREATE TABLE `search_documentfield` (
  `phid` varbinary(64) NOT NULL,
  `phidType` varchar(4) COLLATE utf8mb4_bin NOT NULL,
  `field` varchar(4) COLLATE utf8mb4_bin NOT NULL,
  `auxPHID` varbinary(64) DEFAULT NULL,
  `corpus` longtext CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci,
  `stemmedCorpus` longtext CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci,
  KEY `phid` (`phid`),
  FULLTEXT KEY `key_corpus` (`corpus`,`stemmedCorpus`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_bin
1 row in set (0.00 sec)

Mentioned in SAL (#wikimedia-operations) [2017-01-30T09:05:51Z] <marostegui> Start slaves from s1 to s7 on dbstore2001 - T156373

Mentioned in SAL (#wikimedia-operations) [2017-01-30T09:06:18Z] <marostegui> Upgrade db2012 to 10.0.29-2 (this was done couple of hours ago, but for the record) - T156373

Marostegui mentioned this in T156758: Drop m3 from dbstore servers.Jan 31 2017, 9:29 AM

db2012 caught up nicely so I believe this ticket can be closed.
We can discuss about m3 on dbstore2001 (and dbstore servers in general) on this ticket: T156758

Marostegui closed this task as Resolved.Feb 1 2017, 6:48 AM

Marostegui assigned this task to jcrespo.

Marostegui mentioned this in T156905: Phabricator master and slave crashed.Feb 1 2017, 10:06 AM

Paladox mentioned this in T156939: Setup a private elasticsearch cluster for phabricator.Feb 1 2017, 6:16 PM

During Phabricator upgrade on 2017-01-26, all m3 replica dbs crashed at the same time
Closed, ResolvedPublic
Actions

Description

Related Objects

Event Timeline

🥂 🍾

During Phabricator upgrade on 2017-01-26, all m3 replica dbs crashed at the same timeClosed, ResolvedPublicActions

Description

Related Objects

Event Timeline

🥂 🍾

During Phabricator upgrade on 2017-01-26, all m3 replica dbs crashed at the same time
Closed, ResolvedPublic
Actions