Page MenuHomePhabricator

Contention on search phabricator database creating full phabricator outages
Closed, ResolvedPublic

Description

Starting from ~13h UTC today (26 September), phabricator search started not responding, and later the website responded with error messages of phabricator cannot connect to mysql.

Upon investigation, we found that there was table contention:
https://grafana-admin.wikimedia.org/dashboard/db/mysql?panelId=19&fullscreen&from=1474876675832&to=1474899475832&var-dc=eqiad%20prometheus%2Fops&var-server=db1048

This was caused by lots of queries waiting for full table lock on search_documentfield, such as:

Waiting for table level lock
    Info: SELECT
        document.phid
        FROM `search_document` document
           JOIN `search_documentfield` field ON field.phid = document.phid  JOIN `search_documentr
          WHERE MATCH(corpus) AGAINST ('mediawiki' IN BOOLEAN MODE)
        GROUP BY document.phid
          ORDER BY
          IF(documentType = 'USER', 0, 1) ASC,
          MAX(MATCH(corpus) AGAINST ('mediawiki')) DESC
        LIMIT 188, 1

The cause for the full table lock is because this table was in Aria (~MyISAM) format, in which writes lock the full table, and block reads. This caused search to not work, but after a while, brought phabricator down because pileup of these queries filled up the available connections, affecting all connections:

https://grafana-admin.wikimedia.org/dashboard/db/mysql?panelId=9&fullscreen&from=1474890496586&to=1474898713300&var-dc=eqiad%20prometheus%2Fops&var-server=db1048

The cause is believed to be a large amount of DELETEs done on this table, which will cause issues because you cannot have a large amount of writes on a MyISAM tables (causes contention) because full table locks. This are the queries that we believe caused this issue, according to tendril:

Measures taken:

  • MySQL was restarted- connections took too much time to get killed and we failoverd to a read only replica while maintenace was ongoing
  • The table was converted to InnoDB, including the fulltext index- We are unaware of the consequeces of this for phabricator. We assume that the table was in Aria for a reason; We do not know if that will work; even if it will, it may need tuning:
CREATE TABLE `search_documentfield` (
  `phid` varbinary(64) NOT NULL,
  `phidType` varchar(4) COLLATE utf8mb4_bin NOT NULL,
  `field` varchar(4) COLLATE utf8mb4_bin NOT NULL,
  `auxPHID` varbinary(64) DEFAULT NULL,
  `corpus` longtext CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci,
  KEY `phid` (`phid`),
  FULLTEXT KEY `corpus` (`corpus`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_bin;
  • As the table was truncated to be able to be converted; (a backup was taken beforhand), a reindexing process was initiated by chase; that may temporarily altered the workflow?

Followup with me to see how to permanently fix this, but that table had legitimate contention issues and probably search should be either on an external service or using InnoDB, not MyISAM.

Related Objects

StatusAssignedTask
Resolvedmmodell
Resolvedjcrespo
Resolvedmmodell
ResolvedNone
ResolvedAklapper
Resolveddemon
Resolveddemon
Resolvedchasemp
Resolvedchasemp
Declinedchasemp
Declinedchasemp
Resolvedchasemp
ResolvedFriedhelmW
Resolvedmmodell
DeclinedNone
Resolved ksmith
ResolvedNone
Resolvedchasemp
Resolvedmmodell
Resolvedmmodell
Resolvedmmodell
Resolvedmmodell
Resolvedmmodell
Resolvedmmodell
DuplicateNone
Resolvedmmodell
Resolvedmmodell
DeclinedNone
ResolvedMarostegui
Resolvedmmodell

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

I doubt myisam supported that in the first place.

Elitre added a subscriber: Elitre.Sep 27 2016, 7:40 AM

It looks like innodb does not support stemming, among other limitations.

Re Stemming: Historically, T95 (We once switched to ElasticSearch but switched back.)

It seems the switch over to innodb may have caused T146789

Looking online MySQL 5.7 supports full text searches and mariadb 10.0.15+ in innodb, are we using a version less then those?

Does phabricator code need updating to support full text searches in innodb if you use a a specific MySQL or mariadb version or higher?

greg added a subscriber: greg.Sep 27 2016, 6:30 PM

It seems the switch over to innodb may have caused T146789

That's unfounded, the reindex is still on-going and would more likely be the cause of that issue.

Oh, but In innodb depending on what MySQL or mariadb version you have dosent support full text support. It started supporting in MySQL 5.7+ and mariadb 10.0.5+

Oh, but In innodb depending on what MySQL or mariadb version you have dosent support full text support. It started supporting in MySQL 5.7+ and mariadb 10.0.5+

We do have innodb support or it wouldn't be working at all right now.

That's unfounded, the reindex is still on-going and would more likely be the cause of that issue.

Greg is correct, T146789 is now resolved though it took an extra phab search index run with --force (details in that task)

jcrespo added a comment.EditedSep 28 2016, 3:53 PM

Here is the things I would like to be done to close this ticket (all ongoing issues solved):

  • Check that search ok-ish (it is stable, and it will not create further outage-like issues and reindexing has finished correctly) - I belive this is done
  • Check old myisam configuration regarding stop words/max length, etc.- I believe phabricator had special MyISAM configuration that will need tuning for InnoDB fulltext search and potentially, rebuild the table.

Once both things are solved, close this ticket dealing with ongoing issues and only then think if search needs a long term plan regarding engine change (e.g. elastic)

It is said that:

The fulltext search feature in innodb is not as mature as MyISAM fulltext. The ranking algorithm is very simplistic and I expect results to be lower quality than before.

And I may challenge with that, and that it requires proper tuning. I agree, however, that MySQL fulltext support in general (both MyISAM and InnoDB) is very basic, and that T146843 is more than worth persuing (elastic is million times a better solution for search than mysql).

Just let's close this ticker first by making sure there are not more ongoing issues. Sorry I am not much on top of this aside from handling the outage for obvious reasons this week; but please communicate with me asynchronously.

jcrespo updated the task description. (Show Details)Sep 28 2016, 3:53 PM

I would add one extra bullet point to the TODO: evaluate other potentially problematic Aria/MyISAM tables and convert them to InnoDB.

Paladox added a comment.EditedSep 28 2016, 4:02 PM

I would add one extra bullet point to the TODO: evaluate other potentially problematic Aria/MyISAM tables and convert them to InnoDB.

Filled as T146910

@jcrespo: Are there other myisam tables? I was under the impression that this was the only one but I will check to be sure.

As far as I can tell, search is working ok and there doesn't seem to be a risk of further outages.

@mmodell yep, there are other myisam tables.

SELECT TABLE_SCHEMA, TABLE_NAME
FROM   information_schema.TABLES
WHERE  Engine = 'MyISAM';
+-------------------------+--------------------------+
| TABLE_SCHEMA            | TABLE_NAME               | 
+-------------------------+--------------------------+
| phabricator_conpherence | conpherence_index        |
| phabricator_metamta     | metamta_applicationemail |
+-------------------------+--------------------------+

@jcrespo: Those two tables ^ appear to be all that remain on myisam. I can't see any reason why we should not convert them.

@jcrespo: So I think what remains for this task is as follows:

  1. Reduce innodb_ft_min_token_size from 3 to 2
  2. Import stopwords from https://github.com/phacility/phabricator/blob/master/resources/sql/stopwords.txt
  3. Convert the following tables to innodb:
phabricator_conpherence.conpherence_index

phabricator_metamta.metamta_applicationemail

"1. Reduce innodb_ft_min_token_size from 3 to 2"

Done here https://gerrit.wikimedia.org/r/#/c/313235/

Change 313235 had a related patch set uploaded (by Paladox):
phabricator: Reduce innodb_ft_min_token_size from 3 to 1

https://gerrit.wikimedia.org/r/313235

@mmodell Thank you very much for investigating, if today is "phabricator's maintenance window", I would like to do them tonight, as it probably will require to go to read only mode for a few minutes (requires table rebuilding).

That will allow us to close this ticket, having fixed all immediate issues.

@jcrespo and @mmodell I found that innodb full search is more accurate then myisam according to the reasearch I did. So setting the value to one in innodb will allow it to be more reliable then it is currently. Setting the value to 3 will be more unreliable then myisam.

@jcrespo The maintenance window was last night. I would be happy to work with you to do the changes at what ever time is most convenient for you. I don't think it's a big deal if we have a few minutes of read-only outside the normal maintenance window but we can wait for next week if you think it will potentially take a long time.

Paladox moved this task from To Triage to Misc on the Phabricator board.Sep 30 2016, 9:27 PM

Sorry, @mmodell I missed it. This was not a good week. I think it will take more than a few minutes; what I am going to do it to apply the changes on the slave; then failover using the proxy.

I still will need you around because it will require restarting the phabricator service; and of course, if anything else goes wrong. Let's meet at some point this week, not too late on European Timezone.

@jcrespo: Just let me know what time works best for you and I should be able to make myself available.

Mentioned in SAL (#wikimedia-operations) [2016-10-03T22:08:55Z] <jynus> running schema change (innodb conversion) on phabricator db hosts T146673

I have done #3 of T146673#2675083.

The stopwords handling requires a patch to adapt the old MyISAM syntax to the InnoDB one, @Paladox's patch is not enough. Also there is a workaround in place for the loading of the stopwords, maybe there is now a way to avoid it (?). There are references to T89274 that I should stop workaround it or apply again to the new variables. Needs a bit more of time to properly assess and fix.

@jcrespo yeh,but https://gerrit.wikimedia.org/r/313235 should improve things more then there are with innodb, and along with a change to stopwords.

@jcrespo Hi, would you know how we can do the stopwords for innodb, I carn't find any where, where we can set it so it uses the stopwords of our choice?

@jcrespo Hi, I'm wondering would you be able to do the patch that create an table for our stopwords please.

I belive we need to use configs

innodb_ft_enable_stopword

innodb_ft_server_stopword_table

Change 314286 had a related patch set uploaded (by Paladox):
Create a phabricator_stopwords phabricator table in sql (innodb)

https://gerrit.wikimedia.org/r/314286

jcrespo claimed this task.Oct 10 2016, 8:05 AM

Change 314286 merged by Jcrespo:
phabricator: Create & configure a phabricator_stopwords table for innodb

https://gerrit.wikimedia.org/r/314286

Change 313235 abandoned by Paladox:
phabricator: Reduce innodb_ft_min_token_size from 3 to 1

Reason:
https://gerrit.wikimedia.org/r/#/c/315057,edit/

https://gerrit.wikimedia.org/r/313235

Change 313235 restored by Paladox:
phabricator: Reduce innodb_ft_min_token_size from 3 to 1

https://gerrit.wikimedia.org/r/313235

Change 313235 abandoned by Paladox:
phabricator: Reduce innodb_ft_min_token_size from 3 to 1

Reason:
https://gerrit.wikimedia.org/r/#/c/315057/

https://gerrit.wikimedia.org/r/313235

Change 315057 had a related patch set (by Paladox) published:
phabricator: Reduce innodb_ft_min_token_size from 3 to 1

https://gerrit.wikimedia.org/r/315057

So I have applied the previous config settings and get rid of Aria on the slave. Reindexing took only:

Query OK, 1833997 rows affected (3 min 24.94 sec)      
Records: 1833997  Duplicates: 0  Warnings: 0

on db1043 ( the current passive slave):

+---------------------------------+------------------------------------------+
| Variable_name                   | Value                                    |
+---------------------------------+------------------------------------------+
| innodb_ft_aux_table             |                                          |
| innodb_ft_cache_size            | 8000000                                  |
| innodb_ft_enable_diag_print     | OFF                                      |
| innodb_ft_enable_stopword       | ON                                       |
| innodb_ft_max_token_size        | 84                                       |
| innodb_ft_min_token_size        | 3                                        |
| innodb_ft_num_word_optimize     | 2000                                     |
| innodb_ft_result_cache_limit    | 2000000000                               |
| innodb_ft_server_stopword_table | phabricator_search/phabricator_stopwords |
| innodb_ft_sort_pll_degree       | 2                                        |
| innodb_ft_total_cache_size      | 640000000                                |
| innodb_ft_user_stopword_table   |                                          |
+---------------------------------+------------------------------------------+
12 rows in set (0.01 sec)

The difference with those changes should not be excessively great- I would like however to perform a master failover to test the proxy again (last time phabricator wasn't properly puppetized) plus I made some TLS updates that we should apply to the master, too.

The main issue is that InnoDB fulltext does not allow to transparently change OR operators into AND, as we used to do with MyISAM with:

set global ft_boolean_syntax = ' |-><()~*:""&^';

So searches such as "hhvm memory leak" gets converted into "hhvm OR memory OR leak", instead of "hhvm AND memory AND leak". Querying: "+hhvm +memory +leak" solves that, but that should be fixed at phabricator level, not the database.

jcrespo removed jcrespo as the assignee of this task.Oct 10 2016, 10:35 AM
jcrespo added a project: DBA.
jcrespo moved this task from Triage to Blocked external/Not db team on the DBA board.

To clarify, pending tasks:

  • let's schedule a downtime for the master failover, which should only require some seconds (the time to restart phabricator).
  • Once that is done, see how to fix the OR -> AND or if we can fix it at all

Or for https://github.com/wikimedia/phabricator/blob/wmf/stable/src/applications/search/fulltextstorage/PhabricatorMySQLFulltextStorageEngine.php#L168

we could do

$query = clone $query;
$query->setParameter(

'query',
addcslashes(
  $query->getParameter('query'), '+-&|!(){}[]^"~*?:\\'));

$q = $query->getParameter('query');

?

Havent tested this idea but wondering would this be the way?

Per how elasticsearch does it https://github.com/wikimedia/phabricator/blob/f477ca9bc79c06130ea61f0cf66d3bbdb5f58141/src/applications/search/fulltextstorage/PhabricatorElasticFulltextStorageEngine.php#L283

Upstream task about InnoDB support: https://secure.phabricator.com/T11741

jcrespo added a comment.EditedOct 11 2016, 6:49 AM

@mmodell: Some answers, if you can send them upstream (copy source):

  • We need to detect that this capability is present (what command can we run to ask MySQL if it supports InnoDB FULLTEXT?)

if SELECT count(@@global.innodb_ft_max_token_size) AS innodb_fulltext_enabled; returns 1 row, it is enabled, else (the query fails), it is disabled. If you want to check versions, it is enabled by default since MySQL 5.6.4 and MariaDB 10.0.5, but detecting the configuration variables is simpler and more reliable. Both have 2 stable versions supporting this feature as of this writing.

  • InnoDB uses a different stopword mechanism: a stopword table instead of a stopword file. We need to generate this table (just unconditionally?) and configure it or have users configure it.

See how we do it at Wikimedia: https://phabricator.wikimedia.org/diffusion/OPUP/browse/production/manifests/role/mariadb.pp;85887e74bb5abd95de5bec729d3b9e9137a263e7$311 and https://phabricator.wikimedia.org/diffusion/OPUP/browse/production/templates/mariadb/phabricator-stopwords-update.sql.erb and https://phabricator.wikimedia.org/diffusion/OPUP/browse/production/templates/mariadb/phabricator.my.cnf.erb;85887e74bb5abd95de5bec729d3b9e9137a263e7$77

  • Does InnoDB FULLTEXT respect ft_boolean_syntax? If no: is there another similar option? If no: we need to parse queries ourselves and submit the +A +B version to the backend (see also T10642).

Innodb does not respect such an option, and it has not an alternative- it has to add the +signs at application layer. It is documented, but see verified bug: https://bugs.mysql.com/bug.php?id=71551

  • Does InnoDB FULLTEXT respect ft_min_word_len? If no: is there another similar option? If no: maybe we need to filter short words out when indexing/querying?

It does not respect ft_min_word_len, but it uses innodb_ft_min_token_size. See our config at: https://phabricator.wikimedia.org/diffusion/OPUP/browse/production/templates/mariadb/phabricator.my.cnf.erb;85887e74bb5abd95de5bec729d3b9e9137a263e7$79

  • Does swapping the ENGINE on an existing MyISAM table also rebuild the index? Do we need to get users to run search index --all after this change? If so, how do we tell them?

We rebuilt the index, but I believe that if the above changes are done (which requires in some cases a server restart), just a database table rebuilt/optimize (including engine conversion) would work. Whenever the fulltext parameters are changed, the index requires rebuilt (as with myisam).

@jcrespo: Thank you for the thoughtful responses, I've posted them upstream!

@mmodell seems the way search now works is the source of lot of confusion and pain for a lot of people. Jaime proposed to try out Paladox suggestion T146673#2703554 which is to hack the code and push that on our setup, that might workaround the issue while we wait for upstream?

I have of course now idea what needs to be done exactly, much less how to properly test nor how much pain it is to maintain a live hack on our Phabricator instance. But that sounds like worth a shot?

From a conversation with Mukunda, D413 should address it and is deployed on https://phab-01.wmflabs.org/ for testing purposes.

@hashar I will need to change the search from elasticsearch since I was testing that.

Also we need to convert the table to innodb too.

When I'm near a pc I can do this, which won't be till around between 3pm or 4pm bst

Paladox added a comment.EditedOct 13 2016, 12:17 PM

I found an ssh app on the iPhone so I used that.

I have installed this https://downloads.mariadb.org/mariadb/repositories/#mirror=coreix&distro=Ubuntu&distro_release=trusty--ubuntu_trusty&version=10.1

Also I installed mariadb-sever and converted one of the tables to innodb.

I think we may need to reindex which I carnt do since I won't be able to leave my iPhone on.

I've also switched from elasticsearch to innodb to test @mmodell change.

Thanks @Paladox! I'll reindex.

Your welcome, I guess we reindex twice now :)

We should deftly deploy this If this improves things more.

@mmodell are we running reindexing on iridium, since It looks like it works https://phabricator.wikimedia.org/search/query/8bAe1tPxgoKI/#R

T146789

@Paladox: reindexing already happened a while ago.

Oh, so we doint need to reindex with your change?

Right. This just changes the query to always include + before every word. So if you search for foo bar it changes to +foo +bar.. but if you search for foo -bar it'll become +foo -bar

Ok, thankyou.

Change 315718 had a related patch set uploaded (by Jcrespo):
dbproxy: Failover phabricator dbs from db1048 to db1043

https://gerrit.wikimedia.org/r/315718

Change 315718 merged by Jcrespo:
dbproxy: Failover phabricator dbs from db1048 to db1043

https://gerrit.wikimedia.org/r/315718

Change 315732 had a related patch set uploaded (by Jcrespo):
wmnet: Set db1048 as the new s3-slave after failover

https://gerrit.wikimedia.org/r/315732

Change 315732 merged by Jcrespo:
wmnet: Set db1048 as the new s3-slave after failover

https://gerrit.wikimedia.org/r/315732

  • Does InnoDB FULLTEXT respect ft_boolean_syntax? If no: is there another similar option? If no: we need to parse queries ourselves and submit the +A +B version to the backend (see also T10642).

It feels like T10642 is a typo. I tried a couple of variations of the task ID (adding 2,000, adding 100,000), but can't figure out which task was meant.

hashar removed a subscriber: hashar.

Dropping the Release-Engineering-Team tag to limit inbox filling. The task is already in Phabricator and has appropriate peoples CCed. :] Unsubscribing myself.

jcrespo closed this task as Resolved.Oct 14 2016, 8:14 AM
jcrespo claimed this task.

@Paladox The first 2 links are unrelated. The third is a 7 year old code.

After seeing the latest updates from @mmodell, seeing there is a small bug, but not a security concern (although I suggest fixing it), I am going to close this bug (ongoing phabricator issues) as resolved. Improvements on InnoDB search or implementing elastic should go to T146843, this was just to fix ongoing availability problems and following degradations of service.

Looks like mysql has stemming if you do

@Paladox: Please stay on-topic. This task is not about stemming but about outages. Thanks!

Change 315057 abandoned by Paladox:
phabricator: Reduce innodb_ft_min_token_size from 3 to 1

https://gerrit.wikimedia.org/r/315057