Page MenuHomePhabricator

Read timeout reached while viewing AbuseLog
Closed, ResolvedPublic

Description

I noticed a spike of errors on Logstash, complaining about the read timeout being reached with a query on the abuse_filter_log table. It's been happening since 4:00 UTC at a constant stream of ~100 errors every hour. Such slow queries are also reported on tendril, and here are the two that I found:

SELECT /* IndexPager::buildQueryInfo (AbuseLogPager) */ * FROM `abuse_filter_log` LEFT JOIN `abuse_filter` ON ((CAST( af_id AS BINARY )=afl_filter)) WHERE afl_deleted = '0' ORDER BY afl_timestamp DESC LIMIT 51
SELECT /* ApiQueryAbuseLog::execute */ afl_timestamp, afl_rev_id, afl_deleted, afl_filter, afl_id, afl_user_text, afl_namespace, afl_title, afl_action, afl_actions, af_public_comments FROM `abuse_filter_log` LEFT JOIN `abuse_filter` ON ((CAST( af_id AS BINARY )=afl_filter)) WHERE afl_deleted = '0' ORDER BY afl_timestamp DESC LIMIT 11

They mostly happen on wikidata and commons, but just because the abuse_filter_log table is very big there, I guess.

Now I already know this happens due to this recent patch, which added a CAST for Postgres compatibility. I ran an explain on the optimizer for both queries, with and without the CAST. In both cases, the query is not very efficient (it performs a FTS I think), but the main difference is:

With CAST: Using where; Using temporary; Using filesort
Without CAST: Using where; Using filesort

So apparently the CAST is adding a huge payload due to the need of a temporary table.

The fix itself is easy, I'll revert the patch in a minute. However, I'm puzzled by that temporary table coming out of the blue for the presence of the CAST. Maybe a DBA could provide some explanation?

Event Timeline

Daimona created this task.Apr 18 2019, 11:49 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 18 2019, 11:49 AM
Daimona triaged this task as Unbreak Now! priority.Apr 18 2019, 11:49 AM

Train blockers are UBN, and this is a TB because it's impossible to open Special:AbuseLog or even use the API (plus hotfix coming).

Restricted Application added subscribers: Liuxinyu970226, TerraCodes. · View Herald TranscriptApr 18 2019, 11:50 AM

Change 504862 had a related patch set uploaded (by Daimona Eaytoy; owner: Daimona Eaytoy):
[mediawiki/extensions/AbuseFilter@master] Revert "Use string cast for Postgres compatibility"

https://gerrit.wikimedia.org/r/504862

Change 504863 had a related patch set uploaded (by Daimona Eaytoy; owner: Daimona Eaytoy):
[mediawiki/extensions/AbuseFilter@wmf/1.34.0-wmf.1] Revert "Use string cast for Postgres compatibility"

https://gerrit.wikimedia.org/r/504863

Daimona claimed this task.Apr 18 2019, 11:51 AM

Maybe a DBA could provide some explanation?

Not a DBA but perhaps: because of the LEFT JOIN, the database starts scanning abuse_filter_log and then tries to lookup a row in abuse_filter. However, since the primary key is casted, the index on it cannot be used. So the next possibility is to create a temporary table. If it was the other way around, ie. afl_filter was casted, I suppose it would work.

Change 504862 merged by jenkins-bot:
[mediawiki/extensions/AbuseFilter@master] Revert "Use string cast for Postgres compatibility"

https://gerrit.wikimedia.org/r/504862

@matej_suchanek Ah, so the index isn't used because of the cast? That's interesting, I'll do some experiments. However, what mostly confused me is that the DB engine performs the cast implicitly all the same, and I wouldn't expect a different performance. Probably there's some difference in how such cast is performed. I'll follow up with an update after having analyzed the EXPLAIN more thoroughly.

P.S. I didn't talk about times in the task description, but they vary from a nearly instantaneous query without the cast, to 60-80 seconds with the cast.

Change 504863 merged by jenkins-bot:
[mediawiki/extensions/AbuseFilter@wmf/1.34.0-wmf.1] Revert "Use string cast for Postgres compatibility"

https://gerrit.wikimedia.org/r/504863

@matej_suchanek Ah, so the index isn't used because of the cast? That's interesting, I'll do some experiments. However, what mostly confused me is that the DB engine performs the cast implicitly all the same, and I wouldn't expect a different performance. Probably there's some difference in how such cast is performed. I'll follow up with an update after having analyzed the EXPLAIN more thoroughly.
P.S. I didn't talk about times in the task description, but they vary from a nearly instantaneous query without the cast, to 60-80 seconds with the cast.

My goodness! My EXPLAIN analysis was really, really shallow. No index is used with the cast!

With cast:

id	select_type	table	type	possible_keys	key	key_len	ref	rows	Extra
1.1	SIMPLE	abuse_filter_log	ALL					2853029	Using where; Using temporary; Using filesort
1.2	SIMPLE	abuse_filter	        ALL					118	Using where; Using join buffer (flat, BNL join)

Without cast:

id	select_type	table	type	possible_keys	key	key_len	ref	rows	Extra
1.1	SIMPLE	abuse_filter_log	index		afl_timestamp	14		2853029	Using where; Using filesort
1.2	SIMPLE	abuse_filter	eq_ref	PRIMARY	PRIMARY	8	wikidatawiki.abuse_filter_log.afl_filter	1	Using where

(sorry for the bad format)

Quarry and sql optimizer don't let me use a FORCE INDEX, so I can't say if that would help, but indexes are definitely necessary for this query.

Synchronized php-1.34.0-wmf.1/extensions/AbuseFilter/includes/: sync https://gerrit.wikimedia.org/r/c/mediawiki/extensions/AbuseFilter/+/504863 (duration: 01m 00s)

mmodell lowered the priority of this task from Unbreak Now! to Normal.Apr 18 2019, 5:22 PM

No longer blocking the train, thanks.

Thanks @mmodell! Now we have plenty of time to understand how to get this working. However, I fear that our only possibility to keep the cast is to FORCE the index (and I still don't know if it would work). Other than that, I guess we're left with the option to keep the Postgres incompatibility for now and jump straight to T220791. That one would also normalize the schema a bit (so that Postgres won't be the only DBMS to benefit from it), but it won't be quick to do.

@Daimona: I don't know how important PostgreSQL support is to you, however, the normalized schema does sound like the better long-term solution. abusefilter is a common source of database-related timeouts for sure so any improvements to the schema would be welcome I think.

@Daimona if you let me know the query you'd like to try with the FORCE I can try to run it on a core host for you.
Also let me know which wiki would be good for that test

@mmodell Well, I guess we should always try to keep our code cross-DBMS-compat, and Postgres compatibility is mandatory for including AF in the tarball. That being said, there's no hurry to make it work now, given that it's been broken for 10 years or so... Basing on Logstash and tendril, it doesn't seem like AF is causing too many timeouts, aside from a couple of not-so-common queries on abuse_filter_log. Normalization is always great, although as for every schema change this one would take some time to be merged and deployed.

@Marostegui thanks! You can try the first query in the task description, adding a FORCE INDEX for afl_timestamp on abuse_filter_log. It'd be ideal on a large wiki; for instance, commons, wikidata or enwiki. I tried an online SQL editor but it didn't help, and I don't have my gear with me to test it properly.

Marostegui moved this task from Triage to In progress on the DBA board.Apr 21 2019, 6:31 AM

It is not really making any difference unfortunately, the optimizer still thinks it is better to do a full scan (I guess the CAST is the culprit here):

SELECT  * FROM `abuse_filter_log` FORCE INDEX(afl_timestamp) LEFT JOIN `abuse_filter` ON ((CAST( af_id AS BINARY )=afl_filter)) WHERE afl_deleted = '0' ORDER BY afl_timestamp DESC LIMIT 51


root@db1089.eqiad.wmnet[(none)]> show explain for 6822134393;
+------+-------------+------------------+------+---------------+------+---------+------+----------+-------------------------------------------------+
| id   | select_type | table            | type | possible_keys | key  | key_len | ref  | rows     | Extra                                           |
+------+-------------+------------------+------+---------------+------+---------+------+----------+-------------------------------------------------+
|    1 | SIMPLE      | abuse_filter_log | ALL  | NULL          | NULL | NULL    | NULL | 23070264 | Using where; Using temporary; Using filesort    |
|    1 | SIMPLE      | abuse_filter     | ALL  | NULL          | NULL | NULL    | NULL |      900 | Using where; Using join buffer (flat, BNL join) |
+------+-------------+------------------+------+---------------+------+---------+------+----------+-------------------------------------------------+
2 rows in set, 1 warning (0.00 sec)

I ran an analyze on the table to refresh all the stats but no change.

@Marostegui Thanks! What if a FORCE INDEX (af_id) is added to the join with abuse_filter? Is it still ignored? That's bad, and it means that we'll really have to go straight with the schema change. Apparently the engine is not casting af_id to string implicitly, or it's doing that differently, or it's a bug in the optimizer, I sincerely don't know.

There is no difference, I don't think it is an optimizer bug, it is an issue with CAST.
I personally don't have much experience with casting, but from the mysql manual:

If you convert an indexed column using BINARY, CAST(), or CONVERT(), MySQL may not be able to use the index efficiently.

There is no difference, I don't think it is an optimizer bug, it is an issue with CAST.
I personally don't have much experience with casting, but from the mysql manual:

If you convert an indexed column using BINARY, CAST(), or CONVERT(), MySQL may not be able to use the index efficiently.

This makes sense. But then, isn't MySQL implicitly casting af_id when it has to compare a bigint with a varchar? It's either not casting the column (sounds weird) or doing that in such a way that the index is used.

mmodell changed the subtype of this task from "Task" to "Production Error".Aug 28 2019, 11:07 PM
Daimona closed this task as Resolved.Sep 13 2019, 5:39 PM

The long-term goal is T220791, this issue was resolved, hence closing.