I noticed a spike of errors on Logstash, complaining about the read timeout being reached with a query on the abuse_filter_log table. It's been happening since 4:00 UTC at a constant stream of ~100 errors every hour. Such slow queries are also reported on tendril, and here are the two that I found:
SELECT /* IndexPager::buildQueryInfo (AbuseLogPager) */ * FROM `abuse_filter_log` LEFT JOIN `abuse_filter` ON ((CAST( af_id AS BINARY )=afl_filter)) WHERE afl_deleted = '0' ORDER BY afl_timestamp DESC LIMIT 51
SELECT /* ApiQueryAbuseLog::execute */ afl_timestamp, afl_rev_id, afl_deleted, afl_filter, afl_id, afl_user_text, afl_namespace, afl_title, afl_action, afl_actions, af_public_comments FROM `abuse_filter_log` LEFT JOIN `abuse_filter` ON ((CAST( af_id AS BINARY )=afl_filter)) WHERE afl_deleted = '0' ORDER BY afl_timestamp DESC LIMIT 11
They mostly happen on wikidata and commons, but just because the abuse_filter_log table is very big there, I guess.
Now I already know this happens due to this recent patch, which added a CAST for Postgres compatibility. I ran an explain on the optimizer for both queries, with and without the CAST. In both cases, the query is not very efficient (it performs a FTS I think), but the main difference is:
With CAST: Using where; Using temporary; Using filesort
Without CAST: Using where; Using filesort
So apparently the CAST is adding a huge payload due to the need of a temporary table.
The fix itself is easy, I'll revert the patch in a minute. However, I'm puzzled by that temporary table coming out of the blue for the presence of the CAST. Maybe a DBA could provide some explanation?