AFComputedVariable::compute query timeouts
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	aaron
	Oct 25 2015, 11:50 PM

Description

Lots of these in the logs at https://logstash.wikimedia.org/#/dashboard/elasticsearch/wfLogDBError:

AFComputedVariable::compute	10.64.32.21	2013	Lost connection to MySQL server during query (10.64.32.21)	SELECT rev_user_text FROM (SELECT  rev_user_text,rev_timestamp  FROM `revision`   WHERE rev_page = '8972734' AND (rev_timestamp < '20151025233104')  ORDER BY rev_timestamp DESC LIMIT 100  ) AS tmp GROUP BY rev_user_text ORDER BY MAX(rev_timestamp) DESC LIMIT 10

Seems to affect the enwiki API DBs (not sure if that is page/traffic pattern or something with those DBs). The query plan on those looks similar to other servers *except* for the index "key length" used and ref vs range scans. The ones with less errors say 4 (vs 20 for the API ones).

> aaron@terbium:~$ mwscript sql.php --wiki=enwiki --slave=db1065
> EXPLAIN SELECT rev_user_text FROM (SELECT  rev_user_text,rev_timestamp  FROM `revision`   WHERE rev_page = '36395484' AND (rev_timestamp < '20151025072035')  ORDER BY rev_timestamp DESC LIMIT 100  ) AS tmp GROUP BY rev_user_text ORDER BY MAX(rev_timestamp) DESC LIMIT 10;
stdClass Object
(
    [id] => 1
    [select_type] => PRIMARY
    [table] => <derived2>
    [type] => ALL
    [possible_keys] => 
    [key] => 
    [key_len] => 
    [ref] => 
    [rows] => 100
    [Extra] => Using temporary; Using filesort
)
stdClass Object
(
    [id] => 2
    [select_type] => DERIVED
    [table] => revision
    [type] => range
    [possible_keys] => rev_page_id,rev_timestamp,page_timestamp
    [key] => page_timestamp
    [key_len] => 20
    [ref] => 
    [rows] => 173152
    [Extra] => Using where
)

> aaron@terbium:~$ mwscript sql.php --wiki=enwiki --slave=db1073
> EXPLAIN SELECT rev_user_text FROM (SELECT  rev_user_text,rev_timestamp  FROM `revision`   WHERE rev_page = '36395484' AND (rev_timestamp < '20151025072035')  ORDER BY rev_timestamp DESC LIMIT 100  ) AS tmp GROUP BY rev_user_text ORDER BY MAX(rev_timestamp) DESC LIMIT 10;
stdClass Object
(
    [id] => 1
    [select_type] => PRIMARY
    [table] => <derived2>
    [type] => ALL
    [possible_keys] => 
    [key] => 
    [key_len] => 
    [ref] => 
    [rows] => 100
    [Extra] => Using temporary; Using filesort
)
stdClass Object
(
    [id] => 2
    [select_type] => DERIVED
    [table] => revision
    [type] => ref
    [possible_keys] => rev_page_id,rev_timestamp,page_timestamp
    [key] => page_timestamp
    [key_len] => 4
    [ref] => const
    [rows] => 189680
    [Extra] => Using where
)

Details

Subject	Repo	Branch	Lines +/-
MariaDB: Repool db2062 after maintenanace	operations/mediawiki-config	master	+2 -2
Enble puppet on db2062- a one-time experiment for mariadb 10.1	operations/puppet	production	+15 -3
Add FORCE INDEX to AFComputedVariable::{closure}	mediawiki/extensions/AbuseFilter	master	+5 -1
MariaDB: Depool db2062 database server for maintenance	operations/mediawiki-config	master	+2 -2
Compute last-recent-authors result during edit stashing	mediawiki/extensions/AbuseFilter	wmf/1.27.0-wmf.10	+55 -18
Add ParserOutputStashForEdit hook for extension cache warming	mediawiki/core	wmf/1.27.0-wmf.11	+9 -0
Add ParserOutputStashForEdit hook for extension cache warming	mediawiki/core	master	+9 -0
Compute last-recent-authors result during edit stashing	mediawiki/extensions/AbuseFilter	wmf/1.27.0-wmf.11	+55 -18
Compute last-recent-authors result during edit stashing	mediawiki/extensions/AbuseFilter	master	+55 -18
Simplify "load-recent-authors" query	mediawiki/extensions/AbuseFilter	master	+11 -13
Avoid pointless range scan for 'load-recent-authors'	mediawiki/extensions/AbuseFilter	wmf/1.27.0-wmf.3	+1 -7
Avoid pointless range scan for 'load-recent-authors'	mediawiki/extensions/AbuseFilter	master	+1 -7

Related Objects
Search...

Status	Assigned	Task
Resolved	akosiaris	T199073 Perform a datacenter switchover (2018-19 Q1)
Resolved	None	T163495 Mediawiki revision-related queries are failing with high rate for enwiki on codfw
Resolved	Krinkle	T116557 AFComputedVariable::compute query timeouts

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 248989 had a related patch set uploaded (by Aaron Schulz):
Avoid pointless range scan for 'load-recent-authors'

https://gerrit.wikimedia.org/r/248989

gerritbot added a project: Patch-For-Review.Oct 26 2015, 8:34 PM

Change 249032 had a related patch set uploaded (by Ori.livneh):
Avoid pointless range scan for 'load-recent-authors'

https://gerrit.wikimedia.org/r/249032

Change 248989 merged by jenkins-bot:
Avoid pointless range scan for 'load-recent-authors'

https://gerrit.wikimedia.org/r/248989

Change 249032 merged by jenkins-bot:
Avoid pointless range scan for 'load-recent-authors'

https://gerrit.wikimedia.org/r/249032

ReleaseTaggerBot added projects: MW-1.27-release (WMF-deploy-2015-10-27_(1.27.0-wmf.4)), MW-1.27-release (WMF-deploy-2015-10-13_(1.27.0-wmf.3)).Oct 27 2015, 1:00 AM

• jcrespo subscribed.Oct 27 2015, 8:51 PM

• jcrespo moved this task from Triage to Backlog on the DBA board.Oct 27 2015, 9:07 PM

I've just activated profiling on db1065, it will take some time to have proper results, but the query has already been detected:

MariaDB PRODUCTION s1 localhost sys > SELECT * FROM sys.statement_analysis WHERE digest = 'd993cce8555e342b2fcbf039c54d6ec3'\G
*************************** 1. row ***************************
            query: SELECT rev_user_text FROM ( SE ...  rev_timestamp ) DESC LIMIT ? 
               db: enwiki
        full_scan: *
       exec_count: 714
        err_count: 0
       warn_count: 0
    total_latency: 12.62 s
      max_latency: 313.92 ms
      avg_latency: 17.68 ms
     lock_latency: 48.14 ms
        rows_sent: 2163
    rows_sent_avg: 3
    rows_examined: 0
rows_examined_avg: 0
    rows_affected: 0
rows_affected_avg: 0
       tmp_tables: 1428
  tmp_disk_tables: 0
      rows_sorted: 3208
sort_merge_passes: 0
           digest: d993cce8555e342b2fcbf039c54d6ec3
       first_seen: 2015-10-29 10:41:48
        last_seen: 2015-10-29 11:20:51
1 row in set (0.01 sec)

So there are several problems here:

A temporary table is created
A filesort process is needed
160K+ rows are read (it could be more for other queries, and those could be on disk). Even if it is using an index, that is a lot of rows.

MariaDB PRODUCTION s1 localhost sys > use enwiki; FLUSH STATUS; SELECT rev_user_text FROM (SELECT rev_user_text,rev_timestamp FROM `revision` WHERE rev_page = '11238105' ORDER BY rev_timestamp DESC LIMIT 100 ) AS tmp GROUP BY rev_user_text ORDER BY MAX(rev_timestamp) DESC LIMIT 10; SHOW STATUS like 'Hand%';
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Database changed
Query OK, 0 rows affected (0.00 sec)

+---------------+
| rev_user_text |
+---------------+
| DeltaQuadBot  |
| Ymblanter     |
| Diannaa       |
| The Anome     |
+---------------+
4 rows in set (2.38 sec)

+----------------------------+--------+
| Variable_name              | Value  |
+----------------------------+--------+
| Handler_commit             | 1      |
| Handler_delete             | 0      |
| Handler_discover           | 0      |
| Handler_external_lock      | 0      |
| Handler_icp_attempts       | 0      |
| Handler_icp_match          | 0      |
| Handler_mrr_init           | 0      |
| Handler_mrr_key_refills    | 0      |
| Handler_mrr_rowid_refills  | 0      |
| Handler_prepare            | 0      |
| Handler_read_first         | 0      |
| Handler_read_key           | 101    |
| Handler_read_last          | 0      |
| Handler_read_next          | 163186 |
| Handler_read_prev          | 0      |
| Handler_read_rnd           | 4      |
| Handler_read_rnd_deleted   | 0      |
| Handler_read_rnd_next      | 106    |
| Handler_rollback           | 0      |
| Handler_savepoint          | 0      |
| Handler_savepoint_rollback | 0      |
| Handler_tmp_update         | 96     |
| Handler_tmp_write          | 104    |
| Handler_update             | 0      |
| Handler_write              | 0      |
+----------------------------+--------+
25 rows in set (0.00 sec)

MariaDB PRODUCTION s1 localhost enwiki > SHOW STATUS like 'sort%';
+---------------------------+-------+
| Variable_name             | Value |
+---------------------------+-------+
| Sort_merge_passes         | 0     |
| Sort_priority_queue_sorts | 1     |
| Sort_range                | 1     |
| Sort_rows                 | 104   |
| Sort_scan                 | 1     |
+---------------------------+-------+
5 rows in set (0.00 sec)

MariaDB PRODUCTION s1 localhost enwiki > SHOW STATUS like '%tmp%';
+-------------------------+-------+
| Variable_name           | Value |
+-------------------------+-------+
| Created_tmp_disk_tables | 0     |
| Created_tmp_files       | 27    |
| Created_tmp_tables      | 2     |
| Handler_tmp_update      | 96    |
| Handler_tmp_write       | 104   |
| Rows_tmp_read           | 216   |
+-------------------------+-------+
6 rows in set (0.00 sec)

I have to tweak the profiling to catch the 4 minute queries:

max_latency: 4.71 m

The reason why they are not 4 minutes while executed ad-hoc is that they have already been cached on memory, and is performance increases a lot in that case.

Aside from rewriting the query, maybe the partitioning on the recentchanges slaves may help? Please compare the query stats on those servers (db1051 and db1055). Those are special slaves with slightly different schema to help certain slow queries:

+----------------------------+-------+
| Variable_name              | Value |
+----------------------------+-------+
| Handler_commit             | 1     |
| Handler_delete             | 0     |
| Handler_discover           | 0     |
| Handler_external_lock      | 0     |
| Handler_icp_attempts       | 0     |
| Handler_icp_match          | 0     |
| Handler_mrr_init           | 0     |
| Handler_mrr_key_refills    | 0     |
| Handler_mrr_rowid_refills  | 0     |
| Handler_prepare            | 0     |
| Handler_read_first         | 0     |
| Handler_read_key           | 123   |
| Handler_read_last          | 0     |
| Handler_read_next          | 0     |
| Handler_read_prev          | 99    |
| Handler_read_rnd           | 4     |
| Handler_read_rnd_deleted   | 0     |
| Handler_read_rnd_next      | 106   |
| Handler_rollback           | 0     |
| Handler_savepoint          | 0     |
| Handler_savepoint_rollback | 0     |
| Handler_tmp_update         | 96    |
| Handler_tmp_write          | 104   |
| Handler_update             | 0     |
| Handler_write              | 0     |
+----------------------------+-------+
25 rows in set (0.00 sec)

• jcrespo moved this task from Backlog to In progress on the DBA board.Oct 29 2015, 12:21 PM

To make it explicit, my current recommendation is to perform a small test that executes this query on a 'recentchanges' group slave and confirm that this effectively fixes the timeouts and the slow writes.

If the test is positive, maybe we can create a special group for this query or deploy partitioning to more slaves.

aaron moved this task from To-do: Goals prioritized current Quarter to Doing (old) on the Performance-Team board.Nov 2 2015, 7:48 PM

• jcrespo moved this task from In progress to Backlog on the DBA board.Nov 5 2015, 5:16 PM

aaron moved this task from Doing (old) to To-do: Goals prioritized current Quarter on the Performance-Team board.Dec 1 2015, 11:47 AM

Last day of errors (58) is only db1065 and db1066 (evenly).

We could experiment with a query group, though this seems like the kind of boring query that should be able to run everywhere quickly.

Change 260634 had a related patch set uploaded (by Aaron Schulz):
Simplify "load-recent-authors" query

https://gerrit.wikimedia.org/r/260634

Change 260634 merged by jenkins-bot:
Simplify "load-recent-authors" query

https://gerrit.wikimedia.org/r/260634

ReleaseTaggerBot added a project: MW-1.27-release (WMF-deploy-2016-01-12_(1.27.0-wmf.10)).Dec 23 2015, 8:00 PM

aaron lowered the priority of this task from High to Medium.Jan 15 2016, 10:46 PM

Change 266644 had a related patch set uploaded (by Aaron Schulz):
Add ParserOutputStashForEdit hook for extension cache warming

https://gerrit.wikimedia.org/r/266644

Change 266646 had a related patch set uploaded (by Aaron Schulz):
Compute last-recent-authors result during edit stashing

https://gerrit.wikimedia.org/r/266646

aaron moved this task from To-do: Goals prioritized current Quarter to Blocked (old) on the Performance-Team board.Jan 28 2016, 7:55 PM

Change 267137 had a related patch set uploaded (by Ori.livneh):
Compute last-recent-authors result during edit stashing

https://gerrit.wikimedia.org/r/267137

Change 267139 had a related patch set uploaded (by Ori.livneh):
Compute last-recent-authors result during edit stashing

https://gerrit.wikimedia.org/r/267139

Change 266646 merged by jenkins-bot:
Compute last-recent-authors result during edit stashing

https://gerrit.wikimedia.org/r/266646

Change 267137 merged by jenkins-bot:
Compute last-recent-authors result during edit stashing

https://gerrit.wikimedia.org/r/267137

ReleaseTaggerBot added projects: MW-1.27-release (WMF-deploy-2016-01-19_(1.27.0-wmf.11)), MW-1.27-release (WMF-deploy-2016-02-02_(1.27.0-wmf.12)).Jan 28 2016, 10:00 PM

Change 267177 had a related patch set uploaded (by Ori.livneh):
Add ParserOutputStashForEdit hook for extension cache warming

https://gerrit.wikimedia.org/r/267177

Change 266644 merged by jenkins-bot:
Add ParserOutputStashForEdit hook for extension cache warming

https://gerrit.wikimedia.org/r/266644

Change 267177 merged by Ori.livneh:
Add ParserOutputStashForEdit hook for extension cache warming

https://gerrit.wikimedia.org/r/267177

ReleaseTaggerBot added a project: MW-1.27-release-notes.Jan 29 2016, 12:00 AM

Change 267139 abandoned by Krinkle:
Compute last-recent-authors result during edit stashing

https://gerrit.wikimedia.org/r/267139

Only seeing 13 in the last day, so this is less common.

matej_suchanek removed a project: Patch-For-Review.Apr 12 2016, 8:08 PM

Krinkle moved this task from Blocked (old) to To-do: Goals prioritized current Quarter on the Performance-Team board.Aug 5 2016, 1:24 AM

https://logstash.wikimedia.org/goto/430429f471afeb24a32f58c659f57cf1 (mediawiki-errors AFComputedVariable)

Screen Shot 2016-12-02 at 22.59.27.png (247×924 px, 39 KB)

Between 50-100 per day. @aaron So, the underlying problem is that the query is too slow? Or is it caused by something else?

• Gilles moved this task from To-do: Goals prioritized current Quarter to Backlog: Maintenance, non-prioritized on the Performance-Team board.Dec 7 2016, 5:30 PM

https://logstash.wikimedia.org/goto/9b00a481c72faee815a075b28043af4b.
Still upto 100 exceptions per day from this code path for database connection lost during query.

AFComputedVariable::{closure}	errno=2013	Lost connection to MySQL server during query
SELECT rev_user_text FROM revision WHERE rev_page = %  ORDER BY rev_timestamp DESC LIMIT 100

@aaron So, the underlying problem is that the query is too slow? Or is it caused by something else?

I think this got worse on failover- the query planer gets confused between page_user_timestamp and page_timestamp, they are too similar indexes, but the wrong one causes a fileshort (and even worse performance when a page has many revisions).

Normally running ANALYZE fixes it, but we are low on resources on codfw, and it takes right now 3 days to run it (being a blocking statement). We should either put an IGNORE statement there to force the plan or send it to an rc slave, as I suggested earlier. We can do proper fixes later, but this needs a workaround quickly: https://logstash.wikimedia.org/goto/59d22f141508612b4e4bb32857406e0a

• jcrespo added a parent task: T163351: codfw API slaves overloaded during the 2017-04-19 codfw switch.Apr 19 2017, 7:05 PM

Krinkle removed projects: MW-1.27-release-notes, MW-1.27-release (WMF-deploy-2016-02-02_(1.27.0-wmf.12)), MW-1.27-release (WMF-deploy-2016-01-19_(1.27.0-wmf.11)), MW-1.27-release (WMF-deploy-2016-01-12_(1.27.0-wmf.10)), MW-1.27-release (WMF-deploy-2015-10-13_(1.27.0-wmf.3)), MW-1.27-release (WMF-deploy-2015-10-27_(1.27.0-wmf.4)).Apr 19 2017, 7:29 PM

Krinkle moved this task from Backlog: Maintenance, non-prioritized to Inbox, needs triage on the Performance-Team board.

• jcrespo edited parent tasks, added: T163495: Mediawiki revision-related queries are failing with high rate for enwiki on codfw; removed: T163351: codfw API slaves overloaded during the 2017-04-19 codfw switch.Apr 21 2017, 7:48 AM

This is interesting:

root@db2070.codfw.wmnet[enwiki]> EXPLAIN SELECT /* AFComputedVariable::{closure}  */  rev_user_text  FROM `revision` WHERE rev_page = '34681'  ORDER BY rev_timestamp DESC LIMIT 100\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: revision
         type: ref
possible_keys: page_timestamp,page_user_timestamp,rev_page_id
          key: page_timestamp
      key_len: 4
          ref: const
         rows: 5088
        Extra: Using where
1 row in set (0.04 sec)

root@db2070.codfw.wmnet[enwiki]> EXPLAIN SELECT /* AFComputedVariable::{closure}  */  rev_user_text  FROM `revision` WHERE rev_page = '1772647'  ORDER BY rev_timestamp DESC LIMIT 100\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: revision
         type: ref
possible_keys: page_timestamp,page_user_timestamp,rev_page_id
          key: page_user_timestamp
      key_len: 4
          ref: const
         rows: 985
        Extra: Using where; Using filesort
1 row in set (0.04 sec)

root@db2070.codfw.wmnet[enwiki]> pager cat > /dev/null; SELECT /* AFComputedVariable::{closure}  */  rev_user_text  FROM `revision` WHERE rev_page = '34681'  ORDER BY rev_timestamp DESC LIMIT 100;
PAGER set to 'cat > /dev/null'
100 rows in set (0.05 sec)

root@db2070.codfw.wmnet[enwiki]> SELECT /* AFComputedVariable::{closure}  */  rev_user_text  FROM `revision` WHERE rev_page = '1772647'  ORDER BY rev_timestamp DESC LIMIT 100;
100 rows in set (0.10 sec)

So, I say we force the index or send it to the special slaves- this is not a good or long term solution, but something we can do now to minimize issue while we investigate if it is a problem with the mariadb version or our config, or we need better calculated index statistics.

Sorry, I need help, I do not know where this query come from on the code (I cannot find it).

Change 349388 had a related patch set uploaded (by Jcrespo):
[operations/mediawiki-config@master] Depool db2055 for maintenance

https://gerrit.wikimedia.org/r/349388

gerritbot added a project: Patch-For-Review.Apr 21 2017, 8:28 AM

Change 349388 merged by jenkins-bot:
[operations/mediawiki-config@master] MariaDB: Depool db2062 database server for maintenance

https://gerrit.wikimedia.org/r/349388

Mentioned in SAL (#wikimedia-operations) [2017-04-21T09:10:20Z] <jynus> stopping and upgrading/reconfiguring db2062 (depooled) T116557

This is madness:

root@db2062[enwiki]> EXPLAIN SELECT /* AFComputedVariable::{closure}  */  rev_user_text  FROM `revision` WHERE rev_page = '34681'  ORDER BY rev_timestamp DESC LIMIT 100\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: revision
         type: ref
possible_keys: page_timestamp,page_user_timestamp,rev_page_id
          key: page_timestamp
      key_len: 4
          ref: const
         rows: 5097
        Extra: Using where
1 row in set (0.09 sec)

root@db2062[enwiki]> EXPLAIN SELECT /* AFComputedVariable::{closure}  */  rev_user_text  FROM `revision` WHERE rev_page = '1772647'  ORDER BY rev_timestamp DESC LIMIT 100\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: revision
         type: ref
possible_keys: page_timestamp,page_user_timestamp,rev_page_id
          key: page_timestamp
      key_len: 4
          ref: const
         rows: 989
        Extra: Using where
1 row in set (0.06 sec)

root@db2062.codfw.wmnet[enwiki]> EXPLAIN SELECT /* AFComputedVariable::{closure}  */  rev_user_text  FROM `revision`    WHERE rev_page = '1743794'   ORDER BY rev_timestamp DESC LIMIT 100\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: revision
         type: ref
possible_keys: page_timestamp,page_user_timestamp,rev_page_id
          key: page_user_timestamp
      key_len: 4
          ref: const
         rows: 128600
        Extra: Using where; Using filesort
1 row in set (0.03 sec)

Mentioned in SAL (#wikimedia-operations) [2017-04-21T12:07:57Z] <marostegui> Analyze revision, logging and page table on s1 db1080 - T116557

Mentioned in SAL (#wikimedia-operations) [2017-04-21T14:32:40Z] <marostegui> Analyze revision, logging and page table on s1 db1067 - https://phabricator.wikimedia.org/T116557

root@db2062[enwiki]> ANALYZE TABLE revision;                                                             
+-----------------+---------+----------+-----------------------------------------+                       
| Table           | Op      | Msg_type | Msg_text                                |                       
+-----------------+---------+----------+-----------------------------------------+                       
| enwiki.revision | analyze | status   | Engine-independent statistics collected |                       
| enwiki.revision | analyze | status   | OK                                      |                       
+-----------------+---------+----------+-----------------------------------------+                       
2 rows in set (1 day 30 min 44.40 sec)

However:

root@db2062[enwiki]> EXPLAIN SELECT /* AFComputedVariable::{closure}  */  rev_user_text  FROM `revision`    WHERE rev_page = '1743794'   ORDER BY rev_timestamp DESC LIMIT 100\G
ERROR 2006 (HY000): MySQL server has gone away
No connection. Trying to reconnect...
Connection id:    133531
Current database: enwiki

*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: revision
         type: ref
possible_keys: page_timestamp,page_user_timestamp,rev_page_id
          key: page_user_timestamp
      key_len: 4
          ref: const
         rows: 128600
        Extra: Using where; Using filesort
1 row in set (0.05 sec)

So this now 100% repeatable -probably an optimizer bug, and probably https://jira.mariadb.org/browse/MDEV-6402 or https://jira.mariadb.org/browse/MDEV-6657 (Sadly, only fixed on 10.1). I am going to confirm it.

Mentioned in SAL (#wikimedia-operations) [2017-04-23T17:49:47Z] <jynus> disabling puppet on db2062 and upgrading MariaDB package to 10.1 T116557

Nope, it doesn't work either:

root@db2062[enwiki]> SELECT version();
+-----------------+
| version()       |
+-----------------+
| 10.1.22-MariaDB |
+-----------------+
1 row in set (0.00 sec)

root@db2062[enwiki]> EXPLAIN SELECT /* AFComputedVariable::{closure}  */  rev_user_text  FROM `revision`    WHERE rev_page = '1743794'   ORDER BY rev_timestamp DESC LIMIT 100\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: revision
         type: ref
possible_keys: page_timestamp,page_user_timestamp,rev_page_id
          key: page_user_timestamp
      key_len: 4
          ref: const
         rows: 128600
        Extra: Using where; Using filesort
1 row in set (0.13 sec)

CC @Anomie because I think he was hoping 10.1 would solve this or very similar issues, but it doesn't seem to be the case. :-/

In T116557#3204426, @jcrespo wrote:

root@db2062[enwiki]> ANALYZE TABLE revision;                                                             
+-----------------+---------+----------+-----------------------------------------+                       
| Table           | Op      | Msg_type | Msg_text                                |                       
+-----------------+---------+----------+-----------------------------------------+                       
| enwiki.revision | analyze | status   | Engine-independent statistics collected |                       
| enwiki.revision | analyze | status   | OK                                      |                       
+-----------------+---------+----------+-----------------------------------------+                       
2 rows in set (1 day 30 min 44.40 sec)

However:

root@db2062[enwiki]> EXPLAIN SELECT /* AFComputedVariable::{closure}  */  rev_user_text  FROM `revision`    WHERE rev_page = '1743794'   ORDER BY rev_timestamp DESC LIMIT 100\G
ERROR 2006 (HY000): MySQL server has gone away
No connection. Trying to reconnect...
Connection id:    133531
Current database: enwiki

*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: revision
         type: ref
possible_keys: page_timestamp,page_user_timestamp,rev_page_id
          key: page_user_timestamp
      key_len: 4
          ref: const
         rows: 128600
        Extra: Using where; Using filesort
1 row in set (0.05 sec)

And for some reason db1080 and db1067 do work after the analyze, they do not filesort anymore:

root@db1080[enwiki]> EXPLAIN SELECT /* AFComputedVariable::{closure}  */  rev_user_text  FROM `revision`    WHERE rev_page = '17437194'   ORDER BY rev_timestamp DESC LIMIT 100\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: revision
         type: ref
possible_keys: page_timestamp,page_user_timestamp,rev_page_id
          key: page_timestamp
      key_len: 4
          ref: const
         rows: 2
        Extra: Using where
1 row in set (0.00 sec)

root@db1067[enwiki]> EXPLAIN SELECT /* AFComputedVariable::{closure}  */  rev_user_text  FROM `revision`    WHERE rev_page = '17437194'   ORDER BY rev_timestamp DESC LIMIT 100\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: revision
         type: ref
possible_keys: rev_page_id,page_timestamp,page_user_timestamp
          key: page_timestamp
      key_len: 4
          ref: const
         rows: 2
        Extra: Using where
1 row in set (0.01 sec)

I don't see that:

root@db1080[enwiki]> EXPLAIN SELECT /* AFComputedVariable::{closure}  */  rev_user_text  FROM `revision`    WHERE rev_page = '1743794'   ORDER BY rev_timestamp DESC LIMIT 100\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: revision
         type: ref
possible_keys: page_timestamp,page_user_timestamp,rev_page_id
          key: page_user_timestamp
      key_len: 4
          ref: const
         rows: 137288
        Extra: Using where; Using filesort

That is _so strange_ same query, just some time later and now it filesorts?!!!

It is not the same query:

rev_page = '1743794'

rev_page= '17437194'

root@db2062[enwiki]> SELECT count(*) FROM revision WHERE rev_page = '1743794';
+----------+
| count(*) |
+----------+
|    77861 |
+----------+
1 row in set (0.03 sec)

root@db2062[enwiki]> SELECT count(*) FROM revision WHERE rev_page= '17437194';
+----------+
| count(*) |
+----------+
|        2 |
+----------+
1 row in set (0.02 sec)

It makes sense, using only the first part of the index is considered more advantageous even if a filesort has to be done:

+----------------------------+-------+
| Variable_name              | Value |
+----------------------------+-------+
| Handler_commit             | 1     |
| Handler_delete             | 0     |
| Handler_discover           | 0     |
| Handler_external_lock      | 0     |
| Handler_icp_attempts       | 0     |
| Handler_icp_match          | 0     |
| Handler_mrr_init           | 0     |
| Handler_mrr_key_refills    | 0     |
| Handler_mrr_rowid_refills  | 0     |
| Handler_prepare            | 0     |
| Handler_read_first         | 0     |
| Handler_read_key           | 1     |
| Handler_read_last          | 0     |
| Handler_read_next          | 77861 |
| Handler_read_prev          | 0     |
| Handler_read_retry         | 0     |
| Handler_read_rnd           | 0     |
| Handler_read_rnd_deleted   | 0     |
| Handler_read_rnd_next      | 0     |
| Handler_rollback           | 0     |
| Handler_savepoint          | 0     |
| Handler_savepoint_rollback | 0     |
| Handler_tmp_update         | 0     |
| Handler_tmp_write          | 0     |
| Handler_update             | 0     |
| Handler_write              | 0     |
+----------------------------+-------+
26 rows in set (0.00 sec)

+----------------------------+-------+
| Variable_name              | Value |
+----------------------------+-------+
| Handler_commit             | 1     |
| Handler_delete             | 0     |
| Handler_discover           | 0     |
| Handler_external_lock      | 0     |
| Handler_icp_attempts       | 0     |
| Handler_icp_match          | 0     |
| Handler_mrr_init           | 0     |
| Handler_mrr_key_refills    | 0     |
| Handler_mrr_rowid_refills  | 0     |
| Handler_prepare            | 0     |
| Handler_read_first         | 0     |
| Handler_read_key           | 1     |
| Handler_read_last          | 0     |
| Handler_read_next          | 0     |
| Handler_read_prev          | 99    |
| Handler_read_retry         | 0     |
| Handler_read_rnd           | 0     |
| Handler_read_rnd_deleted   | 0     |
| Handler_read_rnd_next      | 0     |
| Handler_rollback           | 0     |
| Handler_savepoint          | 0     |
| Handler_savepoint_rollback | 0     |
| Handler_tmp_update         | 0     |
| Handler_tmp_write          | 0     |
| Handler_update             | 0     |
| Handler_write              | 0     |
+----------------------------+-------+
26 rows in set (0.00 sec)

The LIMIT starts doing that for LIMIT 23 -> LIMIT 24.

In T116557#3204495, @jcrespo wrote:

CC @Anomie because I think he was hoping 10.1 would solve this or very similar issues, but it doesn't seem to be the case. :-/

Let's add a force- If someone tells me where that code is, I can add it.

In T116557#3206859, @jcrespo wrote:

Let's add a force- If someone tells me where that code is, I can add it.

Looks like it's here.

Thank you- seeing the code, I think there may be a change for a more elegant solution later- as the query is not really what it is wanted, it was already a compromise. I will just add the FORCE now and we can open later a separate ticket for alternative solutions.

https://gerrit.wikimedia.org/r/350006

In T116557#3206807, @Anomie wrote:

:(

Don't worry, we will do more tests with 10.2 and Oracle's MySQL.

Change 350006 merged by jenkins-bot:
[mediawiki/extensions/AbuseFilter@master] Add FORCE INDEX to AFComputedVariable::{closure}

https://gerrit.wikimedia.org/r/350006

ReleaseTaggerBot added a project: MW-1.29-release (WMF-deploy-2017-04-25_(1.29.0-wmf.21)).Apr 24 2017, 9:00 PM

The fix was merged and is going out in this weeks branch. Should start applying to Wikipedia sometime tomorrow.

Looking at the last 7 days and 24 hours in log stash, this error is now gone as of 21:00 UTC today - when the patch rolled out to all wikis as part of 1.29.0-wmf.21.

https://logstash.wikimedia.org/goto/98a6a55853f5ce1976d239515a11f222

Screen Shot 2017-04-27 at 20.06.42.png (434×2 px, 78 KB)

Krinkle moved this task from Inbox, needs triage to Doing (old) on the Performance-Team board.Apr 28 2017, 3:08 AM

Indeed:

AFComputerVariable.png (726×2 px, 93 KB)

Change 353015 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] Enble puppet on db2062- a one-time experiment for mariadb 10.1

https://gerrit.wikimedia.org/r/353015

Change 353015 merged by Jcrespo:
[operations/puppet@production] Enble puppet on db2062- a one-time experiment for mariadb 10.1

https://gerrit.wikimedia.org/r/353015

Change 354092 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/mediawiki-config@master] MariaDB: Repool db2062 after maintenanace

https://gerrit.wikimedia.org/r/354092

Change 354092 merged by jenkins-bot:
[operations/mediawiki-config@master] MariaDB: Repool db2062 after maintenanace

https://gerrit.wikimedia.org/r/354092

Mentioned in SAL (#wikimedia-operations) [2017-05-18T06:05:43Z] <marostegui@tin> Synchronized wmf-config/db-codfw.php: Repool db2062 - T116557 (duration: 00m 39s)

Daimona mentioned this in T71367: page_recent_contributors leaks revdeleted user names (CVE-2021-31545).Sep 18 2019, 5:12 PM

Aklapper removed a subscriber: Anomie.Oct 16 2020, 5:41 PM

Daimona mentioned this in T272050: Move LazyVariableComputer::getLastPageAuthors to MW core.Jan 14 2021, 4:16 PM

Daimona mentioned this in T281579: Check whether `FORCE INDEX page_timestamp` is still needed in LazyVariableComputer.php.Apr 30 2021, 2:25 PM