Page MenuHomePhabricator

[Timebox: 8 hours] Database reports tool experiencing SQL timeouts
Open, Needs TriagePublicSpike

Description

The Database Reports tool has been spitting a lot of SQL timeouts into its error log. It's not clear if this started happening more before or after the actor timeout.

We should probably take this opportunity to add timestamps to the error log so we'll have that information next time.

This is the existing error:

Traceback (most recent call last):
  File "/data/project/database-reports/reports/database-reports/main.py", line 82, in <module>
    main(sys.argv)
  File "/data/project/database-reports/reports/database-reports/main.py", line 25, in main
    method()
  File "/data/project/database-reports/reports/database-reports/main.py", line 46, in pages_with_most_revisions
    self.rep.pages_with_most_revisions()
  File "/mnt/nfs/labstore-secondary-tools-project/database-reports/reports/database-reports/reports.py", line 92, in pages_with_most_revisions
    cur.execute( query )
  File "/mnt/nfs/labstore-secondary-tools-project/database-reports/reports/database-reports/venv/local/lib/python2.7/site-packages/pymysql/cursors.py", line 170, in execute
    result = self._query(query)
  File "/mnt/nfs/labstore-secondary-tools-project/database-reports/reports/database-reports/venv/local/lib/python2.7/site-packages/pymysql/cursors.py", line 328, in _query
    conn.query(q)
  File "/mnt/nfs/labstore-secondary-tools-project/database-reports/reports/database-reports/venv/local/lib/python2.7/site-packages/pymysql/connections.py", line 517, in query
    self._affected_rows = self._read_query_result(unbuffered=unbuffered)
  File "/mnt/nfs/labstore-secondary-tools-project/database-reports/reports/database-reports/venv/local/lib/python2.7/site-packages/pymysql/connections.py", line 732, in _read_query_result
    result.read()
  File "/mnt/nfs/labstore-secondary-tools-project/database-reports/reports/database-reports/venv/local/lib/python2.7/site-packages/pymysql/connections.py", line 1075, in read
    first_packet = self.connection._read_packet()
  File "/mnt/nfs/labstore-secondary-tools-project/database-reports/reports/database-reports/venv/local/lib/python2.7/site-packages/pymysql/connections.py", line 657, in _read_packet
    packet_header = self._read_bytes(4)
  File "/mnt/nfs/labstore-secondary-tools-project/database-reports/reports/database-reports/venv/local/lib/python2.7/site-packages/pymysql/connections.py", line 707, in _read_bytes
    CR.CR_SERVER_LOST, "Lost connection to MySQL server during query")
pymysql.err.OperationalError: (2013, 'Lost connection to MySQL server during query')

Event Timeline

MaxSem created this task.Jun 14 2019, 2:45 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 14 2019, 2:45 AM
Niharika added a subscriber: Niharika.

@MaxSem Can you quantify "a lot"? How often and which report(s)? That will help me prioritise this.

Can't tell because logs don't have timestamps, but the error log is pretty large.

aezell renamed this task from Database reports SQL timeouts to Database reports tool experiencing SQL timeouts.Jul 9 2019, 11:20 PM
aezell updated the task description. (Show Details)
Niharika renamed this task from Database reports tool experiencing SQL timeouts to [Timebox: 8 hours] Database reports tool experiencing SQL timeouts.Jul 9 2019, 11:21 PM
Niharika changed the subtype of this task from "Task" to "Spike".

The query that is timing out is at https://github.com/wikimedia/database-reports/blob/master/reports.py#L87-L91

To simplify, let's take out the JOIN on page (which isn't slowing it down):

SELECT COUNT(*) AS revisions, rev_page
FROM revision
GROUP BY rev_page
ORDER BY revisions DESC
LIMIT 1000

The query plan is pretty awful. I'm not sure how you could break this out into separate queries.

aezell added a subscriber: aezell.Jul 10 2019, 1:20 PM

It's the GROUP BY that causes the query plan to do a filesort which is the problem as the optimizer tool tells us. I think the COUNT(*) AS revisions makes this use a temporary column and the GROUP BY is on that column. As far as I know, an index can't help that problem though the query is making use of the rev_page index apparently.

We can eliminate that filesort by using an ORDER BY NULL instead of ORDER BY revisions DESC. We could sort the resultset in code at that point. But, then, we're faced with possibly not getting all the data that we want as long as that LIMIT is there.

In testing this, the implied sorting by the GROUP BY isn't enough because the resultset is too large even at the LIMIT 1000. A LIMIT 10000 takes nearly a minute to complete on the replicas and it still doesn't reliably give you all the rows at the top end.

In short, I don't know that there's a way to avoid the filesort for this kind of query with the current schema. To answer this query, we have to scan all the rows in the table and even with an index, it's just a lot of rows at over 700 million according to the EXPLAIN.