Page MenuHomePhabricator

[Timebox: 8 hours] Database reports tool experiencing SQL timeouts
Closed, ResolvedPublicSpike

Description

The Database Reports tool has been spitting a lot of SQL timeouts into its error log. It's not clear if this started happening more before or after the actor timeout.

We should probably take this opportunity to add timestamps to the error log so we'll have that information next time.

This is the existing error:

Traceback (most recent call last):
  File "/data/project/database-reports/reports/database-reports/main.py", line 82, in <module>
    main(sys.argv)
  File "/data/project/database-reports/reports/database-reports/main.py", line 25, in main
    method()
  File "/data/project/database-reports/reports/database-reports/main.py", line 46, in pages_with_most_revisions
    self.rep.pages_with_most_revisions()
  File "/mnt/nfs/labstore-secondary-tools-project/database-reports/reports/database-reports/reports.py", line 92, in pages_with_most_revisions
    cur.execute( query )
  File "/mnt/nfs/labstore-secondary-tools-project/database-reports/reports/database-reports/venv/local/lib/python2.7/site-packages/pymysql/cursors.py", line 170, in execute
    result = self._query(query)
  File "/mnt/nfs/labstore-secondary-tools-project/database-reports/reports/database-reports/venv/local/lib/python2.7/site-packages/pymysql/cursors.py", line 328, in _query
    conn.query(q)
  File "/mnt/nfs/labstore-secondary-tools-project/database-reports/reports/database-reports/venv/local/lib/python2.7/site-packages/pymysql/connections.py", line 517, in query
    self._affected_rows = self._read_query_result(unbuffered=unbuffered)
  File "/mnt/nfs/labstore-secondary-tools-project/database-reports/reports/database-reports/venv/local/lib/python2.7/site-packages/pymysql/connections.py", line 732, in _read_query_result
    result.read()
  File "/mnt/nfs/labstore-secondary-tools-project/database-reports/reports/database-reports/venv/local/lib/python2.7/site-packages/pymysql/connections.py", line 1075, in read
    first_packet = self.connection._read_packet()
  File "/mnt/nfs/labstore-secondary-tools-project/database-reports/reports/database-reports/venv/local/lib/python2.7/site-packages/pymysql/connections.py", line 657, in _read_packet
    packet_header = self._read_bytes(4)
  File "/mnt/nfs/labstore-secondary-tools-project/database-reports/reports/database-reports/venv/local/lib/python2.7/site-packages/pymysql/connections.py", line 707, in _read_bytes
    CR.CR_SERVER_LOST, "Lost connection to MySQL server during query")
pymysql.err.OperationalError: (2013, 'Lost connection to MySQL server during query')

Event Timeline

Niharika subscribed.

@MaxSem Can you quantify "a lot"? How often and which report(s)? That will help me prioritise this.

Can't tell because logs don't have timestamps, but the error log is pretty large.

aezell renamed this task from Database reports SQL timeouts to Database reports tool experiencing SQL timeouts.Jul 9 2019, 11:20 PM
aezell updated the task description. (Show Details)
Niharika renamed this task from Database reports tool experiencing SQL timeouts to [Timebox: 8 hours] Database reports tool experiencing SQL timeouts.Jul 9 2019, 11:21 PM
Niharika changed the subtype of this task from "Task" to "Spike".

The query that is timing out is at https://github.com/wikimedia/database-reports/blob/a901b301a29652c39a20eab971990ba87df479ba/reports.py#L102-L106

To simplify, let's take out the JOIN on page (which isn't slowing it down):

SELECT COUNT(*) AS revisions, rev_page
FROM revision
GROUP BY rev_page
ORDER BY revisions DESC
LIMIT 1000

The query plan is pretty awful. I'm not sure how you could break this out into separate queries.

It's the GROUP BY that causes the query plan to do a filesort which is the problem as the optimizer tool tells us. I think the COUNT(*) AS revisions makes this use a temporary column and the GROUP BY is on that column. As far as I know, an index can't help that problem though the query is making use of the rev_page index apparently.

We can eliminate that filesort by using an ORDER BY NULL instead of ORDER BY revisions DESC. We could sort the resultset in code at that point. But, then, we're faced with possibly not getting all the data that we want as long as that LIMIT is there.

In testing this, the implied sorting by the GROUP BY isn't enough because the resultset is too large even at the LIMIT 1000. A LIMIT 10000 takes nearly a minute to complete on the replicas and it still doesn't reliably give you all the rows at the top end.

In short, I don't know that there's a way to avoid the filesort for this kind of query with the current schema. To answer this query, we have to scan all the rows in the table and even with an index, it's just a lot of rows at over 700 million according to the EXPLAIN.

I'm going to go ahead and move to Needs Review. I think T225774#5319960 and T225774#5320840 summarize the issue. Basically there doesn't appear to be a workable solution. This query was always really slow, and apparently we're at the point where there are so many revisions that it just won't finish. I suggest we retire the "Pages with most revisions" database report.

Who is the primary audience for these reports? Where do they end up?

Also, you know it's trouble when I'm making proclamations about the EXPLAIN :)

Pages with the Most Revisions is the report that fails on occasion. We will leave it in because it does work sometimes and some people get value.

Who is the primary audience for these reports? Where do they end up?

Most of them end up here: https://en.wikipedia.org/wiki/Wikipedia:Database_reports such as https://en.wikipedia.org/wiki/Wikipedia:Database_reports/Forgotten_articles.
Not all of these are ours.
The bot also works in a few other languages - posting similar lists on other language wikis.
Primary audience are community members who work off of these as backlogs to improve/fix them as needed.