Page MenuHomePhabricator

Mysql connection lost during query from toolforge
Closed, ResolvedPublicBUG REPORT

Description

Description:

I am trying to run a query to get the number of pages a certain module is transcluded in from all the wikis from respective databases. The query I am using is this:

SELECT page_id, COUNT(tl_from) AS transcluded_in
FROM page
INNER JOIN templatelinks
    ON page_title=tl_title 
    AND page_namespace=828 
    AND page_content_model='Scribunto' 
    AND tl_namespace=828 
GROUP BY page_id

I ran a script in Grid from toolforge to do this. This query works fine for all wikis except enwiki, which gives OperationalError(2013, 'Lost connection to MySQL server during query') error. I used LIMIT 500 OFFSET x to fetch 500 rows at a time due to memory constraints. I got similar MySQL connection errors for another query which got solved by reducing the LIMIT value. But for this particular query in enwiki database, using LIMIT as low as 2 does not seem to work.

To test, I ran the same script in PAWS. Although a little slow, it seems to work just fine and it is executing this query for enwiki with LIMIT 500.

Steps to Reproduce python script:

  1. Login to toolforge
  2. Clone repo (https://github.com/wikimedia/abstract-wikipedia-data-science/tree/db-info-fetcher)
  3. Create a file named missed_db_info.txt in the cloned folder and write the line get_templatelinks_info enwiki in it.
  4. Run the python script (.py file in GitHub) in Grid like this jsub -N test-run python3 fetch_db_info.py -gm
  5. Wait. It takes 1-2 hours to finish (with an error)

Steps to Reproduce in PAWS:

Run this notebook : enwiki-transclusions notebook
The only difference in these two codes is the exception handling procedures. I had to be more extensive with catching errors in Grid as it runs as a script.

Actual Result:

Getting MySQL connection lost, even after multiple tries.

Expected Result:

Expected same or better speed from running in Grid from toolforge.

Solution:

Use analytics cluster to connect to database replicas.

Event Timeline

@dr0ptp4kt @LostEnchanter
Hi, I described as much as possible, let me know if more information could be useful in debugging this situation.

tanny411 updated the task description. (Show Details)
tanny411 updated the task description. (Show Details)
tanny411 set Final Story Points to 0.
tanny411 set Final Story Points to .

I've run this script from my local PC using ssh tunnels. I tried different variations:

  1. Connecting through ssh to meta database and connecting to enwiki database;
  2. Using pandas to fetch the result and using basic pymysql cursor.fetchall()
  3. Using LIMIT 500 and LIMIT 2 OFFSET 100

In all cases and variations code seems to be slow, but work nethertheless. So looks like it's really something related to Grid execution.

[16:39]  <    bd808> tanny411: 'Lost connection to MySQL server during query' generally means that the "query killer" running on the wiki replica servers themselves are closing your connection for taking too long.
[16:40]  <    bd808> A "LIMIT n" is not going to help with a query like this that contains a "group by" clause. All the limit does in that case is truncate the returned results, but the full result set must be computed by the database server first.
[16:41]  <    bd808> tanny411: one thing to check is which wiki replica server class you are querying. The "analytics" hostnames are the ones you need to use for long queries.

Thanks @bd808 that might be it. We used analytics when we connected locally.

@LostEnchanter the toolforge library connects to web by default, I guess I have to change that. Not sure why it's working from PAWS though, as its the same library.

Thanks all, this issue is now resolved!

tanny411 updated the task description. (Show Details)
tanny411 updated the task description. (Show Details)