Description:
I am trying to run a query to get the number of pages a certain module is transcluded in from all the wikis from respective databases. The query I am using is this:
SELECT page_id, COUNT(tl_from) AS transcluded_in FROM page INNER JOIN templatelinks ON page_title=tl_title AND page_namespace=828 AND page_content_model='Scribunto' AND tl_namespace=828 GROUP BY page_id
I ran a script in Grid from toolforge to do this. This query works fine for all wikis except enwiki, which gives OperationalError(2013, 'Lost connection to MySQL server during query') error. I used LIMIT 500 OFFSET x to fetch 500 rows at a time due to memory constraints. I got similar MySQL connection errors for another query which got solved by reducing the LIMIT value. But for this particular query in enwiki database, using LIMIT as low as 2 does not seem to work.
To test, I ran the same script in PAWS. Although a little slow, it seems to work just fine and it is executing this query for enwiki with LIMIT 500.
Steps to Reproduce python script:
- Login to toolforge
- Clone repo (https://github.com/wikimedia/abstract-wikipedia-data-science/tree/db-info-fetcher)
- Create a file named missed_db_info.txt in the cloned folder and write the line get_templatelinks_info enwiki in it.
- Run the python script (.py file in GitHub) in Grid like this jsub -N test-run python3 fetch_db_info.py -gm
- Wait. It takes 1-2 hours to finish (with an error)
Steps to Reproduce in PAWS:
Run this notebook : enwiki-transclusions notebook
The only difference in these two codes is the exception handling procedures. I had to be more extensive with catching errors in Grid as it runs as a script.
Actual Result:
Getting MySQL connection lost, even after multiple tries.
Expected Result:
Expected same or better speed from running in Grid from toolforge.
Solution:
Use analytics cluster to connect to database replicas.