Page MenuHomePhabricator

PAWS crash: User has exceeded the 'max_user_connections' resource (current value: 10)
Open, Needs TriagePublicBUG REPORT

Description

Steps to reproduce:

What happens?:

  • However, when I run the last line, I get the following:
OperationalError: (1226, "User 'p71826986' has exceeded the 'max_user_connections' resource (current value: 10)")

What should have happened instead?:

I wish I could keep running SQL queries

Software version (skip for WMF-hosted wikis like Wikipedia):

PAWS / Jupyter Notebook

Other information (browser name/version, screenshots, etc.):

Browser: Chrome
Perhaps related to https://phabricator.wikimedia.org/T325501

Event Timeline

This is not a PAWS issue as much as a connection limit on the replicas issue. I'm not quite sure what pymysql and Pandas are doing there but it seems connections are leaking beyond each iteration of

def query_wikis(query, df):
    wikis = df.database_code.unique().tolist()
    combined_result = pd.DataFrame()
    for wiki in wikis:
        result = pd_query(query, wiki)
        result['wiki_db'] = wiki
        combined_result = pd.concat([combined_result, result])
    return combined_result

You might have some luck by explicitly closing the connection after getting the results of each wiki.

Running your code I realized that for loop only runs once, so even if it is leaking connections it probably is not your main issue. But since the max_connections are per user they might be coming from another notebook/process.

Thank you @Chicocvenancio for reading my question and adding your ideas! Yeah, I just run the code with another Wiki account I have and it went well, so I'm guessing every user gets a limited number of SQL runs?

Thank you @Chicocvenancio for reading my question and adding your ideas! Yeah, I just run the code with another Wiki account I have and it went well, so I'm guessing every user gets a limited number of SQL runs?

Every user get a limited amount of open connections. If the connections are closed you can make more queries.

Oh! How does one do closed connections? Sorry @Chicocvenancio, I'm super newbie!

Should I change my function as follows?

def pd_query(query, wiki):
    warnings.filterwarnings('ignore')
    connection = connect(wiki)
    result = decode_df(pd.read_sql(query, connection))
    connection.close()  # Closing the connection
    return result

Thank you @Chicocvenancio for reading my question and adding your ideas! Yeah, I just run the code with another Wiki account I have and it went well, so I'm guessing every user gets a limited number of SQL runs?

There is a limit of concurrent connections to the database servers. Generally, SQL engines do keep their connections in a pool (keeping them opened even though they're not actually needed for querying purposes), which causes similar issues like the one you're facing.

In Wikimedia environment, you theoretically only need at most ~10 connections to any servers, as there are 10 database sections that serve all our projects. In fact, if your goal is to query all wikis (or most of them), you also need at most ~10 individual queries (which can often speed your script up). Unfortunately, there currently are no Wikimedia-specific libraries that would work with those assumptions. That said, you can do the following for a global query:

  1. Determine which wiki is served by which database section. This can be doing using data at https://noc.wikimedia.org/conf/dblists/s1.dblist (replace s1 with s1-s8)
  2. Build queries along the following example: SELECT DATABASE() AS wiki, page_id, page_namespace, page_title FROM cswiki_p.page UNION SELECT DATABASE() AS wiki, page_id, page_namespace, page_title FROM nlwiki_p.page, one per database section. The trick is that if you connect to one wiki, all other wikis in the same database section are also available for cross-database querying (in this case, an UNION SELCT). Since both nlwiki and cswiki are in s2, they can be queried via one query. This step will give you at most 8 queries, one per each database sections.
  3. Run all queries you created in 2nd step (each with their own connection). You can pass any wiki from the respective section, as long as you always specify the database in your actual queries.

An example of this approach can be found at https://public-paws.wmcloud.org/User:Martin_Urbanec/Archive/query-spamblacklist.ipynb, which queries the logging table on all of our wikis.

Alternatively, you might want to try my wmpaws library (which I'm yet to document on Wikitech), which deals with the connection issues by simply explicitly closing the connection upon finishing a query. It comes at a performance penalty (there are cases when connection is dropped only to be re-established a millisecond later), but it helps to avoid the max_connections limit. The library is installed in PAWS by default; an example is below:

import wmpaws

wmpaws.run_sql('''
SELECT COUNT(*)
FROM page
WHERE
    page_namespace = 0
AND page_is_redirect = 0
''', 'cswiki')

run_sql returns a Pandas dataframe, which can be benefited from when working with that framework (as you are).

To resuscitate your environment under your main account, you need to turn off all processes that are currently running under your account, as @Chicocvenancio notes. The easiest way to do so is to go to File -> Hub Control Panel, followed by Stop My Server (which will stop the container, including everything that you started in it). Then, log in to PAWS again and you should have an empty environment (files are preserved of course), with zero established database connections.

Hope this helps!