Page MenuHomePhabricator

Database error: Unable to connect to s7.web.db.svc.eqiad.wmflabs
Closed, ResolvedPublic

Description

https://tools.wmflabs.org/guc/?user=Framawiki shows

Error: Database error: Unable to connect to s7.web.db.svc.eqiad.wmflabs

Event Timeline

Framawiki created this task.

Is there any clue if this is a DNS issue or a connection limit issue?

My trivial test works:

$ mysql --defaults-file=$HOME/replica.my.cnf -h s7.web.db.svc.eqiad.wmflabs
Welcome to the MariaDB monitor.  Commands end with ; or \g.
Your MariaDB connection id is 3972596
Server version: 10.1.29-MariaDB MariaDB Server

Copyright (c) 2000, 2017, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

(u3518@s7.web.db.svc.eqiad.wmflabs) [(none)]>

Looks solved... An intermittent problem ?

In the last couple of days I regularly encountered this error with varying hosts (s4.web..., s5.web..., ...). Reloading the page sometimes works, sometimes it throws the same error again.

This might be related to this task which made changes to the server selection: https://phabricator.wikimedia.org/T176686

My naive guess is that GUC is hitting the per client simultaneous connection limits that are enforced on the new Wiki Replica cluster. The default limit for each username is 10 simultaneous connections. If GUC hits all 7 slices to lookup contribs then the limit of 10 would get chewed up pretty quickly. A ticket (maybe even this one) can be opened to request a higher limit for the tool's user with a brief explanation of why it is needed.

Getting the same issues repeatedly now throughout the day.

So far I've seen centralauth.web, s1.web and s2.web at different times failing to connect.

Krinkle added projects: Data-Services, Toolforge.

@bd808 Hereby requesting to raise the connection limit on *.web.db.svc.eqiad.wmflabs for tools.guc. Right now it seems that there that only 1 user can make 1 request to the tool at the same time. Whereas previously it was common to have at least 10 concurrent users via web, of which at least a few will have 2 or 3 tabs open. In addition, there are now 8 logical db sections. This means I'd expect the tool to make upto (8 * 10 * 1.5) 120 connections in parallel. Or, I'd need a new strategy for how to do all-wiki queries.

Right now we already consolidate or re-use connections for the same logical db section, reducing connections per user query from upto 800 to 8 or less.

bd808 added subscribers: Marostegui, jcrespo.

@Marostegui / @jcrespo : Is there a standard process yet to change the per-account connection limit? Is this something that the cloud-services-team should add to the service that provisions the accounts or something that we should manage elsewhere?

bd808,

120 connections in parallel I would say is too much; I wouldn't mind increasing concurrent connections from 10 to 15 or 20 if the usages are reasonable and justified, and the tools are super-important for the community, but 120 connections means that a single users reserves the usage of the database exclusively by itself for a long time- a database can only run efficiently 30-60 connections simultaneously (I mean literally simultaneously, there can be hundreds of connections open but some may be scheduled out of active execution/waiting for disk. Wikireplicas are supposed to be a highly shared resource, and practices like this later lead to complains by other users of "databases being slow", because their other queries have to be queued waiting for others to finish. Enabling a single user to connect 120 means exclusive usage of the database, something we recommend against (reserving connections) at https://wikitech.wikimedia.org/wiki/Help:Toolforge/Database#Connection_handling_policy

Note that policies can be discussed and this is not a definitive answer, we can open a discussion about what we should allow and what we disallow, as long as all or most of the service users are ok with it, and know the consequences in terms of penalties for others, I can make any changes requested, but personally, I find amusing that, the largest server that provides data for the English wikipedia (which is exactly as powerful as one wikireplica server) can serve 20 000 - 30 000 queries per second with only 100-150 concurrent connections, but we need more than those for a single wikireplica user (and we have thousands of those!)? https://grafana.wikimedia.org/dashboard/db/mysql?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1083&var-port=9104

@jcrespo Perhaps there's a terminology mismatch here, or perhaps I'm doing something in my tool that could be improved upon.

The webservice in question (https://tools.wmflabs.org/guc/) provides users with the ability to look up contributions (revisions) from a given user across all wikis. Essentially an "all wikis" version of Special:Contributions.

Input:

  • Username or IP address.
  • Source: revision, recent changes (last 30 days), recent changes (last 1 hour)

Output:

  • The 20 most recent edits from the given source from each wiki.
  • Grouped either by wiki, or by date.

Over the many years of maintenance, I've changed the way it performs its queries many times. Sometimes to reduce load, sometimes to make the queries faster, sometimes both. The current implementation is as follows:

  1. Get list of all wikis from meta_p.wiki. The query looks like SELECT * FROM meta_p.wiki WHERE is_closed = 0 AND family != 'centralauth';. This is the first outgoing query, and results in the first connection attempt. The query doesn't require any particular section, and defaults to s1. The result will provide the app with a list of all known wiki dbnames, server names, and their preferred database section name (e.g. s3 etc.).
  2. Batch query each section found in the previous query (one query for s1, one query for s2 etc.). The query is for either revision_userindex or recentchanges_userindex, performs a COUNT(), filters by user_text, and uses UNION ALL as a way to search many wikis at once with a single connection. The result will provide the app with the subset of wikis that have non-zero contributions, as well as a count of how many within the source (e.g. hour, month, or all time).
  3. Now that we know which wikis have non-zero contributions, we iterate over the tuple of (dbname, section) and for each, sequentially, make a more detailed query to the given source table with ORDER BY rev_timestamp DESC, LIMIT 20. Also, if the input was a username instead of an IP address, we first query the user table to get a user ID, so that we can filter by rev_user instead of rev_user_text, which is presumably faster.
  • At step 1, one connection is established to s1, and sends it one query.
  • Immediately after that in step 2, it sends another query to s1, and also connects to the other sections and sends no more than 1 query to each.
  • Immediately following that, the connections for the sections with non-zero edits are used once more for 1 or 2 more queries.
  • Lastly, all connections are closed.

This typically takes between 0.5s and 5s in total.

Clarifying notes:

  • No persistent connections or connection pools are used (other than in-process, as described above).
  • Web requests are handled by PHP, and no external service is used for sharing connections or anything like that.

It requires 8 (previously 7, but 8 given s8) connections to answer a single user request. (Used to be much more, but now down to 7.) Given the average user has 1.5 open tabs with the tool, and I'd like to be able to handle at least 10 concurrent users (probably more at peak, it's quite popular, but I'm trying to be cheap here.), that brings me to 8 * 10 * 1.5 = 120 potential connections.

In having written this up, I already see two areas for improvement:

  • In Step 3, first close connections for sections with zero edits found (instead of waiting until end of process).
  • In Step 3, when iterating over the wikis with non-zero edits, query them in order of section, and close the section connection when done with a group. Right now I query them in alphabetical order for display purposes, which has the benefit of being able to get output to the user much faster, and using less memory. If I re-arrange these, I'll have to buffer all data, sort later, and keep all output in-memory, thus increasing memory usage on the web server.

the largest server that provides data for the English wikipedia (which is exactly as powerful as one wikireplica server) can serve 20 000 - 30 000 queries per second with only 100-150 concurrent connections,

This particular sentence makes me question whether we use the same terminology, and in what way the tool is acting differently from MediaWiki. As far as I know, MediaWiki also re-uses connections in-process, which makes me wonder why the concurrent connection count is so low? I suppose to some extent, web requests don't all make their connection at exactly the same moment, so there can be multiple a second without their connections being in parallel, but still...

@Krinkle- I am surprised we are not more in line here; let me compare with mediawiki-production: production databases like enwiki have 20 available databases and mostly 1 user, we optimize for low latency; wikireplicas have 1 (out of 3, depending on the usage) server and thousands of users- not only you- which means we have to optimize for eficiency of resources per user -so that all users have fair usage of the available resources. Regarding connections- there is scheduling inside of mysql which limits the number of active threads per second, beyond the ones that are actually connected. The servers allow for thousands of concurrent connections, but not all can be in state running at the same time- plus we kill idle connections so they do not consume memory.

Again, this is just an explanation why we want to limit the number of connections per user- wikireplicas is not a dedicated resource for each tool- so some limits have to be agreed -whatever those are- so a single user does not take over all resources at the same time. "Just give me more resources" is not always a solution (but I also say it is not discarded), because a balance has to be found between making tools fast and using too many resouces, making other tools slow. A normal web application should not take more than 1 database connection per webrequest (less, if possible), and disconnect even before the response has been fully calculated and ready to be sent.

In other words- the same way we ask api.php users to use serialized calls if posible to do a fair usage of resources, we ask normally to avoid high parallelism on wikireplicas. I hope at least my comment is understood and we can keep talking about why a single tool needs so many connections, or if it could work with less of those. For example, I have found in the past tools that didn't properly close open database connections and that wasn't intended by the maintainer- maybe that is happening here?

@jcrespo I have indeed found some potential db handles not being closed. I'll look into that. But the question remains, aside from obviously bad things (unclosed connections), what is the best practice for querying wiki replicas from tools?

I'm currently following what I perceive to be a general industry best practice. The same as we do in MediaWiki: Lazily connect at the first query, re-use queries during the main part request handling to avoid needing to repeatedly re-establish the same connection for each query, and after getting all data, close connections, process the data, and send to the client.

The closest equivalent I can think of in production is probably Special:CentralAuth, e.g. https://en.wikipedia.org/wiki/Special:CentralAuth/Krinkle, which queries centralauth db for all wikis I am active on, and then queries each wiki's local user table for the current edit count and user groups. After each local-wiki query, it passes the db handle to $lb::reuseConnection(), but regardless, during the main portion of this web request, I'm fairly certain it has upto 8 concurrent active connection objects. One slave connection for each production db section.

The only alternative I see would be a radical approach to proactively close the MySQL connection after every single query, and re-establish a connection for every query. Is that what you are recommending as best practice? Is that what we should document for Toolforge? That is not currently the case in any tool I know of. It'll mean an increase in latency, but if that is what is required to make this fair and scalable, we should do it, and we'll need to document and annouce that widely as the recommended way for web tools to perform queries.

If that ends up being the case, I imagine it would make a lot of sense to invest in a local proxy on each Toolforge web server sort that can retain a connection without state (e.g. no selection snapshot, and disallow use <db> command in favour of db-prefixed select from) and let tools connect to that and send queries. Similar to what we do with nutcracker in production.

Alternatively, individual tools could try and maintain a socket file. E.g. max 1 socket for each db section and all concurrent web requests would use the same connection socket. And maybe a refcounter somewhere to automatically open/close it. It'd be extra work and ideally not something part of individual tools (given our the persistent storage there is NFS..), but it would make sense to prioritise a way to make it work if connection count is our main bottleneck.

Alternatively, we could introduce a hostname (or document an existing one) that will have all wikis on one server, and recommend web tools to use that for all their queries. Instead of the current approach which is that we connect based on db-section. Of course, that would make scaling more difficult in the future, but would reduce the connection count.

No, I do not expect you to close the connection after every query- but I expect, unlike dedicated databases, to close and not reuse connections OR limit the number of connections open to less than 5-10 a bit more if the resources are slightly increased. Persistent connections ("reusing connections") were, and are still disallowed- as reflected on https://wikitech.wikimedia.org/wiki/Help:Toolforge/Database#Connection_handling_policy even if they impact negatively your latency. Please also provide your account name, it is possible that other problems are happening, like you getting banned automatically by the db if you create too much traffic, etc.

I think the problem is that you see 8-800 different hosts and you think you need to reconnect. This is something we would like to have, but right now, there is only 1 single instance backing web services serving all the databases, because all tools have not adapted yet. If you could set a switch to change "connect to new db host", and change it to "use db", that allows in the future change to the multi-host configuration trivially, that would help. The main issue is that in the past, there used to be 3 hosts for everything, now there are 1 only for web, with another only being a backup if it fails (that other is not idle, it is serving analytics hosts).

Alternatively, we could introduce a hostname (or document an existing one) that will have all wikis on one server, and recommend web tools to use that for all their queries.

You can do that now, just use any wiki at random (all wikis are in all hosts); but we will not be able to do that in the future- as wiki grows, we will have to delete some wikis from some hosts (probably based on sections). E.g. wikidata is growing a lot.

To avoid programming overhead, my recommendation for your special case is to use dns to resolve the dns host- if it is the same ip, just reuse (change/use database) the connection, if it is different, create a new connections. That will make things compatible, no matter the underlying (and changing) topology. You can also query which host you are using once connected to by doing SELECT @@hostname. But even if you connect and disconnect every time you use a different section, I do not see why you have to keep the old connections open- query one section at a time, and close it when you are done, even if you do other processing on other "virtual" hosts. Keeping connections open that are idle because they are waiting on processing on other queries is bad. Mediawiki is guity of not even having a way to close connections, and that was really bad- but mediawiki can get away with it because it is a single user- there are thousands of other users/tools that must have equal opportunity to access those resources.

You can double your connection count by querying the analytics hosts, which currently is backed by 1-2 servers (1 is currently, at this very moment, offline due to ongoing issues T186579 ). Otherwise, stay in the low 10-20 per host and user. If you think you need more privileges than that, you do not have to convince me- send an email to the cloud list and convince the rest of the users that your tool need to have more privileges resources than the rest (maybe it does, I do not know how popular/important is) of the users OR advocate for more machines to be purchased by the cloud team, to sustain more users, which we would happily administrate if done so. E.g. maybe you can show me numbers that you reasonably use resources (120 connections) because you are actually only using that 1/1millionth of the time, and not use those the rest of the time.

Krinkle claimed this task.

Closing per T186436: Improve GUC database handling. The connection use is now down to at most 1 concurrent connection from an individual GUC web request.

I suppose that means there can still not be globally more than 10 web requests for GUC at the same time, but that's better than the previous situation which would trigger a problem whenever a second request started given 1 request used 9 connections, and the second one would always reach the limit.

For anyone getting "Database error: Unable to connect to" from GUC again in the future, please do re-open this task!

As I said, 10 increased to 15-20 is something I am open to do, and we did for others, e.g. if more than 10 people use it at the same time- that is reasonable. I still don't know which is your database user name to do so.