Page MenuHomePhabricator

Provide an API so external tools can run SQL queries via Quarry
Closed, DeclinedPublic

Description

Request for a Quarry API so external tools can do SQL queries via Quarry. The queries are variable user-driven so no way to setup a query ahead of time and access via page ID. Please close if this is an open request elsewhere.

Event Timeline

zhuyifei1999 renamed this task from Quary API to Quarry API.Oct 22 2017, 2:51 AM
zhuyifei1999 updated the task description. (Show Details)

If you are developing a tool or bot, you can run it on Wikimedia Cloud VPS or Toolforge and connect to the wiki replicas directly. Please see https://wikitech.wikimedia.org/wiki/Portal:Data_Services#Wiki_Replicas. I would not recommend using quarry as an machine interface (rather than human usage) to wiki replicas, as the result of every single query run has to be stored somewhere indefinitely.

Do you have a specific use case for an API here?

It is also possible to run a bot a normal way (from personal computer or your own server) and ssh tunnel to Toolforge to run a database query (I believe it is easy to google how to do it for popular languages, I did not have trouble doing it in Java). It sucks comparing to direct access when you are running on Toolforge itself, the data transfer makes it so that sometimes it is faster to request stuff via Mediawiki API than like this (I am talking from experience), but it is still possible. Well, and it is the only way if you use proprietary components in your code so cannot comply with Toolforge rules. (Unless you create your own API there)

As background I've written a unix command-line tool called Wikiget for accessing certain Wikimedia API functions to generate lists of article titles

https://github.com/greencardamom/Wikiget

I'd like to be able to offer a new switch so users can send SQL queries from the command-line with Wikiget via Quarry API. If the Quarry API requires OAUTH that's doable as I wrote a similar tool to access another OAUTH-based API so have some experience with it. Users would of course need to supply their own tokens and keys.

The queries in this case would be human-generated through the external tool Wikiget. I don't expect a lot of usage honestly but who knows what users will do with it. Personally I would like to use it for searching Wikipedia since the Wikimedia API search caps out at 10000 results due to a hard limit in Elasticsearch.

I'd like to be able to offer a new switch so users can send SQL queries from the command-line with Wikiget via Quarry API

At the current state, I don't think this is within the current scope of Quarry, which provides a means for researchers to share and review each other's queries, rather than command-line automation.

Also, some practical considerations:

  • All SQL code Quarry are licensed with CC0. Is there a way to guarantee that the writer of those command line arguments, which would be sent to quarry, agree on the license?
  • Quarry grant each query a query id (the 1 in https://quarry.wmflabs.org/query/1), and each query may have multiple revisions. If a user runs wikiget twice using the same query, how do you de-duplicate them?
  • Quarry may not be able to store large result sets, and that's a WONTFIX, since the storage space for result sets are limited. (The query status actually stays "running" forever, see T172086)

Personally I would like to use it for searching Wikipedia since the Wikimedia API search caps out at 10000 results due to a hard limit in Elasticsearch.

FYI, the text table is unavailable in the Wiki Replicas so you will be unable to search on the wikitect in Quarry.

If you really want to execute SQL queries, I recommend @Base's method, tunneling the connection to the wiki replicas through Toolforge, and allowing the user of wikiget to specify the database hostname, port, username, password, database name, via the command line arguments. Alternatively, I would not oppose anyone download Quarry query results, which are documented at https://meta.wikimedia.org/wiki/Research:Quarry#Downloading_a_resultset.

Aklapper renamed this task from Quarry API to Provide an API so external tools can run SQL queries via Quarry.Oct 22 2017, 12:04 PM

Ok I'll close this it doesn't sound like it will work but thanks for the explanations and good to have explored the idea anyway.