Page MenuHomePhabricator

MediaWiki API does not support parallel pagination
Closed, DeclinedPublic

Description

Problem
In T178715 we discovered that we needed to get a list of all of the pages a user has touched. It doesn't matter how many times the user has edited a page, we need to know if that user has ever edited a page at any point in history.

As an example, if User A edits a page in 2012 and User B edits a page in 2017 and I am looking for revisions in 2017, we would need to know that User A edited the page (at some point in time), even though we are only requesting revisions in 2017.

The request for getting this information is something like this:
https://en.wikipedia.org/w/api.php?action=query&generator=allrevisions&garvuser=Kaldari&garvlimit=500

However, if you are querying for a user with a huge number of pages they have edited like Kaldari you will have to paginate through over 100 pages of page ids to get all of the pages Kaldari has edited.

This isn't a huge problem, except the requests have to be made serially (i.e. one after the next after the next). This can take roughly 20-30 seconds to complete in the user's browser (where the requests are being made).

I understand that it is bad etiquette to make the requests in parallel, however, the cost in usability and performance is really high.

Solution
A potential solution could be to give clients the ability to pass a param to the API that will return all of the continue ids (and might not even return any results). Another solution could be to switch to incremental page numbers and return the total number of pages.

The API should cache the results of a query and allow the same query to be executed without being throttled. But when a query hits the origin server it should be throttled (by IP Address?). This will allow clients to execute the requests in parallel and allow the server to return the results as quickly as it wants to. I do not think it should be the responsibility of the client to throttle themselves (unless they want to).

Work Around
Make a tool on Toolforge that queries the database and returns the results in JSON format. In other words: Make Your Own API™

Event Timeline

Krinkle renamed this task from MediaWiki API does not support asynchronous pagination to MediaWiki API does not support parallel pagination.Nov 8 2017, 11:09 PM
Krinkle updated the task description. (Show Details)

As an aside, nearly always use limit=max, and also look at the apihighlimits right too

In the case of the user interaction tool, "Make Your Own API" is probably the right solution, although I wouldn't be opposed to making the MediaWiki API pagination more flexible. Generally though, in cases where you are sifting through huge amounts of data, you'll want to query against a database directly rather than sucking all the data over via API and then querying it.

also look at the apihighlimits right too

These are anonymous cross-domain requests directly from the web browser.

In the case of the user interaction tool, "Make Your Own API" is probably the right solution, although I wouldn't be opposed to making the MediaWiki API pagination more flexible. Generally though, in cases where you are sifting through huge amounts of data, you'll want to query against a database directly rather than sucking all the data over via API and then querying it.

Completely agree. It just seems like kind of an anti-pattern, but yeah there will always be use cases that are outside the use cases the API supports.

Completely agree. It just seems like kind of an anti-pattern, but yeah there will always be use cases that are outside the use cases the API supports.

If you are in Toolforge (it sounds like you are or you would just be writing the Action API module you need), the Wiki Replicas allow you to query the revision tables directly. This kind of analytics/data mining work is a core use case for those servers existing.

Anomie subscribed.

we would need to know that User A edited the page (at some point in time), even though we are only requesting revisions in 2017.

The request for getting this information is something like this:
https://en.wikipedia.org/w/api.php?action=query&generator=allrevisions&garvuser=Kaldari&garvlimit=500

If you need to know if User A edited a page at some point in time, try https://en.wikipedia.org/w/api.php?action=query&prop=revisions&titles=PAGE&rvuser=Kaldari&rvlimit=1. That should be reasonably efficient for logged-in users since it should use the page_user_timestamp index.

I understand that it is bad etiquette to make the requests in parallel

Indeed.

A potential solution could be to give clients the ability to pass a param to the API that will return all of the continue ids (and might not even return any results).

Not possible without selecting all the potentially millions of results from the database.

Another solution could be to switch to incremental page numbers and return the total number of pages.

We actively try to avoid that in the API. For one, that causes issues with missing or repeated results when items are added or removed on "earlier" pages in the list. For another, SQL queries with LIMIT X OFFSET Y get really slow as Y increases, see for example T59176.

The API should cache the results of a query and allow the same query to be executed without being throttled.

At some point "cache everything" fails because your cache gets far too large.

But when a query hits the origin server it should be throttled (by IP Address?).

IP address based throttling has significant drawbacks for things like Tool Forge. See T167906 for a very handwavey proposal along those lines that refused to consider those sorts of issues, and see T64615 for somewhat more concrete discussion.

I do not think it should be the responsibility of the client to throttle themselves (unless they want to).

It is the responsibility of clients to not try to DOS the servers, even if the servers have DOS protection.

If you need to know if User A edited a page at some point in time, try https://en.wikipedia.org/w/api.php?action=query&prop=revisions&titles=PAGE&rvuser=Kaldari&rvlimit=1. That should be reasonably efficient for logged-in users since it should use the page_user_timestamp index.

As I mentioned before:

These are anonymous cross-domain requests directly from the web browser.

I tried an article I know Kaldari hasn't edited, and it shows up in the results:
https://en.wikipedia.org/w/api.php?action=query&prop=revisions&titles=Frontier%20City&rvuser=Kaldari&rvlimit=1
so that doesn't really work for our use-case, but thanks for the suggestion anyways!

I tried an article I know Kaldari hasn't edited, and it shows up in the results:
https://en.wikipedia.org/w/api.php?action=query&prop=revisions&titles=Frontier%20City&rvuser=Kaldari&rvlimit=1
so that doesn't really work for our use-case, but thanks for the suggestion anyways!

I think you're mis-interpreting the results. Compare https://en.wikipedia.org/w/api.php?action=query&prop=revisions&titles=User:Legoktm&rvuser=Legoktm&rvlimit=1& and your query. The fact that there's no revisions listed in your Kaldari query means that he's never edited the article.

I think you're mis-interpreting the results. Compare https://en.wikipedia.org/w/api.php?action=query&prop=revisions&titles=User:Legoktm&rvuser=Legoktm&rvlimit=1& and your query. The fact that there's no revisions listed in your Kaldari query means that he's never edited the article.

Oh! I see. Thank you! That is very helpful.

If you need to know if User A edited a page at some point in time, try https://en.wikipedia.org/w/api.php?action=query&prop=revisions&titles=PAGE&rvuser=Kaldari&rvlimit=1. That should be reasonably efficient for logged-in users since it should use the page_user_timestamp index.

As I mentioned before:

These are anonymous cross-domain requests directly from the web browser.

Besides the misinterpretation Lego pointed out, my bad word choice in saying "logged-in" rather than "registered" may have been confusing. I was trying to say that it won't work efficiently if you supply a value for rvuser that doesn't correspond to a registered user, e.g. "127.0.0.1" or "SomeUserNameThatIsn'tRegistered". It doesn't matter whether the query is being made while logged-in or not.

The reason for that is because with a registered rvuser it can use the page_user_timestamp index to directly look for edits to the page by the user ID, while for an IP or non-registered name it has to use either page_timestamp (all edits to the page) then filter by user or usertext_timestamp (all edits by username) then filter by page.

Besides the misinterpretation Lego pointed out, my bad word choice in saying "logged-in" rather than "registered" may have been confusing. I was trying to say that it won't work efficiently if you supply a value for rvuser that doesn't correspond to a registered user, e.g. "127.0.0.1" or "SomeUserNameThatIsn'tRegistered". It doesn't matter whether the query is being made while logged-in or not.

The reason for that is because with a registered rvuser it can use the page_user_timestamp index to directly look for edits to the page by the user ID, while for an IP or non-registered name it has to use either page_timestamp (all edits to the page) then filter by user or usertext_timestamp (all edits by username) then filter by page.

Ah! Thank you for clarifying.

Also, I found that there are pros and cons to different pagination strategies:
http://blog.vermorel.com/journal/2015/5/8/nearly-all-web-apis-get-paging-wrong.html

The token based pagination that we have now is the safest but obviously has other limitations.