Page MenuHomePhabricator

Throttling access to Special Pages that make potentially expensive queries
Open, HighPublic

Description

So, AllPages isn't cached, and it probably shouldn't be. But it can be used to make expensive queries. And users can make a lot of simultaneous requests, which isn't good.

It'd be reasonable if we had a way to limit simultaneous queries by a user in these cases

Event Timeline

Reedy created this task.Mar 20 2017, 4:08 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 20 2017, 4:08 PM

What about doing pagination?

Reedy added a comment.Mar 20 2017, 4:15 PM

It is using pagination

BurstPower added a subscriber: BurstPower.EditedMar 20 2017, 4:30 PM

@Reedy I am guessing that each time it sorts entire table by url or the name of the article? Which i guess is extremely expensive when compared to sorting by ID

Also any chance you could provide me all URLs of all articles?

It shows currently

Welcome to Wikipedia,
the free encyclopedia that anyone can edit.
5,363,335 articles in English

Reedy added a comment.Mar 20 2017, 4:32 PM

@Reedy I am guessing that each time it sorts entire table by url or the name of the article? Which i guess is extremely expensive when compared to sorting by ID
Also any chance you could provide me all URLs of all articles?
It shows currently
Welcome to Wikipedia,
the free encyclopedia that anyone can edit.
5,363,335 articles in English

https://dumps.wikimedia.org/enwiki/20170301/enwiki-20170301-all-titles-in-ns0.gz is a list of all NS 0 ala "content" pages, which on enwiki is the only NS

reedy@tin:/srv/mediawiki-staging/php-1.29.0-wmf.16$ mwscript eval.php enwiki
> var_dump( $wgContentNamespaces );
array(1) {
  [0]=>
  int(0)
}

It is using pagination

It's most likely a bug as it uses elasticsearch. Unless it is using both mysql and elastic search for searching. Elasticsearch default limit for searching is 10,000.

Reedy added a comment.Mar 20 2017, 4:37 PM

It is using pagination

It's most likely a bug as it uses elasticsearch. Unless it is using both mysql and elastic search for searching. Elasticsearch default limit for searching is 10,000.

It's nothing to do with ElasticSearch

Tgr added a subscriber: Tgr.Mar 20 2017, 5:35 PM

Why is it potentially expensive? We have an index on NS + title.

@Reedy ty very much that file really helps me. It contains redirects as well however i can handle them :)

Tgr added a comment.EditedMar 20 2017, 5:40 PM

Ah, OK, the redirect filter. Just remove it? Unindexed queries like that should not be exposed in miser mode.

@Reedy i need some more help about redirects

Because it doesnt show the redirect in HttpWebRequest as ResponseUri

So my only option is encoding title to obtain real url

However i see that space character is encoded as _ instead of +

Are there any other additional rules?

@Tgr @Paladox

Reedy added a comment.Mar 20 2017, 6:12 PM

@Reedy i need some more help about redirects
Because it doesnt show the redirect in HttpWebRequest as ResponseUri
So my only option is encoding title to obtain real url
However i see that space character is encoded as _ instead of +
Are there any other additional rules?
@Tgr @Paladox

Well, the " " being exposed as "_" is very typical in MediaWiki, look at the page URLs to see this.

The API will resolve redirects for you, for example https://en.wikipedia.org/w/api.php?action=query&titles=WP:AWB&redirects

@Reedy ty for answer. however, still it doesnt show encoded URL. i need encoded URLs to match against same pages in other languages :)

I wish it was doing server side redirect instead of client side. That way i would have the absolute final URL

But i wonder this

Can we assume that the absolute final URL is , the title of the page (obtain from H1) , replace space character with _ and then URL encode it?

Reedy added a comment.Mar 20 2017, 6:37 PM

I note this is really the wrong task for these discussions.

What sort of encoding do you have? What sort of encoding do you need?

Can we assume that the absolute final URL is , the title of the page (obtain from H1) , replace space character with _ and then URL encode it?

You can't take the title of the page from the <h1>, as it can be modified for display (e.g. iPhone with lowercase first letter or various articles with HTML formatting).

But… if you're already able to access the page HTML, you must already have a title, and in fact, you must already have the URL you used to access the page… why do you need to build it again?

FWIW, canonical page URLs are just https://en.wikipedia.org/wiki/ + page title with spaces replaced with _ and other special characters percent-encoded as usual.

matmarex removed a subscriber: matmarex.Mar 20 2017, 6:38 PM
Tgr added a comment.Mar 20 2017, 6:58 PM

The API can resolve redirects for you (see the redirects parameter); that, unlike filtering, should not raise any performance concerns.

@BurstPower here is a list of all pages in enwiki's mainspace minus redirects: http://tools.wmflabs.org/betacommand-dev/reports/en_articles.txt