Page MenuHomePhabricator

Throttling access to Special Pages that make potentially expensive queries
Open, HighPublic

Description

So, AllPages isn't cached, and it probably shouldn't be. But it can be used to make expensive queries. And users can make a lot of simultaneous requests, which isn't good.

It'd be reasonable if we had a way to limit simultaneous queries by a user in these cases

Event Timeline

@Reedy I am guessing that each time it sorts entire table by url or the name of the article? Which i guess is extremely expensive when compared to sorting by ID

Also any chance you could provide me all URLs of all articles?

It shows currently

Welcome to Wikipedia,
the free encyclopedia that anyone can edit.
5,363,335 articles in English

@Reedy I am guessing that each time it sorts entire table by url or the name of the article? Which i guess is extremely expensive when compared to sorting by ID

Also any chance you could provide me all URLs of all articles?

It shows currently

Welcome to Wikipedia,
the free encyclopedia that anyone can edit.
5,363,335 articles in English

https://dumps.wikimedia.org/enwiki/20170301/enwiki-20170301-all-titles-in-ns0.gz is a list of all NS 0 ala "content" pages, which on enwiki is the only NS

reedy@tin:/srv/mediawiki-staging/php-1.29.0-wmf.16$ mwscript eval.php enwiki
> var_dump( $wgContentNamespaces );
array(1) {
  [0]=>
  int(0)
}

It is using pagination

It's most likely a bug as it uses elasticsearch. Unless it is using both mysql and elastic search for searching. Elasticsearch default limit for searching is 10,000.

It is using pagination

It's most likely a bug as it uses elasticsearch. Unless it is using both mysql and elastic search for searching. Elasticsearch default limit for searching is 10,000.

It's nothing to do with ElasticSearch

Why is it potentially expensive? We have an index on NS + title.

@Reedy ty very much that file really helps me. It contains redirects as well however i can handle them :)

Ah, OK, the redirect filter. Just remove it? Unindexed queries like that should not be exposed in miser mode.

@Reedy i need some more help about redirects

Because it doesnt show the redirect in HttpWebRequest as ResponseUri

So my only option is encoding title to obtain real url

However i see that space character is encoded as _ instead of +

Are there any other additional rules?

@Tgr @Paladox

@Reedy i need some more help about redirects

Because it doesnt show the redirect in HttpWebRequest as ResponseUri

So my only option is encoding title to obtain real url

However i see that space character is encoded as _ instead of +

Are there any other additional rules?

@Tgr @Paladox

Well, the " " being exposed as "_" is very typical in MediaWiki, look at the page URLs to see this.

The API will resolve redirects for you, for example https://en.wikipedia.org/w/api.php?action=query&titles=WP:AWB&redirects

@Reedy ty for answer. however, still it doesnt show encoded URL. i need encoded URLs to match against same pages in other languages :)

I wish it was doing server side redirect instead of client side. That way i would have the absolute final URL

But i wonder this

Can we assume that the absolute final URL is , the title of the page (obtain from H1) , replace space character with _ and then URL encode it?

I note this is really the wrong task for these discussions.

What sort of encoding do you have? What sort of encoding do you need?

Can we assume that the absolute final URL is , the title of the page (obtain from H1) , replace space character with _ and then URL encode it?

You can't take the title of the page from the <h1>, as it can be modified for display (e.g. iPhone with lowercase first letter or various articles with HTML formatting).

But… if you're already able to access the page HTML, you must already have a title, and in fact, you must already have the URL you used to access the page… why do you need to build it again?

FWIW, canonical page URLs are just https://en.wikipedia.org/wiki/ + page title with spaces replaced with _ and other special characters percent-encoded as usual.

The API can resolve redirects for you (see the redirects parameter); that, unlike filtering, should not raise any performance concerns.