Option on API lists to only have count of links/categories/whatever returned, rather than a full resultset
Open, NormalPublic

Description

Would it be possible on (possibly all?) queries that are getting links on pages, categories on page, pages in category, etc, have an option (or even just by default), a number of results returned?

ie for the link above,

instead of

<categories>
  <cl ns="14" title="Category:1879 births" />
  <cl ns="14" title="Category:1955 deaths" />
  <cl ns="14" title="Category:Academics of the Charles University" />
  <cl ns="14" title="Category:Albert Einstein" />
  <cl ns="14" title="Category:American humanists" />
  <cl ns="14" title="Category:American pacifists" />
  <cl ns="14" title="Category:American philosophers" />
  <cl ns="14" title="Category:American physicists" />
  <cl ns="14" title="Category:American socialists" />
  <cl ns="14" title="Category:American vegetarians" />
</categories>

something like

<categories count="10">
  <cl ns="14" title="Category:1879 births" />
  <cl ns="14" title="Category:1955 deaths" />
  <cl ns="14" title="Category:Academics of the Charles University" />
  <cl ns="14" title="Category:Albert Einstein" />
  <cl ns="14" title="Category:American humanists" />
  <cl ns="14" title="Category:American pacifists" />
  <cl ns="14" title="Category:American philosophers" />
  <cl ns="14" title="Category:American physicists" />
  <cl ns="14" title="Category:American socialists" />
  <cl ns="14" title="Category:American vegetarians" />
</categories>

And an option to be able to go &countonly (or something), and just be returned

<categories count="10" />

Please?

Thanks


Version: 1.14.x
Severity: enhancement
URL: http://en.wikipedia.org/w/api.php?action=query&prop=categories&titles=Albert%20Einstein

Details

Reference
bz17993
bzimport raised the priority of this task from to Normal.
bzimport set Reference to bz17993.
bzimport added a subscriber: Unknown Object (MLST).
Reedy created this task.Mar 15 2009, 9:07 PM

(In reply to comment #0)

Would it be possible on (possibly all?) queries that are getting links on
pages, categories on page, pages in category, etc, have an option (or even just
by default), a number of results returned?

ie for the link above,

instead of

<categories>
  <cl ns="14" title="Category:1879 births" />
  <cl ns="14" title="Category:1955 deaths" />
  <cl ns="14" title="Category:Academics of the Charles University" />
  <cl ns="14" title="Category:Albert Einstein" />
  <cl ns="14" title="Category:American humanists" />
  <cl ns="14" title="Category:American pacifists" />
  <cl ns="14" title="Category:American philosophers" />
  <cl ns="14" title="Category:American physicists" />
  <cl ns="14" title="Category:American socialists" />
  <cl ns="14" title="Category:American vegetarians" />
</categories>

something like

<categories count="10">
  <cl ns="14" title="Category:1879 births" />
  <cl ns="14" title="Category:1955 deaths" />
  <cl ns="14" title="Category:Academics of the Charles University" />
  <cl ns="14" title="Category:Albert Einstein" />
  <cl ns="14" title="Category:American humanists" />
  <cl ns="14" title="Category:American pacifists" />
  <cl ns="14" title="Category:American philosophers" />
  <cl ns="14" title="Category:American physicists" />
  <cl ns="14" title="Category:American socialists" />
  <cl ns="14" title="Category:American vegetarians" />
</categories>

I don't see the use here, as counting results on the client side is trivial and inexpensive.

And an option to be able to go &countonly (or something), and just be returned

<categories count="10" />

I get that this would maybe save bandwidth; I'll look into implementing it.

Reedy added a comment.Mar 16 2009, 2:28 PM

Fair enough..

The idea came about in the cases where in AWB, i want to get a count of categories on a page, but couldn't care less what they were, so might aswell just get a count.

The <categories count="10" /> style, if implemented on most of the query types, would be useful!

But i suppose you're right, if you're wanting the list of categories, the count is redundant there

Thanks

  • Bug 20504 has been marked as a duplicate of this bug. ***

catlow wrote:

This would indeed be useful. Primarily in cases where the size of the result set is greater than the limit for a single query (for example, you want to know how many backlinks there are without having to execute a large number of consecutive queries). I don't see why this should be a problem (I don't know SQL, but thinking abstractly - if you're able to establish that exactly 10 pages are backlinks in a single query, you should be able to establish that exactly 10 pages are not backlinks, i.e. that N-10 pages are backlinks, and similarly for all numbers in between.)

(In reply to comment #4)

This would indeed be useful. Primarily in cases where the size of the result
set is greater than the limit for a single query (for example, you want to know
how many backlinks there are without having to execute a large number of
consecutive queries). I don't see why this should be a problem (I don't know
SQL, but thinking abstractly - if you're able to establish that exactly 10
pages are backlinks in a single query, you should be able to establish that
exactly 10 pages are not backlinks, i.e. that N-10 pages are backlinks, and
similarly for all numbers in between.)

It doesn't work this way. It's *possible* to find out how many backlinks there are without listing them all, but it's not *efficient* up to a level that's acceptable on Wikipedia.

catlow wrote:

So can you explain how it does work? Is the list of backlinks maintained explicitly in some table? Or does the software compile the list each time, by looking through the table of forward links? (I just can't imagine how there would be any efficiency difference between counting N positive results and effectively counting N negative ones.)

(In reply to comment #6)

So can you explain how it does work? Is the list of backlinks maintained
explicitly in some table? Or does the software compile the list each time, by
looking through the table of forward links? (I just can't imagine how there
would be any efficiency difference between counting N positive results and
effectively counting N negative ones.)

They're in the pagelinks table, which has the fields pl_from (page ID of the source page), pl_namespace and pl_title (NS+title of the target page). There is an index on the table which allows us to retrieve data sorted by pl_namespace, then pl_title, then pl_from. Since we're looking for rows with e.g. pl_namespace=0 and pl_title=Foo, all rows we're looking for are consecutive, and the first one can easily be located using binary search.

This means we're not examining any rows that aren't in our list: we know our list is consecutive and where it starts. However, counting how many items are in the list still requires us to examine all of them, which means examining an arbitrary and possibly very large amount of rows, which we don't want to do for performance reasons.

Another caveat is that the N-10 approach assumes that rows that don't satisfy our criterion are rare, which is definitely not the case in the pagelinks table for an enwiki-sized wiki.

(Disclaimer: all of this is based on my limited and possibly misguided understanding of how MySQL indexes work; all of this stuff happens in MySQL, not in MediaWiki)

Bryan.TongMinh wrote:

This can only be implemented as returning "x" where x is a specific number or something that indicates "more than y". Would such a feature still be useful or close as WONTFIX?

It's still useful, ish.

The use case was for the damn stupid "A page is an orhpan, if it has less than X incoming links".. And other such stupid responses.

Technically, the point of the request is not to get all the information sent through that we're just going to ignore - We're bothered about the count, not what they are (in most cases).

To an extent, just putting the request limit to say wanted + 1, and see what we get back would do.. But we're still getting useless information.

But "returning "x" where x is a specific number or" is the same thing.

Though, technically, just doing the request, without the SELECT columns (well, we'd need to select something trivial) would do it, surely? and then using the DB object to do a row count?

a.d.bergi wrote:

Isn't there even an SQL command to return just the length of the matched set instead of the items themselves? Of course Roans explanation is good, but this is only the price for one query. If my aim was to count the set, I would have to make all the continue-queries, which means the same searching through the table as it would have been for one query. Of course, this might be an argument to repeal any api limits, but the real advantage is the save of bandwidth and PHP-requests.

A script that could profit from this would be http://de.wikipedia.org/wiki/MediaWiki:Gadget-revisionCounter.js. Here just two queries would have to be done:

  • api.php?action=query&prop=revisions&titles=Foo&countonly
  • api.php?action=query&prop=revisions&titles=Foo&rvuser=Bar&countonly

Another example would be a tool to retrieve the number of template transclusions, just like http://toolserver.org/~jarry/templatecount/. There a simple call to api.php?action=query&list=embeddedin&eititle=Template:Foo&countonly would be enough.

The countonly parameter should work for all properties but "info", "categoryinfo" and "pageprops", and for all lists but "random". The "search" list already provides a "totalhits" parameter, which might be interesting. I don't think counting would be useful for meta queries.

(In reply to comment #10)

Isn't there even an SQL command to return just the length of the matched set
instead of the items themselves?

Yes. It's COUNT(*)

Of course Roans explanation is good, but this
is only the price for one query.

It's not the same price. A LIMIT 50 query only inspects 50 rows (or maybe a bit more if there's a WHERE clause that can't be done with an index), whereas a COUNT(*) query will inspect all rows in the entire result set in order to count them. That could be a million rows in extreme cases (e.g. counting the number of category members of Living_people on enwiki, I think that's like 750k members). It should be obvious that a query examining 100 rows is much, much faster than a query that inspects almost a million.

If my aim was to count the set, I would have
to make all the continue-queries, which means the same searching through the
table as it would have been for one query. Of course, this might be an argument
to repeal any api limits, but the real advantage is the save of bandwidth and
PHP-requests.

Yes, it means the entire result set will be scanned eventually. But there's an advantage to not doing that all at once. Count queries of a high magnitude can easily take a minute, and at some point things start timing out (PHP max exec time limit, timeouts on the client side, timeouts in intermediate caching proxies)

So we could return a count, but we'd keep paging in.

Anomie moved this task from Unsorted to Needs Code on the MediaWiki-API board.Feb 20 2015, 7:51 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 26 2015, 4:26 PM
Od1n awarded a token.Aug 24 2016, 3:47 AM
Od1n added a subscriber: Od1n.