Page MenuHomePhabricator

Special:LinkSearch fails over 10,000
Closed, ResolvedPublic

Description

Special:LinkSearch will not return links after the first 10,000. You were previously able to manually change the URL in order to display them, but this no longer seems to work.

Eg. https://en.wikipedia.org/w/index.php?title=Special:LinkSearch&limit=5000&offset=5000&target=http%3A%2F%2F*.highbeam.com is a functional link, but https://en.wikipedia.org/w/index.php?title=Special:LinkSearch&limit=100&offset=10000&target=http%3A%2F%2F*.highbeam.com returns "There are no results for this report" (and there should be results). https://en.wikipedia.org/w/index.php?title=Special:LinkSearch&limit=5000&offset=6000&target=http%3A%2F%2F*.highbeam.com returns 4000 results rather than the 5000 results requested (And again, I know that there are results beyond that).

Details

Related Gerrit Patches:

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 15 2016, 7:25 PM
Florian triaged this task as Medium priority.Mar 24 2016, 5:36 PM
Florian moved this task from To triage to SpecialPage system on the MediaWiki-Special-pages board.

How big an issue is this in practice?

Maybe we should increase the limit beyond 10,000. 20,000?

The ideal solution of course is to have an option for QueryPage subclasses to page using where conditions.

Nikkimaria added a comment.EditedMar 24 2016, 11:24 PM

If we could have an option to effectively limit by namespace, that would be great. Even then, you'd need a much higher limit in order to not have an issue. The example I gave should return over 30,000 results, and that's not uncommon. For example, when I'm doing patrols for unreliable/inappropriate external links, I frequently do searches in the 30,000-40,000-results-range because the list is so full of non-mainspace links. But even limiting to mainspace, doing multi-domain searches often exceeds 10,000.

Ocaasi added a subscriber: Ocaasi.EditedMar 25 2016, 6:29 PM

For The Wikipedia Library, we need the ability to query up to 100,000 links for many of our publisher partners. Without that ability we can't demonstrate to them the effectiveness or impact of research access. We send publishers monthly and quarterly reports showing total links and increases in the number of links in a given time period. These reports legitimize the value of our relationships. Beyond paywalled publishers, a robust link search is also critical to demonstrating the impact of open or free-to-read sources on Wikipedia. Without the ability to show that there are a high number of open sources on Wikipedia, it is more difficult to demonstrate to the publishing community that open licenses benefit the spread of their content through Wikipedia citation and traffic.

A 10,000 link cap, ignores the top 300 most referenced sources on English Wikipedia (and that only includes references in article space). See P587 for the report. Special:LinkSearch is an important tool also for looking at non-article space linking by spammers of major websites, for example we don't want miles of Google Book links hanging out in some back page).

Moreover, there is a substantial use case in other wikis (Spanish would exclude over 15 urls on P654 ).

We are working on a different way to report changes in url usage -- at T115119 . We could use support in completing the existing bug there --but that only deals with changes in urls, and doesn't account for the volumes of material already on other wikis.

Adding a few: @Legoktm @Samwalton9

Sadads added a comment.Apr 6 2016, 4:27 PM

@Florian & @Bawolff is there any chance we can raise the cap?

matmarex reopened this task as Open.Apr 14 2016, 7:51 PM
matmarex added a subscriber: matmarex.

T47237 would be a proper fix for this problem, but I guess it's not an exact duplicate.

Until/unless LinkSearch is changed to allow for more results, I'd suggest just running database queries to get the link count (since as I understand the count is all you care about). They should be quick enough to run via Quarry.


For example, to find out how many pages link to jstor.org:

use enwiki;
select count(distinct el_from) from externallinks
where ( el_index like 'http://org.jstor.%' or el_index like 'https://org.jstor.%' );

I just ran this query and the result is 339954 pages (query took 8 min 24.69 sec).


Or, to find out how many pages in the main namespace link to jstor.org:

(This query is slower and will probably time out on Quarry, at least for some domains. Try running it on Analytics slaves.)

use enwiki;
select count(distinct el_from) from externallinks
left join page on page_id = el_from
where ( el_index like 'http://org.jstor.%' or el_index like 'https://org.jstor.%' )
and page_namespace = 0;

I just ran this query and the result is 110773 pages (query took 17 min 29.39 sec).

Change 283753 had a related patch set uploaded (by Brian Wolff):
Put a high max limit of 60,000 on Special:LinkSearch

https://gerrit.wikimedia.org/r/283753

If we could have an option to effectively limit by namespace, that would be great. Even then, you'd need a much higher limit in order to not have an issue. The example I gave should return over 30,000 results, and that's not uncommon. For example, when I'm doing patrols for unreliable/inappropriate external links, I frequently do searches in the 30,000-40,000-results-range because the list is so full of non-mainspace links. But even limiting to mainspace, doing multi-domain searches often exceeds 10,000.

That is T12593

Change 283753 merged by jenkins-bot:
Put a high max limit of 60,000 on Special:LinkSearch

https://gerrit.wikimedia.org/r/283753

@Nikkimaria and everyone: The limit should be increased to 60,000 come Thursday. I hope this will at least help things in the short term.

For The Wikipedia Library, we need the ability to query up to 100,000 links for many of our publisher partners. Without that ability we can't demonstrate to them the effectiveness or impact of research access. We send publishers monthly and quarterly reports showing total links and increases in the number of links in a given time period. These reports legitimize the value of our relationships. Beyond paywalled publishers, a robust link search is also critical to demonstrating the impact of open or free-to-read sources on Wikipedia. Without the ability to show that there are a high number of open sources on Wikipedia, it is more difficult to demonstrate to the publishing community that open licenses benefit the spread of their content through Wikipedia citation and traffic.

For this use case, Wouldn't it be much easier to do a count(*) type query for quarry.wmflabs.org ?

The limit should be increased to 60,000 come Thursday. I hope this will at least help things in the short term.

(Actually, the next Thursday, next week. There are no deployments this week.)

Restricted Application added a subscriber: TerraCodes. · View Herald TranscriptApr 19 2016, 3:33 PM
matmarex closed this task as Resolved.Apr 19 2016, 3:34 PM
matmarex assigned this task to Bawolff.

The query might still time out for you for high offsets. That can't be worked around other than by fixing T47237 (or using a service like Quarry to do queries), so please watch that bug.

TheDJ added a subscriber: TheDJ.May 2 2016, 12:48 PM

Note that for quarry, the command is use enwiki_p; ( where _p indicates the public version of the replicated database).

@matmarex How does the URL syntax work for the Quarry query? I want to search for http://www.jstor.org/stable/ links, but I'm not convinced I understand that syntax enough to alter it to search for this. Thanks in advance.

@matmarex How does the URL syntax work for the Quarry query? I want to search for http://www.jstor.org/stable/ links, but I'm not convinced I understand that syntax enough to alter it to search for this. Thanks in advance.

In el_index, the domain name is "reversed" and a trailing dot is added, to allow efficient querying for things like *.jstor.org (this is implemented in wfMakeUrlIndexes()). To find any page under http://www.jstor.org/stable/, you'll want to query for el_index like 'http://org.jstor.www./stable/%' (you might also want to check for versions with 'https' and without 'www', if they're allowed?).

@matmarex Thanks! I'm doing a "el_index like 'http://org.jstor.www./stable/%' or el_index like 'https://org.jstor.www./stable/%'" query now, but it doesn't seem to be finding as many links as LinkSearch did previously - any idea why? I'm expecting around 80,000 links, but only getting approximately 40000 with the query. The LinkSearch syntax was "http://www.jstor.org/stable/" (which, on an odd unrelated note, picks up https links).

No idea. They should be the same. I'd say you must be querying for something slightly different than before (e.g., previously you checked both with 'www' and without, or you checked just 'jstor.org/' and not 'jstor.org/stable/').

Oh, I see it now – Special:LinkSearch shows every link for every page it is used on, while the query just counts the number of distinct links. If you remove the distinct from the query, the result is 120208 rows. (Although that might count protocol-relative links twice, ugh.)

This query should return the same number of results as paging through Special:LinkSearch:

select count(*) from page, externallinks
where page_id = el_from
and el_index like 'http://org.jstor.www./stable/%'

Apparently, that is 76714 combinations of link target and page.

Excellent, thanks!