Page MenuHomePhabricator

URLs in page source should be properly indexed so they're findable using search
Open, LowestPublic

Description

Author: sumanah

Description:

  1. Note that https://test2.wikipedia.org/w/index.php?title=Birch_beer&oldid=57684 includes a link to growstuff.org .
  1. Search test2wiki for "growstuff.org" - https://test2.wikipedia.org/w/index.php?search=growstuff.org&title=Special%3ASearch
  1. Empty results set.

What is the desired behavior here? If a page does not *mention* growstuff.org but does *link* to it, should we include it in the results set?


Version: unspecified
Severity: normal
See Also:
https://bugzilla.wikimedia.org/show_bug.cgi?id=62058

Details

Reference
bz52905

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 1:49 AM
bzimport added a project: CirrusSearch.
bzimport set Reference to bz52905.

I _think_ we should include it. One way to think of this is, if we did include it, how would you like it highlighted? Another thing to consider is that we're mostly optimized for searching for words and might not be able to notice a url in the stream to properly not split it and (heaven forbid) stem it.

Like Bug 53013, my gut says set the priority to low because we're mostly concerned with searching words. So I'm setting the priority to low. We should revisit this once we're comfortable with other issues.

demon added a comment.Dec 20 2013, 8:06 PM

I've been pondering this, and I'm not convinced we should index it. I can't think of a sane way of doing so, or how to reinsert it into the content (which we've already stripped of all wikitext and html).

We have Special:LinkSearch, does it not work?

*** Bug 59205 has been marked as a duplicate of this bug. ***

Bug 59205 showed us that folks do expect link searches to work. Options:

0. Do nothing.

  1. Detect a link in the search and people to Special:LinkSearch. If folks are searching for full uris without extra terms this would probably work.
  2. Index links in their own multivalued field like section heading but with a uri or non-splitting analyzer and display them like file contents matches. Search them all the time. This would find links to places in the results.
  3. #2 but only search them with terms that "look like" uris. This one makes more sense if users are searching for whole uris AND other terms at the same time.
  4. Figure out some way to get the uris back into the text but strip them out on matches for which they were not explicitly searched. This would produce results similar to what works now but is technically more difficult (changes to how we get parsed output, changes to cirrus, probably changes to Elasticsearch to strip the uris during the highlighting phrase).

Chad got us indexing the links: https://gerrit.wikimedia.org/r/#/c/104986/

Now I'll grab searching them.

I'm going to shoot for option #3 in comment 4. So we'll only look in the link field one of the terms looks like a URI.

An important point I didn't realize at first: if a term "looks like" a link, we can't just search the links. We have to OR that together with searching the text. No big deal, just more syntax we have to send to Elasticsearch.

Another point: Sumana's original query still wouldn't find her growstuff link. You'd have to search for it as http://growstuff.org. Still, we're better off then we were.

Change 105202 had a related patch set uploaded by Manybubbles:
Search links

https://gerrit.wikimedia.org/r/105202

(In reply to Nik Everett from comment #8)

Another point: Sumana's original query still wouldn't find her growstuff
link. You'd have to search for it as http://growstuff.org. Still, we're
better off then we were.

Just a note, that being able to search for partial URL strings is quite useful when trying to combat spam, or to update links to sites that reorganized their directory structure without leaving proper redirects.

Hence, option 1 from comment #4 might be a good addition. Thanks!

Change 105202 abandoned by Manybubbles:
Search links

https://gerrit.wikimedia.org/r/105202

The patch was abandoned as it wasn't relevant.

We will possibly redirect users to [[Special:LinkSearch]] if they type a URL into the search box, as it will serve the user's needs.

Now that insource: is available, it is at least possible to find the desired content. E.g. https://test2.wikipedia.org/w/index.php?title=Special%3ASearch&profile=default&search=insource%3Agrowstuff.org&fulltext=Search

Perhaps we could somehow add "insource:" as an option (or text-hint) at Advanced Search, in order to remind editors of that feature? (Because only crazy people like me, are actually going to hunt their way to [[mw:Help:CirrusSearch#insource:]] ;)

demon removed a subscriber: demon.Aug 19 2015, 4:07 PM
Restricted Application added a project: Discovery. · View Herald TranscriptAug 19 2015, 4:07 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Deskana renamed this task from should include link URLs in search? to URLs in page source should be properly indexed so they're findable using search.Dec 30 2015, 9:29 PM
Deskana removed Manybubbles as the assignee of this task.
Deskana lowered the priority of this task from Medium to Lowest.
Deskana set Security to None.
Deskana moved this task from Needs triage to Search on the Discovery board.
Restricted Application added a project: Discovery-Search. · View Herald TranscriptJun 16 2017, 11:45 AM

Yup! I saw the hackathon showcase demo, and am looking forward to it! Thanks for the followup on this task, though. :)

debt added a subscriber: debt.

We don't use external links as part of the scoring process right now. Based on the fact that T143310 might 'fix' this, I'll remove it off our sprint board.

Another alternative is to use 'insource' for your query.