Page MenuHomePhabricator

Support searching for external links in CirrusSearch
Open, MediumPublic5 Estimated Story Points

Description

A friendly neighborhood IP (I mean that sincerely) left a suggestion on the CirrusSearch extension talk page regarding the ability to search for specific external links in queries.

Problem

As a reader, I want to find articles that mention contain a specific link (e.g. a new story, a hoax or an untrustworthy site) to verify its validity.

As a editor, I want to find articles that mention a specific link and some keyword to eliminate spam or certain vandalism or hoaxes.

Background

Currently, cirrussearch allows searching for internal links, yet it doesn't make it possible to do this for external links. This means that one has to use a page such as Special:LinkSearch or complicated regex with "insource" that may not always find the link because they can be constructed by templates in hard to find ways, e.g. "{{{mainsite}}}.com/{{stringsub}}".

Proposed solution

A new search "keyword" or predicate that indexes external links, e.g.:

banana cures aids extlinksto:/*.hoaxysite.com/
-extlinksto:/*.hoaxysite.com/

Before I created this task I spoke to one of the Discovery engineers about this suggestion. Their thoughts:

"overall i don't think it's crazy hard and most of the work will just be figuring out what the right analysis chain is for it and perhaps creating a second field for external_link_domains that ignores the rest for searching"

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Overall, i think we could add something to tokenize the existing external_links field, and then a new field external_link_domains with a reversed field allowing for prefix search with the * at the beginning. I wonder about the use cases, and if perhaps the domain search is the only particularly useful part.

debt triaged this task as Medium priority.Apr 6 2017, 5:12 PM
debt moved this task from needs triage to This Quarter on the Discovery-Search board.
debt subscribed.

We could do this and then do a re-index and then the right way to analyze this. We just need to figure out how to do it, but it shouldn't be too hard to do.

We talked about this a little at the wednesday meeting, a few thoughts:

  • It's unclear what the final shape of this should look like, do we want to only match against domain names, or do we want to also match the path portion of urls?
  • What about sites like archive.org, should extlinksto:somesite.org find archive.org mirrors of links to that site?
  • Should extlinksto:wikipedia.org match aa.wikipedia.org, or should the user be required to enter extlinksto:*.wikipedia.org?

There are probably more questions, but the answers to these questions will inform what kind of text analysis we should apply, and what kind of query should be used. Some ideas (not exhaustive):

  • Use pattern tokenizer to extract the domain from the url, match full domains
  • Tokenize on . and /, use phrase search to do the matching. This doesn't allow anchoring though, so foo.co would match foo.co.uk.
  • Write a custom analysis component into our plugin. This would allow some form of shingling anchored at the TLD and the usage of java url parsing to ensure we parse various unexpected url forms.