Support searching for external links in CirrusSearch
Open, MediumPublic5 Estimated Story Points
Actions

Assigned To

None

Authored By

	CKoerner_WMF
	Mar 30 2017, 9:52 PM

Description

A friendly neighborhood IP (I mean that sincerely) left a suggestion on the CirrusSearch extension talk page regarding the ability to search for specific external links in queries.

Problem

As a reader, I want to find articles that mention contain a specific link (e.g. a new story, a hoax or an untrustworthy site) to verify its validity.

As a editor, I want to find articles that mention a specific link and some keyword to eliminate spam or certain vandalism or hoaxes.

Background

Currently, cirrussearch allows searching for internal links, yet it doesn't make it possible to do this for external links. This means that one has to use a page such as Special:LinkSearch or complicated regex with "insource" that may not always find the link because they can be constructed by templates in hard to find ways, e.g. "{{{mainsite}}}.com/{{stringsub}}".

Proposed solution

A new search "keyword" or predicate that indexes external links, e.g.:
banana cures aids extlinksto:/*.hoaxysite.com/
-extlinksto:/*.hoaxysite.com/

Before I created this task I spoke to one of the Discovery engineers about this suggestion. Their thoughts:

"overall i don't think it's crazy hard and most of the work will just be figuring out what the right analysis chain is for it and perhaps creating a second field for external_link_domains that ignores the rest for searching"

Related Objects

Mentioned Here: T222349: Do not rate limit dumps from internal network

Event Timeline

CKoerner_WMF created this task.Mar 30 2017, 9:52 PM

Restricted Application added a project: Discovery-Search. · View Herald TranscriptMar 30 2017, 9:52 PM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Overall, i think we could add something to tokenize the existing external_links field, and then a new field external_link_domains with a reversed field allowing for prefix search with the * at the beginning. I wonder about the use cases, and if perhaps the domain search is the only particularly useful part.

debt updated the task description. (Show Details)Mar 31 2017, 3:28 PM

We could do this and then do a re-index and then the right way to analyze this. We just need to figure out how to do it, but it shouldn't be too hard to do.

debt moved this task from This Quarter to elastic / cirrus on the Discovery-Search board.Jan 29 2019, 6:59 PM

CKoerner_WMF unsubscribed.Dec 12 2019, 4:24 PM

Restricted Application added a subscriber: Liuxinyu970226. · View Herald TranscriptDec 12 2019, 4:24 PM

Gehel moved this task from elastic / cirrus to Feature Requests on the Discovery-Search board.Aug 28 2020, 9:10 AM

Abbe98 subscribed.Dec 20 2021, 10:04 AM

dcausse moved this task from Feature Requests to needs triage on the Discovery-Search board.Mar 30 2022, 10:26 AM

MPhamWMF moved this task from needs triage to Current work on the Discovery-Search board.Jul 18 2022, 3:42 PM

MPhamWMF edited projects, added Discovery-Search (Current work); removed Discovery-Search.

MPhamWMF set the point value for this task to 5.Jul 25 2022, 3:48 PM

MPhamWMF moved this task from Incoming to Ready for Dev -- SWE on the Discovery-Search (Current work) board.

We talked about this a little at the wednesday meeting, a few thoughts:

It's unclear what the final shape of this should look like, do we want to only match against domain names, or do we want to also match the path portion of urls?
What about sites like archive.org, should extlinksto:somesite.org find archive.org mirrors of links to that site?
Should extlinksto:wikipedia.org match aa.wikipedia.org, or should the user be required to enter extlinksto:*.wikipedia.org?

There are probably more questions, but the answers to these questions will inform what kind of text analysis we should apply, and what kind of query should be used. Some ideas (not exhaustive):

Use pattern tokenizer to extract the domain from the url, match full domains
Tokenize on . and /, use phrase search to do the matching. This doesn't allow anchoring though, so foo.co would match foo.co.uk.
Write a custom analysis component into our plugin. This would allow some form of shingling anchored at the TLD and the usage of java url parsing to ensure we parse various unexpected url forms.

dcausse removed a project: Discovery-Search (Current work).Sep 26 2022, 3:56 PM

dcausse added a project: Discovery-Search.

dcausse moved this task from needs triage to Feature Requests on the Discovery-Search board.Sep 26 2022, 3:59 PM

Support searching for external links in CirrusSearchOpen, MediumPublic5 Estimated Story PointsActions

Description

Related Objects

Event Timeline

Support searching for external links in CirrusSearch
Open, MediumPublic5 Estimated Story Points
Actions