Page MenuHomePhabricator

Make subphrase matching the default search option on all Wikisources
Open, NormalPublic

Description

In a discussion with Nik at WM2014 we talked about an item of CirrusSearch refinement that would be useful for English Wikisource (at least initially of the Wikisources, as a pilot of functionality)

The discussion indicated to have a community discussion to have enabled an existing component that enabled indexing for typeahead functionality for subpages, so subpages of works, especially biographical dictionaries and encyclopaedias, would be visible in the typeahead search function.

Discussion raised at https://en.wikisource.org/wiki/Special:PermanentLink/5004782#Propose_search_change_in_typeahead_lookup , and no dissent to proposal


Version: master
Severity: enhancement
See Also:
T74285: Restring search to subpages of a page

Details

Reference
bz69658

Event Timeline

bzimport raised the priority of this task from to Low.Nov 22 2014, 3:37 AM
bzimport added a project: CirrusSearch.
bzimport set Reference to bz69658.
bzimport added a subscriber: Unknown Object (MLST).

I don't know the technical details here, hence leaving this to Nik for commenting

I talked with billinghurst about this at WMF - its not on the top of my list but its pretty easy to implement. In fact we already have the feature but it isn't "cluster ready" so it really just needs some cleanup.

Move to CirrusSearch so we can fix up Cirrus to make this work properly.

I have closed the added bug as it was a search with the "prefix" component, and that is an existing functionality. I have left instruction on an available template

@Manybubbles Can I ask where this is on your list now that we have all the wikis wrapped up with their migration? Anything that I can do to assist to get some progress on this baby? Thanks.

Regards, Billinghurst

Deskana removed a subscriber: Deskana.Mar 10 2015, 4:17 PM

With the migration to cirrus well complete, are we able to get some traction on this request?

Restricted Application added a project: Discovery. · View Herald TranscriptJun 28 2015, 12:13 PM
Billinghurst raised the priority of this task from Low to Normal.Jun 28 2015, 12:14 PM
Billinghurst set Security to None.
Restricted Application added a subscriber: Matanya. · View Herald TranscriptJun 28 2015, 12:14 PM
demon removed a subscriber: demon.Aug 19 2015, 4:06 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 19 2015, 4:06 PM
Magnus added a subscriber: Magnus.Jan 30 2017, 1:05 PM

Without having looked at the code, I think a new config option for the search module would be the most efficient option. A per-namespace regexp rewrite comes to mind. So that, for Wikisource, a title could be (additionally) added to the search index after a transformation like

s/^(.+?)\/(.+?)$/$2 ($1)/

Since the only user input involved is the page title, which is limited to ~250 characters, there should be little security impact from this.

Restricted Application added a project: Discovery-Search. · View Herald TranscriptJan 30 2017, 1:05 PM
Nemo_bis updated the task description. (Show Details)Jan 30 2017, 4:32 PM
Nemo_bis added a subscriber: Nemo_bis.

The discussion indicated to have a community discussion to have enabled an existing component that enabled indexing for typeahead functionality for subpages, so subpages of works, especially biographical dictionaries and encyclopaedias, would be visible in the typeahead search function.

I'm unable to parse this sentence to extract any meaning from it, btw.

The discussion indicated to have a community discussion to have enabled an existing component that enabled indexing for typeahead functionality for subpages, so subpages of works, especially biographical dictionaries and encyclopaedias, would be visible in the typeahead search function.

I'm unable to parse this sentence to extract any meaning from it, btw.

The comment mixes together technical solutions with use cases, so it's hard to parse. I believe what the user is requesting is "As a Wikisource reader, I want my typeahead queries to also search subpages so that I can find relevant results in subpages".

In the past few months the Search Team added this as a search option in Special:Preferences on Wikisource, namely "Subphrase matching (recommended for longer article titles)"; see screenshot below for an example of how this works both without and with the preference enabled.

I believe it is currently not possible for us to enable this by default for performance reasons. In the mean time, people with accounts can turn the preference on and benefit from it.

Without preference:

With preference:

I chatted with @EBernhardson about this. Enabling subphrase matching on larger wikis is out of the question for performance reasons. However, the volume of views on Wikisource may be small enough for us to enable this without serious performance issues. Is that something that people would like? If so, it can be considered.

Deskana renamed this task from Refining typeahead search results for subpages at English Wikisource to Make subphrase matching the default search option on all Wikisources.Mar 30 2017, 5:24 PM
Deskana added a subscriber: TJones.Mar 30 2017, 5:30 PM

I've changed the title of this to reflect what we'd want to do with it, as I stated above in T71658#2995052.

It seems logical that this would be an improvement for search on Wikisource, but I am very hesitant to make such a large sweeping change based only on intuition. I need some kind of data demonstrating that it is unlikely that this could be harmful. This could be quantitative data from an A/B test, or qualitative data like lots of people from Wikisource trying the preference and saying it's helpful or us evaluating a set of queries with both preferences and seeing which preference gives better results. Perhaps @TJones might have some thoughts on the last bit there, or on, well, any of this.

I welcome input. :-)

TJones added a subscriber: Deskana.Mar 30 2017, 8:21 PM

... or us evaluating a set of queries with both preferences and seeing which preference gives better results. Perhaps @TJones might have some thoughts on the last bit there ...

We could test the impact of the change on typical searches if we had an API flag or other mechanism that enabled the sub-page searching (possibly only supporting it on RelForge if we wanted to prevent anyone from thinking it was officially supported) and we ran a random batch of users' completion suggestion queries through the completion suggester with and without sub-page matching.

There are a couple of potential concerns:

  • For completion suggester queries, "sessions" are more important than for fulltext search, since the "search while you type" aspect of it sends several versions of the same query. For example, if no one stops typing before getting to 5 letters, do we care how 3 letter queries perform? Or, maybe all shorter versions of queries kinda suck, so people keep typing until it looks good, and it's only the very last query or two in a session that matter. I don't think we have a good model of user behavior (actual or desired), and I don't think we have a good template for sampling to reflect a model other than "queries are random and generally unrelated."
  • RelForge doesn't support completion suggester queries as part of its automated comparison (unless someone added some features when I wasn't looking—quite possible!). We could add such support if we wanted (and work out how to do sampling better) if we wanted to do a lot of testing.
    • Alternatively, we could do a manual comparison—grab 50-100 wikisource queries, run them using the API with and without sub-page matching, and do a semi-manual diff. It wouldn't be definitive, but it would give a sense of the order of magnitude of the effect on average queries (as opposed to only looking at example queries where we know it is useful). Doing the manual review is tedious but possibly quicker than updating RelForge for a one-off comparison. It's not too horrible if you only want to do it once. If you want to do it a few times or more, upgrading RelForge is the way to go—faster and less painful.
  • As always, when using other people's queries, it can be hard to determine intent, and therefore hard to judge quality of changes. ZRR and top-3 re-ordering give a sense of impact, but not whether the impact is good or bad.

And of course, we have A/B testing (which may also require engineering support before it's possible), and impressionistic comparisons from random volunteers. All are reasonably valid and have different strengths and weaknesses.

or on, well, any of this.

It might be too late in the discussion, but since you asked.... I agree with Erasmo Barresi in the Scriptorium discussion. Search that gets the right result would meet my expectations, too, as I tend to hit enter before I read the completion suggester suggestions.

However, that makes me think of another way to encourage testing by real users. If there is a solid consensus that enabling sup-page searching in the completion suggester is a good thing, then announce as widely and loudly as possible that it's coming in X weeks, and encourage people to try it out via their preferences. You'll get plenty of feedback that way—and if no one comments, be bold and go for it. Logged in users could still turn it off through prefs, right? That might also mitigate some concerns.

Okay, I'm out of ideas for now.

debt added a subscriber: debt.Apr 27 2017, 5:18 PM

Can someone from Wikisource let us know how to evaluate this, please?

Can someone from Wikisource let us know how to evaluate this, please?

Adding a temporary preference for users to test and give quick feedback, as TJones suggests, is definitely possible and sensible. I could help with the communication on various wikis.

Manual comparison would be nice but I'm not sure how I could help with that.

I don't think we have a good model of user behavior (actual or desired), and I don't think we have a good template for sampling to reflect a model other than "queries are random and generally unrelated."

As far as I can understand the goal here is to make it easier to find subpages by autocompletion of their subpage name, without a need for the user to resort to full-text search nor to type long basepage names (or use autocomplete multiple times until the last, desired subpage level).

So, would it be possible to "just" count, for each click on a search suggestion or search result, how many keys were pressed before it, and then see how the median/average moves? If we manage to reduce the "cost" of finding a subpage, then some movement should be visible.

I have been using it and it is very useful, though loooooonng work titles are problematic in the seeing the match component

If you would like to see it in action at https://en.wikisource.org search for "william ewart" and it gives some good hits for Gladstone

As a general comment, on the preferences page we say "Default (recommended)" for your search type. Can we remove the "(recommended)" bit? That it is the default is sufficient to indicate that, and as we are discussing, it may not be recommended for all.

I had looked to how enWS could put in some helpful material to guide our users about search tweaking for our site, and within the preferences there is not really reasonable scope. I would also like to be able to retain the general wikimedia defaults and have our information be supplementary. We have text on "search-summary" that points to the preference, though once there no ability to make commentary. I hesitated adding that detail to "mw:Extension:CirrusSearch/CompletionSuggester" though maybe that is preferred.

@debt Are we able to get a count of the wikis and their use of the setting

pinging various wikis
@VIGNERON , @Micru , @Aubrey, @Bodhisattwa @Ankry

Sharing the conversation a little wider through the WSes
@Aleator @Aschroet @DixonD @Amire80 @Butko @Tarawneh @Shizhao
in case this is relevant to your WS.

EBernhardson added a comment.EditedMay 11 2017, 5:58 PM

It turns out very few people have tried out this feature. Taking the top 5 wikisources by page views since jan 1, the number of people that enabled either fuzzy or normal subpage search enable right now:

lang# users
en15
fr1
ru1
ar0
zh0
debt added a comment.May 11 2017, 6:03 PM

@Billinghurst - we did a quick check of the database—we don't have events for properties turning on or off, but if a user has never changed their preference, they don't have the property. If they have ever changed it, then they do have the property.

Based on our check - a grand total of 19 people have at one time or another tried subphrases on enwikisource. Of those 19 people, 4 have changed back (if we understand the data correctly). Here's the breakdown:

2 classic
9 fuzzy-subphrases
6 norma-subphrases
2 strict,
and everyone else gets the default

So, it would be making a big guess to say people who have their preference set to classic or strict have tried the fuzzy and changed back.

Hope this helps!

Zdzislaw added a comment.EditedMay 13 2017, 11:35 PM

@debt @EBernhardson Could you give similar information regarding plwikisource?

Z.

For plwikisource: 12 users have enabled normal-subphrase, and 1 have enabled fuzzy-subphrase

We'd need to do an auto-complete A/B test to see if this does actually work well. Let's take a look at this during this quarter.

@TJones - do you think this would be something that you can get started or at least an outline of how to do the test? :)

TJones added a comment.Jul 6 2017, 8:09 PM

@debt, I wasn't sure why Dan asked for my input before, since I wasn't working on this at all, so I was just spit-balling.

I don't have any insight into the technical details of running an A/B test or RelForge test on the completion suggester—though I don't think RelForge is set up to replay completion suggester queries in a useful way.

Analyzing completion suggester results for an A/B test or RelForge test is weird, too, since the multiple sub-queries that come up as you type are related. For A/B tests, do we sample sessions vs queries properly? Most completion suggester queries get no clicks because they are followed by another query as people keep typing, so only the last one in a sequence is relevant, I think. For RelForge, it's not clear how to measure things if we did run the intermediate queries. One system might give more results sooner than the other, so do we try to measure when the eventually-clicked suggestion came up, or how high it is ranked, etc? This is part of what I meant about not having a model of user behavior for completion suggester use—is the rank in the list that matters? Or how how many keystrokes it takes to get it on the list at all? Etc. etc.

So, I got lots of questions, but not a lot of answers.

It looks like enwikisource has added a suggestion to Special:Search to enable subphrase completion matching. This has resulted in ~421 users that have turned it on, and we don't see any that have turned it back off for the default. This suggests we could possibly move forward with making subphrase matching the default on wikisource.

The consensus for that change has existed for a number of years. It is an excellent ability that works well. As long as the ability exists for users to change it back, then having it as the default would be desired.

This has resulted in ~421 users that have turned it on, and we don't see any that have turned it back off for the default.

Thank you for looking into it. I agree this is a useful and encouraging statistic and that we can proceed with changing the default given the existing consensus.