Page MenuHomePhabricator

Add new columns for Glent Method 1
Closed, ResolvedPublic

Description

  • we need a new column parallel to q1q2LevenDist for M‍1, with the token-aware edit distance value in it. "q1q2TokenAwareDist" might work as a name, though it's a bit long. "q1q2TokAwareDist"? "q1q2TAEDist"?
  • we need a new column parallel to queryNorm for M‍1, with the deduped version of the normalized query (i.e., with repeated characters removed) "queryNormDedupe" sounds good.

Event Timeline

TJones updated the task description. (Show Details)

I'm sure these would be needed/created in the SimilarQueriesSuggester when generating suggestions. Do we also need to write them to the suggestions table for use by SuggestionAggregator? Essentially are these needed when merging suggestions from many algo's and deciding the best single suggestion?

Per discussion at last weeks wendnesday meeting:

  • Rename q1q2LevenDist to q1q2EditDist and change it to a float. This will need to be applied to the data already stored in hive, along with adjusting the appropriate bits of glent.
  • The deduped version of the norm query wont need to be saved, it can be constructed on demand from the data already stored.

Change 583491 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[search/glent@master] similar queries: Apply all-pairs matching against character deduplicated queries

https://gerrit.wikimedia.org/r/583491

Change 583491 merged by Tjones:
[search/glent@master] similar queries: Apply all-pairs matching against character deduplicated queries

https://gerrit.wikimedia.org/r/583491

Will need to run the included hive migration scripts just prior to deploying the new glent version including this patch.