Page MenuHomePhabricator

[S] Test commons search with synonyms turned on
Closed, ResolvedPublic

Description

NOTE: this should only be done once wmf/1.38.0-wmf.18 is on production so that the synonyms optimisation patch is available

In order to make sure that synonyms search doesn't negatively affect searching

  • run AnalyzeResults.php locally, using the commons search api endpoint with mediasearch_synonyms set, to prepopulate the caches
  • run AnalyzeResults.php again twice - once with mediasearch_synonyms set and once without, and paste the results in a comment. Also gather data for response times (stored in resultset.searchExecutionTime_ms) when the synonyms profile is on/off and paste here (e.g. median response time, average response time of slowest 10% of calls, average response time of fastest 10% of calls)

If neither precision@25 nor response times are significantly worse, then we can make search-with-synonyms default on commons. If that's the case, create a new ticket to turn on the config. If not, discuss next steps in the comments.

Event Timeline

CBogen renamed this task from Test commons search with synonyms turned on to [S] Test commons search with synonyms turned on.Jan 26 2022, 5:58 PM
CBogen updated the task description. (Show Details)
Scores

Light improvement, but to be taken with a grain of salt.

current

F1 Score | 0.65874139069724
Precision@1 | 0.94964028776978
Precision@3 | 0.92439024390244
Precision@10 | 0.88924661944623
Precision@25 | 0.85384615384615
Precision@50 | 0.82200772200772
Precision@100 | 0.77702285361334
Recall | 0.60978165938865
Average precision | 0.51507706084643

synonyms

F1 Score | 0.66441314553991
Precision@1 | 0.95316159250585
Precision@3 | 0.93892215568862
Precision@10 | 0.90568475452196
Precision@25 | 0.87420460107685
Precision@50 | 0.8398285268901
Precision@100 | 0.79800498753117
Recall | 0.60344533515265
Average precision | 0.51973424941461

Response times

They're mostly ok, but I'm a little disappointed in the worst 5% response times with synonyms.
Curiously, they're worse than when I ran this locally.
I have another quick idea, though - will give that a quick shot tomorrow.

current

average: 371ms
median: 319ms
10%: 266ms
90%: 421ms
95%: 498ms
99%: 1491ms

synonyms

average: 522ms
median: 354ms
10%: 289ms
90%: 630ms
95%: 984ms
99%: 4231ms

Change 757906 had a related patch set uploaded (by Matthias Mullie; author: Matthias Mullie):

[mediawiki/extensions/WikibaseMediaInfo@master] Further narrow down which synonyms to use, and cap them

https://gerrit.wikimedia.org/r/757906

I submitted another patch to optimize synonyms.
Here's a rough overview of how synonyms are (to be) used:

Profile currently in use:

  • we only use wikidata entities to match via P180/P6243
  • no synonyms at all

Synonyms profile currently in master:

  • all of the above default existing mediasearch stuff
  • gather English label & aliases of the entities best matching the search term (where we have a score > 0.5, based on their similarity to the search term & their position)
  • remove duplicate terms after normalizing (lowercasing, removing redundant whitespace & punctuation)
  • remove terms that are a superset of another term (e.g. "wikimedia commons" will be stripped if "commons" is already being searched for) because they would yield no additional results that would not already be found

Synonyms profile in patch in CR:

  • all of the above existing mediasearch + synonyms stuff
  • remove synonyms that are too short (<= 1 byte; i.e. all latin characters and single digits); they're too generic (expensive and of little value)
  • remove synonyms where the normalized form deviates too much from the original term with whitespace/punctuation (e.g. "C#" !== "c") for not being relevant/specific/accurate enough
  • remove synonyms that are too similar to other terms already being search (e.g. "cat" & "cats", or "house cat" and "housecat"): they're often adding no extra value (the term would already be matched via the other version once stemmed), or very little compared to searching another completely different variant
  • sort synonyms by score and "differentness", and remove all but the top 5: if one or multiple of the terms match a massive amount of documents, it'd leave a massive amount of documents to be scored; at least there's now an upper limit, and we're still getting the most valuable synonyms

I ran all 3 profiles through our set of 3000 labels terms and here's what I've got in terms of result "accuracy" & response times.
Note that the results are probably underestimated for synonyms (because our labeled data was retrieved via the original profile: we're unable to tell which additional good matches synonyms bring in). But even in this case, we're seeing a minor improvement in precision.
Note that the response times are probably an overestimation for all 3 - there's some extra latency because I'm running these through an SSH tunnel.

I suggest we merge the new patch: it seems to have no measurable negative impact on the results of the synonyms vs the current version, but it does further improve response times (and puts a hard cap on the extent to which they can increase)

Current (no synonyms)
average898ms
median820ms
10%734ms
90%965ms
95%1120ms
99%3339ms

This last one is odd; the other 2 profiles were faster, even though they can't possibly *be* faster; probably caused by external factors, which may also explain why my results with the current synonyms profile last week were so unexpectedly high)

F1 Score0.65478244201648
Precision@10.93012048192771
Precision@30.9047619047619
Precision@100.88442565186751
Precision@250.85282051282051
Precision@500.82153209109731
Precision@1000.78375733855186
Recall0.59668122270742
Average precision0.49995289721047
Synonyms currently in master:
average1017ms
median883ms
10%755ms
90%1246ms
95%1635ms
99%3190ms
F1 Score0.66101533479379
Precision@10.93821510297483
Precision@30.91122071516646
Precision@100.89148351648352
Precision@250.86802030456853
Precision@500.83518747424804
Precision@1000.79908972691808
Recall0.59407635678822
Average precision0.50591556848914
Synonyms after latest patch in CR:
average949ms
median865ms
10%742ms
90%1137ms
95%1406ms
99%2544ms
F1 Score0.66117111995452
Precision@10.93793103448276
Precision@30.91432068543452
Precision@100.89178082191781
Precision@250.86646433990895
Precision@500.83414832925835
Precision@1000.79780715898097
Recall0.5961045617632
Average precision0.50701830107509

Change 757906 merged by jenkins-bot:

[mediawiki/extensions/WikibaseMediaInfo@master] Further narrow down which synonyms to use, and cap them

https://gerrit.wikimedia.org/r/757906

Response times

They're mostly ok, but I'm a little disappointed in the worst 5% response times with synonyms.
Curiously, they're worse than when I ran this locally.
I have another quick idea, though - will give that a quick shot tomorrow.

Note: I since discovered that the normalizeFulltextScores trick which was supposed to be disabled (T296631) was accidentally re-enabled.
This was already known to have a significant impact on synonyms.
That's most why the earlier results were unexpectedly worse.
After disabling again, and with these new optimizations, I suspect we'll find that synonyms will no longer have a pronounced impact on response times.

Ran numbers again.

  • Precision has improved a few percents
  • We can't assess changes in recall, which is where the biggest impact should be
  • I can't no longer spot an obvious performance issue; they are all within the same range - extremes (both low and high) seem inconsistent but similar enough between both profiles
Current (no synonyms)

Response times (limit 1) #1

average568ms
median530ms
10%308ms
90%647ms
95%870ms
99%1994ms

Response times (limit 1) #2

average797ms
median695ms
10%556ms
90%959ms
95%1264ms
99%2348ms

Response times (limit 500) #1

average1011ms
median839ms
10%505ms
90%1362ms
95%1471ms
99%2464ms

Response times (limit 500) #2

average1036ms
median853ms
10%505ms
90%1375ms
95%1524ms
99%4498ms

Scores

F1 Score0.65810593900482
Precision@10.94199535962877
Precision@30.92466585662211
Precision@100.88996763754045
Precision@250.85397590361446
Precision@500.82021604938272
Precision@1000.77815594059406
Recall0.60873362445415
Average precision0.51359377920299
With synonyms

Response times (limit 1) #1

average613ms
median546ms
10%491ms
90%698ms
95%992ms
99%1832ms

Response times (limit 1) #2

average638ms
median546ms
10%488ms
90%744ms
95%1186ms
99%1917ms

Response times (limit 500) #1

average1087ms
median1157ms
10%506ms
90%1448ms
95%1724ms
99%3372ms

Response times (limit 500) #2

average1093ms
median1087ms
10%486ms
90%1483ms
95%1745ms
99%4053ms

Scores

F1 Score0.66196520921486
Precision@10.94808126410835
Precision@30.93816884661118
Precision@100.90665796344648
Precision@250.875
Precision@500.84133858267717
Precision@1000.79797660448941
Recall0.60109289617486
Average precision0.51693857371811

Testing on out labeled dataset complete & satisfactory.
I've created another ticket to actually switch on the profile by default & monitor production impact: T301559
Closing.