Compare ZRR for query features across other search engines
Closed, ResolvedPublic10 Estimated Story Points
Actions

Assigned To

Authored By

	mpopov
	May 27 2016, 1:02 AM

Description

@JustinOrmont had a neat idea based on T128118 wherein we could take a sample of queries exhibiting particular features (and/or combinations of features) and then compare our ZRR with Google's/Bing's/site:wikipedia.org/etc. to see which high-ZRR features have significantly lower ZRR on other search engines.

This could highlight certain query categories for us and help us prioritize our work on improving ZRR.

Related Objects

Mentioned In: T149143: Investigate what we'd need to do to ignore double quotes in search queries
Mentioned Here: T128118: Investigate how search query features affect result sets

Event Timeline

mpopov created this task.May 27 2016, 1:02 AM

Restricted Application added subscribers: Zppix, Aklapper. · View Herald TranscriptMay 27 2016, 1:02 AM

debt triaged this task as Low priority.May 31 2016, 8:26 PM

debt moved this task from Needs triage to Later on the Discovery-Analysis board.

debt moved this task from Later to Up Next on the Discovery-Analysis board.Oct 4 2016, 8:32 PM

debt edited projects, added Discovery-ARCHIVED, Discovery-Analysis (Current work); removed Discovery-Analysis.Oct 18 2016, 8:21 PM

mpopov set the point value for this task to 10.Oct 18 2016, 8:32 PM

mpopov moved this task from Backlog to In progress on the Discovery-Analysis (Current work) board.

Like halfway done with this: https://github.com/wikimedia-research/Discovery-Search-Adhoc-SearchEngineComparison

Should be done by tomorrow or most likely Monday since I can see it taking all weekend to run the queries.

The bots encountered problems performing some searches, so the table below is incomplete, but should provide a good starting point for discussion. Not all queries were successfully searched for, so we include the fraction of zero result SERPs out of successfully searched queries on a per-engine basis. The "Proportion" column represents the % of full-text searches performed on EnWiki by U.S.-based website visitors on desktops in September and October of 2016.

Note: there is a copy of this table on GitHub that has prettier formatting.

Features	Proportion	Cirrus ZRR	Google ZRR	Yahoo ZRR	Bing ZRR	DDG ZRR
[is simple]	91.417227%	12% (12/100)	1% (1/100)	3% (3/100)	2% (2/100)	2% (2/100)
[has even double quotes]	5.888506%	78% (78/100)	20% (20/100)	27% (27/100)	25% (25/100)	41% (30/74)
[ends with ?, has wildcard]	0.569473%	9% (9/100)		0% (0/100)	0% (0/100)	0% (0/99)
[has wildcard]	0.088912%	62% (62/100)		41% (41/100)	42% (40/95)	18% (16/90)
[has one double quote, has odd double quotes]	0.059571%	45% (45/100)	6% (6/100)	8% (8/99)	7% (7/100)	21% (21/98)
[has logic inversion (-)]	0.042736%	20% (20/100)		6% (6/100)	6% (6/100)	11% (11/97)
[has wildcard, has even double quotes]	0.009066%	43% (43/100)		43% (43/100)	41% (40/98)	36% (29/80)
[has logic inversion (-), has even double quotes]	0.007691%	46% (46/100)		26% (26/100)	25% (25/100)	43% (40/92)
[ends with ?, has wildcard, has even double quotes]	0.007300%	34% (34/100)	4% (1/25)	7% (7/100)	6% (6/100)	13% (13/100)
[has logic inversion (!)]	0.004586%	14% (14/100)		7% (7/100)	6% (6/100)	5% (5/94)
[has odd double quotes]	0.003391%	65% (65/100)	28% (28/99)	44% (44/99)	44% (44/99)	59% (49/83)
[has wildcard, has one double quote, has odd double quotes]	0.002089%	65% (65/100)	35% (35/100)	51% (49/97)	45% (45/100)	64% (54/84)
[ends with ?, has wildcard, has one double quote, has odd double quotes]	0.001306%	51% (51/100)		10% (10/99)	10% (10/100)	36% (35/96)
[has logic inversion (-), has wildcard]	0.000575%	60% (40/67)		33% (22/67)	32% (21/65)	16% (8/51)
[ends with ?, has logic inversion (-), has wildcard]	0.000379%	43% (39/91)	8% (7/91)	20% (17/87)	19% (17/91)	3% (3/89)
[has quot, has even double quotes]	0.000302%	100% (74/74)	0% (0/1)	5% (4/74)	5% (4/74)	23% (6/26)
[has logic inversion (-), has wildcard, has even double quotes]	0.000261%	13% (8/63)		5% (3/63)	5% (3/61)	16% (10/63)
[has wildcard, has odd double quotes]	0.000220%	76% (41/54)	56% (30/54)	57% (31/54)	56% (30/54)	79% (38/48)
[has quot]	0.000200%	17% (8/48)		0% (0/48)	2% (1/48)	0% (0/48)
[has logic inversion (!), has wildcard]	0.000175%	56% (24/43)	42% (18/43)	53% (23/43)	58% (25/43)	18% (7/39)
[has logic inversion (!), has even double quotes]	0.000151%	57% (21/37)		27% (10/37)	24% (9/37)	24% (8/34)
[has logic inversion (-), has one double quote, has odd double quotes]	0.000135%	67% (22/33)		30% (10/33)	33% (11/33)	58% (18/31)
[ends with ?]	0.000073%	84% (16/19)		100% (19/19)	100% (19/19)	100% (12/12)
[has logic inversion (!), has one double quote, has odd double quotes]	0.000073%	50% (9/18)	12% (2/17)	28% (5/18)	24% (4/17)	53% (9/17)
[ends with ?, has wildcard, has odd double quotes]	0.000069%	94% (16/17)	24% (4/17)	24% (4/17)	24% (4/17)	82% (14/17)
[has logic inversion (-), has odd double quotes]	0.000057%	57% (8/14)	21% (3/14)	36% (5/14)	36% (5/14)	42% (5/12)
[ends with ?, has logic inversion (!), has wildcard]	0.000020%	20% (1/5)	20% (1/5)	20% (1/5)	20% (1/5)	0% (0/5)
[has logic inversion (!), has wildcard, has one double quote, has odd double quotes]	0.000020%	80% (4/5)	20% (1/5)	60% (3/5)	60% (3/5)	0% (0/3)
[has logic inversion (!), has wildcard, has even double quotes]	0.000016%	75% (3/4)	75% (3/4)	75% (3/4)	75% (3/4)	50% (2/4)
[ends with ?, has logic inversion (-), has wildcard, has even double quotes]	0.000012%	67% (2/3)		33% (1/3)	33% (1/3)	67% (2/3)
[ends with ?, has logic inversion (-), has wildcard, has one double quote, has odd double quotes]	0.000012%	33% (1/3)	33% (1/3)	0% (0/3)	0% (0/3)	67% (2/3)
[has logic inversion (-), has wildcard, has one double quote, has odd double quotes]	0.000012%	33% (1/3)		33% (1/3)	33% (1/3)	67% (2/3)
[has logic inversion (!), has odd double quotes]	0.000012%	67% (2/3)	67% (2/3)	67% (2/3)	33% (1/3)	67% (2/3)
[ends with ?, has wildcard, has quot]	0.000004%	100% (1/1)	100% (1/1)	100% (1/1)	100% (1/1)	0% (0/1)
[has logic inversion (-), has logic inversion (!), has even double quotes]	0.000004%	100% (1/1)		100% (1/1)	100% (1/1)
[has logic inversion (-), has wildcard, has odd double quotes]	0.000004%	100% (1/1)	100% (1/1)	100% (1/1)	100% (1/1)	100% (1/1)
[has wildcard, has quot]	0.000004%	100% (1/1)	100% (1/1)	100% (1/1)	100% (1/1)

mpopov moved this task from In progress to Done on the Discovery-Analysis (Current work) board.Oct 24 2016, 7:14 PM

Taking Proportion * ZRR as a measure of impact for each feature set, it looks like double quotes are the feature to look at.

Other than "simple" it's the worst for all the search engines—but they have an impact of 1.18% (Google) to 2.41% (DDG) for the others, and 4.59% for Cirrus. Obviously, searching again without quotes seems like an obvious approach.

In other news, we may want to rethink how we classify ?, since we did change its behavior in Cirrus. \? is the wildcard now, and simple ? is ignored (though the posing of an apparent ?-final question is still a decent predictor of poor search performance).

Cool stuff!

Quite interesting. To add to TJones' comments, have you looked at what the
ZRR rate would be if you dropped the double quotes from properly quoted ZRR
queries? Aka simulate the automatic removal of double quotes if no (or few)
results were found.

--justin

In T136377#2740605, @JustinOrmont wrote:

... have you looked at what the ZRR rate would be if you dropped the double quotes from properly quoted ZRR queries?

Yep. I did that back when Mikhail first did his Zero to Hero report, looking at quotes and question marks. For quotes, replacing quotes with spaces (do deal with "this"kind of thing), it dropped the ZRR by almost half, putting it into DDG territory.

We went with question marks first—the problem was bigger and the solution even easier—but it's no surprise that quotes are the next biggest thing.

debt mentioned this in T149143: Investigate what we'd need to do to ignore double quotes in search queries.Oct 25 2016, 10:43 PM

Thanks, all, resolving this ticket!

Compare ZRR for query features across other search enginesClosed, ResolvedPublic10 Estimated Story PointsActions

Description

Related Objects

Event Timeline

Compare ZRR for query features across other search engines
Closed, ResolvedPublic10 Estimated Story Points
Actions