Page MenuHomePhabricator

Search terms data request
Closed, ResolvedPublic

Description

Description

One of the aspects of the search experience we have been discussing in T255603 is how to format the page title of each search result T255603#6369734. We believe that having some basic data on what search terms get entered could help guide this decision, as well as inform our general understanding of how search is used.

I'm not sure how to best formulate this data request. Perhaps it would make sense to look at the top search terms for ten large Wikipedias and ten small Wikipedias?

Any guidance/recommendations you all can provide would be much appreciated.

Related Objects

StatusSubtypeAssignedTask
OpenNone
OpenNone
OpenNone
OpenNone
Resolvedovasileva
ResolvedVolker_E
Resolvedphuedx
Resolved eprodromou
ResolvedEvanProdromou
ResolvedPeter.ovchyn
ResolvedPeter.ovchyn
ResolvedPeter.ovchyn
Resolved eprodromou
ResolvedPeter.ovchyn
Resolveddaniel
Resolveddaniel
Resolveddaniel
Resolved holger.knust
Resolved holger.knust
Resolved holger.knust
ResolvedBPirkle
OpenNone
DeclinedNone
ResolvedSpikeovasileva
Resolvedphuedx
ResolvedJdrewniak
Resolved alexhollender_WMF
Resolvedovasileva
ResolvedJdlrobson
ResolvedJdrewniak
ResolvedJdrewniak
Resolved Niedzielski
ResolvedVolker_E
Resolvedphuedx
Resolvedovasileva
Resolvedovasileva
DeclinedNone
Resolvedovasileva
Resolvedovasileva
Resolved nray
OpenNone
Resolvedovasileva
ResolvedVolker_E
ResolvedAnneT
ResolvedStevenSun
Resolvedovasileva
ResolvedMNeisler

Event Timeline

LGoto triaged this task as Medium priority.
LGoto edited projects, added Product-Analytics (Kanban); removed Product-Analytics.

@alexhollender

I completed a review of the most common search terms on ten large and ten small wikis. See summary of results below and notebook for further details:

Notes re Data/Approach:

  • Data includes search events recorded in August 2020 in the SearchSatisfaction table.
  • As a person types in the search header box, autocomplete search events are generated. We record the query input for multiple text entries as the user types. For example, if someone is searching for Paris. There might be a record for "Pa", "Par" and finally "Paris". depending on how fast the person types. To help isolate the data to complete searches (i.e. the user was done typing), I looked for the longest character length search term entered in a search session and also set a minimum search query length to at least 3 characters.
  • The search terms below reflect what the person enters into the search header box not what was selected in the provided drop down menu. For example, if the person just typed "Par" and then clicked "Paris" in the search results provided in the drop down, the search term would be recorded as "Par".

    Top search terms for a set of 10 large size Wikipedias (English, Spanish, German, French, Japan, Russian, Italian, Chinses, Portuguese, and Polish) in August 2020.
#search_querynum_searches
1nasdaq156726
2202031542
3part of an url22403
4tenet22117
5kamala17500
6covid14902
7lucifer14334
8belarus14111
9joe biden13479
10the13173
11kamala harris12880
12dark8601
13the boys8423
14donald trump8208
15the batman7988

Top search terms for a set of 10 small size Wikipedias (Persian, Catalan, Serbian, Indonesian, Norwegian, Korean, Finnish, Hungarian, Czech, and Serbo-Croatian) in August 2020. Note: I selected wikis with at least 100,000 articles to have enough data for the analysis and avoid any privacy/sensitive data concerns that may result from reviewing wikis with a fewer number of searches.

#search_querynum_searches
1libanon1067
2bělorusko983
32020720
4ledek668
5suomi535
6سوپر جام441
7covid344
8usa341
9ایران322
10praha318
11česká republika309
12سکس292
13valko277
14대한민국271
15česko268


Based on the above, the top search terms on the small and larger wikis reviewed in August are single words. "Kamala" received more views than "Kamala Harris" indicating that a larger number of people selected the search result provided in the drop down menu or pressing the search button prior to entering her entire name into the search box. There are a couple terms such as "part of an url" that are likely caused by unidentified bots.

I also reviewed the frequency of searches by the number of words entered into the search widget. One word or two word searches account for 85.5% of autocomplete searches, with one-word searches conducted 24.7% more than two-word searches.

See results below:

num_wordsn_searchesprop_searches (%)
191467847.46
273331138.05
31748889.07
4557952.90
5218101.13
6106330.55

Let me know if this helps or you need any additional details or breakdowns.

Thanks!

@alexhollender - Any follow-up questions about these results?

@alexhollender - Any follow-up questions about these results?

Apologies for the delayed response. This report is great — thank you so much. The only follow-up question I have currently is: what percentage of total search terms do the 15 terms listed above account for?

@ovasileva @RHo @Volker_E please see Megan's report above. Thoughts:

  1. The report gives me additional confidence that we should bold the non-matching part of the search result. Looking at a few of the popular search terms listed in the report and comparing the two patterns of bolding for the suggested results:
bold matching partbold non-matching part
Screen Shot 2020-09-16 at 3.57.11 PM.png (291×346 px, 29 KB)
Screen Shot 2020-09-16 at 3.58.16 PM.png (275×330 px, 25 KB)
Screen Shot 2020-09-16 at 4.01.28 PM.png (279×341 px, 33 KB)
Screen Shot 2020-09-16 at 3.58.46 PM.png (278×364 px, 36 KB)
Screen Shot 2020-09-16 at 4.01.15 PM.png (279×340 px, 23 KB)
Screen Shot 2020-09-16 at 3.59.17 PM.png (274×327 px, 22 KB)
Screen Shot 2020-09-16 at 4.00.40 PM.png (274×330 px, 27 KB)
Screen Shot 2020-09-16 at 4.00.27 PM.png (277×335 px, 27 KB)
Screen Shot 2020-09-16 at 4.00.53 PM.png (274×333 px, 24 KB)
Screen Shot 2020-09-16 at 4.00.16 PM.png (274×330 px, 22 KB)

one concern/question that this brings for me is: does the top result look somehow less attractive since it's not bolded?

  1. Seeing the report gives me confidence that most people know how to use search : )

Thanks @MNeisler - resolving this as of right now! @alexhollender - definitely agree on the bolding of the non-matching part. Agree that it makes the first results seem a bit strange, but interestingly enough here the first term seems to not be the most likely one in these examples...