Page MenuHomePhabricator

Full review of small sample (~1K) of full text queries to categorize them all
Closed, ResolvedPublic

Description

Rather than looking for big patterns, we also need to identify categories that can't be readily detected other than by manual inspection (e.g., typos and gibberish) to gauge their extent.

This also gives us a sample of typos sent through the API to see how many would get suggestions if suggestions were enabled.

Event Timeline

TJones claimed this task.
TJones raised the priority of this task from to High.
TJones updated the task description. (Show Details)
TJones added subscribers: EBernhardson, dcausse.

Done:
https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Survey_of_Zero-Results_Queries#Full_manual_review_of_a_1K_enwiki_sample

Highlights:
About 14.6% typos, plus another 8.9% that look like incomplete strings (almost all from API, so they are probably apps trying to do prefix searches, but I don't know for sure).

Of the typos, 13.1% had mistakes in the first two characters of a search term, so a reverse index might be helpful!

Of the typos, 50% got the obviously correct results with autosuggestions. 25% got something, and 25% got nothing.

7.2% of queries were in or mostly in a foreign language.

28.0% were not encyclopedic in my estimation.

5.0% was junk.

And about 1% was someone searching for addresses in Las Vegas.

TJones set Security to None.