Page MenuHomePhabricator

Begin to quantify why people use Search Engines instead of Wikimedia search
Closed, ResolvedPublic3 Estimated Story Points

Description

The initial hypothesis is that there are two big reasons why people use Google (or other search engines) instead of Wikimedia search: (1) external search engines give better results than Wikimedia search, and (2) it's just habit, more convenient, they don't know about our search capabilities, etc.

We could begin to quantify this by looking at queries that come from search engines and lead to Wikimedia pages (at least those with referrers) and test whether those same queries (possibly minus "wiki" or "wikipedia" and similar search terms) give the destination page as a result using our search (say, top 5 results).

If they don't, then it gives credence to the idea that people are using external search engines because they give better results.

If they do, then it's habit or convenience of using an external engine or ignorance of our search capabilities.

In the former case, we have examples of what we need to work on in our search engine. In the latter case, we need to work on our advertising!

Initial scope: we could limit our initial investigation to some set of specific search engines (Google, DuckDuckGo, Bing) and/or to some specific set of Wikis (the large wikis, or ones we have in labs) to see if there's a big effect from the biggest search engines to the biggest wikis.

Results should probably be broken down by search engine and wiki or at least by language—maybe Google users are using difficult queries to get into the Hungarian Wiktionary, but not into German Wikipedia, while DuckDuckGo users create difficult queries to get content out of French Wikipedia, but not Finnish Wikipedia. These are jokey examples, but there's some insight to be gained from this breakdown, especially by language. Maybe our support of English is great, but our support of Finnish is much weaker than Bing's, for example.

Caveat: There are other reasons why someone might use an external search engine—like searching multiple wikis at once—that we won't necessarily be able to detect here. However, we should be able to find obvious shortcomings, such as language support, typo correction, and magical inference of user intent.

  • Step 1: extract queries ("gorge clooney wiki"), sources ("Google"), and destination ("en.wikipedia.org/wiki/George_Clooney")
  • Step 2?: normalize queries (should we drop "wiki" if it's not the only query term, since that seems to be info for Google, or, run both with and without "wiki", etc.)
  • Step 3: run the referring queries agains the relevant wiki
  • Step 4: analysis (profit?!)

Event Timeline

TJones raised the priority of this task from to Needs Triage.
TJones updated the task description. (Show Details)
TJones added a project: CirrusSearch.
TJones added subscribers: TJones, Deskana, Ironholds.
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Done! Code here; @TJones the file is in stat1002 at /home/ironholds/matched_google_searches.tsv