Begin to quantify why people use Search Engines instead of Wikimedia search
Closed, ResolvedPublic3 Story Points


The initial hypothesis is that there are two big reasons why people use Google (or other search engines) instead of Wikimedia search: (1) external search engines give better results than Wikimedia search, and (2) it's just habit, more convenient, they don't know about our search capabilities, etc.

We could begin to quantify this by looking at queries that come from search engines and lead to Wikimedia pages (at least those with referrers) and test whether those same queries (possibly minus "wiki" or "wikipedia" and similar search terms) give the destination page as a result using our search (say, top 5 results).

If they don't, then it gives credence to the idea that people are using external search engines because they give better results.

If they do, then it's habit or convenience of using an external engine or ignorance of our search capabilities.

In the former case, we have examples of what we need to work on in our search engine. In the latter case, we need to work on our advertising!

Initial scope: we could limit our initial investigation to some set of specific search engines (Google, DuckDuckGo, Bing) and/or to some specific set of Wikis (the large wikis, or ones we have in labs) to see if there's a big effect from the biggest search engines to the biggest wikis.

Results should probably be broken down by search engine and wiki or at least by language—maybe Google users are using difficult queries to get into the Hungarian Wiktionary, but not into German Wikipedia, while DuckDuckGo users create difficult queries to get content out of French Wikipedia, but not Finnish Wikipedia. These are jokey examples, but there's some insight to be gained from this breakdown, especially by language. Maybe our support of English is great, but our support of Finnish is much weaker than Bing's, for example.

Caveat: There are other reasons why someone might use an external search engine—like searching multiple wikis at once—that we won't necessarily be able to detect here. However, we should be able to find obvious shortcomings, such as language support, typo correction, and magical inference of user intent.

  • Step 1: extract queries ("gorge clooney wiki"), sources ("Google"), and destination ("")
  • Step 2?: normalize queries (should we drop "wiki" if it's not the only query term, since that seems to be info for Google, or, run both with and without "wiki", etc.)
  • Step 3: run the referring queries agains the relevant wiki
  • Step 4: analysis (profit?!)
TJones created this task.Sep 8 2015, 8:20 PM
TJones updated the task description. (Show Details)
TJones raised the priority of this task from to Needs Triage.
TJones added a project: CirrusSearch.
TJones added subscribers: TJones, Deskana, Ironholds.
Restricted Application added a project: Discovery. · View Herald TranscriptSep 8 2015, 8:20 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Ironholds moved this task from Needs triage to Analysis on the Discovery board.Sep 8 2015, 8:21 PM
Ironholds set Security to None.
Ironholds edited a custom field.
Ironholds claimed this task.Sep 9 2015, 6:09 PM
Ironholds moved this task from Backlog to In progress on the Discovery-Analysis (Current work) board.

Done! Code here; @TJones the file is in stat1002 at /home/ironholds/matched_google_searches.tsv

Deskana closed this task as Resolved.Sep 24 2015, 4:06 AM
Deskana moved this task from Done to Resolved on the Discovery-Analysis (Current work) board.
Restricted Application added a subscriber: StudiesWorld. · View Herald TranscriptDec 31 2015, 5:07 AM