The goal is to release a public dataset of daily search engine referrals to Wikipedia that is ideally sliced by country, language, and device. Currently, there are public datasets on how much daily traffic overall comes from search engines , how much daily traffic comes from different device types , soon there will be a dataset on daily top-viewed pages by country , and the early clickstream datasets contained information about referrals by article from a few specific search engines . There is no current data though regarding what the top referrers to Wikimedia sites are. Given that Wikipedia is globally popular website, this dataset would provide valuable insight into the platforms that individuals use to find knowledge on the web.
Ideally this will be a daily dataset. If the privacy restrictions make this dataset largely non-useful -- i.e. we have to throw out too much data -- a monthly dataset might (also) be useful.
The ideal set of facets are three (though device is actually two):
- country -- i.e. where is the reader accessing Wikipedia from (based on their IP address)
- Device -- both OS (e.g., iOS, Linux, Windows) and browser family (e.g., Chrome, Safari, Firefox) as in this dashboard. For privacy/data reasons, we just need browser_family and os_family but can aggregate all browser/OS versions together.
- language -- assuming we just focus on Wikipedia, this would be the associated language (e.g., "en" for English Wikipedia and no data on Wiktionary etc.). We could expand to include other projects too and then it would be project (e.g., en.wikipedia, en.wiktionary, etc.).
Identifying search engines from referrer data
With the current system we use for classifying referrers on pageviews, there are a series of ~20 regexes that are matched to known search engines as well as a catch-all regex that should identify search engines that are not explicitly named but use a search-engine-ish URL pattern . It should be largely trivial (though what ever is?) to just reapply these regexes to building a more fine-grained dataset of top search engines for public release. My sense (see below in limitations) is that this set of 20 is still pretty globally complete.
Following the example set by T207171, the initial thought is to apply a few filters:
- Potentially remove certain countries where this data is deemed more sensitive
- Enforce a minimum number of pageviews or unique devices to be included on the list
Caveats / Limitations
- Referrer information is incomplete because browsers / apps still haven't really figured this out. It's harder to know how much this affects Search traffic but apps are where the problem tends to be most salient and, for example, I was able to determine at one point that 40% of Youtube traffic comes in without referrer information (T195880#6207748). There really is no easy fix for this.
- I just checked the list of top external referers not identified as search engines (data from Dec 10th ) and the only things I'd consider changing are:
- startpage.com is not currently caught by the startpage regex (which expects .startpage.) and so that regex should probably be redesigned to capture that.
- suche.t-online.de isn't caught because it uses the German word for search
- I also checked what falls under PREDICTED_OTHER (regex: "(^.?|(?<!re)|(^|\\.)(pre|secure))search") in the currently identified search engines to see if there are any large search engines that should be bumped to their own regex. None exceeded 20k pageviews on December 10th  and the only one that I saw that seemed country-specific and therefore potentially important to include is ukr.net but someone with better knowledge of Ukraine would have to look into that to see if it should be included.
- We can optionally expand this to some pre-defined set of non-search-engines as well (e.g., Facebook, Reddit, Twitter, Youtube could reasonably be added like we do in the Social Media Traffic Report)
- Browser family data largely depends on a standard Java user_agent parser library  that I believe pulls from this a widely-used public YAML , so anything not listed in there won't be picked up. This is clearest with e.g., browsers on Android that build on the generic Chrome's WebView API (and will therefore be labeled as Chrome WebView).
- The usual error around geocoding IP addresses. MaxMind claims very high accuracy (99.8%) for countries but there might be regions where this is lower .
 Existing search engine regexes: https://github.com/wikimedia/analytics-refinery-source/blob/81744162364493d65ad746ab500f0302c0080ac6/refinery-core/src/main/java/org/wikimedia/analytics/refinery/core/SearchEngine.java#L25
 Notebook analysis of referrer data: stat1004:/home/isaacj/notebooks/Search_Engine_Traffic.ipynb
 Wikimedia UA Parser code: https://github.com/wikimedia/analytics-refinery-source/blob/81744162364493d65ad746ab500f0302c0080ac6/refinery-core/src/main/java/org/wikimedia/analytics/refinery/core/UAParser.java
 Standard regexes for user agents: https://raw.githubusercontent.com/ua-parser/uap-core/master/regexes.yaml