Release dataset on top search engine referrers by country, device, and language
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Isaac
	Dec 14 2020, 10:36 PM

Description

The goal is to release a public dataset of daily search engine referrals to Wikipedia that is ideally sliced by country, language, and device. Currently, there are public datasets on how much daily traffic overall comes from search engines [1], how much daily traffic comes from different device types [2], soon there will be a dataset on daily top-viewed pages by country [3], and the early clickstream datasets contained information about referrals by article from a few specific search engines [4]. There is no current data though regarding what the top referrers to Wikimedia sites are. Given that Wikipedia is globally popular website, this dataset would provide valuable insight into the platforms that individuals use to find knowledge on the web.

Temporal

Ideally this will be a daily dataset. If the privacy restrictions make this dataset largely non-useful -- i.e. we have to throw out too much data -- a monthly dataset might (also) be useful.

Facets

The ideal set of facets are three (though device is actually two):

country -- i.e. where is the reader accessing Wikipedia from (based on their IP address)
Device -- both OS (e.g., iOS, Linux, Windows) and browser family (e.g., Chrome, Safari, Firefox) as in this dashboard. For privacy/data reasons, we just need browser_family and os_family but can aggregate all browser/OS versions together.
language -- assuming we just focus on Wikipedia, this would be the associated language (e.g., "en" for English Wikipedia and no data on Wiktionary etc.). We could expand to include other projects too and then it would be project (e.g., en.wikipedia, en.wiktionary, etc.).

Identifying search engines from referrer data

With the current system we use for classifying referrers on pageviews, there are a series of ~20 regexes that are matched to known search engines as well as a catch-all regex that should identify search engines that are not explicitly named but use a search-engine-ish URL pattern [5]. It should be largely trivial (though what ever is?) to just reapply these regexes to building a more fine-grained dataset of top search engines for public release. My sense (see below in limitations) is that this set of 20 is still pretty globally complete.

Privacy

Following the example set by T207171, the initial thought is to apply a few filters:

Potentially remove certain countries where this data is deemed more sensitive
Enforce a minimum number of pageviews or unique devices to be included on the list

Caveats / Limitations

Referrer information is incomplete because browsers / apps still haven't really figured this out. It's harder to know how much this affects Search traffic but apps are where the problem tends to be most salient and, for example, I was able to determine at one point that 40% of Youtube traffic comes in without referrer information (T195880#6207748). There really is no easy fix for this.
I just checked the list of top external referers not identified as search engines (data from Dec 10th [6]) and the only things I'd consider changing are:
- startpage.com is not currently caught by the startpage regex (which expects .startpage.) and so that regex should probably be redesigned to capture that.
- suche.t-online.de isn't caught because it uses the German word for search
I also checked what falls under PREDICTED_OTHER (regex: "(^.?|(?<!re)|(^|\\.)(pre|secure))search") in the currently identified search engines to see if there are any large search engines that should be bumped to their own regex. None exceeded 20k pageviews on December 10th [6] and the only one that I saw that seemed country-specific and therefore potentially important to include is ukr.net but someone with better knowledge of Ukraine would have to look into that to see if it should be included.
We can optionally expand this to some pre-defined set of non-search-engines as well (e.g., Facebook, Reddit, Twitter, Youtube could reasonably be added like we do in the Social Media Traffic Report)
Browser family data largely depends on a standard Java user_agent parser library [7] that I believe pulls from this a widely-used public YAML [8], so anything not listed in there won't be picked up. This is clearest with e.g., browsers on Android that build on the generic Chrome's WebView API (and will therefore be labeled as Chrome WebView).
The usual error around geocoding IP addresses. MaxMind claims very high accuracy (99.8%) for countries but there might be regions where this is lower [9].

References

[1] https://discovery.wmflabs.org/external/
[2] https://analytics.wikimedia.org/dashboards/browsers/#mobile-site-by-browser
[3] T207171
[4] https://figshare.com/articles/Wikipedia_Clickstream/1305770
[5] Existing search engine regexes: https://github.com/wikimedia/analytics-refinery-source/blob/81744162364493d65ad746ab500f0302c0080ac6/refinery-core/src/main/java/org/wikimedia/analytics/refinery/core/SearchEngine.java#L25
[6] Notebook analysis of referrer data: stat1004:/home/isaacj/notebooks/Search_Engine_Traffic.ipynb
[7] Wikimedia UA Parser code: https://github.com/wikimedia/analytics-refinery-source/blob/81744162364493d65ad746ab500f0302c0080ac6/refinery-core/src/main/java/org/wikimedia/analytics/refinery/core/UAParser.java
[8] Standard regexes for user agents: https://raw.githubusercontent.com/ua-parser/uap-core/master/regexes.yaml
[9] https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Geolocation#Data

Details

	Subject	Repo	Branch	Lines +/-
	Add daily referrers Hive table and Oozie job	analytics/refinery	master	+635 -0

Customize query in gerrit

Related Objects

Mentioned In: T348504: [Data Platform] Update referer job to use global country deny list instead of a hard-coded one
T112284: Create new table for 'referer' aggregated data
T228802: External traffic breakdown in Druid/Turnilo/Superset
Mentioned Here: T195880: % of "none" referers seems too high
T207171: Have a way to show the most popular pages per country

Event Timeline

Isaac renamed this task from Release dataset on top search engine referrers by country, OS, and language to Release dataset on top search engine referrers by country, device, and language.Dec 14 2020, 10:36 PM

Isaac created this task.

Some data from December 10th to help us think about privacy. Raw data can be found in isaacj.search_engine_data in Hive and data pipeline in stat1004:/home/isaacj/notebooks/Search_Engine_Traffic.ipynb. Specifically looking at how much data we'd have for each country if our daily threshold was at least 500 pageviews. I'll try to provide some additional analyses on how different privacy thresholds and k values (e.g., only reporting top 100) affect how much data is made available.

list_size: number of triplets of language, device (browser family), and search engine (e.g., Google) that exceed 500 pageviews threshold for that country and that day
langs: number of different Wikipedia languages that were in at least one triplet  for that country -- e.g., en for English, fr for French
devices: number of unique devices (browser family) that were in at least one triplet for that country -- e.g., Chrome Mobile, Safari, etc.
search_engines: number of unique search engines that were in at least one triplet for that country -- e.g., Google, Bing, DuckDuckGo. This can also include search engines that aren't on our regex list but match the generic `search` regex such as search.pch.com, but these are quite rare.

+--------------------------------+---------+-----+-------+--------------+
|country                         |list_size|langs|devices|search_engines|
+--------------------------------+---------+-----+-------+--------------+
|United States                   |349      |43   |33     |17            |
|Germany                         |324      |36   |31     |11            |
|France                          |221      |25   |25     |12            |
|United Kingdom                  |209      |28   |27     |11            |
|Japan                           |190      |14   |28     |19            |
|Spain                           |167      |23   |19     |10            |
|India                           |161      |30   |26     |8             |
|Netherlands                     |160      |22   |19     |8             |
|Canada                          |160      |23   |22     |9             |
|Italy                           |160      |21   |25     |9             |
|Russia                          |140      |21   |30     |5             |
|Switzerland                     |134      |17   |17     |7             |
|Belgium                         |119      |15   |16     |6             |
|Poland                          |117      |14   |23     |8             |
|Sweden                          |116      |19   |17     |6             |
|Austria                         |114      |20   |18     |7             |
|Ukraine                         |104      |12   |23     |5             |
|South Korea                     |104      |11   |17     |7             |
|Australia                       |95       |17   |18     |8             |
|Czechia                         |93       |12   |19     |7             |
|Hong Kong                       |88       |11   |20     |4             |
|Brazil                          |84       |9    |21     |7             |
|Finland                         |83       |13   |17     |6             |
|Turkey                          |83       |11   |19     |5             |
|Serbia                          |81       |11   |17     |4             |
|Indonesia                       |81       |17   |18     |8             |
|Mexico                          |78       |8    |21     |8             |
|Philippines                     |78       |12   |19     |8             |
|Norway                          |77       |13   |16     |5             |
|Malaysia                        |76       |11   |18     |5             |
|Romania                         |74       |12   |16     |5             |
|Israel                          |72       |8    |17     |7             |
|Taiwan                          |70       |9    |18     |4             |
|Denmark                         |69       |13   |15     |5             |
|Singapore                       |68       |16   |19     |5             |
|Greece                          |66       |13   |16     |5             |
|Portugal                        |63       |7    |16     |5             |
|Ireland                         |63       |13   |18     |5             |
|Kazakhstan                      |60       |4    |17     |3             |
|Bulgaria                        |57       |9    |16     |4             |
|Croatia                         |56       |8    |16     |3             |
|Hungary                         |56       |9    |16     |4             |
|Slovakia                        |55       |7    |15     |3             |
|Vietnam                         |55       |7    |20     |4             |
|Iran                            |53       |9    |21     |3             |
|Egypt                           |53       |9    |18     |3             |
|Thailand                        |53       |7    |16     |4             |
|Argentina                       |53       |6    |19     |7             |
|Morocco                         |50       |5    |16     |3             |
|Unknown                         |47       |13   |14     |4             |
|Azerbaijan                      |47       |4    |16     |2             |
|United Arab Emirates            |44       |11   |17     |4             |
|Saudi Arabia                    |42       |7    |14     |3             |
|Slovenia                        |42       |9    |13     |3             |
|Belarus                         |42       |5    |18     |2             |
|Republic of Lithuania           |41       |5    |13     |5             |
|Georgia                         |40       |7    |11     |4             |
|Estonia                         |39       |5    |12     |4             |
|Bangladesh                      |39       |4    |18     |3             |
|Bosnia and Herzegovina          |39       |6    |10     |2             |
|Algeria                         |39       |4    |17     |3             |
|Latvia                          |39       |4    |12     |5             |
|Colombia                        |38       |5    |16     |5             |
|Republic of Moldova             |38       |5    |12     |2             |
|Uzbekistan                      |38       |4    |16     |2             |
|Chile                           |37       |5    |15     |5             |
|Luxembourg                      |37       |6    |10     |3             |
|Peru                            |35       |4    |17     |5             |
|North Macedonia                 |35       |8    |10     |2             |
|Albania                         |34       |8    |11     |3             |
|South Africa                    |33       |4    |17     |5             |
|New Zealand                     |33       |4    |16     |5             |
|Pakistan                        |32       |5    |19     |4             |
|Tunisia                         |31       |4    |11     |3             |
|Armenia                         |30       |3    |11     |2             |
|Ecuador                         |29       |2    |14     |5             |
|Kyrgyzstan                      |28       |6    |12     |2             |
|Iraq                            |28       |6    |12     |2             |
|Dominican Republic              |27       |3    |12     |4             |
|Venezuela                       |25       |2    |16     |4             |
|Hashemite Kingdom of Jordan     |25       |3    |11     |3             |
|Sri Lanka                       |24       |3    |16     |3             |
|Nigeria                         |24       |4    |18     |2             |
|Lebanon                         |24       |4    |9      |2             |
|Panama                          |23       |2    |12     |4             |
|Myanmar                         |23       |3    |14     |2             |
|Costa Rica                      |22       |2    |12     |2             |
|Nepal                           |22       |4    |16     |3             |
|Mongolia                        |22       |3    |10     |2             |
|Kuwait                          |22       |3    |11     |2             |
|Cyprus                          |21       |3    |9      |2             |
|Kenya                           |20       |2    |16     |3             |
|Montenegro                      |20       |6    |6      |1             |
|Puerto Rico                     |20       |2    |9      |2             |
|Oman                            |20       |3    |9      |2             |
|Qatar                           |20       |3    |11     |2             |
|Bolivia                         |19       |2    |14     |2             |
|Bahrain                         |19       |2    |9      |2             |
|Cameroon                        |18       |2    |12     |2             |
|Uruguay                         |18       |2    |12     |2             |
|Cambodia                        |18       |5    |9      |2             |
|China                           |18       |3    |8      |2             |
|Ghana                           |17       |1    |16     |2             |
|Iceland                         |17       |3    |8      |3             |
|Paraguay                        |17       |3    |11     |2             |
|Tanzania                        |17       |2    |13     |2             |
|Ivory Coast                     |17       |3    |12     |2             |
|Guatemala                       |16       |2    |10     |2             |
|Sudan                           |16       |2    |9      |1             |
|Angola                          |16       |3    |11     |2             |
|Syria                           |15       |3    |9      |1             |
|Honduras                        |15       |2    |10     |2             |
|DR Congo                        |15       |2    |10     |1             |
|Senegal                         |14       |4    |9      |2             |
|El Salvador                     |14       |2    |10     |2             |
|Macao                           |14       |2    |8      |2             |
|Ethiopia                        |14       |2    |11     |3             |
|Mozambique                      |12       |2    |8      |1             |
|Haiti                           |12       |3    |8      |1             |
|Libya                           |12       |2    |8      |1             |
|Zambia                          |12       |1    |12     |2             |
|Palestine                       |12       |3    |8      |1             |
|Malta                           |12       |2    |9      |2             |
|Uganda                          |12       |1    |12     |2             |
|Tajikistan                      |11       |3    |6      |1             |
|Cuba                            |11       |2    |8      |1             |
|Réunion                         |11       |2    |9      |2             |
|Mauritius                       |11       |2    |8      |2             |
|Madagascar                      |10       |2    |8      |1             |
|Zimbabwe                        |10       |1    |10     |2             |
|Trinidad and Tobago             |10       |1    |9      |2             |
|Yemen                           |10       |2    |7      |1             |
|Brunei                          |10       |2    |8      |1             |
|Jamaica                         |10       |1    |9      |2             |
|Nicaragua                       |9        |2    |6      |1             |
|Kosovo                          |9        |4    |3      |1             |
|Laos                            |9        |4    |4      |1             |
|Benin                           |9        |2    |7      |1             |
|Afghanistan                     |8        |2    |4      |1             |
|Turkmenistan                    |8        |4    |4      |1             |
|Martinique                      |8        |2    |7      |2             |
|Suriname                        |8        |2    |4      |1             |
|Togo                            |8        |2    |6      |1             |
|Guadeloupe                      |8        |2    |7      |2             |
|Mali                            |7        |3    |5      |1             |
|Gabon                           |7        |2    |6      |1             |
|Somalia                         |7        |3    |5      |1             |
|Mauritania                      |7        |3    |4      |1             |
|Andorra                         |7        |3    |4      |1             |
|Rwanda                          |7        |2    |5      |1             |
|Burkina Faso                    |7        |2    |6      |1             |
|New Caledonia                   |6        |1    |6      |1             |
|French Polynesia                |6        |1    |6      |1             |
|Bahamas                         |6        |1    |6      |2             |
|Curaçao                         |6        |2    |3      |1             |
|Guam                            |5        |1    |5      |1             |
|Faroe Islands                   |5        |2    |3      |1             |
|Barbados                        |5        |1    |5      |1             |
|Guinea                          |5        |2    |4      |1             |
|Malawi                          |5        |1    |5      |1             |
|Sierra Leone                    |4        |1    |4      |1             |
|Fiji                            |4        |1    |4      |1             |
|Guyana                          |4        |1    |4      |1             |
|French Guiana                   |4        |1    |4      |1             |
|Botswana                        |4        |1    |4      |1             |
|Monaco                          |4        |2    |2      |1             |
|Namibia                         |4        |1    |4      |1             |
|Jersey                          |4        |1    |4      |1             |
|Liberia                         |4        |1    |4      |1             |
|Isle of Man                     |4        |1    |4      |1             |
|Niger                           |4        |2    |3      |1             |
|Bermuda                         |4        |1    |4      |1             |
|Guernsey                        |4        |1    |4      |1             |
|Burundi                         |4        |1    |4      |1             |
|Cabo Verde                      |4        |2    |3      |1             |
|Maldives                        |4        |1    |4      |1             |
|Liechtenstein                   |4        |1    |4      |1             |
|Åland                           |4        |2    |3      |1             |
|Cayman Islands                  |3        |1    |3      |1             |
|Bhutan                          |3        |1    |3      |1             |
|Gambia                          |3        |1    |3      |1             |
|U.S. Virgin Islands             |3        |1    |3      |1             |
|Saint Lucia                     |3        |1    |3      |1             |
|Mayotte                         |3        |1    |3      |1             |
|Aruba                           |3        |1    |3      |1             |
|Gibraltar                       |3        |1    |3      |1             |
|Eswatini                        |3        |1    |3      |1             |
|Belize                          |3        |1    |3      |1             |
|East Timor                      |3        |3    |1      |1             |
|Sint Maarten                    |3        |1    |3      |1             |
|Lesotho                         |3        |1    |3      |1             |
|Congo Republic                  |3        |1    |3      |1             |
|San Marino                      |3        |1    |3      |1             |
|Antigua and Barbuda             |3        |1    |3      |1             |
|Eritrea                         |3        |1    |3      |1             |
|Djibouti                        |3        |2    |2      |1             |
|Papua New Guinea                |2        |1    |2      |1             |
|Saint Vincent and the Grenadines|2        |1    |2      |1             |
|Grenada                         |2        |1    |2      |1             |
|South Sudan                     |2        |1    |2      |1             |
|Seychelles                      |2        |1    |2      |1             |
|Chad                            |2        |2    |1      |1             |
|Dominica                        |2        |1    |2      |1             |
|Central African Republic        |1        |1    |1      |1             |
|Equatorial Guinea               |1        |1    |1      |1             |
|Turks and Caicos Islands        |1        |1    |1      |1             |
|Anguilla                        |1        |1    |1      |1             |
|St Kitts and Nevis              |1        |1    |1      |1             |
|Comoros                         |1        |1    |1      |1             |
|São Tomé and Príncipe           |1        |1    |1      |1             |
|Solomon Islands                 |1        |1    |1      |1             |
|Northern Mariana Islands        |1        |1    |1      |1             |
|Guinea-Bissau                   |1        |1    |1      |1             |
+--------------------------------+---------+-----+-------+--------------+

JAllemandou subscribed.Dec 15 2020, 8:38 AM

We groomed this today, and here are our thoughts:

We think it'll take about a week to write the job, a week to deploy and publish the data, and at least two weeks to work out the privacy implications (probably more like a month).

We don't have time to focus on this until next quarter at the earliest, but we're happy to show Fabian around so he can get started earlier.

Our recommendation is that we start with Privacy and let James run this new dataset through the privacy process.

Isaac updated the task description. (Show Details)Jan 4 2021, 8:08 PM

Isaac updated the task description. (Show Details)

fkaelin subscribed.Jan 7 2021, 4:19 PM

Thanks @Milimetric -- @bmansurov will be leading the technical work on this so we're going to start work on this and greatly appreciate whatever code review / support Analytics is able to provide along the way. My assumption is that this will be a reportupdater query like the existing browser/OS stats but I'm largely ambivalent about where/how the data is generated.

I'll start working on some examples to help with determining thresholds for inclusion and the privacy review.

Thanks @Isaac, hi @bmansurov!! Actually, this should be an oozie job. They're a bit more of a pain to write, but I can help with that. The major benefit is that we get alerts if the pipeline breaks or gets stuck, and it's easier to rerun and backfill. Do ping me everywhere if I'm not responsive enough.

• JFishback_WMF added a project: Privacy Engineering.Jan 11 2021, 8:44 PM

o/ @Milimetric!

I think I created an Oozie job once ;) Would you say this is a good starting point: https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Oozie? Any other links? Maybe you can share a similar job that I can take an inspiration from?

@bmansurov that's a good overview but it can be too detailed and not all of it is relevant. My suggestion is to look at the pageview hourly job, because you'll be writing something very similar. You're basically depending on the pageview_actor dataset, and transforming the data. That's what this job does: https://github.com/wikimedia/analytics-refinery/tree/master/oozie/pageview/hourly The xml is just setting up that dependency and this HQL query does the transformation: https://github.com/wikimedia/analytics-refinery/blob/master/oozie/pageview/hourly/pageview_hourly.hql

Ping me if you have Q

I might be slightly ahead but I wanted to give an update on where we are with the queries / privacy analyses. This first post focuses on the dataset generation code. I think it's pretty straightforward (thanks to wmf.pageview_actor) and just requires a few choices to be made around privacy (future posts will address that). Basic code below and some notes:

# Basic data schema
# I leave `num_referrals` kinda vague. It either can be:
# * total pageviews from a referral source
# * total # unique "actors" from a referral source

CREATE TABLE IF NOT EXISTS isaacj.search_engine_data (
        country          STRING  COMMENT 'Reader country per IP geolocation',
        lang             STRING  COMMENT 'Wikipedia language -- e.g., en for English',
        browser_fam      STRING  COMMENT 'Browser family from user-agent',
        os_family        STRING  COMMENT 'OS family from user-agent',
        search_engine    STRING  COMMENT 'One of 20 standard search engines (e.g., Google) or host of referer URL if not listed',
        num_referrals    INT     COMMENT 'Number of referrals from this source'
    )
    PARTITIONED BY (
        year             INT     COMMENT 'Unpadded year of request',
        month            INT     COMMENT 'Unpadded month of request',
        day              INT     COMMENT 'Unpadded day of request')

# Search engine regexes to map raw referrer URLs to canonical search engine names
# I imagine this will be replaced with IdentifySearchEngine: https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-core/src/main/java/org/wikimedia/analytics/refinery/core/SearchEngineClassifier.java

regexes = {"Google": "google\\.",
      "Yahoo": "search\\.yahoo\\.",
      "Bing": "\\.bing\\.",
      "Yandex": "yandex\\.",
      "Baidu": "\\.baidu\\.",
      "DuckDuckGo": "duckduckgo\\.",
      "Ecosia": "\\.ecosia\\.",
      "Startpage": "\\.(startpage|ixquick)\\.",
      "Naver": "search\\.naver\\.",
      "Docomo": "\\.docomo\\.",
      "Qwant": "qwant\\.",
      "Daum": "search\\.daum\\.",
      "MyWay": "search\\.myway\\.",
      "Seznam": "\\.seznam\\.",
      "AU": "search\\.auone\\.",
      "Ask": "\\.ask\\.",
      "Lilo": "\\.lilo\\.",
      "Coc Coc": "coccoc\\.",
      "AOL": "search\\.aol\\.",
      "Rakuten": "\\.rakuten\\."
}
for r in regexes:
    regexes[r] = re.compile(regexes[r])

def getSE(referer):
    for site, r in regexes.items():
        if re.search(r, referer):
            return site
    return 'other'
    
spark.udf.register('getSE', getSE, 'String')

# Pipeline
# COUNT(1) AS num_referrals can be placed with COUNT(DISTINCT(actor)) AS num_referrals
# This would likely be for privacy reasons but it's also not bad from a data standpoint to curb the influence of power users on this dataset.
# Right now, because actors with >800 pageviews are identified as `automated`, the maximum referrals a single user could contribute to this dataset each day is 800.

WITH search_engine_referrals AS (
    SELECT
      geocoded_data['country'] AS country,
      normalized_host.project AS lang,
      user_agent_map['browser_family'] AS browser_fam,
      user_agent_map['os_family'] AS os_family,
      getSE(referer) AS search_engine,
      actor_signature_per_project_family AS actor
    FROM wmf.pageview_actor
    WHERE
      year = {YEAR}
      AND month = {MONTH}
      AND day = {DAY}
      AND is_pageview
      AND agent_type = 'user'
      AND referer_class = 'external (search engine)'
      AND normalized_host.project_class = 'wikipedia'
)
INSERT OVERWRITE TABLE isaacj.search_engine_data
PARTITION(year={YEAR}, month={MONTH}, day={DAY})
SELECT
  country,
  lang,
  browser_fam,
  os_family,
  search_engine,
  COUNT(1) AS num_referrals
FROM search_engine_referrals
GROUP BY
  country,
  lang,
  browser_fam,
  os_family,
  search_engine

Regarding privacy, there are always various options for how to implement this that James will help guide. This dataset is perfect for differential privacy but unfortunately I assume we're not ready yet to apply it. In the future though, I'd love to come back to do that. In the meantime, I'm going to assume we're using our most straightforward approach of defining a threshold of data that a datapoint must exceed to be released -- e.g., at least 1000 pageviews to a given country+lang+browser+OS for that data point to be released. This is a simple thing to enforce in the data pipeline -- e.g., add HAVING COUNT(1) >= 1000 in the final clause of the example pipeline above. This is an update from an earlier analysis above when we were just considering browser family and not OS (so the data was more aggregated).

I show one reasonable threshold below (1000) but can easily look into others and report aggregate statistics on how much data is released based on that threshold. To me, we get the vast majority of data at a threshold like 1000 (good) but there are a lot of browsers/OSes that only show up if we e.g., drop the threshold to 100. For example, at a threshold of 100, the United States data goes from 31 to 62 unique languages represented, 29 to 47 unique browsers represented, 12 to 16 unique OSes represented, and 12 to 14 unique search engines represented. This was based on one day's worth of data in January so obviously the numbers would shift with other days but I think it's as good as any dataset to use for making design decisions. Data columns:

list_size: how many unique sets of lang+browser+OS+searchengine are in the resulting dataset for a given country
pct_pv_data: what percentage of all search engine referrals are covered by the data release
langs: how many unique languages make the list -- e.g., enwiki, frwiki. There are 93 at this threshold.
browsers: how many unique browser families make the list -- e.g., Mobile Safari, Edge. There are 41 at this threshold.
OSes: how many unique OSes make the list -- e.g., Android, Linux. There are 12 at this threshold.
search_engines: how many unique search engines make the list -- e.g., Google, Yahoo. There are 21 at this threshold.

+---------------------------+---------+-----------+-----+--------+----+--------------+
|country                    |list_size|pct_pv_data|langs|browsers|OSes|search_engines|
+---------------------------+---------+-----------+-----+--------+----+--------------+
|United States              |304      |99.4       |31   |29      |12  |12            |
|Germany                    |283      |98.4       |28   |23      |10  |10            |
|France                     |200      |98.5       |17   |21      |10  |11            |
|United Kingdom             |174      |98.6       |23   |22      |10  |9             |
|Japan                      |162      |99.4       |11   |26      |8   |10            |
|Spain                      |141      |98.0       |20   |18      |7   |6             |
|Canada                     |135      |98.1       |16   |18      |8   |8             |
|Russia                     |134      |98.9       |13   |26      |9   |5             |
|Netherlands                |134      |96.2       |15   |18      |7   |7             |
|Italy                      |128      |98.8       |14   |20      |8   |7             |
|Switzerland                |116      |94.2       |10   |17      |7   |6             |
|India                      |112      |99.2       |21   |20      |10  |8             |
|Belgium                    |104      |95.1       |13   |13      |7   |5             |
|Poland                     |100      |98.6       |9    |20      |9   |5             |
|Ukraine                    |94       |98.4       |9    |20      |8   |4             |
|Sweden                     |93       |96.3       |12   |17      |7   |4             |
|Austria                    |90       |95.0       |15   |17      |6   |6             |
|South Korea                |86       |96.8       |9    |15      |6   |6             |
|Czechia                    |82       |95.9       |10   |17      |7   |5             |
|Brazil                     |77       |98.9       |7    |19      |8   |7             |
|Australia                  |77       |97.6       |10   |18      |7   |6             |
|Indonesia                  |72       |99.1       |11   |18      |6   |6             |
|Mexico                     |72       |99.0       |5    |19      |7   |8             |
|Serbia                     |71       |96.0       |9    |16      |6   |4             |
|Hong Kong                  |69       |96.7       |7    |16      |6   |3             |
|Malaysia                   |69       |97.6       |9    |18      |6   |5             |
|Finland                    |68       |96.1       |7    |15      |7   |4             |
|Romania                    |68       |97.3       |10   |16      |6   |5             |
|Turkey                     |65       |97.8       |8    |18      |5   |5             |
|Taiwan                     |62       |98.4       |6    |18      |6   |3             |
|Israel                     |61       |97.0       |6    |15      |6   |5             |
|Philippines                |61       |98.3       |7    |19      |6   |8             |
|Norway                     |59       |93.8       |8    |13      |7   |4             |
|Greece                     |58       |96.9       |9    |16      |6   |4             |
|Kazakhstan                 |57       |98.8       |3    |16      |5   |2             |
|Hungary                    |55       |97.1       |7    |16      |6   |4             |
|Portugal                   |55       |95.8       |7    |14      |6   |4             |
|Vietnam                    |51       |97.4       |5    |16      |6   |4             |
|Iran                       |51       |99.1       |5    |20      |6   |2             |
|Denmark                    |51       |93.0       |8    |13      |7   |4             |
|Singapore                  |50       |94.0       |7    |17      |7   |4             |
|Bulgaria                   |49       |96.1       |6    |16      |7   |3             |
|Slovakia                   |49       |94.4       |6    |14      |5   |2             |
|Croatia                    |47       |94.3       |7    |13      |5   |3             |
|Argentina                  |46       |98.4       |5    |17      |6   |5             |
|Egypt                      |44       |96.8       |6    |16      |4   |3             |
|Thailand                   |40       |97.3       |4    |15      |4   |3             |
|Ireland                    |38       |93.7       |7    |14      |7   |5             |
|Morocco                    |37       |94.5       |5    |12      |4   |3             |
|Bangladesh                 |37       |97.4       |3    |18      |6   |3             |
|Latvia                     |36       |93.3       |3    |10      |4   |2             |
|Belarus                    |36       |95.0       |3    |16      |5   |2             |
|Saudi Arabia               |36       |98.0       |4    |13      |4   |3             |
|Slovenia                   |35       |92.3       |5    |12      |5   |3             |
|Republic of Lithuania      |34       |93.6       |4    |11      |4   |3             |
|Georgia                    |34       |93.5       |6    |10      |4   |4             |
|United Arab Emirates       |34       |93.1       |6    |13      |5   |3             |
|Colombia                   |34       |97.9       |2    |15      |6   |3             |
|Algeria                    |33       |95.6       |5    |12      |3   |2             |
|Estonia                    |32       |91.6       |3    |11      |4   |2             |
|South Africa               |31       |95.3       |2    |17      |8   |3             |
|Pakistan                   |31       |97.9       |4    |17      |7   |3             |
|Peru                       |30       |98.3       |2    |13      |6   |4             |
|Chile                      |30       |97.5       |2    |13      |6   |3             |
|Venezuela                  |29       |97.5       |2    |15      |7   |4             |
|Bosnia and Herzegovina     |29       |90.8       |6    |8       |3   |1             |
|Azerbaijan                 |29       |94.3       |4    |9       |4   |1             |
|Unknown                    |28       |72.6       |8    |12      |4   |3             |
|New Zealand                |28       |94.3       |2    |14      |7   |3             |
|Republic of Moldova        |26       |90.6       |3    |10      |3   |2             |
|Dominican Republic         |25       |97.1       |3    |12      |5   |4             |
|Nigeria                    |24       |97.1       |4    |17      |5   |2             |
|Luxembourg                 |24       |74.4       |3    |8       |4   |1             |
|Uzbekistan                 |24       |91.6       |4    |11      |3   |2             |
|Tunisia                    |23       |91.2       |4    |9       |4   |3             |
|Puerto Rico                |23       |94.4       |2    |9       |4   |2             |
|Ecuador                    |22       |97.2       |2    |10      |5   |4             |
|Hashemite Kingdom of Jordan|22       |95.0       |3    |9       |4   |2             |
|Costa Rica                 |21       |93.6       |2    |10      |4   |2             |
|Albania                    |21       |89.9       |4    |8       |3   |3             |
|Sri Lanka                  |21       |93.1       |3    |13      |4   |3             |
|Iraq                       |21       |93.3       |5    |8       |4   |1             |
|Lebanon                    |20       |89.6       |3    |9       |4   |2             |
|North Macedonia            |20       |86.7       |6    |7       |3   |1             |
|Kuwait                     |20       |94.3       |2    |9       |4   |2             |
|Kenya                      |19       |93.6       |2    |13      |5   |2             |
|Nepal                      |19       |93.7       |3    |13      |4   |2             |
|Panama                     |19       |94.8       |2    |10      |4   |2             |
|Uruguay                    |18       |94.2       |2    |10      |5   |1             |
|Myanmar                    |18       |85.8       |2    |13      |4   |2             |
|Oman                       |18       |92.4       |2    |8       |4   |2             |
|Kyrgyzstan                 |18       |93.8       |4    |9       |3   |2             |
|Armenia                    |17       |90.6       |3    |6       |4   |1             |
|Mongolia                   |17       |88.0       |3    |7       |4   |1             |
|Qatar                      |16       |87.3       |2    |9       |4   |2             |
|Cyprus                     |16       |81.2       |3    |8       |4   |1             |
|Ghana                      |15       |93.5       |1    |12      |5   |1             |
|Guatemala                  |14       |92.9       |2    |9       |4   |1             |
|Iceland                    |14       |81.3       |3    |7       |4   |1             |
|Syria                      |14       |92.0       |3    |8       |3   |1             |
|Bolivia                    |14       |93.9       |2    |9       |3   |2             |
|Montenegro                 |13       |81.9       |5    |4       |3   |1             |
|Honduras                   |13       |91.4       |2    |8       |3   |2             |
|China                      |13       |78.0       |3    |7       |4   |2             |
|Bahrain                    |13       |86.9       |2    |7       |4   |1             |
|Ivory Coast                |13       |89.7       |2    |10      |4   |1             |
|Tanzania                   |13       |84.6       |2    |10      |4   |1             |
|Macao                      |12       |84.8       |2    |6       |4   |1             |
|Sudan                      |12       |87.1       |2    |7       |4   |1             |
|Senegal                    |12       |85.2       |3    |8       |3   |1             |
|Palestine                  |12       |92.1       |2    |8       |3   |1             |
|Cambodia                   |12       |78.0       |3    |7       |4   |1             |
|Angola                     |12       |82.9       |2    |7       |4   |2             |
|El Salvador                |11       |90.1       |2    |7       |3   |1             |
|Cameroon                   |11       |84.5       |2    |8       |4   |1             |
|DR Congo                   |10       |85.9       |2    |8       |4   |1             |
|Libya                      |10       |86.3       |2    |7       |3   |1             |
|Paraguay                   |10       |89.9       |2    |7       |3   |1             |
|Trinidad and Tobago        |9        |88.1       |1    |8       |4   |2             |
|Jamaica                    |9        |87.6       |1    |7       |5   |2             |
|Malta                      |9        |77.4       |2    |7       |4   |2             |
|Kosovo                     |8        |76.5       |3    |3       |3   |1             |
|Tajikistan                 |8        |79.1       |3    |5       |3   |1             |
|Réunion                    |8        |81.0       |1    |7       |4   |2             |
|Ethiopia                   |8        |77.9       |1    |7       |4   |1             |
|Madagascar                 |8        |75.1       |2    |6       |4   |1             |
|Nicaragua                  |7        |87.5       |2    |4       |3   |1             |
|Benin                      |7        |79.4       |2    |5       |4   |1             |
|Laos                       |7        |69.8       |3    |4       |3   |1             |
|Afghanistan                |7        |76.4       |2    |4       |3   |1             |
|Yemen                      |7        |87.5       |2    |5       |3   |1             |
|Mauritius                  |7        |80.2       |2    |5       |3   |1             |
|Mozambique                 |7        |77.5       |2    |4       |3   |1             |
|Turkmenistan               |6        |74.3       |3    |4       |3   |1             |
|Burkina Faso               |6        |77.4       |1    |6       |4   |1             |
|Haiti                      |6        |85.8       |2    |5       |3   |1             |
|Cuba                       |6        |82.4       |2    |5       |3   |1             |
|Suriname                   |6        |73.6       |2    |3       |3   |1             |
|Uganda                     |6        |80.7       |1    |6       |4   |1             |
|Martinique                 |5        |74.6       |1    |5       |4   |1             |
|Zambia                     |5        |77.4       |1    |5       |4   |1             |
|French Polynesia           |5        |69.4       |1    |5       |4   |1             |
|Somalia                    |5        |77.3       |3    |3       |3   |1             |
|Zimbabwe                   |5        |72.5       |1    |5       |3   |1             |
|New Caledonia              |4        |66.4       |1    |4       |3   |1             |
|Brunei                     |4        |69.4       |2    |3       |3   |1             |
|Guadeloupe                 |4        |69.5       |1    |4       |4   |1             |
|Isle of Man                |4        |70.2       |1    |4       |4   |1             |
|Andorra                    |4        |44.2       |1    |4       |4   |1             |
|Jersey                     |4        |67.9       |1    |4       |4   |1             |
|Bahamas                    |3        |68.0       |1    |3       |3   |1             |
|French Guiana              |3        |58.8       |1    |3       |3   |1             |
|Sierra Leone               |3        |64.7       |1    |2       |2   |1             |
|Bhutan                     |3        |80.6       |1    |3       |3   |1             |
|Maldives                   |3        |72.3       |1    |3       |3   |1             |
|Gibraltar                  |3        |58.3       |1    |3       |3   |1             |
|Gabon                      |3        |66.2       |1    |3       |3   |1             |
|Bermuda                    |3        |62.1       |1    |3       |3   |1             |
|Mauritania                 |3        |52.7       |2    |2       |2   |1             |
|Mali                       |3        |62.4       |1    |3       |3   |1             |
|Guam                       |3        |63.7       |1    |3       |3   |1             |
|Barbados                   |3        |67.8       |1    |3       |3   |1             |
|Cayman Islands             |3        |62.8       |1    |3       |3   |1             |
|Eritrea                    |3        |71.0       |1    |2       |2   |1             |
|Namibia                    |3        |68.2       |1    |3       |3   |1             |
|Belize                     |3        |67.6       |1    |3       |3   |1             |
|Botswana                   |3        |75.1       |1    |3       |3   |1             |
|Guyana                     |3        |72.4       |1    |3       |3   |1             |
|Eswatini                   |2        |61.4       |1    |2       |1   |1             |
|Guinea                     |2        |61.2       |1    |2       |2   |1             |
|San Marino                 |2        |48.6       |1    |2       |2   |1             |
|Rwanda                     |2        |49.9       |1    |2       |2   |1             |
|Faroe Islands              |2        |34.4       |1    |2       |2   |1             |
|Cabo Verde                 |2        |61.8       |1    |2       |2   |1             |
|Guernsey                   |2        |44.2       |1    |2       |2   |1             |
|Djibouti                   |2        |50.0       |1    |2       |2   |1             |
|Aruba                      |2        |33.9       |1    |2       |2   |1             |
|Togo                       |2        |51.3       |1    |2       |2   |1             |
|Liechtenstein              |2        |33.0       |1    |2       |2   |1             |
|Fiji                       |2        |66.9       |1    |2       |2   |1             |
|Saint Lucia                |2        |61.0       |1    |2       |2   |1             |
|U.S. Virgin Islands        |2        |45.5       |1    |2       |2   |1             |
|South Sudan                |2        |62.4       |1    |2       |2   |1             |
|Papua New Guinea           |2        |65.1       |1    |2       |2   |1             |
|Malawi                     |2        |57.6       |1    |2       |2   |1             |
|Liberia                    |1        |55.9       |1    |1       |1   |1             |
|Lesotho                    |1        |54.3       |1    |1       |1   |1             |
|Burundi                    |1        |25.1       |1    |1       |1   |1             |
|Guinea-Bissau              |1        |49.9       |1    |1       |1   |1             |
|Chad                       |1        |39.2       |1    |1       |1   |1             |
|Curaçao                    |1        |17.3       |1    |1       |1   |1             |
|Antigua and Barbuda        |1        |39.6       |1    |1       |1   |1             |
|Gambia                     |1        |38.7       |1    |1       |1   |1             |
|Mayotte                    |1        |27.7       |1    |1       |1   |1             |
|Niger                      |1        |47.5       |1    |1       |1   |1             |
|Congo Republic             |1        |38.2       |1    |1       |1   |1             |
|Monaco                     |1        |16.8       |1    |1       |1   |1             |
|Seychelles                 |1        |31.8       |1    |1       |1   |1             |
+---------------------------+---------+-----------+-----+--------+----+--------------+

Thanks, both.

Change 655804 had a related patch set uploaded (by Bmansurov; owner: Bmansurov):
[analytics/refinery@master] WIP: Oozie: Add search engine referrers

https://gerrit.wikimedia.org/r/655804

gerritbot added a project: Patch-For-Review.Jan 13 2021, 1:52 AM

MGerlach subscribed.Jan 18 2021, 12:19 PM

• JFishback_WMF moved this task from Incoming to Backlog on the Privacy Engineering board.Jan 25 2021, 4:18 PM

@JFishback_WMF I just wanted to check in to see whether there is anything I can get started to make the privacy review as simple as possible for you -- e.g., a doc I could start filling out or any statistics that you know you'll need?

@bmansurov FYI I found this browser dataset oozie job that implements a privacy filter that I assume will look very similar to what we want (compute all the data then go through and change fields to "other" if they don't exceed a threshold and re-group/sum the data). I haven't rewritten the code yet, but if you get to a point where you'd want something very close to the final code, let me know and I'll update.

@Isaac thanks for the link. I've been working on this patch. This is the main file. Feel free to add comments to the patch with new additions or fixes.

I've been working on this patch. This is the main file. Feel free to add comments to the patch with new additions or fixes.

@bmansurov thanks for letting me know -- I'll switch over to leaving comments there when there's specific things related to the code etc.

Status update:

Huge huge thanks to @bmansurov and @Milimetric for getting the code to a good place!
At this point, the job to generate the data feels almost ready (just a few dangling comments) so we can probably start running it as soon as the privacy strategy is set.
That still leaves the question about making the data publicly accessible. @Milimetric what's your recommendation for how to surface the data publicly via a dashboard? I assume via Dashiki like the browser statistics or is there a better option? My only concern with Dashiki is that I think the referer dataset is a good bit larger than other ones it handles (with current parameterization, it would produce almost 10,000 rows per day). I assume that means we want to split it out somehow but this is segmented by country instead of wiki, so I don't know if that is straightforward or not...

• JFishback_WMF moved this task from Backlog to In Progress on the Privacy Engineering board.Mar 18 2021, 6:46 PM

I'm so sorry, is this stuck on me? Maybe a quick meeting would unstuck it. My view right now:

privacy review is ongoing
https://gerrit.wikimedia.org/r/c/analytics/refinery/+/655804 is basically ready to merge (am I needed there or is @fkaelin taking it?)
We have two dashboarding options: Wikistats via AQS (so we'd have to change the oozie job to load the data into Cassandra), or Dashiki. Either way we need some glue code, so we should talk pros & cons. Initially I was thinking this would just be released as text.

mpopov awarded a token.Apr 6 2021, 3:36 PM

mpopov subscribed.

I'm so sorry, is this stuck on me?

No worries -- as you said, privacy review is still ongoing. FYI @JFishback_WMF I just discovered that we've been publishing some of this data in a highly aggregated form here for the past several years. Not sure if that helps with the privacy review at all.

We have two dashboarding options: Wikistats via AQS (so we'd have to change the oozie job to load the data into Cassandra), or Dashiki. Either way we need some glue code, so we should talk pros & cons. Initially I was thinking this would just be released as text.

Probably good to get started on this as the current idea is to get the dashboard up and not just rely on e.g., tsv files. I'll put time on your calendar so we can discuss and I can figure out what I need to do.

Htriedman subscribed.Apr 9 2021, 2:21 PM

Pablo subscribed.Apr 14 2021, 8:04 AM

Hello all, I've completed the privacy risk analysis and shared it with the original requester: Due to the low impact of harm and low probability of malicious use of this data, coupled with the mitigation described above, the residual risk of collecting and retaining this data is considered LOW so the risk is automatically accepted by WMF under current policy.

Huge huge thanks to @JFishback_WMF for the privacy review! Everything makes sense from my side.

I'll add comments directly on the patch @Milimetric but summary:

The oozie job already uses 500 as the minimum number of pageviews so no change is needed for that
There are a few countries that we will remove from the data. It's only seven so different from the geoeditors blocklist. I assume the easiest place is to just add a AND country NOT IN (...) clause.

Then from my standpoint we're ready to push the code. I'll inspect the data to make sure it matches expectations and then write the glue to push it to the dashboard.

Change 655804 merged by Milimetric:

[analytics/refinery@master] Add daily referrers Hive table and Oozie job

https://gerrit.wikimedia.org/r/655804

Maintenance_bot removed a project: Patch-For-Review.Apr 27 2021, 9:10 PM

Isaac updated the task description. (Show Details)Apr 29 2021, 3:46 PM

mpopov mentioned this in T228802: External traffic breakdown in Druid/Turnilo/Superset.Oct 1 2021, 7:38 PM

Milimetric mentioned this in T112284: Create new table for 'referer' aggregated data.Oct 22 2021, 7:56 PM

odimitrijevic added a project: Data-Engineering.Jan 6 2022, 3:22 AM

odimitrijevic moved this task from Incoming (new tickets) to Datasets on the Data-Engineering board.Jan 6 2022, 3:41 AM

odimitrijevic removed a project: Analytics.Jan 12 2022, 12:19 AM