Identify data / questions that we can(not) answer regarding external reuse
Address the following questions:

  • What data do we currently have available to us regarding external reuse?
  • What data are we currently missing regarding external reuse? What are potential methods for getting that data?
  • What questions can we answer with the data that we currently have?
  • What questions can we not answer with the data that we currently have?

In the course of the literature review, I identified the data that we have available to us and improvements that could be made to this data in order to make it more useful for understanding traffic:

I identified the following important types of data that we do not have:

  • Voice Assistants: we have no data on usage of voice assistants beyond anecdotal evidence and research datasets that suggest Wikipedia's importance in answering queries.
  • Fine-grained Click-through Rate: while we have some click-through rate data from the Google Search Console, this is limited to a subset of articles with high impressions / clicks. It also is not broken out in terms of whether it comes from search results or knowledge panels.
  • Fine-grained Referral Data: due to HTTPS (yay!), we only receive domain-level information in referrers. So we might be able to count the number of people who view an article from Reddit but we cannot directly determine which subreddit or post lead to the clicks (if we wanted to study in what context the link was surfaced)
  • Section Views: See T240359

From this evaluation of the data available / unavailable, I have developed a number of research questions. The high-level ones are as follows:

  • What is the effect of different designs for attribution on reader awareness of brand / clickthrough?
  • How could we adjust our internal pageview metrics to also reflect external readership?
  • What are the most valuable pieces of information to surface in snippets of articles?
  • What is the impact of showing users translated content?