Page MenuHomePhabricator

Get a fresh read on Google-referred traffic
Closed, ResolvedPublic4 Estimated Story Points

Description

Wes, Kevin and I met to review the state of the data we have on google-referred traffic, after receiving a note from Comms. To my knowledge, we don't have a recent read on data we presented at the December 2014 traffic update (complete slides), particularly on the proportion of Google-referred pageviews over total human pageviews as a function of platform (mobile site vs desktop site) and time: the most recent datapoint is November 2014 (slide 17 in the above deck).

It would be useful – regardless of press inquiries – to monitor referred traffic and determine if something dramatic happened to readership funneled from Google in the first 7 months of 2015. Unfortunately, the new refined pageview tables generated by Analytics don't include referrer domain (only a binary flag for internal vs external requests), so the only source of data we have covering 2015 with intact referrer information is the sampled logs.

The goal of this task is to refresh the data used in the deck above using the sampled logs or identify another metric that could give us a fresh read on trend changes in Google-referred traffic over the first half of this calendar year.

I am assigning the task to @Deskana (per conversation with @Wwes) and copying @JKatzWMF from Readership and @Tbayer so we can determine the most apporopriate owner for these requests going forward.

I am also adding @kevinator so this can be on Analytics Dev's radar: Kevin can you add the related task you created to include referring domain info to the pageview tables?

Event Timeline

DarTar assigned this task to Deskana.
DarTar raised the priority of this task from to Needs Triage.
DarTar updated the task description. (Show Details)
DarTar added a project: Research.
DarTar moved this task to Radar on the Research board.

Who is this request coming from? Given that prioritising this work will jeopardise the Discovery Department's ability to achieve its quarterly goals, it's important to have that on record here.

This request has apparently come from Lila and the Board of Trustees, so I am prioritising this accordingly.

However, I am noting here that this prioritisation is performed under protest. The last minute, urgent nature of this request, which has bypassed our planning processes, seriously jeopardises the ability of the Discovery Analysis Team to achieve its Q1 goal, as it is now necessary to deprioritise T105355 to fulfil this request.

Deskana set Security to None.
Deskana added a subscriber: Ironholds.

@Ironholds will need to handle this.

This request has apparently come from Lila and the Board of Trustees, so I am prioritising this accordingly.

However, I am noting here that this prioritisation is performed under protest. The last minute, urgent nature of this request, which has bypassed our planning processes, seriously jeopardises the ability of the Discovery Analysis Team to achieve its Q1 goal, as it is now necessary to deprioritise T105355 to fulfil this request.

I fully endorse Dan's reading of this, and his protest. It is reasonable that the executive team have urgent, high-priority work in response to changing events. It is not reasonable that this work trickle down and cripple the ability of teams to complete the same goals that executive team will be holding them to. This is the first request of this nature I have been asked to handle in my role in Discovery, but is most definitely not the first that analysts overall have, and I would strongly encourage the executive team to, if they have a consistent need for ad-hoc analytics (which they do) consider hiring an analyst to provide immediate support. I am happy to recommend some excellent people.

The purpose of Phabricator as I understand it is to complete task oriented work.

The tone of the response in this ticket thus far is divisive and not productive to the task. This is not a random request and is in response to organizational perception and understanding that we need to know. The request followed procedure by coming in via the task system and to the PM rather than a random one off request. It is quite specific and states the owners, builds on institutional knowledge that we need and is a metric that requires expertise that is currently limited to the Discovery team.

This is NOT a US v THEM thing and should not be framed in that manner as we are a TEAM working to solve issues, as they arise, on behalf of a larger set of users. Nor is this a channel to deal with other organizational challenges. Please use this tool to complete the given task and proactively solve the issue otherwise the tool loses it's value. We are already working on addressing the other challenges, restating it in a action oriented ticket is counter productive and distracting to solving the task at hand. So it has the correct impact and attention re: resourcing I will follow up with you individually.

@Deskana thank you for prioritizing the work and lets consider if this is a metric potentially on our dashboard and setup a meeting with @JKatzWMF and @kevinator to discuss as it could easily live in both audiences (Discovery and Readership) and may deserve a proper place on the overall metric boards run by Analytics.

@Ironholds:

  • How long to run the query based on the requirements above? When will it be done?
  • Is it repeatable as an on going, automated analysis task? Please determine hours of work to do so and consider the resources involved to do so. Please create that as a separate task and reference it here.

Thanks.

@Deskana thank you for prioritizing the work and lets consider if this is a metric potentially on our dashboard and setup a meeting with @JKatzWMF and @kevinator to discuss as it could easily live in both audiences (Discovery and Readership) and may deserve a proper place on the overall metric boards run by Analytics.

I'd say this is outside the scope of the Discovery dashboards at http://searchdata.wmflabs.org, which are intended right now to provide metrics around search. We're also getting to the stage where we have a lot of graphs on the dashboard, so I'm not eager to bloat it out more. Organisation-wide traffic trends such as this should probably live in a separate dashboard intended for that purpose.

It is indeed for task-oriented work. I apologise unreservedly for my tone; this type of discussion is not appropriate for phabricator, and wider concerns about resourcing should have been moved to other venues and approached more constructively.

In terms of the ticket itself, the actual task is not that complex. It will take several days, but that's parallelised computation time, not my writing time. I estimate that the data should be provided by EOD Monday. Building out something to track /just this metric/ (that's important. See below) is maybe a week, tops.

What I'd also like us to think about is long-term questions around this metric - not just basic referer data but how we track incoming requests from searches and how google interacts with us. The obvious answer is a mix of server-side tracking at our end, and Google's webmaster tools - I'd like us to seriously consider building out a distinct set of dashboards and infrastructure for these sort of metrics as a Q2 goal for the Discovery analytics team.

The worry I have around that is resourcing; this is a lot of work, as in "as much work as setting up the dashboards for internal metrics was". We have to build out the pipelines and make them work. I feel uncomfortable making any guarantees about being able to do that /and/ our other ongoing work (building out the user satisfaction KPI, supporting existing dashboards, providing A/B test analysis for search, the portal, the frontend, etc) without more people.

What I'd also like us to think about is long-term questions around this metric - not just basic referer data but how we track incoming requests from searches and how google interacts with us.

fully agreed. For a first stop-gap measure, check out T108886, I asked @kevinator if we can prioritize this because this piece of data got lost in the transition from sample logs+manual parsing to the refined pageview tables in hadoop.

Makes sense. That should be relatively easy to do with the inbuilt parse functions and a couple of tweaks.

Okay, the queries have been kicked off, running from January to August 2015. https://github.com/wikimedia-research/memo is where the code lives.

Thanks @Ironholds

One note triggered from the SimilarWeb post: the author reports a sudden drop in traffic from Wikipedia to their domain, which sounds very familiar.

Since most outbound traffic now originates from HTTPS and we haven't set an explicit referrer policy, we are technically sending a vast amount of unidentified traffic to external sites that don't support HTTPS. This may partly explain drops in inbound traffic reported by 3rd party SEO/tracking tools: traffic is still there but it's invisible. One quick fix for this which we can test would be to set an explicit "Origin" referrer policy. See T87276 and T99174.

The initial read on this data has been completed and can be found at https://github.com/wikimedia-research/memo/blob/master/report.pdf . It is distinctly being communicated to upper management.

@Ironholds Agree, that is is pretty awesome.

Since I am surprised by the results and there is still the 'unknown' traffic bucket, when we talk to Google, I would like to ask them about getting click-through data (people clicking our item in search results) in table form and for more than 90 days back. I couldn't see any trends in click-through rate while manually comparing dates, but it's not a great methodology.

@Wwes, are you going to be reaching out to them in the near future?

@JKatzWMF yes I plan to follow up with them shortly to get validation on our interpretation and will ask for this as well.

@Ironholds nice work!!

Now @JKatzWMF, @Deskana and @kevinator please create follow up task to discuss ongoing tracking for this as a automated service and expand on the meta on geo, platform etc.