Page MenuHomePhabricator

[Spike] How can we measure seen page previews with as high a degree of accuracy as possible?
Closed, ResolvedPublic

Description

Background

With the deployment of Page Previews, we introduce a new form of reading Wikipedia content apart from the standard pageviews. We need to measure this for the same reasons as we do for pageviews. These include providing executives with accurate numbers on the overall level of usage of our content, and the editor community with accurate numbers on the readership of the individual articles and projects they are working on. In particular, based on the previous A/B tests, we expect that the deployment of previews on a wiki will cause the total pageviews to decrease for that wiki, but that "page interactions" – any intentional interaction with a page, i.e. page previews + pageviews – will increase. We would like a way to track this metric over time.

Requirements/Constraints

  1. Client-side, implement a way to register every preview that is seen by the reader (defined as having been visible for at least 1000ms), e.g. by sending an EventLogging/beacon request as soon as that threshold time has passed
  2. Server-side, implement on a way to store, query and count these requests
  3. The page interaction data that we collect from Page Previews should eventually be available as aggregated Hive tables like [[ https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Pageview_hourly | wmf.pageview_hourly ]]

Acceptance Criteria

Determine a way to report on the following (hourly, daily, weekly, monthly, yearly):

  • Total page previews
  • Page previews per project
  • Page previews per previewed page
  • Page previews by other applicable dimensions that are currently used for pageviews, e.g. country or browser type

Notes

As in the current Popups schema, we should also record information on the source page from which the preview was viewed (similar to the internal referer data we log for pageviews).

Related Objects

StatusSubtypeAssignedTask
ResolvedDereckson
ResolvedJdlrobson
Resolvedovasileva
DuplicateNone
OpenNone
Resolvedmforns
Resolvedovasileva
ResolvedJdlrobson
DuplicateNone
DuplicateNone
Resolvedovasileva
Resolvedovasileva
Resolvedovasileva
Resolvedphuedx
Resolvedphuedx
DuplicateNone
ResolvedJdlrobson
ResolvedJdlrobson
DuplicateNone
Duplicateovasileva
Resolvedovasileva
DuplicateNone
DeclinedNone
DuplicateJdlrobson
ResolvedMhurd
Declined JMinor
Resolvedphuedx
Resolved Pchelolo
ResolvedJdlrobson
Declined Pchelolo
Resolvedphuedx
DeclinedJdlrobson
DuplicateNone
Resolved Fjalapeno
Resolvedphuedx
Declinedpmiazga
DeclinedNone
Resolvedphuedx
DeclinedNone
Resolved Pchelolo
Resolved bearND
Resolved Mholloway
ResolvedMSantos
Resolved Mholloway
InvalidNone
ResolvedJdlrobson
InvalidNone
DuplicateNone
ResolvedJdlrobson
ResolvedJdlrobson
ResolvedJdlrobson
ResolvedJdlrobson
Resolvedphuedx
Resolved bearND
Resolved Mholloway
DuplicateNone
ResolvedJdlrobson
ResolvedJdlrobson
Resolvedphuedx
ResolvedJdlrobson
ResolvedJdlrobson
Resolved bearND
ResolvedJdlrobson
Resolved Mholloway
Resolved Mholloway
ResolvedJdlrobson
ResolvedJdlrobson
Resolved bearND
Resolved Tbayer
ResolvedNone
Resolvedovasileva

Event Timeline

ovasileva triaged this task as Medium priority.Dec 8 2017, 1:36 PM
ovasileva created this task.
ovasileva raised the priority of this task from Medium to High.Dec 8 2017, 1:37 PM
phuedx renamed this task from [Spike] Determine ways of counting page previews with highest sampling possible to [Spike] How can we count previews with as high a degree of accuracy as possible?.Dec 12 2017, 5:52 AM
phuedx updated the task description. (Show Details)

Possible Solutions

  1. Just leave the EventLogging instrumentation running and refine this information from the event.popups Hive table
    • i.e. make this Not Our (Readers Web's) Problem™
  2. Modify the EventLogging instrumentation to track how long the preview is visible for.
  3. Update the statsv reducer to track when a preview is shown and dismissed and then increment a counter if the preview was visible for longer than X milliseconds.
    • This seems incompatible with the long-term goal of having all page interaction data in Hive.
  4. Introduce a new kind of instrumentation that makes a request (via the Beacon API) to /beacon/preview?duration=X&uri=Y, like the Multimedia Viewer does.

Notes

  • #2 and #4 could and should include simplifying and/or removing the existing EventLogging instrumentation and A/B test code.
  • #2 and #4 require QA and Research (@Tbayer's) involvement.
  • #1 requires involvement from Analytics Engineering and Ops as we're looking to:
    • Store this data for as long as we can; and
    • Increase the sampling size as much as we can in order to improve accuracy.
  • In reality, #1 is only punting #2 and #4 until some later date. Analytics Engineering (AE) will have to introduce a shim to refine event.popups table into a pageview_hourly-like table when there's precedent to refine requests to /beacon/foo. This sounds like technical debt and I'd expect AE to push back.
Jdlrobson moved this task from Upcoming to Needs Prioritization on the Web-Team-Backlog board.
Jdlrobson subscribed.

I will setup a meeting for all engineers when I return from vacation so all the engineers can talk through the potential solutions and work out how to make this happen.

Tbayer renamed this task from [Spike] How can we count previews with as high a degree of accuracy as possible? to [Spike] How can we measure seen page previews with as high a degree of accuracy as possible?.Dec 12 2017, 5:49 PM
Tbayer reassigned this task from Jdlrobson to ovasileva.
Tbayer reassigned this task from ovasileva to Jdlrobson.
Tbayer updated the task description. (Show Details)

@ovasileva Rewrote the task description further as discussed on Friday. To highlight one point in particular: WMF already tracks pageview numbers with relatively high precision (with a lot of work having gone into refining definitions and building and debugging aggregation steps over the past few years), so I don't think we should focus on creating a separate permanent "preview + pageviews" metric here (which would require us to maintain the custom pageviews part too). Also, we won't be able to test the hypothesis that it (or pageviews per se) will increase or decrease just by observing the timeline - that's what the A/B tests were for.

I will setup a meeting for all engineers when I return from vacation so all the engineers can talk through the potential solutions and work out how to make this happen.

reminder to also add @Tbayer to the meeting

Notes

  • #2 and #4 could and should include simplifying and/or removing the existing EventLogging instrumentation and A/B test code.

Should we add this to the criteria? Is everyone fine with this happening? (From my side, 👍)


Re: solutions, #2 and #4 I think are quite clear in the Page-Previews client code base part and sound fine, but I'm personally not sure what they would entail on the backend (regarding extra work or collaborations).

...but I'm personally not sure what they would entail on the backend (regarding extra work or collaborations).

Meaning, we'll dive in in the meeting, and reach out to other teams as needed to clarify things and get back to you in this task when we know more.

I have looked a bit into how the aggregation step could work for the beacon approach (#4). The structure of the corresponding query that generates the pageview_hourly table is fairly simple. I think I could take a stab at adapting it for seen previews, once we have found a way to send that kind of "virtual" beacon request which conveys the necessary information (e.g. project, page name), analogous to the MediaViewer solution. It could look very similar to the existing EL beacon requests that we already send for Popups (which already has those fields), but would be processed differently on the backend.

We would then need some support from AE to make this query into an Oozie job, but I can't imagine that this will be a big challenge for them.

^ @MBinder_WMF, @ovasileva: Reflecting reality. The majority of Readers Web had a meeting to discuss the answer to this question yesterday (Thursday, 11th January 2018).

Etherpad: https://etherpad.wikimedia.org/p/t182414
Notes: I've created the following implementation tasks: T184793: [EPIC] Instrument page interactions and T173952: Remove A/B testing instrumentation code. Note well that the former blocks the latter.

Per the

SS: Ping Analytics Engineering and Operations about which URL that we should request. Operations need a traffic estimate.

action item, I'll be reaching out to AE and Ops by email and will follow up on those implementation tasks.

phuedx moved this task from Ready for Signoff to Doing on the Readers-Web-Kanbanana-Board-Old board.

Sorry. I'll add a note about why we can't use the existing EventLogging instrumentation to the description.

Possible Solutions

  1. Just leave the EventLogging instrumentation running and refine this information from the event.popups Hive table
    • i.e. make this Not Our (Readers Web's) Problem™
  2. Modify the EventLogging instrumentation to track how long the preview is visible for.
  3. Update the statsv reducer to track when a preview is shown and dismissed and then increment a counter if the preview was visible for longer than X milliseconds.
    • This seems incompatible with the long-term goal of having all page interaction data in Hive.
  4. Introduce a new kind of instrumentation that makes a request (via the Beacon API) to /beacon/preview?duration=X&uri=Y, like the Multimedia Viewer does.

#1 and #2 were dismissed because the EventLogging stack doesn't record IPs, for example, and so the data can't be refined to something pageview-like.

Sam is this still blocked? I think we got an answer on list, right?

Sam is this still blocked?

No. Mostly.

I think we got an answer on list, right?

I've believe https://lists.wikimedia.org/pipermail/analytics/2018-January/006136.html summarises the outcome of the on- and off-list discussions. For whatever reason, the formatting of that email seems to have been lost so I'll quote/reformat it here:

Hullo all,It seems like we've arrived at an implementation for the client-side (JS) part of this problem: use EventLogging to track a page interaction from within the Page Previews code. This'll give us the flexibility to take advantage of a stream processing solution if/when it becomes available, to push the definition of a "Page Previews page interaction" to the client, and to rely on any events that we log in the immediate future ending up in tables that we're already familiar with.

I'll update T184793: [EPIC] Instrument page interactions with additional detail and suggestions from AE.

Over to you @ovasileva! Signing this off should boil down to making sure that T184793: [EPIC] Instrument page interactions is actionable.