[Spike] How can we measure seen page previews with as high a degree of accuracy as possible?
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	ovasileva
	Dec 8 2017, 1:36 PM

Description

Background

With the deployment of Page Previews, we introduce a new form of reading Wikipedia content apart from the standard pageviews. We need to measure this for the same reasons as we do for pageviews. These include providing executives with accurate numbers on the overall level of usage of our content, and the editor community with accurate numbers on the readership of the individual articles and projects they are working on. In particular, based on the previous A/B tests, we expect that the deployment of previews on a wiki will cause the total pageviews to decrease for that wiki, but that "page interactions" – any intentional interaction with a page, i.e. page previews + pageviews – will increase. We would like a way to track this metric over time.

Requirements/Constraints

Client-side, implement a way to register every preview that is seen by the reader (defined as having been visible for at least 1000ms), e.g. by sending an EventLogging/beacon request as soon as that threshold time has passed
Server-side, implement on a way to store, query and count these requests
The page interaction data that we collect from Page Previews should eventually be available as aggregated Hive tables like [[ https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Pageview_hourly | wmf.pageview_hourly ]]

Acceptance Criteria

Determine a way to report on the following (hourly, daily, weekly, monthly, yearly):

Total page previews
Page previews per project
Page previews per previewed page
Page previews by other applicable dimensions that are currently used for pageviews, e.g. country or browser type

Notes

As in the current Popups schema, we should also record information on the source page from which the preview was viewed (similar to the internal referer data we log for pageviews).

Related Objects
Search...

Status	Assigned	Task
Resolved	Dereckson	T68374 Enable Hovercards on se.wikimedia.org (Swedish chapter wiki)
Resolved	Jdlrobson	T70860 [GOAL] Graduate Page Previews feature (Popups extension) out of Beta Feature
Resolved	ovasileva	T154635 [EPIC] Deploy page previews to English and German Wikipedia
Duplicate	None	T184801 Remove A/B testing code and EventLogging instrumentation
Open	None	T193524 Publish data on seen page previews
Resolved	mforns	T186728 Record and aggregate page previews
Resolved	ovasileva	T192622 [EPIC] Page previews post-deploy cleanup
Resolved	Jdlrobson	T173952 Remove A/B testing instrumentation code
Duplicate	None	T167433 Switch all projects to the new (and yet to be built) summary-html endpoint for page previews
Duplicate	None	T167429 Make enwiki and dewiki fetch previews from the summary-html RESTBase endpoint
Resolved	ovasileva	T165018 Page previews can consume new summary-HTML endpoint
Resolved	ovasileva	T168332 HTML previews' layout breaks text multi-line text truncation
Resolved	ovasileva	T113094 [EPIC] The Page Summary API needs to provide useful content for the majority of articles
Resolved	phuedx	T113633 Spike: Alternative to TextExtracts for Popups, Gather, Read more
Resolved	phuedx	T109867 Infobox template parameters showing in Hovercard
Duplicate	None	T109869 Hovercards should not show <noinclude> content in the extract
Resolved	Jdlrobson	T74629 Reference lists appearing in extracts for some articles
Resolved	Jdlrobson	T73023 Invalid HTML markup standalone LI element when using references on main page
Duplicate	None	T112137 Hovercards loses manual superscript formatting by requesting plain text
Duplicate	ovasileva	T152414 Previews must display useful text for the majority of articles
Resolved	ovasileva	T159388 Text Extracts used in Page Previews shows strange characters from template
Duplicate	None	T162219 Parentheticals: Words incorrectly concatenated due to too simple removing of spaces
Declined	None	T159065 Fix formulas in HTML extracts
Duplicate	Jdlrobson	T163442 Math formulas are not rendering in preview page dialogs
Resolved	Mhurd	T155573 [Regression] exclude pronunciation guides from article extracts
Declined	• JMinor	T164100 Consider excluding pronunciation guides from TextExtracts
Resolved	phuedx	T112226 Text extracts unable to summarise content contains {{dts}} templates
Resolved	• Pchelolo	T165017 Setup RESTBase HTML extract endpoint
Resolved	Jdlrobson	T165619 Spike: Specify changes to page-summary endpoint
Declined	• Pchelolo	T166163 Create a summary content migration filter
Resolved	phuedx	T167852 "<translate>" and <tvar\|*> tag visible in Page Previews: HtmlFormatter only flattens spans
Declined	Jdlrobson	T168329 exsentences does not work correctly when HTML output used
Duplicate	None	T168625 Make the Page Summary API return an "intro" for a page
Resolved	• Fjalapeno	T167022 Inventory requirements for summary 2.0 endpoint across Reading teams
Resolved	phuedx	T168400 [EPIC] Add notion of emptiness to page summary API
Declined	pmiazga	T168328 Extracts for Wikimedia List articles display partial previews
Declined	None	T168391 Page previews should display generic preview for disambiguation pages
Resolved	phuedx	T114418 Disable TextExtracts on file pages
Declined	None	T168418 Page Previews should accept the service's notion of emptiness
Resolved	• Pchelolo	T164291 Make title-related properties consistent
Resolved	• bearND	T169761 Review Summary 2.0 Spec
Resolved	• Mholloway	T170692 Return common URLs in summary API so clients do not have to perform bug prone string manipulation
Resolved	MSantos	T177619 Return variant URLs and titles in the metadata response
Resolved	• Mholloway	T178446 Expose display titles for a page in all available language variants through the action API
Invalid	None	T69232 Hovercards: Redirect titles should be bolded in the extract
Resolved	Jdlrobson	T170617 Adjust expectations for API consumers when using the TextExtracts API
Invalid	None	T185472 API problems with double spaces in wiki sections
Duplicate	None	T171053 Complete entire page previews test plan for page summary api
Resolved	Jdlrobson	T171052 Add disambiguation page handling in Page Summary API
Resolved	Jdlrobson	T168848 Bootstrap an initial version of the Page Summary API in MCS
Resolved	Jdlrobson	T172021 It shouldn't be possible for coordinates to be the lead paragraph
Resolved	Jdlrobson	T174698 Parenthetical stripping is too aggressive
Resolved	phuedx	T171065 Expose disambiguation property (ppprop=disambiguation) in mobileview api
Resolved	• bearND	T175286 Do side by side comparison of old summary endpoint against new summary endpoint
Resolved	• Mholloway	T176974 Mime vulnerability blocking merges to MCS
Duplicate	None	T176063 Page preview shows only the first two words for a specific article
Resolved	Jdlrobson	T176517 Pages that are redirects give empty summary objects
Resolved	Jdlrobson	T176519 Old templates can lead to sup elements inside summary
Resolved	phuedx	T176521 data-mw attributes should be stripped from summary before scrubbing parentheticals
Resolved	Jdlrobson	T176522 Intro property incorrectly identified in formatted endpoint
Resolved	Jdlrobson	T176525 Parenthetical: New edge case with nested brackets
Resolved	• bearND	T177007 Should we flatten spans in summary output
Resolved	Jdlrobson	T178125 Metadata (coordinates) is sometimes surfaced as lead intro/summary
Resolved	• Mholloway	T178333 Move RESTBase page summary logic to MCS
Resolved	• Mholloway	T178420 What is a "content namespace" for purposes of the summary 2.0 endpoint?
Resolved	Jdlrobson	T183833 [Bug report] Removing parentheses breaks chemical formulas
Resolved	Jdlrobson	T185050 Run comparison of html extracts again
Resolved	• bearND	T185161 Stripping parentheticals can result in trailing punctuation
Resolved	• Tbayer	T182315 Write and publish report on enwiki and dewiki a/b test analysis for page previews
Resolved	None	T184793 [EPIC] Instrument page interactions
Resolved	ovasileva	T182414 [Spike] How can we measure seen page previews with as high a degree of accuracy as possible?

Event Timeline

ovasileva triaged this task as Medium priority.Dec 8 2017, 1:36 PM

ovasileva created this task.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptDec 8 2017, 1:36 PM

ovasileva raised the priority of this task from Medium to High.Dec 8 2017, 1:37 PM

ovasileva moved this task from Incoming to Upcoming on the Web-Team-Backlog board.Dec 11 2017, 12:14 PM

phuedx renamed this task from [Spike] Determine ways of counting page previews with highest sampling possible to [Spike] How can we count previews with as high a degree of accuracy as possible?.Dec 12 2017, 5:52 AM

phuedx updated the task description. (Show Details)

Possible Solutions

Just leave the EventLogging instrumentation running and refine this information from the event.popups Hive table
- i.e. make this Not Our (Readers Web's) Problem™
Modify the EventLogging instrumentation to track how long the preview is visible for.
~~Update the statsv reducer to track when a preview is shown and dismissed and then increment a counter if the preview was visible for longer than X milliseconds.~~
- This seems incompatible with the long-term goal of having all page interaction data in Hive.
Introduce a new kind of instrumentation that makes a request (via the Beacon API) to /beacon/preview?duration=X&uri=Y, like the Multimedia Viewer does.

Notes

#2 and #4 could and should include simplifying and/or removing the existing EventLogging instrumentation and A/B test code.
#2 and #4 require QA and Research (@Tbayer's) involvement.
#1 requires involvement from Analytics Engineering and Ops as we're looking to:
- Store this data for as long as we can; and
- Increase the sampling size as much as we can in order to improve accuracy.
In reality, #1 is only punting #2 and #4 until some later date. Analytics Engineering (AE) will have to introduce a shim to refine event.popups table into a pageview_hourly-like table when there's precedent to refine requests to /beacon/foo. This sounds like technical debt and I'd expect AE to push back.

ovasileva moved this task from Backlog to Next Up on the Page-Previews board.Dec 12 2017, 3:48 PM

I will setup a meeting for all engineers when I return from vacation so all the engineers can talk through the potential solutions and work out how to make this happen.

MBinder_WMF awarded a token.Dec 12 2017, 5:36 PM

• Tbayer renamed this task from [Spike] How can we count previews with as high a degree of accuracy as possible? to [Spike] How can we measure seen page previews with as high a degree of accuracy as possible?.Dec 12 2017, 5:49 PM

• Tbayer reassigned this task from Jdlrobson to ovasileva.

• Tbayer reassigned this task from ovasileva to Jdlrobson.

• Tbayer updated the task description. (Show Details)

@ovasileva Rewrote the task description further as discussed on Friday. To highlight one point in particular: WMF already tracks pageview numbers with relatively high precision (with a lot of work having gone into refining definitions and building and debugging aggregation steps over the past few years), so I don't think we should focus on creating a separate permanent "preview + pageviews" metric here (which would require us to maintain the custom pageviews part too). Also, we won't be able to test the hypothesis that it (or pageviews per se) will increase or decrease just by observing the timeline - that's what the A/B tests were for.

In T182414#3831704, @Jdlrobson wrote:

I will setup a meeting for all engineers when I return from vacation so all the engineers can talk through the potential solutions and work out how to make this happen.

reminder to also add @Tbayer to the meeting

In T182414#3830586, @phuedx wrote:

Notes

#2 and #4 could and should include simplifying and/or removing the existing EventLogging instrumentation and A/B test code.

Should we add this to the criteria? Is everyone fine with this happening? (From my side, 👍)

Re: solutions, #2 and #4 I think are quite clear in the Page-Previews client code base part and sound fine, but I'm personally not sure what they would entail on the backend (regarding extra work or collaborations).

In T182414#3882256, @Jhernandez wrote:

...but I'm personally not sure what they would entail on the backend (regarding extra work or collaborations).

Meaning, we'll dive in in the meeting, and reach out to other teams as needed to clarify things and get back to you in this task when we know more.

Meeting set for Thursday. Thanks @phuedx!

I have looked a bit into how the aggregation step could work for the beacon approach (#4). The structure of the corresponding query that generates the pageview_hourly table is fairly simple. I think I could take a stab at adapting it for seen previews, once we have found a way to send that kind of "virtual" beacon request which conveys the necessary information (e.g. project, page name), analogous to the MediaViewer solution. It could look very similar to the existing EL beacon requests that we already send for Popups (which already has those fields), but would be processed differently on the backend.

We would then need some support from AE to make this query into an Oozie job, but I can't imagine that this will be a big challenge for them.

phuedx mentioned this in T184801: Remove A/B testing code and EventLogging instrumentation.Jan 12 2018, 2:20 PM

phuedx added a project: Readers-Web-Kanbanana-Board-Old.Jan 12 2018, 2:23 PM

phuedx moved this task from To Do to Doing on the Readers-Web-Kanbanana-Board-Old board.

^ @MBinder_WMF, @ovasileva: Reflecting reality. The majority of Readers Web had a meeting to discuss the answer to this question yesterday (Thursday, 11th January 2018).

phuedx mentioned this in T184793: [EPIC] Instrument page interactions.Jan 12 2018, 2:48 PM

Etherpad: https://etherpad.wikimedia.org/p/t182414
Notes: I've created the following implementation tasks: T184793: [EPIC] Instrument page interactions and T173952: Remove A/B testing instrumentation code. Note well that the former blocks the latter.

Per the

SS: Ping Analytics Engineering and Operations about which URL that we should request. Operations need a traffic estimate.

action item, I'll be reaching out to AE and Ops by email and will follow up on those implementation tasks.

phuedx moved this task from Doing to Ready for Signoff on the Readers-Web-Kanbanana-Board-Old board.Jan 12 2018, 2:59 PM

Sorry. I'll add a note about why we can't use the existing EventLogging instrumentation to the description.

In T182414#3830586, @phuedx wrote:

Possible Solutions

Just leave the EventLogging instrumentation running and refine this information from the event.popups Hive table

i.e. make this Not Our (Readers Web's) Problem™

Modify the EventLogging instrumentation to track how long the preview is visible for.

~~Update the statsv reducer to track when a preview is shown and dismissed and then increment a counter if the preview was visible for longer than X milliseconds.~~

This seems incompatible with the long-term goal of having all page interaction data in Hive.

Introduce a new kind of instrumentation that makes a request (via the Beacon API) to /beacon/preview?duration=X&uri=Y, like the Multimedia Viewer does.

#1 and #2 were dismissed because the EventLogging stack doesn't record IPs, for example, and so the data can't be refined to something pageview-like.

phuedx mentioned this in T173952: Remove A/B testing instrumentation code.Jan 15 2018, 11:19 AM

Jdlrobson moved this task from Needs Prioritization to 2017-18 Q2 on the Web-Team-Backlog board.Jan 16 2018, 9:05 PM

This is now blocked on a discussion on the analytics@ mailing list.

ovasileva moved this task from 2017-18 Q2 to 2017-18 Q3 on the Web-Team-Backlog board.Jan 17 2018, 6:43 PM

phuedx added a parent task: T184793: [EPIC] Instrument page interactions.Jan 22 2018, 5:44 PM

Sam is this still blocked? I think we got an answer on list, right?

phuedx updated the task description. (Show Details)Jan 31 2018, 1:07 PM

In T182414#3928498, @Jdlrobson wrote:

Sam is this still blocked?

No. Mostly.

I think we got an answer on list, right?

I've believe https://lists.wikimedia.org/pipermail/analytics/2018-January/006136.html summarises the outcome of the on- and off-list discussions. For whatever reason, the formatting of that email seems to have been lost so I'll quote/reformat it here:

Hullo all,It seems like we've arrived at an implementation for the client-side (JS) part of this problem: use EventLogging to track a page interaction from within the Page Previews code. This'll give us the flexibility to take advantage of a stream processing solution if/when it becomes available, to push the definition of a "Page Previews page interaction" to the client, and to rely on any events that we log in the immediate future ending up in tables that we're already familiar with.

I'll update T184793: [EPIC] Instrument page interactions with additional detail and suggestions from AE.

phuedx moved this task from Blocked on Others to Doing on the Readers-Web-Kanbanana-Board-Old board.Jan 31 2018, 1:19 PM

Done. See T184793#3906745 to T184793#3933982.

phuedx moved this task from Doing to Ready for Signoff on the Readers-Web-Kanbanana-Board-Old board.Jan 31 2018, 1:58 PM

phuedx removed phuedx as the assignee of this task.Feb 1 2018, 10:07 AM

Over to you @ovasileva! Signing this off should boil down to making sure that T184793: [EPIC] Instrument page interactions is actionable.

ovasileva closed this task as Resolved.Feb 6 2018, 6:10 PM

I'll reply to the thread asking for a review of T184793: [EPIC] Instrument page interactions.

Jdlrobson mentioned this in T193051: Remove all page previews instrumentation code.Apr 25 2018, 4:26 PM

[Spike] How can we measure seen page previews with as high a degree of accuracy as possible?Closed, ResolvedPublicActions