Page MenuHomePhabricator

Newcomer tasks: create separate EventLogging schema for newcomer task impressions
Open, Needs TriagePublic

Description

Currently newcomer task impressions are logged as part of the HomepageModule schema. The post-edit dialog will also involve task impressions, and some of the HomepageModule fields don't make sense in that situation. In the future tasks might show up in even more places (e.g. if we move them from the homepage to a dedicated special page, as planned in one of the variant tests). To have the task impression data in one place, it needs a schema of its own. This will also make it more semantic as we are logging almost a dozen fields for a task impression, but currently these are serialized into a string.

Event Timeline

Tgr created this task.Apr 30 2020, 2:24 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 30 2020, 2:24 PM
Tgr added a comment.Apr 30 2020, 2:38 PM

Tentative name: NewcomerTaskImpression.

Task-related fields (same as what we log in action_data for HomepageModule):

  • taskType
  • topic (the best-matching one of the topics which the user has selected, not present when the user has not selected any topics)
  • matchScore (topic matching score, float, bigger is stronger match, only present when topic is)
  • maintenanceTemplates (array of template names without namespace prefix)
  • revisionId
  • pageId
  • pageTitle
  • hasImage (whether the task card has an image)
  • pageviews
  • ordinalPosition (of the task in the result list, 0-based)

ordinalPosition won't always make sense but there's no harm in setting it to zero when we only show a single suggestion.

Context fields:

  • One option would be to always log two events, a NewcomerTaskImpression and a HompageModule/HelpPanel/whatever else, with some sort of token to bind the two together. That would allow more flexibility for context data - in the post-edit dialog, HomepageModule would collect page ID and so on. So the only other field would be some sort of taskimpression_token and maybe a source that tells whether this is a homepage or postedit impression.
    • Or maybe it's enough to be able to join on the pageview (which can include many task impressions), so the schema could have a pageview token, which would be the same as homepage_pageview_token or help_panel_session_id.
  • The other (not necessarily mutually exclusive) option is to try to collect the relevant context data - user ID, editcount, not sure what else would be needed.
Tgr claimed this task.Apr 30 2020, 2:39 PM
Tgr added a subscriber: TgrTest2.
Tgr removed a subscriber: TgrTest2.
MMiller_WMF renamed this task from Create separate EventLogging schema for newcomer task impressions to Newcomer tasks: create separate EventLogging schema for newcomer task impressions.Apr 30 2020, 5:44 PM
MMiller_WMF added subscribers: Catrope, kostajh, marcella and 3 others.

@Tgr @nettrom_WMF -- we've decided that this is a blocker for releasing guidance, right? Because we want to make this change to facilitate guidance instrumentation, and instrumentation is a blocker? If so, I will put it in Ready for Development.

nettrom_WMF moved this task from Triage to Tracking on the Product-Analytics board.

@MMiller_WMF : yes, that's my interpretation of the situation, that we'll need this implemented in order to track post-edit task impressions, so it should go to Ready for Development.

@Tgr : The proposed name and fields all make sense to me, and the latter should since it's what's captured in HomepageModule already :)

A couple of things I thought of:

  1. Should this schema also be used to capture se-task-pseudo-impression events? I'm not sure whether we should regard those as a type of impression event that should go here, or regard those as an error that's better kept in the HomepageModule. Part of the reason why I'm unsure is that it might add a couple of fields to the schema that'll be (relatively) rarely used, but I do see se-task-impression and se-task-pseudo-impression as very similar events.
  2. When it comes to context fields, we use homepage_pageview_token to join HomepageVisit and HomepageModule, and to group HomepageModule events together (and also to join with EditAttemptStep for edits). So it would make sense to have a token field to join Homepage task impressions with other events in HomepageModule. If I understand the plan correctly, we'll be using the HelpPanel schema to store guidance. That schema does capture a lot of the context information and I'd lean towards keeping it there, meaning I'm leaning towards logging two events (one for the guidance post-edit context and one for the task impression itself). We might want to have those events use the same token as the edit so they can easily be connected, but I'm not sure how easy it is to keep tokens around.
Tgr added a comment.May 1 2020, 3:11 PM

Thanks @nettrom_WMF, I also prefer the approach of logging two events that can be joined. Keeping tokens around for multiple request is possible (we'll keep some other data like choice of editor around anyway) but I think here the opposite is needed: we want to generate a new token for every task shown, and there could be multiple tasks during a single pageview (we only show one on postedit now, but that seems like something that could change). So it's better to just use a one-off random token. So the HelpPanel event and the NewcomerTask event (I used that name in the end, can still be changed of course, since it will include not just task impressions but also task clickthroughs) can be joined by that token, and the HelpPanel event is still connected to earlier events like EditAttemptStep via session_token. (If we want to connect by something more specific, like editing_session_id, that should be added to HelpPanel, not NewcomerTask, IMO. Having that kind of connection would be useful for non-task-related events too, in any case.)

Pseudo-impressions don't have any task data, so I don't think it makes sense to involve the NewcomerTask schema. They don't seem that conceptually related, either - the user seeing an error page or a "no more tasks" message seems not very relevant for analysis of what tasks work well.

Tgr added a comment.EditedMay 1 2020, 3:11 PM

New schema: NewcomerTask

Changes: HomepageModule, HelpPanel (includes other changes for T245790 as well).

Change 593972 had a related patch set uploaded (by Gergő Tisza; owner: Gergő Tisza):
[mediawiki/extensions/GrowthExperiments@master] Instrumentation schema changes

https://gerrit.wikimedia.org/r/593972

Thanks @Tgr, this is looking good to me! Regarding the pseudo-impressions, what you say makes sense at this point. I think we're mainly concerned about them in the context of the Homepage, where the user has the ability to affect them by changing topics and tasks.

I gave some more thought to the way tokens should work, and find that I agree with the way you've implemented that. It's tempting to want to connect more user activity together, i.e. such that the post-edit dialog connects to the previous guidance, but I think it makes sense to generate a new token on the post-guidance screen. Then that's where a new task impression and potential edit starts, and I expect us to be interested in the conversion rate of those into saved edits, which we'll be able to measure.

Tgr added a comment.May 5 2020, 3:16 PM

It's tempting to want to connect more user activity together, i.e. such that the post-edit dialog connects to the previous guidance, but I think it makes sense to generate a new token on the post-guidance screen. Then that's where a new task impression and potential edit starts, and I expect us to be interested in the conversion rate of those into saved edits, which we'll be able to measure.

Currently we have three tokens: newcomer_task_token (on relevant events only), which just serves as a foreign key to the task data, session_token which identifies the user's browsing session (but not the specific browser tab) and help_panel_session_id which identifies the pageview. So there is some amount of connection to the guidance events; if we want something more specific (browser tab specific user sessions, basically) I'd add that as a new token or maybe change the semantics of session_token.

nettrom_WMF added a comment.EditedMay 6 2020, 5:22 PM

Hm, I think I should've been clearer in my previous comment and laid out the joins we do in our measurements, so let me do that. Hopefully we're then all on the same page and we know that the new schema structure will serve our needs.

In the Homepage context, we currently make the following joins in order to understand usage of the Homepage and the Newcomer Tasks funnel.

HomepageVisit -> HomepageModule -> EditAttemptStep

These joins use homepage_pageview_token in the Homepage schemas and editing_session_id in EditAttemptStep.

Once Guidance is in place, I think we end up with two different contexts where we want to make joins. The first is an extension of the current one, that allows us to understand usage of the Guidance panel during editing when clicking on a task on the Homepage:

HomepageModule -> NewcomerTask -> EditAttemptStep -> HelpPanel

In this case, we'd use homepage_pageview_token in HomepageModule, newcomer_task_token in NewcomerTask, editing_session_id in EditAttemptStep, and help_panel_session_id in the HelpPanel schema. Joining with the HomepageModule schema isn't strictly necessary, but I suspect we'd want to be able to for whenever we want to dig deeper.

The second context is the post-edit screen. Based on how I understand the instrumentation, we get the following joins:

HelpPanel -> NewcomerTask -> EditAttemptStep -> HelpPanel

(So yes, it's similar to the previous one, except in this case HelpPanel is the context for the NewcomerTask, instead of HomepageModule)

As I understand it, we'll generate a new pageview token and therefore a new help_panel_session_id in the post-edit screen, so that creates a new context and makes sure that editing_session_id in the subsequent session is unique (I think that's an assumption in EditAttemptStep that we shouldn't break). That'll then allow us to connect the chain of events from clicking on a task in the post-edit screen through edits and guidance usage.

In short, I think what we have covers our needs. Let me know if I've missed or misunderstood something.

Tgr added a comment.EditedMay 7 2020, 9:18 PM

@nettrom_WMF this is the current connection between the schemas AIUI (current as in, some of the patches are still pending review), with everything in the same column being equal:

HomepageVisithomepage_pageview_token
HomepageModulehomepage_pageview_tokenaction_data.newcomerTaskToken
NewcomerTask (on homepage)newcomer_task_token
HelpPanel (guidance)help_panel_session_id
HelpPanel (in editor)help_panel_session_id
EditAttemptStepediting_session_id
(more HelpPanel / EditAttemptStep events without a page save, e.g. reloading the editor or cancelling the edit)same session id
HelpPanel (post-edit)help_panel_session_id (different)action_data.newcomerTaskToken
NewcomerTask (on post-edit panel)newcomer_task_token
more HelpPanel, EditAttemptStep and NewcomerTask events if the user makes more edits to the same task, or starts a new task (via post-edit task card)same session ID (until the next save)

All HelpPanel and EditAttemptStep events are also linked by having the same session_token, but that's shared by all browser tabs so it's barely more specific than the user id. Some EditAttemptStep events are also linked by having the same page_token, but that's heavily implementation-dependent.

So the EditAttemptStep session IDs are not quite unique. (Note this has been the case to a lesser extent since we deployed suggested edits: editing sessions inherit the homepage token, so if you open a bunch of task cards in a new tab, all those tabs use the same session ID.) The schema documentation is not very clear on whether they should be, either; it says "unique to the current page view session" but doesn't say what a page view session is.

If I understand correctly, this is not exactly what you had in mind, and the "more HelpPanel / EditAttemptStep events without a page save" row should reset the session ID?

@Tgr : Thanks for putting this table together, that's a great way to describe the events and how the data works out!

I also appreciate the notes about how EditAttemptStep session IDs won't necessarily be unique. That's an assumption it was useful to have corrected.

From your layout, the only thing that I'd like to change is that newcomer_task_token is connected through action_data. Instead, I'd like to see it have the same value as either homepage_pageview_token (if the events happen on the Homepage) or help_panel_session_id (if the event happens in the post-edit dialogue). Then the way the tokens are used would be consistent across our features, and connecting the various schemas and events is straightforward.

When it comes to the "more HelpPanel / EditAttemptStep events without a page save" row, I would prefer if it didn't reset the token, but if I remember correctly there are cases where that's really difficult to engineer (e.g. if the user cancels the edit and then opens the editor again, we might have a different editing_session_id), and that's okay.

The key measurement case for me in all of this is the funnel from task impression to first saved edit. That's the key thing we track in the reporting notebook that @MMiller_WMF has. Once guidance is in place, we want to track that for newcomer tasks clicked on the Homepage, or in the post-edit guidance screen.

Tgr added a comment.May 12 2020, 12:46 PM

From your layout, the only thing that I'd like to change is that newcomer_task_token is connected through action_data. Instead, I'd like to see it have the same value as either homepage_pageview_token (if the events happen on the Homepage) or help_panel_session_id (if the event happens in the post-edit dialogue). Then the way the tokens are used would be consistent across our features, and connecting the various schemas and events is straightforward.

That seems like a bad idea to me. If you page through four task cards and click on the fifth one, there will be six homepage events (five impressions and a click), all with the same homepage_pageview_token. If we use that to connect schemas, how do you tell which is the NewcomerTask event that corresponds to the task the user actually selected? (For the post-edit dialog we don't have that problem right now since there is only one card, but it doesn't seem that unlikely that we'd have some sort of navigation interface there as well in the future.)

If you want to easily connect everything in a single session, maybe it would be better to add a new NewcomerTask.session_id field and store homepage_pageview_token / help_panel_session_id there. Or, if the problem is that using a sub-field of action_data is hard, maybe we should promote action_data.newcomerTaskToken to a top-level field?

When it comes to the "more HelpPanel / EditAttemptStep events without a page save" row, I would prefer if it didn't reset the token, but if I remember correctly there are cases where that's really difficult to engineer (e.g. if the user cancels the edit and then opens the editor again, we might have a different editing_session_id), and that's okay.

My current approach (which has not yet been validated by code review and testing, though) is to store a session token in the browser's sessionStorage (kinda like a cookie, except 1) does not get automatically included in web requests, 2) is not shared with other browser tabs and only lives as long as the browser tab does) whenever the user clicks on a task card, and then run a small script on every new pageview that decides whether that pageview is still part of the session.

If that works out, it makes it pretty easy to apply almost arbitrary business logic to when the token should be reset. Currently it's done after the user saves the page (so it does survive cancelling out and editing again, and it also survives reloading the page which happens a lot on mobile when you put the tab in the background, which was my main motivation for doing it this way), but it could be done more or less often just as easily.

That seems like a bad idea to me.

I went back and read through things, and agree that it seems like a bad idea. One thing that had slipped my mind in this discussion is that we're logging two events for an impression/click, one in HomepageModule/HelpPanel with the context, and one in NewcomerTask with the task-specific information. Then it makes total sense to use action_data as a foreign key to newcomer_task_token in the way you've set it up. Sorry for not catching that earlier!

The way you've described how token storage would work through sessionStorage in the browser also sounds good to me. So, in short: all of this looks fine to me!

@Tgr : I went and had another look at Schema:NewcomerTask. I noticed that page_id is not required, but page_title is required. In analyses, working with page IDs is preferable to titles, because the latter changes with page moves. In other words, for me page_id would also be required. Maybe there are cases where we have a title but not the page ID?

Tgr added a comment.May 15 2020, 10:30 AM

@Tgr : I went and had another look at Schema:NewcomerTask. I noticed that page_id is not required, but page_title is required. In analyses, working with page IDs is preferable to titles, because the latter changes with page moves. In other words, for me page_id would also be required. Maybe there are cases where we have a title but not the page ID?

Not right now; there will be if we add translation or new article creation as a task type.
(Also on development setups we often use fake page names so there's no ID, but I don't think making it required would interfere with that.)

Tgr added a comment.May 15 2020, 10:34 AM

I guess technically it could also happen that the user queries the task API in the short timeframe between deleting a page and the ElasticSearch index being updated, so they receive a task card about an article that does not exist anymore, and since the ID is coming from a DB lookup and not ElasticSearch, it would be missing from the task data. It's a pretty unlikely edge case though.

Change 593972 merged by jenkins-bot:
[mediawiki/extensions/GrowthExperiments@master] Instrumentation schema changes

https://gerrit.wikimedia.org/r/593972

Change 605696 had a related patch set uploaded (by Gergő Tisza; owner: Gergő Tisza):
[mediawiki/extensions/GrowthExperiments@master] Fix NewcomerTask schema field names

https://gerrit.wikimedia.org/r/605696

Change 605696 merged by jenkins-bot:
[mediawiki/extensions/GrowthExperiments@master] Fix NewcomerTask schema field names

https://gerrit.wikimedia.org/r/605696

For reviewing by @MMiller_WMF and @nettrom_WMF

After testing on betalabs (and in the Console in production), the following needs to be reviewed.

(1) I could not see the following values for context on betalabs (and not in the Console for production):
"homepage-click"
"postedit-click"

"postedit-impression" and "homepage-impression" appear all right. And for Schema:HelpPanel all the events such as "postedit-impression", "postedit-close", "postedit-link-click", and "postedit-task-click" were present.

Also, the following were consistently absent (though required=false):
topic'
match_score

(2) ordinal_position will be always 0 for "postedit-impression" (of course).

(3) Full page_title is displayed for Schema:NewcomerTask for both context values - "postedit-impression" and "homepage-impression". It's not going to be hidden for any of the context?

HelpPanel schema does not record page title for postedit-task-click and for postedit-link-click.

I checked this out, and yes, I agree with @Etonkovidova that the big thing I'm confused about is where we are recording that a user clicked on the post-edit task suggestion. I was expecting to see that in the NewcomerTasks schema, but instead, I see an event for that in the HelpPanel schema. Is it is supposed to be in both places? If so, why? Below are counts for the NewcomerTasks schema in arwiki right now, grouping by context. Meanwhile, I also see that the HomepageModule schema is also recording the se-task-click event. How does this all work?

Tgr added a comment.Jun 29 2020, 1:37 PM

(1) I could not see the following values for context on betalabs (and not in the Console for production):
"homepage-click"
"postedit-click"

Click events do not reliably show up on the console because the page gets unloaded by the time the eventloggging request is handled by the server. You can see them in the Network tab in Chrome in a "pending" status. I think that's just a DevTools anomaly and the requests do succeed (EventLogging uses the sendBeacon API which browsers provide specifically for the use case of making requests that do not need a response but need to survive the page unloading). I'm not sure there's a good way to test clicks other than looking them up in EventLogging data.

Also, the following were consistently absent (though required=false):
topic'
match_score

They are only present when you are filtering for some topics. This was part of the spec for T242052, and it is also a technical restriction related to T243478: Newcomer tasks: fetch ElasticSearch data for search results (although we might want to fix that anyway for performance reasons).

Also not that even when present, the scores are faked on beta. They only work properly when searching for tasks local to the wiki.

(3) Full page_title is displayed for Schema:NewcomerTask for both context values - "postedit-impression" and "homepage-impression". It's not going to be hidden for any of the context?
HelpPanel schema does not record page title for postedit-task-click and for postedit-link-click.

Should it be? That tells what task the user is working on, while HelpPanel would potentially log any page the user visits, so it is a lot less discriminate.

I checked this out, and yes, I agree with @Etonkovidova that the big thing I'm confused about is where we are recording that a user clicked on the post-edit task suggestion. I was expecting to see that in the NewcomerTasks schema, but instead, I see an event for that in the HelpPanel schema. Is it is supposed to be in both places? If so, why? Below are counts for the NewcomerTasks schema in arwiki right now, grouping by context. Meanwhile, I also see that the HomepageModule schema is also recording the se-task-click event. How does this all work?

Yeah, task events should be recorded in two places. It's basically data normalization: there is a lot of data about a task, so instead of adding all those fields to both the HomepageModule and HelpPanel schemas, they are recorded in NewcomerTasks and the original HomepageModule/HelpPanel just contains the ID of the NewcomerTasks event instead of containing the event data directly. We can denormalize that if it's causing problems.

Tgr added a comment.Jun 29 2020, 1:44 PM

Click events do not reliably show up on the console

Actually that's probably not the reason. We are reusing NewcomerTasks events within the same request, so logs look something like this:

interactionHelpPanel eventNewcomerTask event
user sees post-edit panelaction: postedit-impression, action_data.newcomerTaskToken=ff236f447233c9a4647cnewcomer_task_token=ff236f447233c9a4647c
user clicks on taskaction: postedit-click, action_data.newcomerTaskToken=ff236f447233c9a4647c-

(HomepageModule events similarly reuse the task data. Also for multiple impression events of the same task, when navigating back and forth.)

The assumption was that reports will join HelpPanel and NewcomerTask using the newcomer_task_token key, in which case omitting the event on click doesn't really matter - it's the same task, with the same data, so both HelpPanel events will be associated with the same NewcomerTask event. (Although it also means that the context field for NewcomerTask is fake / not really meaningful, something I did not realize at the time.)

Tgr added a comment.Jun 29 2020, 9:35 PM

So basically it is like this (apologies for the shoddy drawing):


where things in the same row happen at the same time, and the arrow means the two records are connected by having the same newcomer_task_token value.

If that works for you and @nettrom_WMF then I think the only thing left here is to remove the NewcomerTask.context field which is misleading and there was never any need for it. Otherwise, it's easy to change the code to duplicate the NewcomerTask events so that each HelpPanel or HomepageModule event has its own copy (or merge task data back into those schemas, even).

Change 608733 had a related patch set uploaded (by Gergő Tisza; owner: Gergő Tisza):
[mediawiki/extensions/GrowthExperiments@master] Remove context field from NewcomerTask schema

https://gerrit.wikimedia.org/r/c/mediawiki/extensions/GrowthExperiments/ /608733

If that works for you and @nettrom_WMF

That works and makes sense to me. The way this is set up means that post-edit task impression and clicks in the HelpPanel schema works the same way as for the Homepage. In other words, we only need to join with the NewcomerTask schema when we need more information about the task itself.

Change 608733 merged by jenkins-bot:
[mediawiki/extensions/GrowthExperiments@master] Remove context field from NewcomerTask schema

https://gerrit.wikimedia.org/r/608733

Thanks @Tgr for explanation in https://phabricator.wikimedia.org/T251526#6264138 comment! I see the connection between schemas, as in https://phabricator.wikimedia.org/T251526#6265832 and since it all makes sense to @nettrom_WMF - all is fine.

context field is removed - Schema:NewcomerTask.