Page MenuHomePhabricator

Homepage: HomepageVisit schema specification
Closed, ResolvedPublic

Description

From the measurement specification draft:

  1. General
    1. What percent of users access the homepage?
      1. What percent of those users access the homepage multiple times?
      2. When users access the homepage multiple times, over what timeline does that happen?
      3. Are there differences between mobile and desktop users?
    2. How soon after creating their account does a user first visit the homepage?
    3. How do users get to the homepage?
      1. By clicking on their username in the top navigation?
      2. By clicking on the tab from their User or User talk page?
      3. Via another driver, such as a banner on wiki or link in an email?
      4. If the user clicked on a link to the Homepage from the top navigation or a tab, we want to capture the context they were in. Specifically, we want to know what namespace they were in when they clicked it, and whether they were reading or editing.
    4. How much time do users spend on the homepage per visit?
      1. During their first visit?
      2. During subsequent visits?
    5. What percent of users follow links from at least one module? (covered by T219435)
    6. Are users who have access to the homepage more or less likely to create a user page?
    7. Are users who have access to the homepage more or less likely to interact with other users through user talk pages?
    8. Are users who have access to the homepage more or less likely to interact with the community through article and project talk pages?
  2. Overall rules
    1. We will record an impression event every time a user visits the home page.
      1. For these events, we will also capture a list of the modules that are rendered.
      2. We will also capture impression events for specific modules as listed below in order to gather metadata relevant to the analysis of those modules.

Proposed schema, to be logged-to from the execute() method in Special:Homepage.

1{
2 "description": "Log visits to Special:Homepage (provided by Extension:GrowthExperiments) from the server-side.",
3 "properties": {
4 "is_mobile": {
5 "type": "boolean",
6 "required": true,
7 "description": "If the event is associated with the mobile web frontend."
8 },
9 "referer_route": {
10 "type": "string",
11 "required": false,
12 "description": "The route the user took to arrive at the Special:Homepage. Calculated by looking at the query parameter.",
13 "enum": [
14 "userpagetab",
15 "usertalkpagetab",
16 "personaltoolslink",
17 "direct",
18 "other"
19 ]
20 },
21 "referer_namespace": {
22 "type": "integer",
23 "required": false,
24 "description": "The namespace associated with the MediaWiki Title (e.g. 0, for Main_Page) that is the referer to this page. Calculated by attempting to load a MediaWiki title from parsing the HTTP REFERER header"
25 },
26 "referer_action": {
27 "type": "string",
28 "required": false,
29 "enum": [
30 "view",
31 "edit",
32 "other"
33 ],
34 "description": "The action associated with the user activities on the MediaWiki Title that is the referer to this page. Calculated by looking at the action parameter in the query string of the HTTP REFERER header."
35 },
36 "user_editcount": {
37 "type": "integer",
38 "required": true,
39 "description": "The user edit count."
40 },
41 "user_id": {
42 "type": "integer",
43 "required": true,
44 "description": "User ID, needed for tracking across login sessions."
45 },
46 "impact_module_state": {
47 "type": "string",
48 "required": true,
49 "enum": [
50 "activated",
51 "unactivated"
52 ],
53 "description": "Activation state of the impact module."
54 },
55 "start_tutorial_state": {
56 "type": "string",
57 "required": true,
58 "enum": [
59 "complete",
60 "incomplete"
61 ],
62 "description": "Completion state of the tutorial module."
63 },
64 "start_userpage_state": {
65 "type": "string",
66 "required": true,
67 "enum": [
68 "complete",
69 "incomplete"
70 ],
71 "description": "Completion state of the userpage module."
72 },
73 "start_email_state": {
74 "type": "string",
75 "required": true,
76 "enum": [
77 "noemail",
78 "unconfirmed",
79 "confirmed"
80 ],
81 "description": "Completion state of the email module."
82 },
83 "homepage_pageview_token": {
84 "type": "string",
85 "required": true,
86 "description": "One-time token per page load. This is a random user session ID that will be exported to the client-side, so that HomepageModule schema events can be associated with this HomepageVisit event."
87 }
88 }
89}

Event Timeline

kostajh added subscribers: nettrom_WMF, MMiller_WMF.

@nettrom_WMF @MMiller_WMF how important is "and whether they were reading or editing" to answering the research questions? If it's a must-have then I think we would need to log visits to Special:Homepage on the client-side. On the server-side we could look at action=edit parameter but that's not necessarily going to be accurate in all editing scenarios.

I think that's a question for @nettrom_WMF. @kostajh, are you sure the copy/paste worked right? It looks cut off to me.

I think everything in this task can be accomplished by adding referrer to the EditorJourney schema, extending it beyond 24h (when the target title or referrer are Special:Homepage), and adding source params to a few links (userpage, tabs).

[...] On the server-side we could look at action=edit parameter but that's not necessarily going to be accurate in all editing scenarios.

In which scenarios would edit or vedit not be sufficient or accurate?

In which scenarios would edit or vedit not be sufficient or accurate?

For example, editing one's user page in plain wikitext mode (no VE), switching to VE, then clicking on the tab to Special:Homepage. In this case, editor journey records action as view, although I also see that the referer sends us http://default.web.mw.localhost/mediawiki/index.php?title=User:Admin&action=edit, so I guess we could parse the referer string to override the action, if it's set there.

are you sure the copy/paste worked right? It looks cut off to me.

@MMiller_WMF I tried to copy only the relevant text from the measurement document, if you think I've missed anything please add it to the description.

kostajh renamed this task from Homepage: Homepage schema to Homepage: [REVIEW] Modify EditorJourney schema for Homepage visit logging.Mar 28 2019, 3:43 PM

I'm a bit confused at this point about what we're proposing to do here, so here's a question to try to clear that up: is the suggestion here to use EditorJourney for capturing this, rather than have an event in the proposed HomePage schema with action=impression?

Regarding @kostajh's question: how important is "and whether they were reading or editing" to answering the research questions?

After thinking about this, I think it's a "nice to have". My main thoughts around this is that we want to understand whether they are using the Homepage as their central place to find help, particularly around editing. For example, that they're editing and thinking "I'm struggling with this, where can I get help? Oh, I remember, there's help on my homepage!" In this case, these wikis also have the Help Panel, so we'd expect them to use that instead.

I'm a bit confused at this point about what we're proposing to do here, so here's a question to try to clear that up: is the suggestion here to use EditorJourney for capturing this, rather than have an event in the proposed HomePage schema with action=impression?

Yes. The proposal is to log to EditorJourney schema rather than create a new one.

After thinking about this, I think it's a "nice to have". My main thoughts around this is that we want to understand whether they are using the Homepage as their central place to find help, particularly around editing.

I think by using the referer, as mentioned in the task description, we should be able to have this without too much difficulty.

I'm a bit confused at this point about what we're proposing to do here, so here's a question to try to clear that up: is the suggestion here to use EditorJourney for capturing this, rather than have an event in the proposed HomePage schema with action=impression?

Yes. The proposal is to log to EditorJourney schema rather than create a new one.

I'm not sure I understand the rationale behind that. Is the concern that if we were to have a "HomePage" schema, it's only purpose would be to log these impressions?

One of my concerns with this is that we now have an EditorJourney schema that has different purposes across different projects, and I just realized that these projects also have different retention policies. We're proposing to have a schema that'll contain all Homepage visits, but not be able to retain that data for the entirety of the experiment. I'll have to think more about this as there might be workarounds (e.g. aggregate data for the questions related to this), but I'm not sure this is a great idea.

I'm not sure I understand the rationale behind that. Is the concern that if we were to have a "HomePage" schema, it's only purpose would be to log these impressions?

One of my concerns with this is that we now have an EditorJourney schema that has different purposes across different projects, and I just realized that these projects also have different retention policies. We're proposing to have a schema that'll contain all Homepage visits, but not be able to retain that data for the entirety of the experiment. I'll have to think more about this as there might be workarounds (e.g. aggregate data for the questions related to this), but I'm not sure this is a great idea.

No, I think the idea was that since we already have code writing to a schema that looks very close to the requirements of this task, we could reuse. But your points about 1) retention policies and 2) clearly separating concerns are well-taken, and I think we should revise the task description to have a HomepageVisits schema. I'll do that shortly, so we can discuss it in our meeting later today.

kostajh renamed this task from Homepage: [REVIEW] Modify EditorJourney schema for Homepage visit logging to Homepage: [REVIEW] HomepageVisits schema.Mar 28 2019, 7:47 PM
kostajh updated the task description. (Show Details)
kostajh renamed this task from Homepage: [REVIEW] HomepageVisits schema to Homepage: HomepageVisits schema specification.Mar 29 2019, 1:55 AM
kostajh updated the task description. (Show Details)

@MMiller_WMF @nettrom_WMF I updated the schema per our meeting earlier today. I'm not sure how you'd like to have the start module and impact module states logged @nettrom_WMF, so please have a look at those fields in particular and if you'd prefer a different format let me know.

kostajh renamed this task from Homepage: HomepageVisits schema specification to Homepage: HomepageVisit schema specification.Mar 29 2019, 1:58 AM

@MMiller_WMF @nettrom_WMF I updated the schema per our meeting earlier today. I'm not sure how you'd like to have the start module and impact module states logged @nettrom_WMF, so please have a look at those fields in particular and if you'd prefer a different format let me know.

Thanks for getting this update out quickly, @kostajh! I think the way you've chosen to log the start and impact module states makes sense. Should we be concerned about flexibility for future modules? In other words, would a key/value approach work better than separate fields? I'm not sure if that's something we need to be concerned about at this point, and I don't remember if we nailed it down in the meeting either.

In other words, would a key/value approach work better than separate fields?

Do you mean doing something like `"module_states": "impact=unactivated;start_tutorial=complete;start_userpage=incomplete"? That could work althoughiIt seems like that would be cumbersome to parse, while adding a new property each time a new module is added doesn't seem like too much work. I'll defer to you on this one.

Another thing to think about, I currently have the states as booleans

"start_email_completed": {
	"type": "boolean",
	"required": true,
	"description": "If the email module has been completed"
},

should it instead be a string, in a future where we might have modules with multiple states?

More questions:

  1. Do you just care about reading vs editing, or do you want to know if the user was looking at e.g. page info (action=info), or page history (action=history), etc.
  2. referer_namespace for now this is set based on the assumption that the user has clicked on 1) homepage tab from one of user page or user talk page or 2) personal tools link from any page. There is another scenario, where someone has [[Special:Homepage]] on some arbitary page; do we want to try to attempt to get the namespace from visits when a user clicks on a link on some random page? I assume we don't need that right now, but if we do please let me know.

In other words, would a key/value approach work better than separate fields?

Do you mean doing something like `"module_states": "impact=unactivated;start_tutorial=complete;start_userpage=incomplete"? That could work althoughiIt seems like that would be cumbersome to parse, while adding a new property each time a new module is added doesn't seem like too much work. I'll defer to you on this one.

A string like that is what I imagined, and you are correct that it would be slightly cumbersome to parse. It would also mean that we're doing a lot of substring searches rather than matching on column values. I think having a column for each module as you proposed is a better idea, it makes querying more straightforward and should be reasonably future-friendly, so let's go with that!

Another thing to think about, I currently have the states as booleans

"start_email_completed": {
	"type": "boolean",
	"required": true,
	"description": "If the email module has been completed"
},

should it instead be a string, in a future where we might have modules with multiple states?

How about we make them enums? That will allow us to switch from binary to multiple states in the future for any of them, whereas if we keep them boolean we'd have to create a new field with a different name to change the type. With enums, we get value checking for free since the EL code checks whether the value is any of the allowed ones (although it will also mean that we can't change the states without also updating the schema code).

More questions:

  1. Do you just care about reading vs editing, or do you want to know if the user was looking at e.g. page info (action=info), or page history (action=history), etc.
  2. referer_namespace for now this is set based on the assumption that the user has clicked on 1) homepage tab from one of user page or user talk page or 2) personal tools link from any page. There is another scenario, where someone has [[Special:Homepage]] on some arbitary page; do we want to try to attempt to get the namespace from visits when a user clicks on a link on some random page? I assume we don't need that right now, but if we do please let me know.

Regarding reading vs editing: how many of these actions are there? I found this documentation, which lists a whole lot of them, but maybe not all of those are available on our target Wikipedias?. I think I'm mainly interested in understanding whether they're reading the page's contents, trying to edit, or "something else". In other words: reading, editing, or "other". Does that make sense?

As for namespaces, our questions focus on what the context was when the user clicks the link in the personal tools, or on one of the tabs. We have a general question about accessing the homepage through other means, but in that case we're interested in the extent to which users do that. I think at this point that I'm happy with not being concerned about namespace in those cases, and if we see a lot of traffic coming in that way we might prioritize writing code to capture that. Does that sound good?

I think I'm mainly interested in understanding whether they're reading the page's contents, trying to edit, or "something else". In other words: reading, editing, or "other". Does that make sense?

Sounds good. We will track "view", "edit" or "other".

As for namespaces, our questions focus on what the context was when the user clicks the link in the personal tools, or on one of the tabs. We have a general question about accessing the homepage through other means, but in that case we're interested in the extent to which users do that. I think at this point that I'm happy with not being concerned about namespace in those cases, and if we see a lot of traffic coming in that way we might prioritize writing code to capture that. Does that sound good?

Sounds good! This is done with the current patch.

I just glanced over this, and I think it looks good. The only questions I have are:

  • @nettrom_WMF -- I think we probably discussed this and I forgot exactly -- we are recording the states of the modules in the schema so that it is easy to answer the question of "what state was the homepage in when the user visited?" Is that right? To which of our research questions it that related?
  • You were talking about how to record the states of modules, and whether they should be binary. Doesn't the email module have three states? You can need to add your email, need to confirm your email, or be complete.
  • The HomepageModule schema also records the states of modules, but does it differently, by saying whether a module is "complete" or "incomplete" in the action data. Is it bad to record the same sort of information in two different ways in two different schemas?
  • @kostajh -- if someone refreshes the web browser on their homepage, what will be recorded?

Doesn't the email module have three states? You can need to add your email, need to confirm your email, or be complete.

I'll let @nettrom_WMF speak to this, but just noting that in the UI we only present an incomplete or complete state to the user, so that's why I chose a binary representation for the schema. We could change this if you prefer though.

if someone refreshes the web browser on their homepage, what will be recorded?

The second event should look identical to the first with the exception of the homepage_pageview_token.

That said, when this was in code review, we noticed some oddities with reloading that I'm not able to reproduce at the moment, so it's worth reviewing and testing this in betalabs and again in testwiki.

@kostajh -- so you're saying that the email module has two states in the sense that the checkmark is either green or not green? From the product design perspective, I definitely consider it to have three states: no email, unconfirmed email, and confirmed email. I think we should distinguish between all three.

And I'm not sure if my other question was for @kostajh or @nettrom_WMF:

The HomepageModule schema also records the states of modules, but does it differently, by saying whether a module is "complete" or "incomplete" in the action data. Is it bad to record the same sort of information in two different ways in two different schemas?

I just glanced over this, and I think it looks good. The only questions I have are:

  • @nettrom_WMF -- I think we probably discussed this and I forgot exactly -- we are recording the states of the modules in the schema so that it is easy to answer the question of "what state was the homepage in when the user visited?" Is that right? To which of our research questions it that related?

RQs 6, 7, and 8. In RQ6, we're asking what modules users engage with, and the state of modules play into that (e.g. do users interact with the impact module in both its non-activated and activated states?) RQ7 asks if we effectively personalize the home page, which I interpret as asking what the state of the page is when users see it. RQ8 asks about users customizing the homepage. While that can be gathered through user preferences and the PrefUpdate schema, storing it in the HomepageVisit schema enables us to understand what layouts are viewed more often.

  • The HomepageModule schema also records the states of modules, but does it differently, by saying whether a module is "complete" or "incomplete" in the action data. Is it bad to record the same sort of information in two different ways in two different schemas?

I don't think so. Storing it in both places makes it easy for us to do both homepage-based analysis and module-based analysis. I would also expect some of the interactions with the modules to change their state, which would then be reflected in the associated HomepageModule event but not in the HomepageVisit event until the homepage is loaded again.

Thanks, @nettrom_WMF. Those answers help.

Okay so I put this back In Progress so we can record the three different states of the email module.

Okay so I put this back In Progress so we can record the three different states of the email module.

Schema updated in https://meta.wikimedia.org/w/index.php?title=Schema:HomepageVisit&oldid=18998622, please let me know if that looks good and I'll update the code.

As an aside, we now have two very similar schemas in HomepageVisit and HomepageModule, the differences are:

  1. referer_route (HomepageVisit)
  2. referer_namespace (HomepageVisit)
  3. referer_action (HomepageVisit)
  4. action (HomepageModule)
  5. action_data (HomepageModule)
  6. module (HomepageModule)

If we wanted to use a single schema, we could consolidate referer_action, referer_route and referer_namespace into action_data. For server-side logged visits to Special:Homepage we'd log homepage as the module property.

I'm not sure if this would make things easier for analysis @nettrom_WMF, but putting it out there for consideration.

Change 502340 had a related patch set uploaded (by Kosta Harlan; owner: Kosta Harlan):
[mediawiki/extensions/GrowthExperiments@master] Record multiple states for email module

https://gerrit.wikimedia.org/r/502340

@nettrom_WMF and I decided to keep two separate schemas. I've made the change to record three different states for email module.

@nettrom_WMF please confirm that https://meta.wikimedia.org/w/index.php?title=Schema%3AHomepageVisit&type=revision&diff=18999420&oldid=18984461 is OK with you. Thanks!

Change 502340 merged by jenkins-bot:
[mediawiki/extensions/GrowthExperiments@master] Record multiple states for email module

https://gerrit.wikimedia.org/r/502340

Noting that I've updated P8308 with the current state of the schema, so that's reflected here as well.

I've again updated P8308 with the current state of the schema, to reflect changes to referer_route, referer_namespace, and impact_module_state.