Page MenuHomePhabricator

Understanding first day: define questions, metrics, and data
Closed, ResolvedPublic


In this task, we will define questions that the instrumentation data needs to be able to help us answer.

We will determine the concrete metrics to be derived from the instrumentation, and define the concrete data that the instrumentation will need to generate for this purpose.

We will reality check the instrumentation definition to ensure the data can actually be generated with the event model of the client.


These are the questions that this effort is trying to answer.

  • What are the most common workflows that Czech and Korean new account holders go through during their first 24 hours?
  • What percent of those new account holders go through each of those workflows?
  • Which workflows do and do not tend to lead to edits?

To do this, we will develop a new “EditorJourney” EventLogging schema that records all page views of new account holders in Czech and Korean Wikipedias during the 24 hours after they create their accounts. This will record URLs along with User IDs and timestamps. Using User ID, we will connect this with a set of existing EventLogging schemas to compose the full journey of the new editor.

Overall rules

  • For all new accounts created in Czech and Korean Wikipedias that are not auto-created, we want to capture User ID along with the information below for the first 24 hours of the account’s existence.
  • For all measurements, we will want to:
    • Distinguish by wiki.
    • Analyze how much time elapses between each action.
    • Know the context from which the account was created.
    • Know whether each action was taken from desktop or mobile. We should also capture user-agent in order to parse additional details of how the user accessed the page.
  • Our implementation will record events for each pageview. For some namespaces, we will want to capture the exact page title/ID, but for others, we will only capture its namespace. We should capture the exact page title/ID for all namespaces except the following. For namespaces where we do not capture the exact page title/ID, we want to capture a hash that will allow us to see if users continually visit the same page, though it will not show what page it is.
    • Article (0)
    • Article talk (1)
    • File (6)
    • File talk (7)
    • Portal (100)
    • Portal (101)
    • Draft (118)
    • Draft talk (119)
    • Exception: we also want to drop search query parameters when the user is on Special:Search, and we need to be careful about scrubbing relevant namespace (from above) data when the user is on Special:Search with a 302 redirect.
  • Data should be aggregated, anonymized, or deleted after 90 days.

Specific rules

The outline below lays out the specific questions we want to answer with this instrumentation effort. Most of these questions will be answered with the new EventLogging schema being built for this project. Some questions, especially those under #4 ("After the editing experience begins") will be answered by connecting with existing EventLogging schemas built to measure existing features.

  1. How often do accounts get created from the different possible account creation contexts?
    • Homepage
    • Reading experience (sub-divided by namespace)
    • Editing experience (sub-divided by namespace)
  2. After account creation, when shown the “Personalized first day” survey, do users respond to one or more questions in the survey, or skip the survey altogether and go back to what they were doing prior to account registration?
  3. After the “Personalized first day” survey, what are the various common workflows that new account holders go through before making an edit (or before never making an edit)? We want to count the frequency of workflows such as, but not limited to, the following. The reason it’s “not limited to” is that we don’t yet know which workflows we will discover.
    • Reading many pages in the Article namespace and then either leaving or editing.
    • Consuming some sort of nurturing/learning content and then either leaving or editing. This content is found in namespaces other than Article namespace, or through certain actions that may not be captured in page views or existing schemas:
      • Clicking on a link in a welcome message on their own user talk page
      • Click on external help links, like to the Outreach Dashboard.
      • Opening and reading notifications
      • Opening and updating account settings / preferences
      • Verifying or adding/updating email address.
    • Going straight to editing
      • This can either happen because the account was created from the editing experience, or the reader opened the editor from the reading experience soon after account creation.
      • Is this the creation of a new page?
      • When opening the editor, some wikis display a GuidedTour or GettingStarted. Did the user click on anything in GuidedTour or GettingStarted?
    • Any combination of the above, such as a workflow in which users read some articles, followed by reading a help page, followed by starting and abandoning an edits, followed by a successful edit.
  4. After the editing experience begins, what percent of users successfully save an edit? And for those who abort their edits, what do they do in the editor before aborting?
    • On what page is the attempted edit happening?
    • How often do users quickly exit the editor without actually interacting with the page?
    • How often do users do a substantial amount of interaction with the page before aborting?
    • How often do users switch the type of editor:
      • In Czech, there is a tab for “Edit” and a tab for “Edit source”. Which one do users click, and then do they switch to the other?
      • In Korean, there is one tab for “Edit”, which opens Visual Editor. Do users then switch to the other?
    • If the edit was saved, how many bytes changed in the edit?
    • Was the resulting edit reverted or thanked?
  5. After saving or aborting an edit, what happens next? (Return to Step 3).

Event Timeline

This task now contains our questions, metrics, and data we plan to gather as of 2018-10-24. This is the plan that served as requirements for the technical details in T205763. Our plans may evolve, and may not always be kept up-to-date on this task. See this Mediawiki page for important updates to these plans.