Page MenuHomePhabricator

ERI Metrics: Establish baseline for RC page tool usage prior to beta release
Closed, ResolvedPublic

Description

Are people using the new RC page tools? Which ones are they using? Has that increased their use of the page, and presumably their satisfaction with it? The ERI success metrics this task proposes are crude, but should provide some answers to these questions.

It's important we establish a baseline for RC Page tool usage before we release the beta. (Establishing the baseline after beta release would be bad, since many of our most active users will be in the beta.) This exercise will also flush out any issues with the tracking mechanisms we've put in place.

Proposed metrics

  • Tool usage profile: What filters and other tools (e.g., highlighting) do reviewers use most often? Can we come up with a profile of current tool usage?
    • Note that for Highlight tools, it will be important to know which filters users are highlighting. I.e., we can't just say the Highlight tool was used X times, though we will want to know that total too. But we'll especially want to know that users were, for example, highlighting Newcomers, etc. The combination, Highlight + Tool.
    • How can the profile numbers best be expressed? I think best would be if each tool were expressed in terms of the % of total tool "settings" it represents (per week/mo?). So, if filters were selected 100K times in a week, and if Newcomer was selected 5K times, then its usage would be 5%. Conceptually, we're talking about a pie chart.
  • Page popularity (sessions): Can we establish a valid metric for and then a baseline stat for something we might call "page sessions." (E.g., a period of RC page and other page use that could be said to be terminated if the user does not return to the RC page for 30 mins.)
  • Tool engagement: What proportion of "page sessions" use only default settings vs .those that involve tool selections? (If this goes up with the new system, we can conclude the interface has made the tools more accessible).
  • Session length: Another traditional measure of engagement is length of session, the theory being that if users like the tool they will use it longer.

Questions/Issues

  • For sessions in which users have employed a saved URL that includes non-default filters, the non-default filters should be counted towards the tool usage profile and tool engagement stats. Otherwise we're liable to ignore the habits of our most prolific and savvy users.
  • I'm suggesting "page sessions" instead of page views as a gross measure of page popularity for a number of reasons. E.g., page views might actually go down after beta release if users are finding what they need more easily.
    • What is currently being counted as a "page view" now anyway? E.g., is it every time the page reloads? Or every time a user chooses a new tool? etc.
  • I'm thinking a week or a month might be the relevant period for this type of analysis, to avoid normal weekly rhythms.
  • While session length is indeed a traditional engagement metric, on the theory that people will spend more time doing things that they like, there is some question as to whether it's meaningful here. We just don't know enough about users' habits. E.g., it's possible that users simply have a certain, nonelastic amount of time they devote to edit review. Or it's possible they have certain fixed goals they seek to attain; if more efficient tools let them attain those goals more quickly, session time might decrease. So what I'd say is, this would be interesting if we can achieve it without a lot of work.
  • Do we need to produce the baseline figures now for all wikis we will ever want to measure? Or will the data be available indefinitely?

Steps

  • We don't need to build a graphing tool out of the gate. Our goal, I think, is to produce a spreadsheet from which we can extract meaningful conclusions. If we want to automate analysis, we can do that later.

My sense of the best way to proceed is this:

  1. Investigate the issues, determine what is possible and how involved the project will be, then report back.
  2. If required, put in place whatever tools are necessary to get the data we want.
  3. Make a trial run at producing analysis for two of the ORES wikis, one large and one small. Say en.wiki and pl.wiki?
  4. Refine methodology/technology as needed.
  5. Rerun the analysis of the two wikis above.
  6. [If baseline figures will not be available indefinitely (see above), determine a test set and run baselines]

Event Timeline

For "Tool usage profile", we already have data on filter usage going back months, and Stephane's got a patch that adds tracking for highlight usage. Re the percentage idea, remember that the percentages would add up to >100% because one selection can include multiple filters.

Re "Page sessions": yes, we can do that, but 1) it would be some work and 2) we would be required to purge the session ID after 90 days, after that we'd only be allowed to store aggregate statistics derived from that data. "Tool engagement" and "Session length" seem easy to compute from data with session IDs in an automated fashion, so that data could be computed from each day's data before it's purged.

Re "non-default filters": yes, those will be counted. Re "what counts as a page view", I'm not sure, but probably every time you change your filters, both in the old UI (permanently) and the new UI (for now, eventually it won't work that way).

Re data availability: the data I mentioned will be available indefinitely, except for session IDs which will be purged after 90 days. We can store aggregate data based on session IDs, but of course that has to be computed before the session IDs are purged.

Re the percentage idea, remember that the percentages would add up to >100% because one selection can include multiple filters.

I don't understand. The goal here is to find out how often a give tool is used in a way that's meaningful. What I'd like to know is how often it's uses as a % of "searches," but I don't know what a "search" is with our dynamic system. So that's why I suggested as a % of all filter usage. Do you have a good suggestion?

Re "Page sessions": yes, we can do that, but 1) it would be some work and 2) we would be required to purge the session ID after 90 days, after that we'd only be allowed to store aggregate statistics derived from that data.

Please explain. Re. having to purge the id, does that mean we could only ever go back 90 days? Or would the "store aggregate statistics" part of this get around that limitation? The session concept is a useful one, given the ambiguity I mention above about what constitutes a "search." But maybe there is an easier way to achieve a meaningful result?

"Tool engagement" and "Session length" seem easy to compute from data with session IDs in an automated fashion, so that data could be computed from each day's data before it's purged.

So it sounds like these can be done if we do the "some work" to do Page sessions?

Re the percentage idea, remember that the percentages would add up to >100% because one selection can include multiple filters.

I don't understand. The goal here is to find out how often a give tool is used in a way that's meaningful. What I'd like to know is how often it's uses as a % of "searches," but I don't know what a "search" is with our dynamic system. So that's why I suggested as a % of all filter usage. Do you have a good suggestion?

My point is, let's say that there we have a history of three searches:

  1. humans, newpages, pageedits, verylikelygoodfaith
  2. humans, pageedits, likelydamaging
  3. newpages, pageedits, categorization, wikidata

Then pageedits was used 3 times; humans and`newpages` 2 times; and verylikelygoodfaith, likelydamaging, categorization and wikidata each 1 time. If you convert that to percentages, you'd get 100% for pageedits, 67% for humans, etc. That would add up to more than 100%; which is fine, you just need to realize that that'll happen (and therefore doesn't lend itself to a pie chart too well).

Alternatively, we could define the "total" as the number of selected filters across all searches (11 in this case), and then the percentages would add up to 100%, but having "humans" be 27% is not as meaningful IMO. (Of course, we can always compute both of these figures independently from the same data.)

Re "Page sessions": yes, we can do that, but 1) it would be some work and 2) we would be required to purge the session ID after 90 days, after that we'd only be allowed to store aggregate statistics derived from that data.

Please explain. Re. having to purge the id, does that mean we could only ever go back 90 days? Or would the "store aggregate statistics" part of this get around that limitation? The session concept is a useful one, given the ambiguity I mention above about what constitutes a "search." But maybe there is an easier way to achieve a meaningful result?

Aggregating gets around part of this limitation. What happens is, we have entries like:

  1. Session 123 clicked on rollback
  2. Session 123 clicked on edit
  3. Session 123 clicked on rollback
  4. Session 128 clicked on undo

And after 90 days those turn from "session N clicked on X" to "someone clicked on X", and we won't know which entries came from the same session any more (so after 90 days, we'd lose the information that #1, #2 and #3 came from the same session but #4 from a different one). We can get around this by computing aggregate data like "there was a session that clicked rollback twice and edit once", or "on this day there were 57 sessions that clicked rollback more than once". But we lose the ability to compute new aggregates after the IDs have been removed. Does that make sense?

"Tool engagement" and "Session length" seem easy to compute from data with session IDs in an automated fashion, so that data could be computed from each day's data before it's purged.

So it sounds like these can be done if we do the "some work" to do Page sessions?

Yes. Which as @Mooeypoo just pointed out is actually a bit trickier than I thought, because I know how to do session IDs on the client side but I don't know what if anything is the recommended way to get them server-side (the same technique is not available there).

@Catrope explains:
... we lose the ability to compute new aggregates after the IDs have been removed. Does that make sense?

I want to be sure I understand the import of what you're saying here. It sounds like we could still go back to the "aggregate data" and get the metrics in the Description: 1) the number of sessions in a period, 2) the avg length of session in the period, 3) "tool engagement" (sessions with explicit settings vs. without). Yes?

What type of analyses would NOT be supported? E.g., can we store all the tools used in particular sessions, then we go back and see whether the tool-usage profile was different for short vs. long sessions?

Let's make a plan to do this. The process would be to do something like this:

Test subjects

  • Define a group of beta users across, what, 3 ORES wikis and 3 non-ORES wikis.?
  • Compare their RC Page usage before and after the beta over a period of, what, 2 weeks in before and 2 weeks after?
  • We might, it occurs to me, look for "regular" RC page users by dropping those who did not visit RC page twice during the sample weeks.

Results we want (see further definitions in Description)

  • Tool usage profile: What filters and other tools (e.g., highlighting) do reviewers use most often?
  • Tool engagement: What proportion of "page sessions" use only default settings vs .those that involve tool selections?
  • Session length: the theory being that if users like the tool they will use it longer.

@Catrope

  • Who could best take on an assignment like this?
  • When can we do it? Is the week of the 15th too soon?
  • How much time do you think it would take to do a project like this—assuming the deliverable is just a spreadsheet. A week?
  • Is everything here possible?