Page MenuHomePhabricator

Store changes made to on-wiki GrowthExperiments configuration somewhere
Closed, ResolvedPublic

Description

Background

By implementing T274520: Move Growth configuration to on-wiki JSON file, we will transfer some control of whether a feature is enabled or not to the communities. Right now, when we change a feature state on a wiki, we can note that internally, and interpret changes in data from reports using this information. When communities will be able to turn features on or off, we will no longer be able to note it ourself, and thus we need a process to capture when was a feature turned on or off, so we can count with that when analyzing.

Requirements

We need to be able to know when was a feature enabled or disabled, or when was other configuration of GrowthExperiments changed. This task aims to decide how we want to store this information, as well as implementing the decided solutions.

Possible solutions
  1. Create a new EventLogging schema, similar to PrefUpdate, but for wikis
  2. Store current configuration state in all of our current eventlogging schemas
  3. Use native MediaWiki history and the JSON blob itself.
Analysis of option 1

This requires a new schema being created, reviewed and deployed to analytics systems. On the other hand, it stores less data than option 2, and makes it easy to create a derrived dataset like option 2.

Analysis of options 2

Stores a lot of additional data, as changes of configuration probably won't be frequent.

Analysis of option 3

This is the only option that wasn't discussed previously. The configuration will live in an on-wiki JSON file, that will have a form so communities can make edits easily. That means we will be provided with a lot of information by MediaWiki history already (see https://cs.wikipedia.org/w/index.php?title=MediaWiki:NewcomerTasks.json&action=history for an example of how it could work). In my (@Urbanecm_WMF) opition, it isn't hard to fetch one of the old JSON files to get the data as needed, or derrive datasets that would be provided by options 1 or 2, using only MediaWiki history.

This does not require any additional work, and stores no additional information. On the other hand, the data will not be directly inside analytics systems, and it might be harder to work with them.

@nettrom_WMF I would appreciate your opinions on whether this would work for you.

Event Timeline

From the analytics perspective, I think there's three things we need to consider:

  1. What's the basis of our analyses?
  2. How often do we run them?
  3. Do they involve complex processing in order to understand conditions?

Regarding the first point, the larger pieces of analyses that we've done so far are done on a per-wiki basis, with random treatment group assignment on a per-user basis (within each wiki), run for some specific period of time. We've also required wikis to meet certain requirements (e.g. translations of messages, definition of task categories) in order to get our features. The way I see the work we're doing around scaling means that it's going to be impossible to require wikis to meet certain requirements to be part of experiments (perhaps with the exception of five main wikis?) We'll instead have to either gather data to understand the conditions and control for those in our analyses, or see if there are ways to change our unit of analysis (for example we can see "number of edits made" as a second-order result and instead focus on users completing a recommended task if they clicked on it).

We currently don't run analyses very often. This means that having access to configuration data in the Data Lake isn't critical, and taking the time to write some Python to grab and parse JSON off wikis won't slow us down (it should also be a "write once, reuse" process, and the number of wikis is known and generally a low N). I also suspect that configuration changes won't happen very often, and lastly that the configuration itself isn't going to be complex. In other words, translating the configuration settings to some kind of "table structure" that we can join with other data is going to be relatively easy.

The third point is where my main concern is. I think one example here is the mentorship module, which as far as I know isn't shown to users unless there are mentors in the mentor list. This means that the "mentor module available" condition requires combining the JSON configuration and the mentor list to figure out. At the moment, I don't think we have many of these "system-derived configurations", but it's something to keep in mind together with the first point on the list, because if we get more of them we might switch our basis of analysis towards "what did users who saw intervention X do?" In the case of the mentorship module, that might mean that we rely on the HomepageModule data and analyze behavior of users who actually saw the mentorship module, taking away the need for understanding the configuration altogether. Or we might find that we then need to log configuration change events so we can keep track of them.

In summary: I think that we currently will be fine with grabbing the on-wiki JSON configuration when running analyses. If the conditions or our needs change, then we can revisit this decision because as mentioned above what we need can depend on how we do our analyses.

I'm tagging @MMiller_WMF as well, so he can review and follow up if needed.

Thanks for your comments, @nettrom_WMF. Closing, going with the on-wiki JSON blob only for now.