Wed, Oct 16
I grabbed data for users who registered on the given wikis between Sept 2018 and Sept 2019 for our four target wikis (Czech, Korean, Vietnamese, and Arabic) using the mediawiki_history table in the Data Lake. This allows for 14 days of editing after registration, as well as 48 hours for a revert to occur. Auto-created accounts and accounts created by others were excluded, as were any accounts identified as bots. For specifics, the notebook is on GitHub.
Tue, Oct 15
@mforns : thanks for taking care of this! I've verified that the table doesn't contain any data prior to Oct 1st. Everything looks good here, so I'm closing this.
This work is done.
Mon, Oct 14
@Aklapper : thanks for the ping on this! This task should still be open as we've got a couple of related tasks open, once they are resolved I'll make sure to close this task as well. I've removed the deadline date since that's no longer relevant.
Thu, Oct 10
@fdans : Can confirm that the range to be deleted is the beginning of time (which is like April 2019) up to Oct 1. And yes, all fields are to be deleted. Thanks!
Tue, Oct 8
The second part of this, me deleting the initial set of data has been done:
This makes no sense as we'd probably just ask to whitelist it again in a month's time. Declining.
Mon, Oct 7
Ran the following query to back up the data:
One aspect around this that has come up during the work on the Newcomer Tasks measurement plan is how we handle A/B testing variants of the intervention for existing users. We'll be discussing that in the team and see what we end up with.
Thu, Oct 3
I used the current dataset from EditorJourney (last 90 days). For each of the three wikis where the Welcome Survey is deployed, I count the number of views of the survey (rather than registrations, because users can go back to the survey and access the links), as well as the number of views of the Tutorial and Help Desk pages. For the latter two, I only count views that have source=survey in the request to identify views that originated from the Welcome Survey. I counted the Tutorial and Help Desk links separately, and also split between views on Desktop and Mobile as requested. Finally, I calculate % of clicks on these links relative to the number of views of the Welcome Survey. The results are as follows:
Wed, Oct 2
Moving this to the Icebox on the Product Analytics board. Not sure when/if this is going to be an issue (we're currently nowhere near danger territory at 20–30 events/sec being the higher end of the dataset). Adding @MMiller_WMF to the subscriber list as well as I figured he should learn if something changes here.
Tue, Oct 1
@Tgr : likewise, thanks for letting me know about these changes. I don't have any current analysis or reporting that will be affected, so we're good there as well. (My main concern was Marshall's reports, but since he already checked those all is well)
Mon, Sep 30
Draft measurement plan has been created, reassigning to @MMiller_WMF for reviews and revisions.
Sat, Sep 28
Fri, Sep 27
@Mayakp.wiki has led the QA work on this, thanks for helping out!
Verified that the obfuscated namespaces are indeed obfuscated, and that the others are not. We don't have any data longer than 24 hours. Also spot-checked the dataset and didn't see anything of concern. Closing as resolved.
The draft measurement plan is being worked on, so I've claimed this task, updated tags, and moved it to the right columns.
Wed, Sep 25
Tue, Sep 24
As done in the previous comments (which I've deleted due to them using erroneous data), I decided to reuse the code and graphs we had for our analysis around emails (T204785), where we got proportion of registrations. In order to reflect any changes around deployment of the Homepage to Czech and Korean Wikipedias in May, the data gathering starts on 2019-01-01. Auto-created accounts are excluded from the analysis.
Investigating this further revealed that this was an error in the underlying query that gathered the data. On 2019-06-03, a switch was made in how the logging table stored the user ID of an account creation, from storing that in the log_user column to instead storing it indirectly through a reference (log_actor) to the actor table. Updating the query to follow this new convention results in graphs that show particular change around this period. I'll follow up in the parent task with updated graphs and analysis.
Here are updated versions of the graphs going back to Jan 1, so we get a few months of data before the initial release of the Homepage to Czech and Korean. As far as I can tell, there's no indication that the release of the Homepage (in May on Czech and Korean, and in July on Vietnamese and Arabic) has had a clear impact on email verification. That being said, this data contains the current state of all users email addresses. We can restrict it to those who provided one within a certain amount of time after registration to see if time plays a role. Secondly, the impact of the Homepage might be possible to detect, but it might also be small.
Sep 20 2019
[comment deleted due to graphs using erroneous data, see updated comment below]
Sep 18 2019
We have a set of slides for sharing this with stakeholders, reassigning to @MMiller_WMF for review.
@ifried or @aezell : not sure which one of you to contact, so I'm pinging you both, sorry! Would be great if someone on the team can review this. If the team has any EventLogging schemas that are whitelisted, and said whitelisting stores a token, a patch of the whitelist should be created to hash related tokens.
Schema:AutoblockIpBlock is not whitelisted, so if that is the only schema that this team is connected to, then no hashing is needed. I'm closing this as resolved, please reopen if necessary.
@MMiller_WMF : given that we've had the Year in Review, can we close this task?
As far as I'm concerned my part of this is now done, so reassigning to @MMiller_WMF for review.
Forgot the second part of this analysis. In this case, I'm using data from the final deployment (26 July) and 4 weeks onwards. All registrations are counted, and the visit_mobile flag below shows whether the first visit was on the mobile (True) or desktop (False) site. Again visits have to occur within 48 hours of registration. Percentages are calculated within each group (wiki and desktop/mobile).
@MMiller_WMF : thanks for asking about why the numbers are different. I went back and had another look at the leading indicators, and then found that I'd forgotten to exclude the control group from the denominator in the current analysis. I'll update the numbers in a minute (spoiler alert: we're still above 100% increase on both wikis).
Sep 17 2019
I grabbed data from Czech and Korean Wikipedias for June, July, and August. Excluded from the analysis are auto-created accounts, known test accounts, and users who turned the Homepage on or off in their preferences. From this, I calculated a daily proportion of registrations that visited the Homepage within 48 hours after registration, and excluded the last two days of registrations to allow everyone equal opportunity.
@MMiller_WMF : while I'm unsure how many users actually read these, I'm wondering if we're sure that this won't affect their motivation to respond? Partly because it's not clear how we're using these answers to improve their experience. And secondly, we might be benefiting from a generally positive view of Wikipedia when it comes to willingness to respond.
Sep 13 2019
Here's another update: I looked into matching IPs against blocks, and I can do this across the entire dataset in the Data Lake instead of using the replicated databases. This means that I should be able to identify the proportion of IP edits that were subsequently blocked for all wikis that we have data for. Will continue working on that next week.
I think what we do here depends on whether we see Newcomer Tasks as a new experiment or a continuation of the existing Homepage experiment. If we see it as a continuation, then I'm primarily concerned about the users who are currently in the Homepage treatment group, and I'd then turn on the Help Panel for all of those so they can get the new features.
Sep 12 2019
@Niharika : I've updated the page on meta with stats split up by project group (wikisource, wikibooks, etc), and for each language within that group. In addition to monthly averages for number of IP edits, I've also added min/max percentages of IP contributions for each of the 12 months in the dataset, so it's possible to see to what degree the proportion of IP edits varies across a year. I think that should be sufficient to answer the first question in this task.
Sep 11 2019
@MMiller_WMF : given the low usage and that we're only removing parts of the Help Panel feature (the question asking ability) rather than the entire panel, I don't see this having major impact on our Homepage experiment, so it can happen whenever the engineers feel like picking it up.
@MMiller_WMF : I don't see this interfere with anything that's currently going on, no. We don't have any experiments relying on those answers, and we've got about a couple of months of data from Arabic at this point, which should provide us with stability in responses there. Feel free to have an engineer pick it up.
Sep 6 2019
@Niharika : I've started working on the first part of this, and am wondering if there's a particular place or format you'd want for the results? Maybe there's a wiki page somewhere that could contain the tables as wiki-tables, so it's easy to refer to in discussions?
@JAllemandou : Thanks for the quick turnaround, the excellent explanation, and for updating the documentation, very much appreciated!