I've updated the measurement specification to incorporate the Visual Editor measurements that we're interested in doing. This involved separating between Media Search on Commons and image search in Visual Editor, so there's now some separate sections for those where I thought it was reasonable.
Tue, Nov 24
Mon, Nov 23
@nettrom_WMF I may have already asked you this elsewhere, but I'll ask again here so we have an officially documented answer.
Do any of these event streams need client IP and/or geocoded data? If not, it will be removed as part of this migration.
Fri, Nov 20
Thu, Nov 19
Since the description mentions "those measurements": what, specifically, are we trying to measure?
Wed, Nov 18
We're quickly running out of time in Q2, so moving this to Q3.
The first pass of this analysis is now complete. I've put the notebooks for this in the NEWTEA GitHub repository, they're numbered 12 through 17. The notebooks numbered 9 through 11 are the same analysis using all edits, where we found a significant increase in the Homepage group compared to the control group.
Mon, Nov 16
I've made T258183 a subtask of this task, since we can't build the dashboard until the instrumentation is in place.
Now that the subtask is resolved and the notebook is accessible, I'm closing this task as well.
I also received word from Mikhail that he'd reviewed it and everything's good to go! Closing as resolved.
Wed, Nov 11
@MMiller_WMF : you've asked me to determine what the duration of the Variant C/D experiment should be. Here's what I've come up with.
@Tgr : Thanks for chiming in here and volunteering to help out! Your suggested approach for reconstructing the edits is what I also came up with. I identified three conditions for exclusion: initializing the Newcomer Task module, changing topics, and changing difficulties. In all three cases the module is loaded/refreshed and the link correct, so the tag would be applied.
Tue, Nov 10
Thu, Nov 5
Also, @nettrom_WMF, can you confirm whether we should migrate all those at the exact same time, or just migrate them close enough?
@Ottomata : It would be helpful to have ServerSideAccountCreation grouped with these, I've updated the task description to reflect that.
I've added Product Analytics so the team's aware of this, we have our board refinement coming up today. I also see that our team members get subscribed to the child tasks as they're created so they're individually aware of them, thanks for doing that!
Wed, Nov 4
Tue, Nov 3
I've updated the notebook on GitHub, adding text aiming to make it more accessible to everyone.
Mon, Nov 2
Oct 30 2020
Oct 28 2020
@Milimetric : Thanks for clarifying that, and for your patience while I got back on this! I chatted with the Product Analytics team about this, and we're fine with waiting for the re-sanitization to come around in early November to fill the gap in the sanitized data.
@JAllemandou : Yes, and I'm expecting to see some checksum-based reverts not having the tag because the tag only checks the last 15 edits.
Oct 27 2020
Oct 26 2020
@mpopov : being able to provide query_hive with a list of parameters and have it replace placeholders would be really useful, I definitely support that!
All of the proposed changes have been implemented and @MMiller_WMF now has the notebook for testing.
Oct 23 2020
@Isaac : you wanted me to tag you when I filed the task for getting information about revision tag changes into MediaWiki history. Here's said tag. I don't remember what changes you were interested in, maybe they'll fit here too?
Oct 20 2020
I spoke too soon! I've written up a query following the above mentioned idea, but this turns out to not work in practice. The issue is that a wiki can use a file from Commons but also have a local file description page. Attendekall.jpg on Nynorsk Wikipedia is an example of that. The actual file is on Commons, but it has a local description page to categorize it into the local programming category. This means that the page table isn't an authoritative source for whether a file exists locally on the wiki.
Oct 19 2020
Oct 16 2020
This statistic was mentioned in the Technology Department's Quarter in Review for Q4 of FY 19/20. Looking further, I found out that it comes from the Understanding Engagement with Images in Wikipedia research project. More detailed statistics can be found on the First Round of Analysis page, which I'll dig into further. Looks like T250154 is the parent task for this work.
Created subtasks for all five points, changing this to an epic and moving it to the Epics column on the Product Analytics board.
There's the MediaViewer schema, and there's data from it in the Data Lake. An investigation would be needed to understand what data is actually logged and whether that can answer this.
As far as I know, there is not any live instrumentation that would allow us to measure this. The SearchSatisfaction schema measures dwell time, but requires the user to reach a page through an on-wiki search, and we know that's not representative of how visitors reach us.
Based on my conversations with @cchen and @mpopov it looks like this will not be straightforward to do any time soon. If we're interested in understanding this based on existing edits we'll need to extract and process diffs between revisions.
I've previously discussed something similar with @jwang in relation to T247417. We can do this on a monthly basis by using the sqooped tables in wmf_raw in the Data Lake. We'll left join mediawiki_imagelinks twice, first with the mediawiki_page table to identify local files, second with mediawiki_page table to identify files used from Commons. If a file isn't found in either of those it should be redlink, and we can mark it as such.
I agree with @MNeisler that using the VisualEditorFeatureUse schema makes sense since we're asking questions about user behaviour around features in VE specifically.
Also, I think storing previous and current state of the filters is a great way to do it! Perhaps particularly if we switch to a map type for storing additional action parameters/values. The only other alternative I was going to suggest was having a combination of value and is_default fields (similar to how PrefUpdate does it), where is_default is true if the value is set back to whatever the default is, and false otherwise. Looking at it again, I think storing the previous and current state is a better option.
@egardner : Thanks for the updates and work so far. Thanks also for your patience while I work on getting feedback to you on this, I met with @mpopov last week and discussed a lot of things around this schema and should've relayed information to you sooner, sorry!
Oct 13 2020
Hmm, I spoke too soon. We rely on the wgWMEUnderstandingFirstDay being set in order to oversample in Schema:EditAttemptStep (in WikimediEvents's shouldSchemaEditAttemptStepOversample()), so we need to detangle the configuration value from that method before we can switch off EditorJourney logging. It shouldn't be that complicated -- I think instead of checking to see if wgWMEUnderstandingFirstDay is true, we instead want to see if GrowthExperiments extension is enabled, because we want to oversample edit attempts for all GrowthExperiments users regardless of whether they are opted-in to the Homepage experiment. @nettrom_WMF does that sound right to you?
Oct 9 2020
@Milimetric : It looks like there's no data in event_sanitized.prefupdate for 2020-09-19 through 2020-09-21, and it looks like there's partial data on 2020-09-22. Would it be possible to re-sanitize that date range, or will we need to wait for the re-sanitization script to stop by?
BTW, I came back to this because of T252391, and noticed that when looking at the two-year registration rate on Vietnamese it looks like the time period where we ran our Welcome Survey A/B test had substantially higher registration rates than expected. If we decide to run another experiment, we should consider fitting a time-series model to the data and use it to predict number of registrations in order to understand if registrations are outside what's expected.
@kostajh : Thanks for picking this up and pinging me about it. I think we should switch off EditorJourney since we're not actively using the data in any ongoing analysis.
@Milimetric : Not a problem, definitely understand that this would be a non-standard request! I've reached out to the PA team and will report back, probably some time on Tuesday.
@Milimetric : I inspected the sanitized data by looking at the event structs of random partitions and aggregating some random months across various years from 2017 onwards, and in all cases the sanitized data looks correct to me.
Oct 7 2020
@mpopov : Thanks for your patience while I work on juggling tasks and finding time to come back to this. I've discussed the schemas with the SD team and we found that the MultimediaViewer and UploadWizard schemas could be marked for deprecation. As I didn't have edit permission of the googledoc, I left a couple of comments to that effect. I think this concludes everything, handing it to you for sign-off!
Oct 6 2020
If there is a better/standard way to capture some of these things I'm happy to re-work the schema (but specific guidance would be helpful).
Oct 5 2020
I've dug into this a bit to get an understanding of what data is available through the VisualEditorFeatureUse schema. I also met with @MNeisler on the Product Analytics team to get a check on whether my understanding of the data was correct, and it appears to be.
Oct 1 2020
With these new upgrades happening, I wanted to move my Jupyter notebooks from stat1008 to stat1006 as stat1008 has been very busy lately. After rsync'ing my files, I started reinstalling my R libraries and had them error out because one of them wasn't available for R v3.3. That surprised me, because Debian Buster ships with R v3.5 (as can be found on stat1005 and stat1008).
This is awesome work so far! I've read through this task, its parent task, and the proposed patch and updated the measurement specification to reflect the set of questions mentioned by @CBogen in T263875#6495409. From what I can tell, the proposed schema allows us to answer our current set of questions.
Sep 29 2020
@mpopov : Ah, feel free to reopen this if you want me to ping the SD team and have them come back to me with a list of schemas.
A huge thanks to @mpopov for doing a lot of work on this, improving the data processing code and figuring out ways massage the data from SearchSatisfaction to pull out the insights!
I've gone through the spreadsheet and added information for all known Growth-related schemas. Looks like the Multimedia team already went through and marked theirs as well. Don't think this needs any peer review, so closing it as resolved.
Sep 24 2020
We're unsure if the finding is trustworthy. I'm moving this back to "Doing" to dig further into this.