Wed, Feb 19
I've looked into the patterns in the first case, that a user activates the Newcomer Tasks module but we don't have a action = "se-activate" event recorded in HomepageModule. To do that, I found all users (excluding known test accounts) who have two consecutive visits in HomepageVisit that showed the Newcomer Tasks module going from "inactive" to "active". Then I joined the first of those sessions with HomepageModule in order to learn if we had any client-side data at all about that visit (which is where they would activate the module). Split by wiki and desktop/mobile, here's whether we have data for those:
@MMiller_WMF : my TL;DR of this is at ~0.5% of users who visit the Homepage do not have their user preferences set correctly to reflect the state of their group assignment in Homepage experiments. I find that to be a concern for two reasons: 1) we've assumed that these assignments are reliable; and 2) it's unclear at this point how this will work when we move to larger wikis.
I dug into the data we have for the two users that Marshall sent over information about. For one of the users, I could identify when they activated the Newcomer Tasks module (by comparing consecutive entries in the HomepageVisit schema). For those two visits, we had no data at all logged in HomepageModule. I noticed that the user was on mobile at the time, but don't yet know if that's a pattern.
Tue, Feb 18
This is now completed and Legal has been notified.
Verified in Hive that data is not available for any of the three schemas up through 2019-11-04 and this task can be closed. Thanks for your work on this @fdans !
Sun, Feb 16
The analysis can be found in this notebook. For simplicity, I used the most recent 90 days of data, with known test accounts removed. I used HomepageModule as the authoritative source of data, because if a user blocks EventLogging we of course won't have any data at all.
Tue, Feb 11
The two listed schemas are the ones I know about that uses the editing session ID. If there are others, please do add them!
@Abit and @Ramsey-WMF : could you let me know if you have any concerns about the work done in this analysis? If not, then I'll go ahead and get this ticket closed next time Product Analytics reviews their board.
Mon, Feb 10
Hi @Ragesoss, acknowledging that the Product Analytics team has received this ticket and will review and prioritize it during our next board review meeting, which is on 2019-02-17.
I've dug into this using data from launch of Newcomer Tasks up until Feb 5, 2020. I've excluded known test accounts from the data gathering.
Thu, Feb 6
Moving this to "Next up", as I believe the next step here is to publish the results on-wiki, but that's currently not a high priority.
Wed, Feb 5
These tables have now been deleted, ref the terminal session shown below. Closing this task as resolved.
T244312 has been created for tracking Analytics Engineering's part of this, deleting the data from the relevant sanitized EventLogging tables.
Tue, Feb 4
@MMiller_WMF has now run the final report using all available Homepage data, which means that we're ready to start deleting old data. Closing this task.
Mon, Feb 3
@Abit and @Ramsey-WMF : I've completed a first pass of an update to the search analysis, and am concerned about some of the underlying data. Since the last update of this analysis a year ago, EventLogging data is only available in Hive, so I updated the queries to work there. However, for some of the analysis, e.g. the the graph of daily search activity, shows a lack of data starting on 2019-12-10. I'm not sure if there were any changes to the EventLogging code for search at that point? If you or someone on your team knows, that would help me understand what's going on here and whether to dig further into the data.
Jan 24 2020
@Mayakp.wiki : pull request has been reviewed, conflicts resolved, and it's now merged. Thanks for your work on this!
Jan 23 2020
I'll add the subtask for Analytics Engineering to delete data once the notebook is ready to run.
@Abit & @Ramsey-WMF : I'm getting back into the swing of things, and am wondering about the priority and deadline for the two search related measurements above. Are they still needed, and if so, is the end of January also the deadline for those?
Jan 7 2020
Dec 24 2019
I'll be following up on this after the holidays. While switching to the Spark backend does appear to work, it also has its own quirks. Some of the queries used in Homepage reports are slow, and querying mediawiki_history was also problematic. Here, "slow" means not finishing in several hours, which is a significant performance decrease compared to what we had with Hive. There might be configuration parameters that can alleviate this and I've had success with some, but want to discuss them with Analytics folks to understand tradeoffs in a shared environment. Now that we also have @mpopov's Hive CLI solution, we should consider the benefits and drawbacks of all these and what to do moving forward.
Dec 23 2019
I checked event_sanitized.homepagevisit, and the whitelisting started on 2019-12-19. Closing this as done, thanks!
Dec 22 2019
@Abit and @Ramsey-WMF : I've today updated the notebook used for the analysis of Question 2, as the analysis I did with @Mayakp.wiki on Question 1 identified a couple additional types of edit comments that should be included. This mainly affects the number of SDC edits for the last two quarters of this year, and shifts the proportions quite drastically upwards (above 10%) for those quarters.
I ran some tests on notebook1004 and notebook1003. From what I can tell, the behavior of the library is the same in both cases. I did run into a couple of issues during installation/use that were no related to queries or data formats. I'll document those below.
Dec 19 2019
From our discussions, the best option moving forward appears to be to switch the backend from Hive to Spark. With thanks to @Ottomata, there's now a pull request that does this. Before merging it, we'll need some testing. From @Neil_P._Quinn_WMF, the following things need to be tested:
Dec 16 2019
@Abit and @Ramsey-WMF : we discussed in our last meeting that the Information and Artwork template had been updated to pull SDC data in through Lua, and that you were interested in understanding the impact of that. Given the limited scope of this, I decided to go ahead and dig around in the data to see if I could figure it out. The work has been documented in this notebook on GitHub.
Dec 13 2019
Based on our conversation in T231952, a couple of files on Commons that I suspect have been affected by this bug are:
Dec 11 2019
Dec 10 2019
@Abit and @Ramsey-WMF : thanks for your patience with me getting this ready, turned out that I jumped the gun during our meeting and gave you numbers based on all edits, not just the ones made some number of days after upload. Anyways, we've got numbers, and while they might not be as impressive I think they're still positive!
Dec 9 2019
Dec 4 2019
@Nuria : I can confirm what @mforns mentions. During my conversations with him yesterday, it became clear to me that how the Growth team is using EventLogging is an in-between case. Since we're running fairly long experiments, we need data for longer than the default 90 days, but we also need richer data than what we'd limit ourselves to if we were to store data indefinitely. Hence a 270 day sliding window for our sanitized data would work well for us. (This is also why we asked for deletion of sanitized data in T234870 as we completed the Help Panel experiment, by the way, we no longer could keep that data around).
Dec 3 2019
I've now completed a preliminary analysis of question 3, quarterly measurement media containing structured fields using non-English languages. As discussed in our meeting last week, this translates to "files with captions in a non-English language". The code behind the analysis can be found in this notebook on GitHub.
Dec 2 2019
Nov 27 2019
I think this is partly a design issue, which @RHo should chime in on, and partly a measurement issue. With regards to the design part, I'm trying to think ahead to how things might work with guidance. If the user is recommended the Egg tart article, clicks through to it, goes somewhere else, and then later returns to it, should they again see the guidance (meaning we treat them as if they came through from the Homepage)? I think the answer to that affects how we connect the Homepage schema to the Help Panel and EditAttemptStep schemas in that situation.
Nov 26 2019
Using the top 10 wikis based on the Wiki segmentation's size ranking (geometric mean of monthly active editors and monthly unique devices), I grabbed the number of accounts for each of them, the number of accounts with an email address set, and the number who have verified their email address. While verification numbers weren't requested, I already had that in the query I reused for this, and maybe the differences between wikis would be meaningful to the Community Tech team.
Nov 25 2019
Like so many others, I'd like to request my credentials for access on stat100x and notebook100x. My username is nettrom. I'll keep an eye out on my Gmail spam folder as well, cheers! :)
Nov 18 2019
Nov 7 2019
Nov 1 2019
What happens here is most likely related to T237124.