Tue, Dec 11
Mon, Dec 10
Sat, Dec 8
Fri, Dec 7
After discussing this with @Neil_P._Quinn_WMF, we propose to remove the information that identifies what page the user was attempting to edit. I've updated the patch so that the page_id, page_title, and revision_id fields are deleted.
Thu, Dec 6
Wed, Dec 5
Tue, Dec 4
To capture the further discussions we had about this today, the current proposal is:
I had a look at this and have a bunch of questions and thoughts that we should discuss and make decisions on.
Fri, Nov 30
Mon, Nov 26
I looked through the questions we're asking about new users, particularly around account creation, and this issue isn't a problem for us.
We're only concerned about the context in which the account was created (reading/editing, which is answered by the ServerSideAccountCreation schema) rather than more specifically what the users were reading prior to creating their account. It's still worthwhile to take note of this, so I've created T210434 to track the things I need to keep in mind as I'm working with the data.
Sat, Nov 17
Oh, my bad, sorry! Should've asked first. I somehow ended up thinking that it was in production since I was seeing the logout events in the data.
@kostajh : Good news: Logout events are now captured by the schema! Bad news: user_id = 0 for all events, so we don't know which user logged out.
Fri, Nov 16
@Niharika : You're welcome! stat1006 doesn't have access to Hive (the query tool for the Data Lake), so you'll need access to either stat1004 or stat1007. There's also a browser interface called Hue, but I'm unsure if that's a tool that Analytics recommends.
Based on our findings, I don't see a reason for not closing this at this point.
@kaldari asked me to look into this, so I did and here's a bit of info. Data for this schema is being captured (ref this dashboard) and stored in the Data Lake (see more info below), but does not appear to be whitelisted to be ingested into MariaDB (ref this documentation and this file). I'm not sure what the process is to get this data into MariaDB if that's needed, but Analytics can hopefully advise.
We wrote up our experiment plan and put it under our team pages on mw.org.
Thu, Nov 15
I updated the "How do I" section of Kafka on Wikitech as well, as it was mentioned there.
@Tbayer : that is very useful, thanks so much for bringing it to our attention!
Thanks for helping make this happen, @revi!
Wed, Nov 14
Tue, Nov 13
Nov 10 2018
Current status on my end is that I have not found any significant issues with the data.
Nov 9 2018
@revi : that sounds great, thanks!
@revi : after a bit more discussion with @SBisson and @MMiller_WMF, we have two options on how to do this. I'll first describe our preferred approach (option 1), and then provide an alternative (option 2).
Nov 7 2018
I'd like to keep things transparent, so here's an update on why I edited this comment to remove my suggestion. The proposal was to add the returnto and returntoquery parameters to the signup link in the warning message shown to users who try to edit without logging in (the anoneditwarning message). Currently, neither of those are present, which means that after signing up, the user is returned to the main page. Adding those two will change the user experience so that they return to the page they tried to edit, and the editor will load.
Nov 6 2018
@MMiller_WMF : Good testing! I hadn't caught that the text editor and Visual Editor have different tab behavior.
Today I learned that my data gathering is most likely biased, shifting accounts from the "editing" to the "reading' context. @SBisson showed me that the English Wikipedia's warning message that is shown to users who try to edit without being logged in does not contain the returntoqueryparameter that we check for. This means that if a user clicks on the link to create an account from that warning message, it would not be counted. Note that the "Create account" link in the upper right hand corner on English Wikipedia does contain the returntoquery parameter if an edit is attempted, so those account creations are counted correctly.
Nov 2 2018
Nov 1 2018
@SBisson : This makes sense, thanks for explaining that! I don't see a need to change the way groups are set up, because we can define the names of the groups in such a way that it would be possible to track experiments and assignments. If the format is set to something like "[experiment name]_[group name]" it's easy for me to split that later if I need to. So for a hypothetical first experiment with two groups "survey" and "control", the group names could be "exp1_survey" and "exp1_control".
Patch submitted, adding Analytics so they can triage and review.
Oct 31 2018
@SBisson : I thought about this, and I think it would be fantastic if we can record data for all users so that we know what experiment and group they were in and have that readily available for all users that were in an experiment. My main concern is that we'll switch between experiments and conditions, and knowing exactly when they started & ended, what the various groups were, and so forth, will become complicated over time. If we store it explicitly, we don't need to document the set of rules needed to infer it.
I updated the work log today to add to the description of the contexts. This was done to make it clear that we capture 100% of the relevant contexts, and that the "reading" context does not exclusively mean they were reading a wiki page, it will also capture for instance looking at search results.
Oct 30 2018
The main question being asked here is whether the survey we are adding has a detrimental effect on user activity. I've discussed this with @MMiller_WMF, and also with the Product-Analytics team. Out of those meetings comes the following recommendations and questions:
@MMiller_WMF : I looked into the queries behind this and found that I was incorrect. As far as I can tell, the query captures 100% of reading/editing contexts but the "reading" context does not mean only reading an article. It will also capture accounts created from a context that has a query associated with it (e.g. a search). I don't think it's feasible to separate out queries, because they can also be used to read articles.
@Neil_P._Quinn_WMF : Could you look over the specification and see if you spot any errors or omissions?
@MMiller_WMF : I see that I maybe shouldn't phrase loose thoughts as questions when there's a dozen other people subscribed to a task, sorry! It was mainly intended as a mental note that these two tasks appear to be connected, so I should keep an eye on the progress made over there since it looked like they're working on it. But, I could also make sure everyone knows about all the symptoms, so thanks for tagging @kostajh!
In preparation for this task, I looked into whether EventLogging data for our EditorJourney schema would be available in the Data Lake while being tested (i.e. on betalabs). My understanding is that the raw data will be available as it goes through Kafka, ref this part of the EL documentation.
Oct 29 2018
Per T202348#4697469, the page_token and session_token properties have been updated to be optional.
Oct 25 2018
@Neil_P._Quinn_WMF : I wanted to let you know, and at the same time document, that since the proposed Schema:Edit2 uses snake_case (per recommendations from AE), the editingSessionId property in the Edit schema becomes editing_session_id in the new version. In this (VisualEditorFeatureUse) schema it's camelCase. This might make joins between the two schemas somewhat confusing. Wanted to flag that so you can consider whether to keep it or not.
I've created Schema:Edit2, and its talk page has the standard template as well as documentation of its relationship and some information about the properties. Once it's finalized, I'll also update the talk page of Schema:Edit to reflect its successor.