Page MenuHomePhabricator

QuickSurveys: Schema changes
Closed, ResolvedPublic1 Estimated Story Points

Description

Furthe schema change requests:

unhashed IP
userAgent
x_forwarded_for
browser_language
uri_query (bonus)

Summary of changes:

Background

For quicksurveys, we need two more data points to be collected in the related schema: event_pageId and event_pageTitle. The reason we need these are:

  1. Eye-balling any issue in EL data collection is much easier if event_pageTitle is collected. Being able to eye-ball the responses from different pages can help us identify possible sampling issues more easily and quickly.
  1. We want to analyze webrequest logs with the additional data collected via quicksurveys and for that, we will need event_pageId and event_pageTitle, on top of clientip, userAgent, and timestamp to be able to correctly match a survey response to a webrequest log.

Other reading schema can also benefit from this information, since what page the user is on when an event happens is a critical point of information. Whether this is added to all schema or not is subject to more technical discussion.

Event Timeline

leila raised the priority of this task from to Needs Triage.
leila updated the task description. (Show Details)
leila subscribed.

Editing and discovery team - are there any reasons why all your schemas wouldn't benefit from making this a standard property provided by EventLogging ?

Jdlrobson reassigned this task from Jdlrobson to JKatzWMF.
Jdlrobson set Security to None.

Following up from email, I'm on board to add these columns to the schema. I don't see it as a blocker for the initial test, but given where the project is with technical work is ahead of the design of the survey it's fine to go ahead with this. Results will be richer.

Currently I'm not clear whether we are proposing doing this for all existing schemas or just quick surveys ones. @leila seemed to think it would be useful for ensuring uniqueness in all of our pages, and given most schemas track this information already they are becoming a defacto standard anyway.

If we were to do this for all schemas @Tfinc @Jdforrester-WMF would this be problematic for your schemas?

That would be useful data to have. @Deskana, are we already tracking it elsewhere? let's avoid duplication if so

There are two separate questions here.

  1. Should page ID and page title be standard fields in EventLogging, like timestamp, user agent, and so on?
    1. My answer is no. EventLogging is a fairly generic event tracking system, and there are more than a few schemas where asking what the associated pageId is does not make sense (e.g. the search funnel in the apps is its own interface not based on any other page).
  2. Should page ID and page title be recorded for the schemas used for quick surveys?
    1. My answer is yes. It helps control for biases in the surveys, as the page the user is on could conceivably alter the way the user answers the question.

I think this field would be invaluable if it provides a reasonable way for product or data analysts to connect the event to a user's web logs. However, given that there is not yet a way for us to do this, I agree with @Deskana that this field should only be added on an ad hoc basis according to the research team's needs.

@leila, I would love to internally productionize the linking you are proposing as this is would fill a sizable gap in our ability to make product decisions. I am specifically interested in segmentation, session depth, and follow-on behavior:

  • segmentation: users who visit more than Y pages are likely to click on X
  • session depth: users tend to click X on their first page or not at all
  • follow-on behavior: users who click X, go on to visit Y more pages than the average user

Do you have code that can generate this sort of thing?

@csteipp as he might have some reasonable security concerns around all of this.

Sounds like we have our answer for this specific and updating the task title accordingly.

atgo renamed this task from Add event_pageId and event_pageTitle to quicksurvey or all schema to Add event_pageId and event_pageTitle to quicksurvey [not all] schema.Oct 6 2015, 8:17 PM

In general, we should limit the number of eventlogging schemas that can be correlated to webrequests. If there's a specific research need for it here, then let's do it, but let's not make that default.

@leila, how are you planning to do the correlation from this data? Are you planning to store the plaintext IP in eventlogging, in addition to the hashed version?

Wouldn't just adding event_pageId suffice? Why do we need both?

Jdlrobson renamed this task from Add event_pageId and event_pageTitle to quicksurvey [not all] schema to Schema changes.Oct 15 2015, 11:56 PM
Jdlrobson updated the task description. (Show Details)

@Jdlrobson, for eye-balling data quality, being able to see page title is very helpful. For joining the data with other tables in production, having page_id is useful.

@csteipp,

In general, we should limit the number of eventlogging schemas that can be correlated to webrequests. If there's a specific research need for it here, then let's do it, but let's not make that default.

+1

@leila, how are you planning to do the correlation from this data? Are you planning to store the plaintext IP in eventlogging, in addition to the hashed version?

Yes, we need event_ip as plaintext IP, the same for event_userAgent. We will take these two information, plus page_title and timestamp and find the corresponding webrequest logs (The more info we have from EL the more accurate the matching will be, for example, x_forwarded_for information can become handy to resolve for multiple matches).

I think this field would be invaluable if it provides a reasonable way for product or data analysts to connect the event to a user's web logs. However, given that there is not yet a way for us to do this, I agree with @Deskana that this field should only be added on an ad hoc basis according to the research team's needs.

@leila, I would love to internally productionize the linking you are proposing as this is would fill a sizable gap in our ability to make product decisions. I am specifically interested in segmentation, session depth, and follow-on behavior:

  • segmentation: users who visit more than Y pages are likely to click on X
  • session depth: users tend to click X on their first page or not at all
  • follow-on behavior: users who click X, go on to visit Y more pages than the average user

Do you have code that can generate this sort of thing?

We don't have any code other than what Adam has already access to. We will keep all the code on GitHub as usual and share as we learn more. :-)

Yes, we need event_ip as plaintext IP, the same for event_userAgent. We will take these two information, plus page_title and timestamp and find the corresponding webrequest logs (The more info we have from EL the more accurate the matching will be, for example, x_forwarded_for information can become handy to resolve for multiple matches).

Yikes. That gives a straight correlation of IP to all eventlogging records for that user. This is only for users who actually answer, not displays, correct? how long does the IP address need to live in EL before we can delete it?

Jdlrobson triaged this task as Medium priority.Oct 26 2015, 5:12 PM

@csteipp, correct, we don't need widget impression data so we should register data only if the user responds to the survey.

Re when we can delete it from EL? I'm not sure about other plans the Reading team may have for this schema. For the specific research that the Research team will be working on (and this schema is changed for that research), all data can be deleted from EL within 30 days after the data is collected, though we (Research) may keep a copy of the data in the cluster for no more than 90 days. Does this work?

@leila, all.

With the rest of the data that we're now saving in Eventlogging like Schema:Edit and AccountCreation, that would give a straight lookup between editors and the IP of anyone answering one of these surveys-- i.e., checkuser data. Eventlogging lacks any authorization or audit, which is part of the reason why we don't currently store any sensitive data there, and my team isn't resourced to take on securing that system at this time. So no, I'm not OK with storing the user's IP address for any amount of time in Eventlogging.

I think there are other ways we can add a marker between this one schema and webrequests, if that is really needed, but IP should not be used. @leila, have you looked at other options that can be used? I happy to work with you to come up with a solution, or escalate this to Toby/Wes if it needs more resourcing from within Reading to come up with a appropriate solution.

@csteipp I understand your point. Thanks for providing more context (I did not have a big picture of all the other schema and what they collect.)

Would the concern be addressed if the data is collected only from logged-out users?

If we're logging the "username" (which is the IP address for logged-out users currently), then I'm fine with that! Then if there's an error in the logic detecting if a user is logged out, or if we ever manage to change our policy and start protecting IP's of logged out users, this will Just Work for compliance.

@csteipp I am confused about your last comment. We're talking of the possibility of logging IP addresses for readers, not contributors. The username field was designed for logging non-PII about contributors: IP addresses of contributors are public data, since they are subject to the terms of use that users accept when editing and are released in perpetuity under CC0.

I understand the username solution addresses your immediate security concerns but it doesn't address the more fundamental question, which is pretty simple: can instrumentation data that includes IP addresses for readers that are not contributors be collected in clear via EventLogging? The answer should be a straight yes or no.

DarTar renamed this task from Schema changes to QuickSurveys: Schema changes.Oct 28 2015, 5:29 PM
DarTar added a project: Research.
DarTar moved this task from Backlog to Radar on the Research board.

I'm also slightly confused, but let me try:

@csteipp I am confused about your last comment. We're talking of the possibility of logging IP addresses for readers, not contributors.

Logging IP addresses of readers in Eventlogging is something we should not do. We currently store the key-hash of the IP as part of the standard Eventlogging collection to prevent this.

The username field was designed for logging non-PII about contributors: IP addresses of contributors are public data, since they are subject to the terms of use that users accept when editing and are released in perpetuity under CC0.

IP addresses of logged-out contributors are public, yes, and used as the contributor's "username" in mediawiki. I'm not clear what you're saying about the field being designed for non-PII. Are you referring to the mediawiki db field? Or a field of a specific schema?

I understand the username solution addresses your immediate security concerns but it doesn't address the more fundamental question, which is pretty simple: can instrumentation data that includes IP addresses for readers that are not contributors be collected in clear via EventLogging? The answer should be a straight yes or no.

I think the answer here is "no", for two reasons.

  1. Just tracking the IP of readers is something we should not do in Eventlogging. We're already collecting that in webrequests, and my team isn't at the place where we can work on securing eventlogging too.
  2. Unless you build something to prevent a non-contributing-reader from later contributing, then as soon as they make an edit, you have checkuser data, because the hashed ip in all the eventlogging schemas is the same, so you can correlate the IP with any other action that was logged in eventlogging.

s/make an edit/make a logged-in edit/

If they never log in, then it's not checkuser data. But #1 still is a concern for me in that case as well.

Change 250029 had a related patch set uploaded (by Jhobs):
Add pageId and pageTitle to EventLogging schema

https://gerrit.wikimedia.org/r/250029

Patch above updates to new revision. Does not include anything around the discussion of logging IP addresses.

Change 250029 merged by jenkins-bot:
Add pageId and pageTitle to EventLogging schema

https://gerrit.wikimedia.org/r/250029

Release needed so I can test...

@Jdlrobson: are we capturing

unhashed IP
userAgent
x_forwarded_for
browser_language
uri_query (bonus)

in the updated schema?

@Jdlrobson: are we capturing

unhashed IP
userAgent
x_forwarded_for
browser_language
uri_query (bonus)

in the updated schema?

The only additional data points we're collecting are pageId and pageTitle. You can see the full schema here: https://meta.wikimedia.org/wiki/Schema:QuickSurveysResponses (revision 14136037).

Thanks @jhobs. Those were also captured as part of this task, I don't think we should close this task before addressing those. I will have a conversation with @csteipp to see how/if those can be collected in EL.

They were captured as "future requests" though, so it might be worth closing this one and splitting those out to another task (this would also help our sprint process). Thoughts @Jdlrobson?

"furthe requests" which was really "further". ;-) I'm happy to open another ticket for this @jhobs. Whatever works better for your workflow. I just added a blocking task to T114164, we should move that to the new task if we create one.

Oh I misread that line.

Regardless, I think T117577 should suffice as a "new task" for that work and can probably be pulled out of this one's blocking tasks with a link to this one for the discussion history. I'll add the Reading Web project so we can keep track of the discussion that occurs there.

@Jdlrobson: are we capturing

unhashed IP
userAgent
x_forwarded_for
browser_language
uri_query (bonus)

in the updated schema?

userAgent is captured as part of the event capsule.