Page MenuHomePhabricator

Collect data for https://meta.wikimedia.org/wiki/Research:Mobile_User_Behavioural_Differences
Closed, ResolvedPublic

Description

Following needs to happen:

See https://meta.wikimedia.org/wiki/Research:Mobile_User_Behavioural_Differences for more information.

Event Timeline

yuvipanda renamed this task from Collect data for https://meta.wikimedia.org/wiki/Research:Mobile_User_Behavioural_Differences (Tracking) to Collect data for https://meta.wikimedia.org/wiki/Research:Mobile_User_Behavioural_Differences.Feb 23 2016, 3:59 AM
yuvipanda updated the task description. (Show Details)

@Mpaulson , @DarTar : if you could sign off here (or not) for max transparency?

Note that Michelle and Dario have both now given permission in email for this to go ahead.

The issue with collecting this data is not just the ethics of collecting it, but also the extra strain it places on the engineers involved in the implementation and ongoing maintenance of the instrumentation and the facilities to house the data it collects, especially since it runs counter to the current efforts by the analytics team to make data easier to share publicly. I am not willing to commit analytics resources to this, given the current load.

I wasn't aware anyone involved in the research had asked for any analytics engineers' support? Absent "does anyone know where I'd look for examples of how to implement X"

To clarify on resources: I have volunteered my time to do this.

I think at the very least https://meta.wikimedia.org/wiki/Research:Mobile_User_Behavioural_Differences and/or this ticket need more detail on sampling rate, data retention, etc. I'm also wary of anything involving a new client UUID being logged, given recent history on related issues, potential damage to our already foundering community goodwill, and that fact that we're all currently trying to get through some big organizational changes that impact related policy and decisions.

Indeed, I'll expand the documentation. Legal has given their clearance as part of our conversation with them but we should move the useful bits of that thread on to metawiki.

I wasn't aware anyone involved in the research had asked for any analytics engineers' support?

That does not mean that you are not implicitly laying claim to them, by planning to use analytics infrastructure to collect and store that data for you.

I wasn't aware anyone involved in the research had asked for any analytics engineers' support?

That does not mean that you are not implicitly laying claim to them, by planning to use analytics infrastructure to collect and store that data for you.

Sure, fair enough. It sort of sounds like, then, if you don't want to commit analytics resources due to load, and by resources mean infrastructure usage, there's a performance issue with temporarily storing this? Could you explain more? Apologies if I'm misunderstanding.

Just for transparency I'm rewriting/expanding the documentation on meta as we speak to include the data sanitisation and privacy concerns, the sampling approach, and the checkins we've done with other teams on utility and appropriateness.

Now expanded! Please let me know if there are specific questions it does not answer you would like clarified.

There are many considerations here (philosophical and technical)

I am just highlighting the biggest technical issue that will prevent the schema https://meta.wikimedia.org/wiki/Schema:MobileBehaviouralDifferences from working:

""The user's XFF-resolved IP address, without salting and hashing (necessary for geolocation and IP classification)"

To be clear, this data has never been available on eventlogging schemas, we recently have dropped IP from eventlogging completely but even before that change we never had stored IPs on the clear. Service has always encrypted IPs and we have never offered the ability of retrieving raw IPs to any schema.

Any chance the EL geolocation can get connection type as well, then? If so a hash would do fine, but I was operating under the understanding it couldn't (hence the request for a raw IP)

Any chance the EL geolocation can get connection type as well, then? If so a hash would do fine, but I was operating under the understanding it couldn't (hence the request for a raw IP)

If EL geolocation comes from the Varnish GeoIP support: we could add netspeed to the available data there (and that doesn't seem like an unreasonable thing to do), but it would probably get blocked up on available engineers' time to implement the change.

If EL geolocation comes from the Varnish GeoIP support:

EL does not have geo-location, never has.

Any chance the EL geolocation can get connection type as well, then? If so a hash would do fine, but I was operating under the understanding it couldn't (hence the request for a raw IP)

If EL geolocation comes from the Varnish GeoIP support: we could add netspeed to the available data there (and that doesn't seem like an unreasonable thing to do), but it would probably get blocked up on available engineers' time to implement the change.

Yeah, that was my understanding and worry, hence the initial data collection aiming at "problematic" as an alternative to "impractical". I don't particularly want to occupy a ton of opsen time.

Ok, ping us in person to see if we can find a less problematic path for the research you want to do. I *think* it should be possible to test the decay of IP/UA combos using statistical tooling on our existing dataset.

Ok, ping us in person to see if we can find a less problematic path for the research you want to do. I *think* it should be possible to test the decay of IP/UA combos using statistical tooling on our existing dataset.

Or we could just give up connection-type geolocation and instead rely on per-country data, which means we could use the varnish geolocation and wouldn't have to store the raw IPs, just a hash for pseudo-UUID generation. We'd lose the ability to contrast connection types, but it sounds like it'd be a lot less effort and a lot cleaner for everyone.

Okay, so is everyone okay with this if we drop connection-type data and, by extension, drop the need for the IP address? And just return a hashed and salted IP and the country-code, instead of the unhashed IP.

Okay, so is everyone okay with this if we drop connection-type data and, by extension, drop the need for the IP address? And just return a hashed and salted IP and the country-code, instead of the unhashed IP.

Eventlogging no longer has access to IP information, we did these changes recently.

Okay, then how does the geolocation work, and could however the geolocation works also pass through a hashed and salted IP?

Okay, then how does the geolocation work, and could however the geolocation works also pass through a hashed and salted IP?

There is no geolocation on eventlogging data, only on webrequest data (pageview data that comes via varnish)

I feel like we're talking at cross-threads here.

Clearly there is the ability to incorporate geolocation data into the eventlogging stream before it hits the databases, otherwise, well, we wouldn't have any country codes in any of the eventlogging tables, which we do. So I'm asking (1) if dropping the unsanitised IP element of things would make people comfortable and (2) if you know if whatever making the country codes appear in the EL data could also pass through a hashed/salted IP or not.

Clearly there is the ability to incorporate geolocation data into the eventlogging stream before it hits the databases,
otherwise, well, we wouldn't have any country codes in any of the eventlogging table

Javascript has ability to determine geo location, see for example NavigationTiming extension, so country can be sent from the client and that is how it ends up on NavigationTiming table:
https://github.com/wikimedia/mediawiki-extensions-NavigationTiming/blob/master/modules/ext.navigationTiming.js#L140

And can the client-side JS collect a hashed IP too?

The client side JS can not collect IP - it used to be able to, and it can
not now (in a non-hostile way, at least) and consensus seems to be against
adding back this ability.

So to summarise, it is now impossible to conduct this research in the way specified?

Yes - looks like IP collection is pretty much no go without significant
resources being dedicated to it in some form or other.

There is geoIP information available in clientside JS, so we can log that.
So technically, if we replace unsanitisedIP in the schema with
geoLocation (as provided by the geoIP stuff we have available in varnish
atm), we can still run this (at a technical level)

Sure, but we need something with the granularity and changerate of the IP address, is the thing.

Right, so in that case I *think* we might be screwed, primarily from a
consensus perspective.

Namely: it is possible, it is endorsed by both the research team and our actual product teams, it has ethical and legal +1s from legal, but AnEng don't want to permit it?

Kindof, but complicated?

So IP address isn't available in clientside GeoIP info anymore, so
providing it would require that we do one of the following:

  1. Re-add client IP to GeoIP. This however is a long lasting change that

would mean IP is available to all client side JS all the time, not just for
this code. This would also probably need to be done at the varnish level -
a pretty specialized area of code that very few people are comfortable
touching.

  1. Write a custom logger beacon that is run just for this experiment. This

allows us to collect all the info we need without the big hammer of #1 and
is probably the right technical choice. This is complicated by the current
move to Varnish 4, but if all we need to do is run this for a month it is
probably not too hard.

#1 is not really an option for various technical, organizational and
privacy reasons. #2 I feel is still reasonable and fairly self contained,
but given previous discussion I guess it would require Analytics
Engineering to sign off on me doing the work to make happen.

Indeed, although unless people are going to substantially change their minds it might be better to bump this so we can come up with a resolution. Let's look at it after the weekend.

In meta it says:

"Once we have a week of data we will geolocate the IP addresses, looking for country of origin and connection type, hash the IP and user agent together to provide a pseudo-UUID, and dispose of the original IP address. On reader behavioural differences we will look at the length of each session, the amount of time on page, the number of pages viewed during a session, and the number of sessions "

The goal here is to have data for this research, let's remember that capturing IPs is not a goal in itself. To be clear we have done similar research for Visual editor without using IPs. Now, is the intent to run this experiment across every possible feature of every possible wiki? or rather focus on a sliver of our user population/feature?

The idea behind it is that the pseudo-UUID is calculated client side and never stored, but send in every request. We have done that with success in the past, let's talk about this further either in IRC or in a short meeting.

Sure; as I've left the Foundation I'm somewhat hindered in setting up a meeting. I'm basically free for the next week so please set one up whenever works for you.

@Ironholds_backup @Nuria @yuvipanda please rope me in if/when there's an agreement on a proposed solution. If the client-side option is viable and we have resources (and Analytics sign-off) to implement it, I'd strongly support it over #1. The output of this study will directly benefit work currently conducted by Research in collaboration with Reading and our collaborators at Stanford, but we can't create more headaches on data collection at a time where a lot of effort is going into explaining publicly WMF's data collection and retention policies and how they impact reader research.

Also, heads up that I'll be AFK (and unreachable except for emergencies) from 3/30 until I am back in the office on 4/11. The NDA (T128037) may have to wait until I am back, unless this task is closed and you can coordinate directly with @Wwes (who signs NDAs for research collaborations as our C-level) before then.

Shall do, although I don't think they're comparable (unless I count as a corporate server outside WMF control already? That was fast ;p)

Nuria, please let me know when you'd like to meet. As said, this task is on you, rather than me, since I can't see your calendar.

"On reader behavioural differences we will look at the length of each session (1), the amount of time on page (2), the number of pages viewed during a session (3), and the number of sessions within a fixed period (say, 24 hours) (4)"

Please note that is possible to measure 2 from client side easily. 1 and 3 are possible/easy to measure client side using pseudo-UUID depending on the feature, we have done it before for VE and text editor. Now, without storing ids/tokens is not possible to do 4.

Also, doing 1,2,3 doesn't touch into the IP/UA combo decay research (5) also mentioned on the meta page.

Ironholds_backup claimed this task.

Switching to a new ticket now that the scope has changed dramatically. Thank you to Nuria for working through the engineering issues with us.