Page MenuHomePhabricator
Paste P4267


Authored by RobLa-WMF on Oct 19 2016, 10:04 PM.
Referenced Files
F4627671: ArchCom-RFC-2016W42-irc-E323.txt
Oct 19 2016, 10:04 PM
21:03:59 <robla> #startmeeting ArchCom RFC meeting: T145472: Survey Cookies/Local Storage usage on Wikimedia sites
21:03:59 <wm-labs-meetbot> Meeting started Wed Oct 19 21:03:59 2016 UTC and is due to finish in 60 minutes. The chair is robla. Information about MeetBot at
21:03:59 <wm-labs-meetbot> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
21:03:59 <wm-labs-meetbot> The meeting name has been set to 'archcom_rfc_meeting__t145472__survey_cookies_local_storage_usage_on_wikimedia_sites'
21:03:59 <stashbot> T145472: Survey Cookies/Local Storage usage on Wikimedia sites -
21:04:07 <robla> o/
21:04:37 <bd808> robla: I guess the bot being blocked from topic changes in this channel saves a step :/
21:05:54 <robla> hi zzhou_ !
21:05:58 <zzhou_> hi everyone
21:06:43 <robla> gwicke, Krinkle and I just briefly discussed this in the ArchCom Planning meeting last hour
21:07:31 <robla> ...and I've been talking to bawolff and dapatrick about this for a little while
21:07:43 * bawolff waves
21:08:05 * dapatrick waves also
21:08:50 <robla> does anyone have any questions about this RFC before I prompt zzhou_ to ask questions that he has?
21:10:19 <bd808> I love the idea of automating the audit process.
21:10:34 <bd808> But I wonder who is going to watch the logs?
21:10:37 <Scott_WUaS> (Hi All)
21:10:56 <bd808> and if we are going to spam flood ourselves each time a new cookie is introduced
21:11:31 <zzhou_> so I am not sure who will watch the logs right now - probably someone on the legal team?
21:11:37 <Zppix|mobile> ??
21:11:40 <zzhou_> and ideally over time people will be more carefula bout introducing these so
21:11:49 <zzhou_> that there will be less alerts over time
21:12:00 <zzhou_> as we better communicate to them about this issue
21:12:14 <zzhou_> of course the first step is to understand the issue (hence this RFC) and figure out the scale of it
21:12:43 <zzhou_> if there really that many cookies that we need to track down, we might have to think of more scalable ways of managing this for now
21:14:05 <robla> bawolff: can you describe how you used mwgrep to help out zzhou_ ?
21:14:42 <bd808> I'm not so much concerned about distinct cookies. It's more about the sheer number of requests that are likely to come between the cookie being introduced and a change to the extension to make it as expected.
21:15:05 <dapatrick> I think that it's necessary to have some process/application which consumes the logged cookie information, stores unique cookies, associated wiki name, and number of times observed. But I'm getting a little ahead of where we are in the meeting.
21:15:18 <bd808> we may have to sample the logs
21:15:25 <bawolff> We tried to find instances of setting cookies in js pages on wiki
21:15:50 <bawolff> mwgrep returned like 1500 results
21:15:56 <Zppix|mobile> #idea test this out on meta or mediawiki before completely rolling the change out
21:16:00 <bd808> o_O
21:16:20 <bawolff> so that sort of simple static analysis was kind of unfeasible
21:16:25 <bd808> bawolff: mostly gadgets storing state?
21:16:43 <bawolff> Seemed like it
21:17:16 <bawolff> often the same gadget or similar gadgets copied across multiple wikis
21:18:37 <bd808> dapatrick: talk to tgr and get sentry setup in prod :)
21:18:39 <robla> #info <bawolff> We tried to find instances of setting cookies in js pages on wiki; mwgrep returned like 1500 results
21:18:58 <dapatrick> bd808 Will do.
21:19:31 <robla> #info 14:18:36 <bd808> dapatrick: talk to tgr and get sentry setup in prod :)
21:19:32 <bawolff> One concern i have is it seems kind of like we are approaching this backwards - we want to know when personal information is stored so we are looking at cookies
21:19:49 <bawolff> but.. really cookies are just a means
21:20:18 <bawolff> and its no different if personal info is stored some other way
21:20:36 <bawolff> So it feels like we are looking at a symptom
21:20:43 <bd808> well, in my mind cookies == correlation == possible tracking
21:20:50 <Krenair> a gadget could store personal info in an api preference?
21:20:59 <bawolff> but i dont have any better suggestions to address this issue
21:21:15 <bawolff> krenair: yeah. Or in a public wiki page
21:21:23 <bd808> database and EL schema audits?
21:21:35 <bawolff> or probably other ways i havent thought of
21:22:47 <zzhou_> so one reason on for tracking down cookies/local storage is that we have a table currently that lists all the cookies/local storage we use so ideally that information will be up-to-date:
21:24:39 <tgr> if we found out a gadget is storing cookies on, say, the Javanese Wikisource, what would we do about it?
21:25:22 <Krenair> presumably the same thing we'd do if we found one on the English Wikisource
21:25:27 <tgr> seems to me like we are planning to collect loads of non-actionable data
21:25:54 <dapatrick> zzhou_, for discussions' sake, could we include a statement on that page that says something to the effect of 'WMF uses these cookies, but there may be others created by Gadgets, extensions, etc. deployed by administrators of individual wikis/projects'?
21:25:59 <zzhou_> If this is being used for say all users to Javanese Wikisource automatially, without their consent, ideally, we would list that in the Cookies table
21:26:27 <zzhou_> dapatrick: that’s also a possibility
21:26:43 <tgr> how do you check whether it's set for all users and without consent?
21:27:04 <zzhou_> but I think perhaps after we figure out the scope of the issue
21:27:08 <dapatrick> tgr, I think the answer there is source code and setting analysis.
21:27:13 <bawolff> I dont think there is any instance of any cookie anywhere that asks for conscent
21:27:39 <zzhou_> there’s could be implied consent when you use Cookies statements like this
21:27:47 <zzhou_> or at least warning to the user
21:27:59 <Krenair> extensions aren't deployed by administrators of individual wikis/projects
21:28:29 <zzhou_> a script that just loads automatically when a page visits a Wiki page without the end user knowing about it would be more problematic
21:28:36 <dapatrick> Thanks for that clarification, Krenair. This is not the final wording of such an addition.
21:29:47 <Krenair> who is going to be doing all this log checking and source code analysis?
21:30:41 <tgr> IMO 1) looking through gadget code on thousands of wikis (possibly written in the local language, possibly broken for ages and/or no one still active knowing what it does) is not realistic
21:31:23 <tgr> 2) even if you want to do that, logging cookies does not seem very helpful data for that kind of review
21:32:00 <tgr> I guess one could do horrible hacks with replacing document.cookie and then logging stack traces
21:33:05 <bawolff> If you assume cookie names are relatively unique having the cookie name is a good start to finding the relavent code
21:33:23 <bawolff> but indeed, not an easy task
21:34:16 <robla> I'm sitting in the same room with zzhou_ now, and I'm going to try to restate the point he's trying to make
21:34:26 <tgr> the JS code might be loaded from another wiki or an external domain
21:34:38 <Scott_WUaS> Zzhou: I noticed you in addition to having a law degree from Columbia that you "spent a semester studying Chinese law at Peking University in Beijing" Will the cookie questions you're asking have differential potential effects for Wikipedia /Wikimedia working in China do you think? Are these considerations to plan for in any way - both legally and in various languages in China too?
21:34:44 <tgr> (well hopefully not external external but tool labs)
21:35:03 <tgr> it might come from a browser plugin etc
21:35:24 <robla> bawolff gave zzhou_ the output of mwgrep, which is basically just a list of cookie setting calls in the MediaWiki: namespace on all of our wikis (I think)
21:35:27 <bawolff> CSP will hopefully solve the external problem one day :p
21:35:48 <robla> bawolff's run gave back 1500 results
21:36:10 <robla> zzhou_ is basically saying "1500 is a lot, but *that's* manageable, right?"
21:36:47 <bawolff> Tgr: thatd have to be a pretty broken browser plugin but presumably thatd only be a small number of users so in the long tail
21:36:47 <zzhou_> yea I am saying even if we end with a list of 1500 unique cookies, we can take time to go through them
21:37:06 <zzhou_> I am not sure we will have this table anymore at that time since it is too large
21:37:08 <bd808> zzhou_: I guess that depends on who has to do that audit and what it keeps them from doing otherwise
21:37:41 <robla> where "a list of 1500 unique cookies" is "1500 places in the MediaWiki: namespace Javascript that seem to be setting cookies"
21:38:07 <tgr> how much time you'd estimate for dealing with one cookie?
21:38:22 <bawolff> Based on a naive regex that probably missed a lot
21:38:24 <Zppix|mobile> Would anyone be viewing the info received for the cookies ??
21:38:39 <zzhou_> yea, I wasn’t proposing someone go over 1500 cookies necessarily - I think potentially past a certain large number, we will just rethink our strategy of listing all the cookies
21:39:25 <bd808> the pint isn't just to list them though is it? its to audit why they exist
21:39:50 <tgr> probably most of the cookies are opt-in (even if the user is not specifically told they are opting into a cookie, but they would have to enable a gadget or something)
21:39:52 <bd808> and likely to stop using them if there isn't a very good reason?
21:40:00 <zzhou_> not necessarily, since we don’t even know the scale of the issue yet
21:40:27 <zzhou_> and to clarify by *1500 unqiue cookies I meant 1500 unqiue cookie names
21:41:07 <Zppix|mobile> If there will be people whom arent employees viewing the info received i would suggest having some sort of confidentially document (not a lawyer/legal team member but thats just my 2 cents)
21:41:33 <dapatrick> Zppix|mobile, zzhou_ is on the legal team.
21:41:37 <bd808> so the logging is just going to end up with a set of N strings. Then someone will need to pour through source code on-wiki and on the server side to see if they can find those same strings
21:42:04 <bd808> Then they will need to determine who "owns" the code that sets the cookie
21:42:08 <Reedy> Zppix|mobile: Also, that's what the generic NDA's cover anyway
21:42:09 <bawolff> So what is the ultimate goal we have here?
21:42:18 <bd808> and then contact those persons to find out why they are doing so
21:42:25 <dapatrick> bd808 Right, then determine from source code, documentation, or conversation with the project owner the reason for the existence of the cookie.
21:42:43 <dapatrick> bd808, sorry, what you said when you finished your thought. :)
21:42:45 <zzhou_> bd808: correct, but potentially, a lot of the scripts are just copies one of another and they are really using the same cookie names so maybe we don’t have as many other cookies as the mgrep suggests
21:42:57 <bawolff> Do we basically want to explain ourselves to our users(?)
21:43:02 <Reedy> did you not run it through uniq?
21:43:17 <bd808> even if someone gets really really good at that process thats going to take an hour a cookie
21:43:40 <Reedy> Nearly a year full time work
21:43:57 <Zppix|mobile> Could a bot handle the tideous source editing or no?
21:44:28 <bawolff> Reedy: at the time the output didnt seem amenable to processing like that
21:44:34 <bawolff> at least not easily
21:44:41 <bawolff> Zppix: no
21:45:15 <Reedy> where's the list?
21:45:20 <Zppix|mobile> ^
21:45:42 <bawolff> Currently only on a private email thread
21:45:58 * robla would love to make mwgrep public, and short of that, make it so that we run mwgrep scans and publish the static logs
21:46:00 <bawolff> i can pastebin it once i find it again
21:46:10 <Krenair> there's a ticket for that robla
21:46:37 <Reedy> Also needs my no private patch merging ;)
21:46:38 <robla> Krenair: I heard about that from Krinkle ....please do tell!
21:46:56 <tgr> so I guess working from the mwgrep list is not realistic, that leaves logging what cookies are set, doing some sort of honeypot approach, working with the community and leaving it to them to identify cookies, or just ignoring the issue
21:47:00 <Zppix|mobile> Is the main time consumer translation for the cookie? I cant think of any other reason
21:47:22 <Krenair> robla,
21:48:18 <bawolff> So umm. What about if we just put the cookie table on meta, and tell people to add items when they introduce new cookies
21:48:32 <Reedy> !bug 1 | bawolff
21:48:32 <wm-bot> bawolff:
21:48:39 <bawolff> and then use cookie logging to guage how complete the table is
21:50:01 * robla is sad that the bug 1 link above doesn't go to
21:50:23 <zzhou_> bawolff: you mean a separate table to help us to chase down the cookies (not the cookies table for the end user we have right now)?
21:50:29 <Reedy> !botbrain
21:51:17 <Reedy> !bug del
21:51:18 <wm-bot> Sorry, you are not authorized to perform this
21:51:29 <Krenair> !bug del
21:51:29 <wm-bot> Sorry, you are not authorized to perform this
21:51:30 <Reedy> just beeds !bug is$1
21:51:32 <Zppix|mobile> Lol
21:52:00 <tgr> say we get a table with 100 cookies and we log 1000 unique cookie names (let's optimistically assume there are no dynamically named cookies)
21:52:01 <bawolff> Zzhou_: a crowd sourced table
21:52:09 <Zppix|mobile> Maybe !bug should change from bugzilla.wikimedia to phabricator.wikimedia
21:52:10 <tgr> again, what would we do with the data?
21:52:25 <Reedy> Zppix|mobile: bugzilla had bugs, phab has tasks
21:52:26 <tgr> would someone have to go through the 900 missing names and check?
21:52:32 <Reedy> if the url is right, it will redirect correctly
21:52:41 <dapatrick> tgr, Yes.
21:52:47 <bawolff> Zppix: but then we cant make snide comments about bug #1
21:52:50 * robla has a meeting to go to in 7 minutes, so will end this abruptly
21:52:55 <bawolff> :p
21:53:13 <robla> also, we can keep the conversation generally going on Phab and on #wikimedia-tech
21:53:29 <tgr> what are the chances of ending up with an amount of data that does not take man-months to sort through?
21:53:51 <zzhou_> tgr: if we have that many cookies, we might need some sort of disclaimer like Dapatrick suggested earlier as it would not be feasible to go over all that many and furthermore, it is not disclosure to the user if we just present them with a list of 1000 cookies
21:54:21 <tgr> zzhou_: so can we just start with that disclaimer and skip the intermediate steps? :)
21:55:07 <zzhou_> that’s an option - it def. less ideal than having a cookies table that’s up to date
21:55:27 <zzhou_> (assuming the size of the table is still limited)
21:56:52 <robla> 180 seconds until abrupt end of meeting....
21:57:01 <zzhou_> does everyone think it is likely we have many hundreds to thousands of unqiue cookie (names) lying around?
21:57:13 <bawolff> Perfectly timed with my battery dying ;)
21:57:27 <Reedy> the list sorted and uniq'd will remove dupes
21:58:16 <robla> #info next week's tentative topic: T138783 SVG stuff
21:58:16 <stashbot> T138783: SVG Upload should (optionally) allow the xhtml namespace -
21:58:18 <bd808> I think we need the logging to find out honestly. Probably not too hard to add into the wikimedia messages extension or something similar
21:58:19 <zzhou_> Reedy: yea, perhaps that’s the first step
21:58:19 <bawolff> I think the distribution will have a long tail
21:58:21 <dapatrick> zzhou_ I believe there may be a possibly untenable number. I do not be believe it will be many hundres of thousands.
21:58:36 <zzhou_> sorry I meant hundred to thousands ;)
21:58:58 <dapatrick> Ah. I also read you wrong.
21:59:07 <bd808> across all projects and languages? I wouldn't doubt high hundreds
21:59:10 <robla> 45 seconds to end of meeting
21:59:18 <dapatrick> Hundreds to thousands is about what I expect.
21:59:19 <tgr> yeah, the long tail will be long
21:59:27 <zzhou_> ok
21:59:34 <robla> thanks everyone! those that want to keep talking can use #wikimedia-tech
22:00:04 <zzhou_> alright I will pop to #wikimedia-tech in case people have time, I want to follow-up a little
22:00:07 <robla> #endmeeting