ArchCom RFC Meeting W42: Surveying Cookie Use (2016-10-19 #wikimedia-office)

Hosted by daniel on Oct 19 2016, 9:00 PM - 10:00 PM.



Meeting summary

  • IDEA: test this out on meta or mediawiki before completely rolling the change out (Zppix|mobile, 21:15:56)
  • <bawolff> We tried to find instances of setting cookies in js pages on wiki; mwgrep returned like 1500 results (robla, 21:18:39)
  • 14:18:36Â <bd808>Â dapatrick: talk to tgr and get sentry setup in prod :) (robla, 21:19:31)
  • next week's tentative topic: T138783 SVG stuff (robla, 21:58:16)

Meeting ended at 22:00:07 UTC.


121:03:59 <robla> #startmeeting ArchCom RFC meeting: T145472: Survey Cookies/Local Storage usage on Wikimedia sites
221:03:59 <wm-labs-meetbot> Meeting started Wed Oct 19 21:03:59 2016 UTC and is due to finish in 60 minutes. The chair is robla. Information about MeetBot at http://wiki.debian.org/MeetBot.
321:03:59 <wm-labs-meetbot> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
421:03:59 <wm-labs-meetbot> The meeting name has been set to 'archcom_rfc_meeting__t145472__survey_cookies_local_storage_usage_on_wikimedia_sites'
521:03:59 <stashbot> T145472: Survey Cookies/Local Storage usage on Wikimedia sites - https://phabricator.wikimedia.org/T145472
621:04:07 <robla> o/
721:04:37 <bd808> robla: I guess the bot being blocked from topic changes in this channel saves a step :/
821:05:54 <robla> hi zzhou_ !
921:05:58 <zzhou_> hi everyone
1021:06:43 <robla> gwicke, Krinkle and I just briefly discussed this in the ArchCom Planning meeting last hour
1121:07:31 <robla> ...and I've been talking to bawolff and dapatrick about this for a little while
1221:07:43 * bawolff waves
1321:08:05 * dapatrick waves also
1421:08:50 <robla> does anyone have any questions about this RFC before I prompt zzhou_ to ask questions that he has?
1521:10:19 <bd808> I love the idea of automating the audit process.
1621:10:34 <bd808> But I wonder who is going to watch the logs?
1721:10:37 <Scott_WUaS> (Hi All)
1821:10:56 <bd808> and if we are going to spam flood ourselves each time a new cookie is introduced
1921:11:31 <zzhou_> so I am not sure who will watch the logs right now - probably someone on the legal team?
2021:11:37 <Zppix|mobile> ??
2121:11:40 <zzhou_> and ideally over time people will be more carefula bout introducing these so
2221:11:49 <zzhou_> that there will be less alerts over time
2321:12:00 <zzhou_> as we better communicate to them about this issue
2421:12:14 <zzhou_> of course the first step is to understand the issue (hence this RFC) and figure out the scale of it
2521:12:43 <zzhou_> if there really that many cookies that we need to track down, we might have to think of more scalable ways of managing this for now
2621:14:05 <robla> bawolff: can you describe how you used mwgrep to help out zzhou_ ?
2721:14:42 <bd808> I'm not so much concerned about distinct cookies. It's more about the sheer number of requests that are likely to come between the cookie being introduced and a change to the extension to make it as expected.
2821:15:05 <dapatrick> I think that it's necessary to have some process/application which consumes the logged cookie information, stores unique cookies, associated wiki name, and number of times observed. But I'm getting a little ahead of where we are in the meeting.
2921:15:18 <bd808> we may have to sample the logs
3021:15:25 <bawolff> We tried to find instances of setting cookies in js pages on wiki
3121:15:50 <bawolff> mwgrep returned like 1500 results
3221:15:56 <Zppix|mobile> #idea test this out on meta or mediawiki before completely rolling the change out
3321:16:00 <bd808> o_O
3421:16:20 <bawolff> so that sort of simple static analysis was kind of unfeasible
3521:16:25 <bd808> bawolff: mostly gadgets storing state?
3621:16:43 <bawolff> Seemed like it
3721:17:16 <bawolff> often the same gadget or similar gadgets copied across multiple wikis
3821:18:37 <bd808> dapatrick: talk to tgr and get sentry setup in prod :)
3921:18:39 <robla> #info <bawolff> We tried to find instances of setting cookies in js pages on wiki; mwgrep returned like 1500 results
4021:18:58 <dapatrick> bd808 Will do.
4121:19:31 <robla> #info 14:18:36 <bd808> dapatrick: talk to tgr and get sentry setup in prod :)
4221:19:32 <bawolff> One concern i have is it seems kind of like we are approaching this backwards - we want to know when personal information is stored so we are looking at cookies
4321:19:49 <bawolff> but.. really cookies are just a means
4421:20:18 <bawolff> and its no different if personal info is stored some other way
4521:20:36 <bawolff> So it feels like we are looking at a symptom
4621:20:43 <bd808> well, in my mind cookies == correlation == possible tracking
4721:20:50 <Krenair> a gadget could store personal info in an api preference?
4821:20:59 <bawolff> but i dont have any better suggestions to address this issue
4921:21:15 <bawolff> krenair: yeah. Or in a public wiki page
5021:21:23 <bd808> database and EL schema audits?
5121:21:35 <bawolff> or probably other ways i havent thought of
5221:22:47 <zzhou_> so one reason on for tracking down cookies/local storage is that we have a table currently that lists all the cookies/local storage we use so ideally that information will be up-to-date: https://wikimediafoundation.org/wiki/Cookie_statement#3._What_types_of_cookies_does_Wikimedia_use.3F
5321:24:39 <tgr> if we found out a gadget is storing cookies on, say, the Javanese Wikisource, what would we do about it?
5421:25:22 <Krenair> presumably the same thing we'd do if we found one on the English Wikisource
5521:25:27 <tgr> seems to me like we are planning to collect loads of non-actionable data
5621:25:54 <dapatrick> zzhou_, for discussions' sake, could we include a statement on that page that says something to the effect of 'WMF uses these cookies, but there may be others created by Gadgets, extensions, etc. deployed by administrators of individual wikis/projects'?
5721:25:59 <zzhou_> If this is being used for say all users to Javanese Wikisource automatially, without their consent, ideally, we would list that in the Cookies table
5821:26:27 <zzhou_> dapatrick: that’s also a possibility
5921:26:43 <tgr> how do you check whether it's set for all users and without consent?
6021:27:04 <zzhou_> but I think perhaps after we figure out the scope of the issue
6121:27:08 <dapatrick> tgr, I think the answer there is source code and setting analysis.
6221:27:13 <bawolff> I dont think there is any instance of any cookie anywhere that asks for conscent
6321:27:39 <zzhou_> there’s could be implied consent when you use Cookies statements like this
6421:27:47 <zzhou_> or at least warning to the user
6521:27:59 <Krenair> extensions aren't deployed by administrators of individual wikis/projects
6621:28:29 <zzhou_> a script that just loads automatically when a page visits a Wiki page without the end user knowing about it would be more problematic
6721:28:36 <dapatrick> Thanks for that clarification, Krenair. This is not the final wording of such an addition.
6821:29:47 <Krenair> who is going to be doing all this log checking and source code analysis?
6921:30:41 <tgr> IMO 1) looking through gadget code on thousands of wikis (possibly written in the local language, possibly broken for ages and/or no one still active knowing what it does) is not realistic
7021:31:23 <tgr> 2) even if you want to do that, logging cookies does not seem very helpful data for that kind of review
7121:32:00 <tgr> I guess one could do horrible hacks with replacing document.cookie and then logging stack traces
7221:33:05 <bawolff> If you assume cookie names are relatively unique having the cookie name is a good start to finding the relavent code
7321:33:23 <bawolff> but indeed, not an easy task
7421:34:16 <robla> I'm sitting in the same room with zzhou_ now, and I'm going to try to restate the point he's trying to make
7521:34:26 <tgr> the JS code might be loaded from another wiki or an external domain
7621:34:38 <Scott_WUaS> Zzhou: I noticed you in addition to having a law degree from Columbia that you "spent a semester studying Chinese law at Peking University in Beijing" Will the cookie questions you're asking have differential potential effects for Wikipedia /Wikimedia working in China do you think? Are these considerations to plan for in any way - both legally and in various languages in China too?
7721:34:44 <tgr> (well hopefully not external external but tool labs)
7821:35:03 <tgr> it might come from a browser plugin etc
7921:35:24 <robla> bawolff gave zzhou_ the output of mwgrep, which is basically just a list of cookie setting calls in the MediaWiki: namespace on all of our wikis (I think)
8021:35:27 <bawolff> CSP will hopefully solve the external problem one day :p
8121:35:48 <robla> bawolff's run gave back 1500 results
8221:36:10 <robla> zzhou_ is basically saying "1500 is a lot, but *that's* manageable, right?"
8321:36:47 <bawolff> Tgr: thatd have to be a pretty broken browser plugin but presumably thatd only be a small number of users so in the long tail
8421:36:47 <zzhou_> yea I am saying even if we end with a list of 1500 unique cookies, we can take time to go through them
8521:37:06 <zzhou_> I am not sure we will have this table anymore at that time since it is too large
8621:37:08 <bd808> zzhou_: I guess that depends on who has to do that audit and what it keeps them from doing otherwise
8721:37:41 <robla> where "a list of 1500 unique cookies" is "1500 places in the MediaWiki: namespace Javascript that seem to be setting cookies"
8821:38:07 <tgr> how much time you'd estimate for dealing with one cookie?
8921:38:22 <bawolff> Based on a naive regex that probably missed a lot
9021:38:24 <Zppix|mobile> Would anyone be viewing the info received for the cookies ??
9121:38:39 <zzhou_> yea, I wasn’t proposing someone go over 1500 cookies necessarily - I think potentially past a certain large number, we will just rethink our strategy of listing all the cookies
9221:39:25 <bd808> the pint isn't just to list them though is it? its to audit why they exist
9321:39:50 <tgr> probably most of the cookies are opt-in (even if the user is not specifically told they are opting into a cookie, but they would have to enable a gadget or something)
9421:39:52 <bd808> and likely to stop using them if there isn't a very good reason?
9521:40:00 <zzhou_> not necessarily, since we don’t even know the scale of the issue yet
9621:40:27 <zzhou_> and to clarify by *1500 unqiue cookies I meant 1500 unqiue cookie names
9721:41:07 <Zppix|mobile> If there will be people whom arent employees viewing the info received i would suggest having some sort of confidentially document (not a lawyer/legal team member but thats just my 2 cents)
9821:41:33 <dapatrick> Zppix|mobile, zzhou_ is on the legal team.
9921:41:37 <bd808> so the logging is just going to end up with a set of N strings. Then someone will need to pour through source code on-wiki and on the server side to see if they can find those same strings
10021:42:04 <bd808> Then they will need to determine who "owns" the code that sets the cookie
10121:42:08 <Reedy> Zppix|mobile: Also, that's what the generic NDA's cover anyway
10221:42:09 <bawolff> So what is the ultimate goal we have here?
10321:42:18 <bd808> and then contact those persons to find out why they are doing so
10421:42:25 <dapatrick> bd808 Right, then determine from source code, documentation, or conversation with the project owner the reason for the existence of the cookie.
10521:42:43 <dapatrick> bd808, sorry, what you said when you finished your thought. :)
10621:42:45 <zzhou_> bd808: correct, but potentially, a lot of the scripts are just copies one of another and they are really using the same cookie names so maybe we don’t have as many other cookies as the mgrep suggests
10721:42:57 <bawolff> Do we basically want to explain ourselves to our users(?)
10821:43:02 <Reedy> did you not run it through uniq?
10921:43:17 <bd808> even if someone gets really really good at that process thats going to take an hour a cookie
11021:43:40 <Reedy> Nearly a year full time work
11121:43:57 <Zppix|mobile> Could a bot handle the tideous source editing or no?
11221:44:28 <bawolff> Reedy: at the time the output didnt seem amenable to processing like that
11321:44:34 <bawolff> at least not easily
11421:44:41 <bawolff> Zppix: no
11521:45:15 <Reedy> where's the list?
11621:45:20 <Zppix|mobile> ^
11721:45:42 <bawolff> Currently only on a private email thread
11821:45:58 * robla would love to make mwgrep public, and short of that, make it so that we run mwgrep scans and publish the static logs
11921:46:00 <bawolff> i can pastebin it once i find it again
12021:46:10 <Krenair> there's a ticket for that robla
12121:46:37 <Reedy> Also needs my no private patch merging ;)
12221:46:38 <robla> Krenair: I heard about that from Krinkle ....please do tell!
12321:46:56 <tgr> so I guess working from the mwgrep list is not realistic, that leaves logging what cookies are set, doing some sort of honeypot approach, working with the community and leaving it to them to identify cookies, or just ignoring the issue
12421:47:00 <Zppix|mobile> Is the main time consumer translation for the cookie? I cant think of any other reason
12521:47:22 <Krenair> robla, https://phabricator.wikimedia.org/T71489
12621:48:18 <bawolff> So umm. What about if we just put the cookie table on meta, and tell people to add items when they introduce new cookies
12721:48:32 <Reedy> !bug 1 | bawolff
12821:48:32 <wm-bot> bawolff: https://bugzilla.wikimedia.org/show_bug?id=1
12921:48:39 <bawolff> and then use cookie logging to guage how complete the table is
13021:50:01 * robla is sad that the bug 1 link above doesn't go to https://phabricator.wikimedia.org/T2001
13121:50:23 <zzhou_> bawolff: you mean a separate table to help us to chase down the cookies (not the cookies table for the end user we have right now)?
13221:50:29 <Reedy> !botbrain
13321:51:17 <Reedy> !bug del
13421:51:18 <wm-bot> Sorry, you are not authorized to perform this
13521:51:29 <Krenair> !bug del
13621:51:29 <wm-bot> Sorry, you are not authorized to perform this
13721:51:30 <Reedy> just beeds !bug is https://bugzilla.wikimedia.org/$1
13821:51:32 <Zppix|mobile> Lol
13921:52:00 <tgr> say we get a table with 100 cookies and we log 1000 unique cookie names (let's optimistically assume there are no dynamically named cookies)
14021:52:01 <bawolff> Zzhou_: a crowd sourced table
14121:52:09 <Zppix|mobile> Maybe !bug should change from bugzilla.wikimedia to phabricator.wikimedia
14221:52:10 <tgr> again, what would we do with the data?
14321:52:25 <Reedy> Zppix|mobile: bugzilla had bugs, phab has tasks
14421:52:26 <tgr> would someone have to go through the 900 missing names and check?
14521:52:32 <Reedy> if the url is right, it will redirect correctly
14621:52:41 <dapatrick> tgr, Yes.
14721:52:47 <bawolff> Zppix: but then we cant make snide comments about bug #1
14821:52:50 * robla has a meeting to go to in 7 minutes, so will end this abruptly
14921:52:55 <bawolff> :p
15021:53:13 <robla> also, we can keep the conversation generally going on Phab and on #wikimedia-tech
15121:53:29 <tgr> what are the chances of ending up with an amount of data that does not take man-months to sort through?
15221:53:51 <zzhou_> tgr: if we have that many cookies, we might need some sort of disclaimer like Dapatrick suggested earlier as it would not be feasible to go over all that many and furthermore, it is not disclosure to the user if we just present them with a list of 1000 cookies
15321:54:21 <tgr> zzhou_: so can we just start with that disclaimer and skip the intermediate steps? :)
15421:55:07 <zzhou_> that’s an option - it def. less ideal than having a cookies table that’s up to date
15521:55:27 <zzhou_> (assuming the size of the table is still limited)
15621:56:52 <robla> 180 seconds until abrupt end of meeting....
15721:57:01 <zzhou_> does everyone think it is likely we have many hundreds to thousands of unqiue cookie (names) lying around?
15821:57:13 <bawolff> Perfectly timed with my battery dying ;)
15921:57:27 <Reedy> the list sorted and uniq'd will remove dupes
16021:58:16 <robla> #info next week's tentative topic: T138783 SVG stuff
16121:58:16 <stashbot> T138783: SVG Upload should (optionally) allow the xhtml namespace - https://phabricator.wikimedia.org/T138783
16221:58:18 <bd808> I think we need the logging to find out honestly. Probably not too hard to add into the wikimedia messages extension or something similar
16321:58:19 <zzhou_> Reedy: yea, perhaps that’s the first step
16421:58:19 <bawolff> I think the distribution will have a long tail
16521:58:21 <dapatrick> zzhou_ I believe there may be a possibly untenable number. I do not be believe it will be many hundres of thousands.
16621:58:36 <zzhou_> sorry I meant hundred to thousands ;)
16721:58:58 <dapatrick> Ah. I also read you wrong.
16821:59:07 <bd808> across all projects and languages? I wouldn't doubt high hundreds
16921:59:10 <robla> 45 seconds to end of meeting
17021:59:18 <dapatrick> Hundreds to thousands is about what I expect.
17121:59:19 <tgr> yeah, the long tail will be long
17221:59:27 <zzhou_> ok
17321:59:34 <robla> thanks everyone! those that want to keep talking can use #wikimedia-tech
17422:00:04 <zzhou_> alright I will pop to #wikimedia-tech in case people have time, I want to follow-up a little
17522:00:07 <robla> #endmeeting

People present (lines said)

  • bawolff (33)
  • zzhou_ (29)
  • robla (24)
  • bd808 (20)
  • tgr (18)
  • dapatrick (13)
  • Reedy (12)
  • Zppix|mobile (9)
  • Krenair (7)
  • wm-labs-meetbot (3)
  • wm-bot (3)
  • stashbot (2)
  • Scott_WUaS (2)

Other meetings

Architecture meetings
13:00 PT ArchCom Planning Meetingsupcomingall since 2016-03-30
14:00 PT ArchCom-RFC Meetingsupcomingall since 2015-09-09

Recurring Event

Event Series
This event is an instance of E66: ArchCom RFC Meeting Wxx: <topic TBD> (<see "Starts" field>, #wikimedia-office), and repeats every week.

Event Timeline

RobLa-WMF renamed this event from ArchCom RFC Meeting Wxx: <topic TBD> (<see "Starts" field>, #wikimedia-office) to ArchCom RFC Meeting W42: Surveying Cookie Use (2016-10-19 #wikimedia-office).Oct 18 2016, 4:27 AM
RobLa-WMF updated the event description. (Show Details)
RobLa-WMF changed the end date for this event from Oct 19 2016, 9:00 PM to Oct 19 2016, 10:00 PM.
daniel renamed this event from ArchCom RFC Meeting W42: Surveying Cookie Use (2016-10-19 #wikimedia-office) to ArchCom RFC Meeting Wxx: <topic TBD> (<see "Starts" field>, #wikimedia-office).Nov 21 2016, 6:11 PM
daniel changed the host of this event from RobLa-WMF to daniel.
daniel invited: ; uninvited: .
daniel updated the event description. (Show Details)
daniel renamed this event from ArchCom RFC Meeting Wxx: <topic TBD> (<see "Starts" field>, #wikimedia-office) to ArchCom RFC Meeting W42: Surveying Cookie Use (2016-10-19 #wikimedia-office).