Agenda
- Location: #wikimedia-office IRC channel
- Meeting type: TBD
- Time: 2016-10-19, Wednesday 21:00 UTC (2pm PDT, 23:00 CEST)
- Topic:
Meeting summary
- IDEA: test this out on meta or mediawiki before completely rolling the change out (Zppix|mobile, 21:15:56)
- <bawolff> We tried to find instances of setting cookies in js pages on wiki; mwgrep returned like 1500 results (robla, 21:18:39)
- 14:18:36Â <bd808>Â dapatrick: talk to tgr and get sentry setup in prod :) (robla, 21:19:31)
- next week's tentative topic: T138783 SVG stuff (robla, 21:58:16)
Meeting ended at 22:00:07 UTC.
Log
1 | 21:03:59 <robla> #startmeeting ArchCom RFC meeting: T145472: Survey Cookies/Local Storage usage on Wikimedia sites |
---|---|
2 | 21:03:59 <wm-labs-meetbot> Meeting started Wed Oct 19 21:03:59 2016 UTC and is due to finish in 60 minutes. The chair is robla. Information about MeetBot at http://wiki.debian.org/MeetBot. |
3 | 21:03:59 <wm-labs-meetbot> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. |
4 | 21:03:59 <wm-labs-meetbot> The meeting name has been set to 'archcom_rfc_meeting__t145472__survey_cookies_local_storage_usage_on_wikimedia_sites' |
5 | 21:03:59 <stashbot> T145472: Survey Cookies/Local Storage usage on Wikimedia sites - https://phabricator.wikimedia.org/T145472 |
6 | 21:04:07 <robla> o/ |
7 | 21:04:37 <bd808> robla: I guess the bot being blocked from topic changes in this channel saves a step :/ |
8 | 21:05:54 <robla> hi zzhou_ ! |
9 | 21:05:58 <zzhou_> hi everyone |
10 | 21:06:43 <robla> gwicke, Krinkle and I just briefly discussed this in the ArchCom Planning meeting last hour |
11 | 21:07:31 <robla> ...and I've been talking to bawolff and dapatrick about this for a little while |
12 | 21:07:43 * bawolff waves |
13 | 21:08:05 * dapatrick waves also |
14 | 21:08:50 <robla> does anyone have any questions about this RFC before I prompt zzhou_ to ask questions that he has? |
15 | 21:10:19 <bd808> I love the idea of automating the audit process. |
16 | 21:10:34 <bd808> But I wonder who is going to watch the logs? |
17 | 21:10:37 <Scott_WUaS> (Hi All) |
18 | 21:10:56 <bd808> and if we are going to spam flood ourselves each time a new cookie is introduced |
19 | 21:11:31 <zzhou_> so I am not sure who will watch the logs right now - probably someone on the legal team? |
20 | 21:11:37 <Zppix|mobile> ?? |
21 | 21:11:40 <zzhou_> and ideally over time people will be more carefula bout introducing these so |
22 | 21:11:49 <zzhou_> that there will be less alerts over time |
23 | 21:12:00 <zzhou_> as we better communicate to them about this issue |
24 | 21:12:14 <zzhou_> of course the first step is to understand the issue (hence this RFC) and figure out the scale of it |
25 | 21:12:43 <zzhou_> if there really that many cookies that we need to track down, we might have to think of more scalable ways of managing this for now |
26 | 21:14:05 <robla> bawolff: can you describe how you used mwgrep to help out zzhou_ ? |
27 | 21:14:42 <bd808> I'm not so much concerned about distinct cookies. It's more about the sheer number of requests that are likely to come between the cookie being introduced and a change to the extension to make it as expected. |
28 | 21:15:05 <dapatrick> I think that it's necessary to have some process/application which consumes the logged cookie information, stores unique cookies, associated wiki name, and number of times observed. But I'm getting a little ahead of where we are in the meeting. |
29 | 21:15:18 <bd808> we may have to sample the logs |
30 | 21:15:25 <bawolff> We tried to find instances of setting cookies in js pages on wiki |
31 | 21:15:50 <bawolff> mwgrep returned like 1500 results |
32 | 21:15:56 <Zppix|mobile> #idea test this out on meta or mediawiki before completely rolling the change out |
33 | 21:16:00 <bd808> o_O |
34 | 21:16:20 <bawolff> so that sort of simple static analysis was kind of unfeasible |
35 | 21:16:25 <bd808> bawolff: mostly gadgets storing state? |
36 | 21:16:43 <bawolff> Seemed like it |
37 | 21:17:16 <bawolff> often the same gadget or similar gadgets copied across multiple wikis |
38 | 21:18:37 <bd808> dapatrick: talk to tgr and get sentry setup in prod :) |
39 | 21:18:39 <robla> #info <bawolff> We tried to find instances of setting cookies in js pages on wiki; mwgrep returned like 1500 results |
40 | 21:18:58 <dapatrick> bd808 Will do. |
41 | 21:19:31 <robla> #info 14:18:36 <bd808> dapatrick: talk to tgr and get sentry setup in prod :) |
42 | 21:19:32 <bawolff> One concern i have is it seems kind of like we are approaching this backwards - we want to know when personal information is stored so we are looking at cookies |
43 | 21:19:49 <bawolff> but.. really cookies are just a means |
44 | 21:20:18 <bawolff> and its no different if personal info is stored some other way |
45 | 21:20:36 <bawolff> So it feels like we are looking at a symptom |
46 | 21:20:43 <bd808> well, in my mind cookies == correlation == possible tracking |
47 | 21:20:50 <Krenair> a gadget could store personal info in an api preference? |
48 | 21:20:59 <bawolff> but i dont have any better suggestions to address this issue |
49 | 21:21:15 <bawolff> krenair: yeah. Or in a public wiki page |
50 | 21:21:23 <bd808> database and EL schema audits? |
51 | 21:21:35 <bawolff> or probably other ways i havent thought of |
52 | 21:22:47 <zzhou_> so one reason on for tracking down cookies/local storage is that we have a table currently that lists all the cookies/local storage we use so ideally that information will be up-to-date: https://wikimediafoundation.org/wiki/Cookie_statement#3._What_types_of_cookies_does_Wikimedia_use.3F |
53 | 21:24:39 <tgr> if we found out a gadget is storing cookies on, say, the Javanese Wikisource, what would we do about it? |
54 | 21:25:22 <Krenair> presumably the same thing we'd do if we found one on the English Wikisource |
55 | 21:25:27 <tgr> seems to me like we are planning to collect loads of non-actionable data |
56 | 21:25:54 <dapatrick> zzhou_, for discussions' sake, could we include a statement on that page that says something to the effect of 'WMF uses these cookies, but there may be others created by Gadgets, extensions, etc. deployed by administrators of individual wikis/projects'? |
57 | 21:25:59 <zzhou_> If this is being used for say all users to Javanese Wikisource automatially, without their consent, ideally, we would list that in the Cookies table |
58 | 21:26:27 <zzhou_> dapatrick: that’s also a possibility |
59 | 21:26:43 <tgr> how do you check whether it's set for all users and without consent? |
60 | 21:27:04 <zzhou_> but I think perhaps after we figure out the scope of the issue |
61 | 21:27:08 <dapatrick> tgr, I think the answer there is source code and setting analysis. |
62 | 21:27:13 <bawolff> I dont think there is any instance of any cookie anywhere that asks for conscent |
63 | 21:27:39 <zzhou_> there’s could be implied consent when you use Cookies statements like this |
64 | 21:27:47 <zzhou_> or at least warning to the user |
65 | 21:27:59 <Krenair> extensions aren't deployed by administrators of individual wikis/projects |
66 | 21:28:29 <zzhou_> a script that just loads automatically when a page visits a Wiki page without the end user knowing about it would be more problematic |
67 | 21:28:36 <dapatrick> Thanks for that clarification, Krenair. This is not the final wording of such an addition. |
68 | 21:29:47 <Krenair> who is going to be doing all this log checking and source code analysis? |
69 | 21:30:41 <tgr> IMO 1) looking through gadget code on thousands of wikis (possibly written in the local language, possibly broken for ages and/or no one still active knowing what it does) is not realistic |
70 | 21:31:23 <tgr> 2) even if you want to do that, logging cookies does not seem very helpful data for that kind of review |
71 | 21:32:00 <tgr> I guess one could do horrible hacks with replacing document.cookie and then logging stack traces |
72 | 21:33:05 <bawolff> If you assume cookie names are relatively unique having the cookie name is a good start to finding the relavent code |
73 | 21:33:23 <bawolff> but indeed, not an easy task |
74 | 21:34:16 <robla> I'm sitting in the same room with zzhou_ now, and I'm going to try to restate the point he's trying to make |
75 | 21:34:26 <tgr> the JS code might be loaded from another wiki or an external domain |
76 | 21:34:38 <Scott_WUaS> Zzhou: I noticed you in addition to having a law degree from Columbia that you "spent a semester studying Chinese law at Peking University in Beijing" Will the cookie questions you're asking have differential potential effects for Wikipedia /Wikimedia working in China do you think? Are these considerations to plan for in any way - both legally and in various languages in China too? |
77 | 21:34:44 <tgr> (well hopefully not external external but tool labs) |
78 | 21:35:03 <tgr> it might come from a browser plugin etc |
79 | 21:35:24 <robla> bawolff gave zzhou_ the output of mwgrep, which is basically just a list of cookie setting calls in the MediaWiki: namespace on all of our wikis (I think) |
80 | 21:35:27 <bawolff> CSP will hopefully solve the external problem one day :p |
81 | 21:35:48 <robla> bawolff's run gave back 1500 results |
82 | 21:36:10 <robla> zzhou_ is basically saying "1500 is a lot, but *that's* manageable, right?" |
83 | 21:36:47 <bawolff> Tgr: thatd have to be a pretty broken browser plugin but presumably thatd only be a small number of users so in the long tail |
84 | 21:36:47 <zzhou_> yea I am saying even if we end with a list of 1500 unique cookies, we can take time to go through them |
85 | 21:37:06 <zzhou_> I am not sure we will have this table anymore at that time since it is too large |
86 | 21:37:08 <bd808> zzhou_: I guess that depends on who has to do that audit and what it keeps them from doing otherwise |
87 | 21:37:41 <robla> where "a list of 1500 unique cookies" is "1500 places in the MediaWiki: namespace Javascript that seem to be setting cookies" |
88 | 21:38:07 <tgr> how much time you'd estimate for dealing with one cookie? |
89 | 21:38:22 <bawolff> Based on a naive regex that probably missed a lot |
90 | 21:38:24 <Zppix|mobile> Would anyone be viewing the info received for the cookies ?? |
91 | 21:38:39 <zzhou_> yea, I wasn’t proposing someone go over 1500 cookies necessarily - I think potentially past a certain large number, we will just rethink our strategy of listing all the cookies |
92 | 21:39:25 <bd808> the pint isn't just to list them though is it? its to audit why they exist |
93 | 21:39:50 <tgr> probably most of the cookies are opt-in (even if the user is not specifically told they are opting into a cookie, but they would have to enable a gadget or something) |
94 | 21:39:52 <bd808> and likely to stop using them if there isn't a very good reason? |
95 | 21:40:00 <zzhou_> not necessarily, since we don’t even know the scale of the issue yet |
96 | 21:40:27 <zzhou_> and to clarify by *1500 unqiue cookies I meant 1500 unqiue cookie names |
97 | 21:41:07 <Zppix|mobile> If there will be people whom arent employees viewing the info received i would suggest having some sort of confidentially document (not a lawyer/legal team member but thats just my 2 cents) |
98 | 21:41:33 <dapatrick> Zppix|mobile, zzhou_ is on the legal team. |
99 | 21:41:37 <bd808> so the logging is just going to end up with a set of N strings. Then someone will need to pour through source code on-wiki and on the server side to see if they can find those same strings |
100 | 21:42:04 <bd808> Then they will need to determine who "owns" the code that sets the cookie |
101 | 21:42:08 <Reedy> Zppix|mobile: Also, that's what the generic NDA's cover anyway |
102 | 21:42:09 <bawolff> So what is the ultimate goal we have here? |
103 | 21:42:18 <bd808> and then contact those persons to find out why they are doing so |
104 | 21:42:25 <dapatrick> bd808 Right, then determine from source code, documentation, or conversation with the project owner the reason for the existence of the cookie. |
105 | 21:42:43 <dapatrick> bd808, sorry, what you said when you finished your thought. :) |
106 | 21:42:45 <zzhou_> bd808: correct, but potentially, a lot of the scripts are just copies one of another and they are really using the same cookie names so maybe we don’t have as many other cookies as the mgrep suggests |
107 | 21:42:57 <bawolff> Do we basically want to explain ourselves to our users(?) |
108 | 21:43:02 <Reedy> did you not run it through uniq? |
109 | 21:43:17 <bd808> even if someone gets really really good at that process thats going to take an hour a cookie |
110 | 21:43:40 <Reedy> Nearly a year full time work |
111 | 21:43:57 <Zppix|mobile> Could a bot handle the tideous source editing or no? |
112 | 21:44:28 <bawolff> Reedy: at the time the output didnt seem amenable to processing like that |
113 | 21:44:34 <bawolff> at least not easily |
114 | 21:44:41 <bawolff> Zppix: no |
115 | 21:45:15 <Reedy> where's the list? |
116 | 21:45:20 <Zppix|mobile> ^ |
117 | 21:45:42 <bawolff> Currently only on a private email thread |
118 | 21:45:58 * robla would love to make mwgrep public, and short of that, make it so that we run mwgrep scans and publish the static logs |
119 | 21:46:00 <bawolff> i can pastebin it once i find it again |
120 | 21:46:10 <Krenair> there's a ticket for that robla |
121 | 21:46:37 <Reedy> Also needs my no private patch merging ;) |
122 | 21:46:38 <robla> Krenair: I heard about that from Krinkle ....please do tell! |
123 | 21:46:56 <tgr> so I guess working from the mwgrep list is not realistic, that leaves logging what cookies are set, doing some sort of honeypot approach, working with the community and leaving it to them to identify cookies, or just ignoring the issue |
124 | 21:47:00 <Zppix|mobile> Is the main time consumer translation for the cookie? I cant think of any other reason |
125 | 21:47:22 <Krenair> robla, https://phabricator.wikimedia.org/T71489 |
126 | 21:48:18 <bawolff> So umm. What about if we just put the cookie table on meta, and tell people to add items when they introduce new cookies |
127 | 21:48:32 <Reedy> !bug 1 | bawolff |
128 | 21:48:32 <wm-bot> bawolff: https://bugzilla.wikimedia.org/show_bug?id=1 |
129 | 21:48:39 <bawolff> and then use cookie logging to guage how complete the table is |
130 | 21:50:01 * robla is sad that the bug 1 link above doesn't go to https://phabricator.wikimedia.org/T2001 |
131 | 21:50:23 <zzhou_> bawolff: you mean a separate table to help us to chase down the cookies (not the cookies table for the end user we have right now)? |
132 | 21:50:29 <Reedy> !botbrain |
133 | 21:51:17 <Reedy> !bug del |
134 | 21:51:18 <wm-bot> Sorry, you are not authorized to perform this |
135 | 21:51:29 <Krenair> !bug del |
136 | 21:51:29 <wm-bot> Sorry, you are not authorized to perform this |
137 | 21:51:30 <Reedy> just beeds !bug is https://bugzilla.wikimedia.org/$1 |
138 | 21:51:32 <Zppix|mobile> Lol |
139 | 21:52:00 <tgr> say we get a table with 100 cookies and we log 1000 unique cookie names (let's optimistically assume there are no dynamically named cookies) |
140 | 21:52:01 <bawolff> Zzhou_: a crowd sourced table |
141 | 21:52:09 <Zppix|mobile> Maybe !bug should change from bugzilla.wikimedia to phabricator.wikimedia |
142 | 21:52:10 <tgr> again, what would we do with the data? |
143 | 21:52:25 <Reedy> Zppix|mobile: bugzilla had bugs, phab has tasks |
144 | 21:52:26 <tgr> would someone have to go through the 900 missing names and check? |
145 | 21:52:32 <Reedy> if the url is right, it will redirect correctly |
146 | 21:52:41 <dapatrick> tgr, Yes. |
147 | 21:52:47 <bawolff> Zppix: but then we cant make snide comments about bug #1 |
148 | 21:52:50 * robla has a meeting to go to in 7 minutes, so will end this abruptly |
149 | 21:52:55 <bawolff> :p |
150 | 21:53:13 <robla> also, we can keep the conversation generally going on Phab and on #wikimedia-tech |
151 | 21:53:29 <tgr> what are the chances of ending up with an amount of data that does not take man-months to sort through? |
152 | 21:53:51 <zzhou_> tgr: if we have that many cookies, we might need some sort of disclaimer like Dapatrick suggested earlier as it would not be feasible to go over all that many and furthermore, it is not disclosure to the user if we just present them with a list of 1000 cookies |
153 | 21:54:21 <tgr> zzhou_: so can we just start with that disclaimer and skip the intermediate steps? :) |
154 | 21:55:07 <zzhou_> that’s an option - it def. less ideal than having a cookies table that’s up to date |
155 | 21:55:27 <zzhou_> (assuming the size of the table is still limited) |
156 | 21:56:52 <robla> 180 seconds until abrupt end of meeting.... |
157 | 21:57:01 <zzhou_> does everyone think it is likely we have many hundreds to thousands of unqiue cookie (names) lying around? |
158 | 21:57:13 <bawolff> Perfectly timed with my battery dying ;) |
159 | 21:57:27 <Reedy> the list sorted and uniq'd will remove dupes |
160 | 21:58:16 <robla> #info next week's tentative topic: T138783 SVG stuff |
161 | 21:58:16 <stashbot> T138783: SVG Upload should (optionally) allow the xhtml namespace - https://phabricator.wikimedia.org/T138783 |
162 | 21:58:18 <bd808> I think we need the logging to find out honestly. Probably not too hard to add into the wikimedia messages extension or something similar |
163 | 21:58:19 <zzhou_> Reedy: yea, perhaps that’s the first step |
164 | 21:58:19 <bawolff> I think the distribution will have a long tail |
165 | 21:58:21 <dapatrick> zzhou_ I believe there may be a possibly untenable number. I do not be believe it will be many hundres of thousands. |
166 | 21:58:36 <zzhou_> sorry I meant hundred to thousands ;) |
167 | 21:58:58 <dapatrick> Ah. I also read you wrong. |
168 | 21:59:07 <bd808> across all projects and languages? I wouldn't doubt high hundreds |
169 | 21:59:10 <robla> 45 seconds to end of meeting |
170 | 21:59:18 <dapatrick> Hundreds to thousands is about what I expect. |
171 | 21:59:19 <tgr> yeah, the long tail will be long |
172 | 21:59:27 <zzhou_> ok |
173 | 21:59:34 <robla> thanks everyone! those that want to keep talking can use #wikimedia-tech |
174 | 22:00:04 <zzhou_> alright I will pop to #wikimedia-tech in case people have time, I want to follow-up a little |
175 | 22:00:07 <robla> #endmeeting |
People present (lines said)
- bawolff (33)
- zzhou_ (29)
- robla (24)
- bd808 (20)
- tgr (18)
- dapatrick (13)
- Reedy (12)
- Zppix|mobile (9)
- Krenair (7)
- wm-labs-meetbot (3)
- wm-bot (3)
- stashbot (2)
- Scott_WUaS (2)
Other meetings
Architecture meetings | ||
---|---|---|
13:00 PT ArchCom Planning Meetings | upcoming | all since 2016-03-30 |
14:00 PT ArchCom-RFC Meetings | upcoming | all since 2015-09-09 |