My latest patchset on that change above is just a draft implementing some of the thoughts so far. It implements the following so that we have a place to start from when we finalize our thoughts on privacy here:
Fri, Aug 23
Mon, Aug 19
I could use a collaboration on the list of countries to blacklist. The paper that Nuria mentions includes: China, Cuba, Egypt, Indonesia, Iran, Kazakhstan, Pakistan, Russia, Saudi Arabia, South Korea, Syria, Thailand, Turkey, Uzbekistan, Vietnam. But the reason for censorship is pretty different in each country, and they don't all seem like they need a blacklist. I tried to guess at a first draft of the blacklist but honestly I'm not sure. The governments in not just those countries but those regions seem pretty troubling to me. And I don't have enough knowledge to know when something goes from troubling to dangerous.
@Yair_rand, that's what we're trying to prevent, yes. The value of the data is great, and the risk will be minimized as much as possible. As Asaf points out above, we have had this conversation for a very long time. Our legal and security teams have thought about the potential danger of this dataset and signed off on us publishing it. Nevertheless, I personally would like to protect this dataset as much as possible and that's why I'm looking into how to make it harder to determine the country of specific editors. Does that make sense? Do you have additional concerns?
Sun, Aug 18
Thanks very much. I also would much prefer not to rehash the conversations we've already had. We're ready to release this data, and the work we're doing to find the best way to release it is just preparation until the privacy framework is ready. We just want to be able to justify our decisions. Bucketing and blacklisting seem like they fit into the privacy framework drafts I've seen so far, and I'll take some time during paternity leave to fill out that rationale if needed. So, here's where we are so far:
Fri, Aug 16
Quick status update. I am currently evaluating ways to release this data. This is just while we wait for our privacy framework to be finished. As soon as that's done, we can evaluate our possible solutions here and execute the release fairly quickly.
Aug 15 2019
Where are / will the details of the new server be?
Right, but this task description mentions the CheckUser change, so either this description should be updated or subtasks should be added here, no? Otherwise how would people like me know where to look?
@dmaza: yes, and keep in mind data is being automatically purged after 90 days.
@MarcoSwart it hasn't changed, but maybe spider traffic has been steadily rising. There's also an unrelated but confusing bug where the time period on the dashboard changes by itself, the fix for that is being deployed soon.
Aug 14 2019
The numbers make sense to me, did you split by agent type? You can see a lot of spider (crawler) traffic: https://stats.wikimedia.org/v2/#/nl.wiktionary.org/reading/total-page-views/normal|bar|2-year|agent~user*spider|monthly
Aug 13 2019
@leila: I looked over the survey and made some edits to the first part. The questions look great to me. Thanks again for putting this together.
Aug 8 2019
We just want to be involved if you want to whitelist data to be kept more than 90 days. Other than that, you don't need any approval from us. We're happy to look over your schema and give advice about how easy data would be to load into, for example, Druid.
Aug 7 2019
Aug 5 2019
Aug 2 2019
Rough draft of a blurb about why this dataset is useful:
Aug 1 2019
FYI: this is already possible in the API, it's just not implemented in the UI. For example
(note split by agent type and selecting only user to match your Hive query)
@Ramsey-WMF: we shouldn't make granular referer information public but you can always access the raw data. Talk to Product Analytics or jump on the Hadoop cluster and take a look.
Vega seems to allow you to control headers via the dataHeaders property, but it's not really documented, I found it here: https://github.com/vega/vega/blob/af5cc1df42eb5aaf2f478d0bda69313643fe0532/docs/releases/v1.5.4/vega.js#L378
TL;DR; the user agent is not set, it just shows up as - so that's what WDQS sees. From what @Smalyshev says above, this is what's causing the 403s, right?
I looked at webrequest for four hours around the time of Petr's post, and I couldn't see any 403s to wikidata.org/w/api.php. If someone could know when the error would show up, you could find it in the webrequest table very easily:
@Neil_P._Quinn_WMF they weren't migrated, they only exist in the old repo: https://github.com/wikimedia/analytics-limn-edit-data
I meant the operator_user= and iorg_domain_internal= prefixes, which are not added by us in any way, so these failures are expected, right?
Jul 31 2019
Jul 30 2019
Jul 29 2019
@leila: we can of course iterate on the format in the future. Eventually we'll have a public API to query the whole dataset. But for now we just want some idea of common / high priority use cases that we can try to serve with a simpler release. Thank you so much for looking into it.
Quick question: the description shows plans to refactor cu_changes.cuc_comment to cuc_comment_id. The activity on this task seems to have stopped and this refactor doesn't seem to have happened yet. Will it happen eventually or has it been abandoned?
Jul 26 2019
So, looked into code history more carefully. There's literally one code change in AQS in 2019, and it doesn't touch pageviews handling at all. npm saw fit to update some of the repository references for kad, swagger-ui, and json-stable-stringify. I suppose we could look into those but that would be pretty crazy bad luck. I think the logical next place to look is the layer in front of AQS, the problem is 99% likely to be from there. Pinging @Pchelolo to see if this sounds familiar. Petr, basically we're seeing a lot more 429s since around April 2019, and we see two different kinds:
Jul 25 2019
Jul 24 2019
my fault - I confused this with our mediawiki-storage repo, I should've read the title more carefully. Will work on fixing.
I think that's right, we have a pretty good working group of the people that care about this. If they agree I don't see too much need for an RfC, there would be too much context for someone else to catch up with. So I would vote to close as invalid. As for requirements/constraints, I think the phab tasks describe those for now, and we'll document them as we go forward (there's still thinking/testing on what the stack should be)
Wanted to mention this in today's meeting but couldn't find it in time: https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventLogging/NotErrorLogging. The main reason to not use EL for error logging, that I agree with, is that if EL goes down it's not a big deal, but if we are blind to client-side errors it affects our users directly.
@daniel sorry I didn't update the group on this, but this project is being led by Filippo and progressing nicely. There's a working version deployed in beta and progress on a production launch is tracked here: T226986. I'm not sure at this point how that interacts with this RfC. Maybe we should update it when we have a good idea what the production stack should look like.
This can be tricky to diagnose because we don't really know what if any upstream changes are made to Hyperswitch. Do you have a more accurate idea about when you started seeing this? Is it when you made the task, beginning of April this year?
Jul 23 2019
@leila / @nettrom_WMF: fyi I'm working on this now. I've started a draft page where I'm thinking out loud about how to publish: https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits/Public
Jul 22 2019
@Gopavasanth: regarding OAuth, here's some debugging by someone with I think a similar issue, maybe it helps: https://github.com/milimetric/passport-mediawiki-oauth/issues/2#issuecomment-513711239
@faidon: we don't have any updaters on our end, we just move the databases around and keep backups for historical use. But let us know if you run into any problems.
Jul 18 2019
this is due to a change in schema from mediawiki-page-create version 3 to version 4.
dashboard is working fine now. It's possible there was just a temporary server hiccup, reportupdater recovers from a lot of problems by itself.