- Erik compiled data from sampled logs which are no longer available
I tried to enable the vagrant role again to test and I got this error, so I figure either my vagrant is messed up or this will prevent me from having a clean test. I think if it works for other folks on a clean vagrant, it's fine, I'll try to test again when I look at the unit testing RFC next.
anticipating release, docs here: https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits/Geoeditors/Public
Thu, Oct 17
Tue, Oct 15
I wouldn't worry too much about how others are using this in their own mediawiki installs. Not that it's not important, just that we can't possibly guess as to how they might want to do that. Just having three repos with some common stuff will allow for plenty of flexibility and refactoring later on.
I like a single namespace, especially because having "common" as a root would be too vague. This might be useful:
Fri, Oct 11
Thu, Oct 10
I looked at this and it's a clever way to get some rough information to answer the main question. But I just wanted to point out: it's not by accident that this kind of correlation is hard to do. We made a conscious decision a while back that we should not optimize accessing reading patterns of specific users. Being able to get this data is problematic for privacy reasons, and more so if we can get this data quickly. So, nice work, but I would hesitate before optimizing it too much more.
Yes, unfortunately we don't have mobile data going back before that. Before 2014, we had:
Wed, Oct 9
ok, I restarted the monthly job and this column will be populated going forward. The first time it will be inserted is November 1st, 2019, when it runs the month of October.
I'm thinking about the generated documentation that may now be confusing to newcomers / people who haven't read this policy yet. For example, if you have two public methods, and one of them is annotated with @stable while the other is not. I would read the documentation and use any public method and expect it to be stable unless something jumped out at me telling me not to make this assumption. So, ideally, docs would color/explain both annotated methods and public methods that are not annotated.
The column has been added and I'm restarting the job so it will be filled going forward. Should we backfill this data as far back as we have the raw source (90 days)?
@Ijon I'm working on a blacklist, and wanted to check with you to see how it would impact the usefulness of the dataset. I'll write more details but basically on the advice of folks more familiar with censorship of Wikipedia I'm using scores from Reporters Without Borders  and Freedom on the Net . It makes more sense to blacklist the top wikis used in each of these countries, but that is often English Wikipedia and things get confusing. So sticking with the simpler approach of just blacklisting the countries, here's a list of the worst offenders according to those two sources:
Mon, Oct 7
ok, we found the problem - we deployed a filter to exclude requests from domains we don't have on a whitelist (like the many Wikipedia mirrors that randomly run our JS code and send bad data). Since this instrumentation comes from wikipedia.org, we were excluding it. We'll put that on the whitelist and rerun refine for this time period.
low priority until someone asks for this in Druid - and let's be careful how we expose the IP/UA/user id column.
The fact that there are some events means the data is flowing through. First guess is take a look at the eventerror table for your schema, see if there's a spike of errors and if so, what they are. If there's a corresponding rise there, it means your events are not matching the schema. If not, let us know and we can help look further. This also looks ok, events flowing in seem relatively consistent (no 100x drops): https://grafana.wikimedia.org/d/000000018/eventlogging-schema?orgId=1&var-schema=WikipediaPortal&from=now-90d&to=now
data's only in Druid for these datasources for 3 months, we decided we don't need to reindex monthly
nothing to do, maybe monitor and see if this causes other performance problems?
Fri, Oct 4
Can't wait to help as much as I can with this effort. I think it's going to be a big challenge with an even bigger payoff. I didn't get to be part of the FAWG but I've used several mainstream and several weird frameworks in my career and have hopefully useful opinions about how to understand pros and cons and decide on this specific kind of technology. Looking forward to help in any way I can.
Ok, so, good progress here. The technical committee would like to see concerns from the March 2015 version reviewed, to ensure we don't repeat any of those problems. As stated in the description, the issues were raised in and around comments T18691#1051560 and T18691#1098570, and include:
Tue, Oct 1
All the data requested here is available in the mediawiki history dataset. We have not had any requests to query this data from our user-facing interface. It's certainly possible, but not trivial: it's just too much data to allow arbitrary querying without putting it on a *monster* cluster. But, for example, if we just had to answer the questions listed here, we could probably do it much more efficiently. We just need someone to stand up and say "this is important". Also, right now this data is computed monthly. If we need more frequent updates, the same principle applies. It's very hard to update everything incrementally, but for a limited set of questions and queries, we could update near real-time.
Mon, Sep 30
While I would love to argue for a .NET deployment, so me and everyone I love can enjoy programming again, what do we need from schemastore? I didn't think we needed any fancy features outside of Stream Config
I fixed a broken link, the task we need to follow is now T233004, and it looks like work is going forward. So we'll need a patch here. I'm happy to take this but will wait for grooming to jump back in the dance.
@dr0ptp4kt and how will this work? The graphs will start out as some placeholder image as the vega module is brought in async?
Oh, glad to hear, @revi, I'm happy to help edit the docs (I just got back from paternity leave), I'll look for updates here
Thu, Sep 26
I'm back and getting up to speed. I'd need to hear what the unresolved issues are, so we can collaborate on a way forward. The patch that Jon proposed for core seemed like an improvement to me, for example. And it wouldn't affect how other extensions use qunit tests, but it seems to improve testing in general. So yeah, let's brainstorm together.
I agree the npm and submodule ideas are the best two. I prefer the submodule idea, after working through what I think are likely scenarios
Aug 23 2019
Aug 19 2019
My latest patchset on that change above is just a draft implementing some of the thoughts so far. It implements the following so that we have a place to start from when we finalize our thoughts on privacy here:
I could use a collaboration on the list of countries to blacklist. The paper that Nuria mentions includes: China, Cuba, Egypt, Indonesia, Iran, Kazakhstan, Pakistan, Russia, Saudi Arabia, South Korea, Syria, Thailand, Turkey, Uzbekistan, Vietnam. But the reason for censorship is pretty different in each country, and they don't all seem like they need a blacklist. I tried to guess at a first draft of the blacklist but honestly I'm not sure. The governments in not just those countries but those regions seem pretty troubling to me. And I don't have enough knowledge to know when something goes from troubling to dangerous.
@Yair_rand, that's what we're trying to prevent, yes. The value of the data is great, and the risk will be minimized as much as possible. As Asaf points out above, we have had this conversation for a very long time. Our legal and security teams have thought about the potential danger of this dataset and signed off on us publishing it. Nevertheless, I personally would like to protect this dataset as much as possible and that's why I'm looking into how to make it harder to determine the country of specific editors. Does that make sense? Do you have additional concerns?