Thu, Jun 10
seems related to T230915, so I'm going to look at both.
Wed, Jun 9
:) not to toot my own horn, but for planning purposes, I do happen to be very fast at stuff like that. I could probably move off Semantic and to something like Vuetify in a week or two at most. Or a WMF design system, if it has a good chunk of the components we need.
Tue, Jun 8
If you run this query on presto:
Mon, Jun 7
Just a quick note to say that I ran the query for May 17th, and still found mismatches on both sides. I will find a way to do a better analysis that we can easily re-run every time we make improvements.
Thu, Jun 3
Should I create another feature request for that? Or is this idea too far-fetched?
Thanks for the work, this is great. I think the static nature of the site isn't too much of a problem, we've solved similar problems with bundles and, worst case, some Apache config magic.
Would we need to ask a security review for exporting aggregated data out of hadoop?
Tue, Jun 1
As far as I understand, our experiments with Extension:Sentry were replaced by the Client errors data pipeline.
Much love to @epriestley for starting this project, helping us move to it, and incorporating our feedback and contributions. In my opinion, moving to Phab was the best decision we've made in the almost 9 years I've been here. Thank you, @epriestley, for your part in that.
Thu, May 20
I think Andrew has some ideas, we'll get to the bottom of this one way or another. Then, once Neil's issue is resolved I'd like to reframe this or add a subtask to go over logging on the cluster in general. Lots of background noise like SLF4J warnings clutter the already cluttered logs, and make maintenance harder.
Wed, May 19
@SNowick_WMF, Reportupdater is fine, it's what's available right now. We don't want to slow you down waiting for AirFlow
Tue, May 18
Mon, May 17
Thanks very much for following through with that. Seeing your prototype makes it very clear what you need and why. I think ideally we would create a better pipeline from community-requested statistics to on-wiki infographics. This is something that's been hard for WMF to prioritize, but something I care about, and will continue to think about.
Ok, weird, I can't reproduce this... maybe it's some weird access problem? We'll triage and look into it
It takes up about 15G so honestly it's not that big a deal to keep around, even if there are only a few downloaders. I can't tell about our mirrors of course, but even from our own web server there are a few downloaders that aren't bots. So, meh. Keep?
There's no need for a fancy tool, this would be a few lines of spark to read the data and save to, probably, a Hive table with an explicit schema. Should take a day to set up and some time after that to run some analysis. We just don't have the capacity, there's a lot of stuff going on that's higher priority right now. But it's relatively easy for anyone to play with. The only concern here for me that's a bit time sensitive is that there are a bunch of IPs in the logs.
Thanks, the wmde-qwerty group would happily self-merge reportupdater-queries while just CC'ing WMF Analytics.
The description says we're keeping the raw JSON import, just not the rest of the pipeline. I agree to delete any of it, unused data is just confusing, just making sure everyone expects the same thing
https://stats.wikimedia.org/ runs https://gerrit.wikimedia.org/r/admin/repos/analytics/wikistats2 and we'd be happy to migrate to GitLab. We merge translation commits from translatewiki.net, and have Jenkins build set up, so not sure if that's tricky for the migration, but we're happy to do it together.
May 15 2021
I'm just as lost as you are so far... it's expected behavior and not a bug, but I can't figure out what configuration triggers it and why "https://github.com/apache/superset/blob/9773aba522e957ed9423045ca153219638a85d2f/superset/translations/en/LC_MESSAGES/messages.json#L1017"
May 13 2021
Culprit is uppercase mismatch, so druid jobs weren't finding the data: https://github.com/wikimedia/analytics-refinery/blob/master/oozie/virtualpageview/druid/daily/coordinator.properties#L48
This would have to happen after data governance, so any help before that is appreciated (I can review patches to the pipeline anytime)
cc-ing @JAllemandou, who said wanted to look at it, we'll triage with him Monday when he's back from vacation
May 12 2021
May 10 2021
This is still high priority for us as we look to make some of our datasets incremental. We're not focusing on it right right now
Might be a good task for Ben (starting soon).
To be done in concert with move to Apache Iceberg and overhaul of how we handle the time dimension more generally
making high for privacy reasons, anyone should feel free to grab it
ping Product-Analytics, any interest in taking a look at this?
We'll look at possible ways to improve this as we move data quality jobs to AirFlow
This feels to me like it will be part of the data governance effort, so definitely something I care about
@awight so it seems like you're good with the secondary event schemas repo, and you (WMDE technical wishes team) just need access to reportupdater-queries? I'm happy to add this, what gerrit group/list of folks should I use?
Sorry, this is not really possible for privacy reasons. Even if you were logged in, we throw away most of the data that would be needed to compile these stats.
ping us if you need any support, @Nuria
I can find the handling code in the eventgate server implementation, but it seems there's no way to send a "guaranteed" event from the eventlogging client yet? Would it make sense to expose this in the client API, or does that belong in a new / different client implementation? In other words, should "Event Logging" always be sent hastily, and we introduce a new abstraction for sending to the same endpoint but synchronously?
May 7 2021
So I'm not sure if you're talking about other problems but I hear two:
May 6 2021
So it looks like the https://dumps.wikimedia.org/other/wikistats_1.0/ folder is empty, so that can be deleted.
The format looks like Common Log Format with two additional fields, "full URI requested" and "user agent"
Thanks, good point, I added a note at the beginning. It's not quite deprecated yet, we may decide it's a good idea and refresh it, wouldn't be terribly hard. But for now, the note will help nice folks like you not waste time.