- User Since
- Oct 8 2014, 5:48 PM (442 w, 2 d)
- IRC Nick
- LDAP User
- MediaWiki User
- Milimetric (WMF) [ Global Accounts ]
Mon, Mar 27
Fri, Mar 24
Just to have some fun I counted current usernames that start with all of these proposed prefixes (and I added a goofy proposal of my own, 73^^9):
Thu, Mar 23
Thu, Mar 16
This is temporarily in review to get opinions on the way I handled the delayed daily timetable interacting with our datasets idea. It's kind of hard coded but I think simpler than a more flexible approach. Let me know what you think. (assigned to Sandra but anyone is welcome to comment)
Thanks for this context, @Niharika! I think it's ok to wait and see when the temp user cookies roll out how those numbers compare with these. If we wanted to know ahead of time, we could probably devise some kind of sampled test by setting a simple cookie on edits and incrementing it... But I don't think it's worth the trouble.
Wed, Mar 15
Tue, Mar 14
+1 to archive then, thanks Federico! And maybe let me know if we can do anything in wikistats 2 to help you out.
sorry we left it open so long. I just have to check with Nemo, I'll reply back within a few days.
Mon, Mar 13
note to self, look at T331892: Move eventlogging_to_druid_ jobs to airflow in order to rely on cluster spark mechanism instead of client
Quick update from our last conversation:
Fri, Mar 10
found a blocker: Druid pageview jobs need pageview_hourly to continue to create _SUCCESS files, but they weren't spelled out in the migration doc so this was missed. I'll update the airflow jobs on monday and resume deployment.
Thu, Mar 9
@apaskulin: created as site ID 24 (note: 24 backwards is 42! :P )
you can see the tracking code here: https://piwik.wikimedia.org/index.php?module=CoreAdminHome&action=trackingCodeGenerator&idSite=24&period=day&date=yesterday&updated=false
Feb 22 2023
Feb 15 2023
Feb 14 2023
Feb 13 2023
Feb 10 2023
Feb 9 2023
Thanks for catching that! Merged - it will auto-deploy
I'm putting this in review, but there are three jobs being migrated so I'll send them in separate patches.
Feb 8 2023
I'm working on a merge request for this, testing the jobs (it's going slow 'cause I'm on ops week)
Jan 31 2023
+1 to Timo's suggestion. The change required is fairly contained right now.
Jan 30 2023
dibs! Yaay :)
Oooh, I'd love to work on this
Jan 26 2023
@EChetty: The old Analytics tag should auto-tag Data-Engineering or be archived/deleted so folks can't use it. I've heard a lot of confusion around the team name lately, and I think the phab tags may be a primary source for that.
Jan 25 2023
Note: Per @EChetty, This task is currently deprioritized since there is a plan to reduce tech debt and use the new mediawiki.page-change stream instead of mediawiki_history for building data pipelines. See more T311129
Jan 20 2023
just to have this on record: the Wikistats annotation display system was thrown together quickly. It can be easily modified to include ranges, and it probably should be in this case and a few other cases. Annotations are great UX and I still think putting them on wiki is a great way to get the community involved.
Jan 19 2023
Jan 9 2023
note for myself: https://github.com/apache/iceberg/pull/6182/files is recent activity about supporting deletes in future Flink / Iceberg APIs
Jan 6 2023
Jan 5 2023
Jan 4 2023
At this point, we are making significant progress on near-real-time dumps generation. If that project continues to work out, we could have an alternate view available for querying, one that would include data from all wikis, identified by perhaps a wiki_db column or similar. Just putting it out there as a possible solution to this very old problem (I haven't forgotten about this, progress here is just... slow)
@Ottomata I don't have any preference here, it just occurred to me that you could also work around the $ref problem like this:
Jan 3 2023
+1 on T288301#8487410, @BPirkle
Dec 20 2022
Dec 16 2022
Dec 13 2022
@taavi: for a little more context, the dashboard basically keeps track of statistics about access to content provided by GLAMs. The long term plan is to make better APIs that can directly serve their needs, but for now they need to parse mediarequest dumps and store the results. This would be easier in more ways than one on our cloud infrastructure.
Dec 1 2022
Nov 30 2022
Works! Not an issue in hive either, @BTullis
Nov 23 2022
@Volans asked me, basically, how come count(distinct ip) gives slightly inaccurate results in superset -> druid queries. I didn't know, but found out that Druid by default has: useApproximateCountDistinct: true. See more at: https://support.imply.io/hc/en-us/articles/360056362993-Getting-exact-count-distinct-results-using-druid-SQL. Here's an example, and how to go about getting exact answers without tuning that setting.
Just for anyone that grabs this, we already define "active wikis" and use it in datasets like public geoeditors, the query for a dataset would be something like:
Nov 22 2022
Nov 21 2022
Deciding against Flink, at least for now. Documenting as a decision record here.
Nov 15 2022
Nov 9 2022
Nov 7 2022
@Michael: to add a little more detail on what Joseph said, querying 5 days of webrequest (only text) means moving 5 * 1.3T = 6.5T over the network. So there are two important points here.
Nov 1 2022
I was wrong to think I'd finish this by the end of the week. It's just been a series of errors with no docs to help. Current state is Iceberg is having trouble reading metadata, seems like somehow it doesn't know how to use HDFS?
Oct 26 2022
Got the basics set up in the Flink SQL client. Updating my code from before. I think I'm going to leave Flink SQL here. The problem is that it has pretty bad actual SQL support (like no built-in timestamp functions etc.) So to use it to actually do the kinds of transformations we need we'd have to build timestamp parsing UDFs and stuff like that. I feel that if you're writing Java/Scala anyway, you might as well just stay in Java and write the whole job there. That way at least all the logic is in one place and understanding the code doesn't require understanding multiple environments. Maybe if we do more work to make the Flink SQL environment painless, we can come back to this. For now, a scala or python Flink job seem to me the best way forward.
Oct 21 2022
Ok, looks good, please check and let me know. If any other language is ready, just file a task and let us know. We'd have no way of knowing that on our own, because even if all the messages are done, they may not be ready for use.
Thanks @Aftabuzzaman, I didn't know about Bengali. Releasing a new language is a manual process at the moment. I'm building and deploying now. It should be available in the next half hour or so.
Oct 19 2022
I think we should do this. We can limit the pages we look at with the import log as Neil says, and then just mark all the revisions that have much larger revision ids than their parent (via rev_parent_id as revision_is_probably_imported