Page MenuHomePhabricator

Milimetric (Dan Andreescu)
Staff Engineer (Data Engineering)

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Monday

  • Clear sailing ahead.

User Details

User Since
Oct 8 2014, 5:48 PM (442 w, 2 d)
Availability
Available
IRC Nick
Milimetric
LDAP User
Milimetric
MediaWiki User
Milimetric (WMF) [ Global Accounts ]

Recent Activity

Mon, Mar 27

Milimetric updated the task description for T333004: Setup config to allow lineage instrumentation.
Mon, Mar 27, 8:21 PM · Data Pipelines, Data-Engineering-Planning
Milimetric created T333223: Adding user_is_temp to the user table.
Mon, Mar 27, 8:07 PM · Data-Persistence, Data-Engineering, IP Masking

Fri, Mar 24

Milimetric added a comment to T332805: Decide the prefix character for temporary usernames.

Just to have some fun I counted current usernames that start with all of these proposed prefixes (and I added a goofy proposal of my own, 73^^9):

Fri, Mar 24, 3:26 AM · IP Masking

Thu, Mar 23

Milimetric moved T330200: mediawiki-history-check-denormalize job migration from In Review to Ready to Deploy on the Data Pipelines (sprint 10) board.
Thu, Mar 23, 5:35 PM · Data Pipelines (sprint 10)

Thu, Mar 16

Milimetric moved T324485: [Migration] Oozie jobs for Druid from In Progress to In Review on the Data Pipelines (sprint 10) board.

This is temporarily in review to get opinions on the way I handled the delayed daily timetable interacting with our datasets idea. It's kind of hard coded but I think simpler than a more flexible approach. Let me know what you think. (assigned to Sandra but anyone is welcome to comment)

Thu, Mar 16, 9:30 PM · Data Pipelines (Sprint 11), Patch-For-Review
Milimetric added a comment to T327365: Estimate the number of temporary accounts that would be created once IP Masking goes into effect.

Thanks for this context, @Niharika! I think it's ok to wait and see when the temp user cookies roll out how those numbers compare with these. If we wanted to know ahead of time, we could probably devise some kind of sampled test by setting a simple cookie on edits and incrementing it... But I don't think it's worth the trouble.

Thu, Mar 16, 8:38 PM · Product-Analytics (Kanban), IP Masking

Wed, Mar 15

Milimetric added a comment to T332004: Archive analytics/wikistats.

What is the replacement to analytics/wikistats? If it exists maybe @Nemo_bis can migrate to it rather than relaying on an abandoned code base?

Wed, Mar 15, 6:41 PM · Patch-For-Review, Data-Engineering-Wikistats, Analytics-Radar, translatewiki.net, Wikimedia-GitHub, Diffusion-Repository-Administrators, Projects-Cleanup, Data-Engineering

Tue, Mar 14

Milimetric added a comment to T332004: Archive analytics/wikistats.

+1 to archive then, thanks Federico! And maybe let me know if we can do anything in wikistats 2 to help you out.

Tue, Mar 14, 10:56 PM · Patch-For-Review, Data-Engineering-Wikistats, Analytics-Radar, translatewiki.net, Wikimedia-GitHub, Diffusion-Repository-Administrators, Projects-Cleanup, Data-Engineering
Milimetric created T332070: 2 additional new wikis.
Tue, Mar 14, 7:19 PM · Data Pipelines (Sprint 11), Data-Engineering-Planning, Product-Analytics
Milimetric added a comment to T332004: Archive analytics/wikistats.

sorry we left it open so long. I just have to check with Nemo, I'll reply back within a few days.

Tue, Mar 14, 3:40 PM · Patch-For-Review, Data-Engineering-Wikistats, Analytics-Radar, translatewiki.net, Wikimedia-GitHub, Diffusion-Repository-Administrators, Projects-Cleanup, Data-Engineering
Milimetric moved T331898: Update pageview_actor with referrer data from Next Up to Ready to Deploy on the Data Pipelines (sprint 10) board.
Tue, Mar 14, 3:01 PM · Data Pipelines (sprint 10), Product-Analytics

Mon, Mar 13

Milimetric added a comment to T324485: [Migration] Oozie jobs for Druid.

note to self, look at T331892: Move eventlogging_to_druid_ jobs to airflow in order to rely on cluster spark mechanism instead of client

Mon, Mar 13, 5:01 PM · Data Pipelines (Sprint 11), Patch-For-Review
Milimetric moved T329119: 13 new wikis missing from mediawiki_history from In Review to Ready to Deploy on the Data Pipelines (sprint 10) board.
Mon, Mar 13, 4:44 PM · Data Pipelines (sprint 10), Data-Engineering-Planning, Product-Analytics
Milimetric claimed T329119: 13 new wikis missing from mediawiki_history.
Mon, Mar 13, 4:41 PM · Data Pipelines (sprint 10), Data-Engineering-Planning, Product-Analytics
Milimetric updated Other Assignee for T329119: 13 new wikis missing from mediawiki_history, added: Milimetric.
Mon, Mar 13, 4:40 PM · Data Pipelines (sprint 10), Data-Engineering-Planning, Product-Analytics
Milimetric moved T329119: 13 new wikis missing from mediawiki_history from Ready to In Review on the Data Pipelines (sprint 10) board.
Mon, Mar 13, 4:40 PM · Data Pipelines (sprint 10), Data-Engineering-Planning, Product-Analytics
Milimetric added a comment to T328049: Investigate the effects of IP Masking on Data Eng systems.

Quick update from our last conversation:

Mon, Mar 13, 4:11 PM · Data Pipelines (Sprint 11)

Fri, Mar 10

Milimetric moved T329646: [Migration] Oozie Migration jobs for Pageviews dumps from In Review to Ready to Deploy on the Data Pipelines (Sprint 11) board.

found a blocker: Druid pageview jobs need pageview_hourly to continue to create _SUCCESS files, but they weren't spelled out in the migration doc so this was missed. I'll update the airflow jobs on monday and resume deployment.

Fri, Mar 10, 8:50 PM · Data Pipelines (sprint 10), Patch-For-Review

Thu, Mar 9

Milimetric created T331647: Grant Hal deployment rights.
Thu, Mar 9, 5:11 PM · SRE, SRE-Access-Requests
Milimetric moved T330234: Differential privacy airflow-dags merge request from In Progress to Ready to Deploy on the Data Pipelines (Sprint 11) board.
Thu, Mar 9, 5:03 PM · Data Pipelines (sprint 10), Data-Engineering
Milimetric added a comment to T308617: Page view statistics for the Developer Portal.

@apaskulin: created as site ID 24 (note: 24 backwards is 42! :P )
you can see the tracking code here: https://piwik.wikimedia.org/index.php?module=CoreAdminHome&action=trackingCodeGenerator&idSite=24&period=day&date=yesterday&updated=false

Thu, Mar 9, 4:32 PM · Wikimedia-Developer-Portal

Feb 22 2023

Milimetric moved T330234: Differential privacy airflow-dags merge request from Next Up to In Review on the Data Pipelines (Sprint 11) board.
Feb 22 2023, 6:19 PM · Data Pipelines (sprint 10), Data-Engineering
Milimetric assigned T330234: Differential privacy airflow-dags merge request to Htriedman.
Feb 22 2023, 6:18 PM · Data Pipelines (sprint 10), Data-Engineering

Feb 15 2023

Milimetric moved T329646: [Migration] Oozie Migration jobs for Pageviews dumps from In Progress to In Review on the Data Pipelines (Sprint 08) board.
Feb 15 2023, 5:00 PM · Data Pipelines (sprint 10), Patch-For-Review

Feb 14 2023

Milimetric moved T329646: [Migration] Oozie Migration jobs for Pageviews dumps from Ready to In Progress on the Data Pipelines (Sprint 08) board.
Feb 14 2023, 5:10 PM · Data Pipelines (sprint 10), Patch-For-Review
Milimetric created T329646: [Migration] Oozie Migration jobs for Pageviews dumps.
Feb 14 2023, 5:10 PM · Data Pipelines (sprint 10), Patch-For-Review
Milimetric updated the task description for T324482: [Migration] Oozie Migration jobs for Pageviews.
Feb 14 2023, 5:10 PM · Data Pipelines (sprint 10), Patch-For-Review

Feb 13 2023

Milimetric created T329550: User-centric documentation links.
Feb 13 2023, 5:17 PM · Data-Engineering-Planning

Feb 10 2023

Milimetric added a project to T329398: Puppetize Skein certificate generation: Data-Engineering.
Feb 10 2023, 11:09 PM · Data Pipelines, Data-Engineering-Planning
Milimetric created T329398: Puppetize Skein certificate generation.
Feb 10 2023, 11:08 PM · Data Pipelines, Data-Engineering-Planning

Feb 9 2023

Milimetric moved T327687: Fix broken image on front page of analytics.wikimedia.org from Incoming to Visualize on the Data-Engineering board.

Thanks for catching that! Merged - it will auto-deploy

Feb 9 2023, 2:40 PM · Data-Engineering, Analytics
Milimetric moved T324482: [Migration] Oozie Migration jobs for Pageviews from In Progress to In Review on the Data Pipelines (Sprint 08) board.

I'm putting this in review, but there are three jobs being migrated so I'll send them in separate patches.

Feb 9 2023, 1:17 AM · Data Pipelines (sprint 10), Patch-For-Review

Feb 8 2023

Milimetric added a comment to T324482: [Migration] Oozie Migration jobs for Pageviews.

I'm working on a merge request for this, testing the jobs (it's going slow 'cause I'm on ops week)

Feb 8 2023, 5:17 PM · Data Pipelines (sprint 10), Patch-For-Review

Jan 31 2023

Milimetric moved T324482: [Migration] Oozie Migration jobs for Pageviews from Ready to In Progress on the Data Pipelines (Sprint 07) board.
Jan 31 2023, 5:12 PM · Data Pipelines (sprint 10), Patch-For-Review
Milimetric added a comment to T257893: Request User-Agent Client-Hints on all of MediaWiki's Responses.

+1 to Timo's suggestion. The change required is fairly contained right now.

Jan 31 2023, 3:29 PM · Patch-For-Review, Anti-Harassment, Performance-Team (Radar), Platform Engineering, MediaWiki-General

Jan 30 2023

Milimetric claimed T324482: [Migration] Oozie Migration jobs for Pageviews.

dibs! Yaay :)

Jan 30 2023, 6:16 PM · Data Pipelines (sprint 10), Patch-For-Review
Milimetric added a comment to T321854: Move Spark JsonSchemaConverter out of analytics/refinery/source and into wikimedia-event-utilities.

Oooh, I'd love to work on this

Jan 30 2023, 3:46 PM · Data-Engineering-Planning, Event-Platform Value Stream
Milimetric added a comment to T309769: Expanding External Referrer Tracking.

I ran the UDF on a day's data and extracted the top 1000 referer's for that day to show the impact of the GetRefererDataUDF on referers. You can check the spreadsheet and a little doc on it.

Jan 30 2023, 3:18 PM · Data Pipelines (Sprint 08), Metrics-Platform-Planning, Foundational Technology Requests

Jan 26 2023

Milimetric raised the priority of T328049: Investigate the effects of IP Masking on Data Eng systems from High to Needs Triage.
Jan 26 2023, 5:29 PM · Data Pipelines (Sprint 11)
Milimetric added a project to T327982: Add cawiki to clickstream dataset: Data-Engineering.

@EChetty: The old Analytics tag should auto-tag Data-Engineering or be archived/deleted so folks can't use it. I've heard a lot of confusion around the team name lately, and I think the phab tags may be a primary source for that.

Jan 26 2023, 3:37 PM · Data Pipelines, Data-Engineering-Planning, Analytics

Jan 25 2023

Milimetric added a comment to T307883: Editors dataset in Turnilo / Superset.

Note: Per @EChetty, This task is currently deprioritized since there is a plan to reduce tech debt and use the new mediawiki.page-change stream instead of mediawiki_history for building data pipelines. See more T311129

Jan 25 2023, 10:25 PM · Product-Analytics, Foundational Technology Requests
Milimetric awarded T58628: Non-mobile UAs on mobile (2g/gprs, etc) IP-blocks a Stroopwafel token.
Jan 25 2023, 10:16 PM · Data Pipelines, Data-Engineering-Planning, Data-Engineering-Wikistats
Milimetric awarded T327431: Request membership (+2) in ContactPage extension group for Dreamy Jazz a Like token.
Jan 25 2023, 10:12 PM · MediaWiki-extensions-ContactPage, Gerrit-Privilege-Requests
Milimetric added a comment to T312566: Emit lineage information about Airflow jobs to DataHub.

lmkwyt

Jan 25 2023, 10:00 PM · Data-Engineering-Planning

Jan 20 2023

Milimetric added a comment to T325256: Document known data quality issues on Wikistats.

just to have this on record: the Wikistats annotation display system was thrown together quickly. It can be easily modified to include ranges, and it probably should be in this case and a few other cases. Annotations are great UX and I still think putting them on wiki is a great way to get the community involved.

Jan 20 2023, 10:07 PM · Data-Engineering-Planning, Data Pipelines

Jan 19 2023

Milimetric created T327447: FYI: Other changes to the CheckUser tables.
Jan 19 2023, 9:07 PM · Data-Engineering-Planning
Milimetric added a comment to T324907: Create separate tables for log events in CheckUser.

@Milimetric: Keeping data engineering in the loop like for T233004. This task will likely affect your analysis in a larger way than T233004 is as it will split data stored in cu_changes into three tables (cu_changes, cu_private_event and cu_log_event).

Jan 19 2023, 8:58 PM · MW-1.40-notes (1.40.0-wmf.22; 2023-02-06), Patch-For-Review, Anti-Harassment, Schema-change, CheckUser

Jan 9 2023

Milimetric added a comment to T320966: Prototype Flink job for content Dumps.

note for myself: https://github.com/apache/iceberg/pull/6182/files is recent activity about supporting deletes in future Flink / Iceberg APIs

Jan 9 2023, 9:56 PM · Event-Platform Value Stream (Sprint 04), Data-Engineering-Planning

Jan 6 2023

Milimetric updated the task description for T323642: Spark Streaming Dumps POC: Backfill metadata table.
Jan 6 2023, 8:37 PM · Data Pipelines (Sprint 11), Data-Engineering-Planning
Milimetric updated the task description for T322326: Prototype Spark Streaming Job for Content Dumps.
Jan 6 2023, 8:35 PM · Data Pipelines (Sprint 11), Patch-For-Review

Jan 5 2023

Milimetric added a comment to T95582: it would be useful to run the same Quarry query conveniently in several database.

March: we are generating a proof of concept XML dump, similar to the current one, via a Kafka -> Spark -> Iceberg -> XML pipeline. This depends on Event Platform's content-enriched Kafka topic (page-content-change). It also depends on me getting better at all this, some big moving pieces and lots to learn but I'm getting there. If the Kafka -> Iceberg jump is too challenging, the nice thing here is we can fall back on hourly batches, and that's still ok for lots of use cases, right @bd808?

I am sure that there are folks who would be ok with things stopping at a big pile of data inside of the restricted access production Hadoop cluster. That does not actually move anything forward for Quarry and the general public however.

Jan 5 2023, 5:19 PM · Quarry
Milimetric removed projects from T326330: Update sqoop for CheckUser table: MW-1.40-notes (1.40.0-wmf.18; 2023-01-09), MW-1.39-notes (1.39.0-wmf.23; 2022-08-01), MW-1.38-notes (1.38.0-wmf.26; 2022-03-14), Platform Team Workboards (Clinic Duty Team), Schema-change, CheckUser.
Jan 5 2023, 3:05 PM · Data Pipelines (Sprint 07), Data-Engineering-Planning, Patch-For-Review
Milimetric created T326330: Update sqoop for CheckUser table.
Jan 5 2023, 3:05 PM · Data Pipelines (Sprint 07), Data-Engineering-Planning, Patch-For-Review
Milimetric added a comment to T95582: it would be useful to run the same Quarry query conveniently in several database.

At this point, we are making significant progress on near-real-time dumps generation. If that project continues to work out, we could have an alternate view available for querying, one that would include data from all wikis, identified by perhaps a wiki_db column or similar. Just putting it out there as a possible solution to this very old problem (I haven't forgotten about this, progress here is just... slow)

You are going to end up with a new real-time replicated, redacted RDBMS as a side effect of dumps generation? That sounds awesome, but also not fully believable.

Jan 5 2023, 2:50 PM · Quarry

Jan 4 2023

Milimetric awarded T300164: Pageview Data loss due to wrong version of package installed on some varnishkafka instances a Barnstar token.
Jan 4 2023, 10:16 PM · Data-Engineering-Kanban, Data-Engineering
Milimetric added a comment to T95582: it would be useful to run the same Quarry query conveniently in several database.

At this point, we are making significant progress on near-real-time dumps generation. If that project continues to work out, we could have an alternate view available for querying, one that would include data from all wikis, identified by perhaps a wiki_db column or similar. Just putting it out there as a possible solution to this very old problem (I haven't forgotten about this, progress here is just... slow)

Jan 4 2023, 10:11 PM · Quarry
Milimetric awarded T321169: Create a dashboard from the fsImage Dataset extracted from the HDFS FsImage a Stroopwafel token.
Jan 4 2023, 4:50 PM · Data Pipelines (Sprint 05-06), Patch-For-Review, Technical-Debt
Milimetric added a comment to T308017: Design Schema for page state and page state with content (enriched) streams.

@Ottomata I don't have any preference here, it just occurred to me that you could also work around the $ref problem like this:

Jan 4 2023, 4:46 PM · MW-1.40-notes (1.40.0-wmf.23; 2023-02-13), Event-Platform Value Stream (Sprint 08), Data-Engineering, Patch-For-Review

Jan 3 2023

Milimetric added a comment to T288301: AQS 2.0:Wikistats 2 service.

+1 on T288301#8487410, @BPirkle

Jan 3 2023, 4:22 PM · Epic, AQS 2.0 Roadmap, API Platform (API Platform Roadmap), Data-Engineering, User-Eevans, Platform Engineering Roadmap
Milimetric added a comment to T233004: Update CheckUser for actor and comment table.

[...] We (data engineering) would like to be kept in the loop as closely as possible here.

@Milimetric cuc_actor has been fully populated by now and we can now start reading from cuc_actor instead of cuc_user and cuc_user_text.

Jan 3 2023, 3:44 PM · MW-1.40-notes (1.40.0-wmf.24; 2023-02-20), Patch-For-Review, MW-1.39-notes (1.39.0-wmf.23; 2022-08-01), MW-1.38-notes (1.38.0-wmf.26; 2022-03-14), Data-Engineering, Platform Team Workboards (Clinic Duty Team), Schema-change, CheckUser

Dec 20 2022

Milimetric updated the task description for T323641: Spark Streaming Dumps POC: Backfill content table.
Dec 20 2022, 2:12 PM · Event-Platform Value Stream, Data-Engineering-Planning

Dec 16 2022

Milimetric closed T311315: [Wikistats] Add newly translated languages as Resolved.
Dec 16 2022, 4:53 AM · Data Pipelines, Data-Engineering-Wikistats, Data-Engineering-Planning

Dec 13 2022

Milimetric added a comment to T325065: Quota-requests - Increase glamwikidashboard disk sizes.

@taavi: for a little more context, the dashboard basically keeps track of statistics about access to content provided by GLAMs. The long term plan is to make better APIs that can directly serve their needs, but for now they need to parse mediarequest dumps and store the results. This would be easier in more ways than one on our cloud infrastructure.

Dec 13 2022, 3:51 PM · Cloud-VPS (Quota-requests), API Platform, Foundational Technology Requests
Milimetric created T325072: Grant ssh access to analytics-admins to mnz.
Dec 13 2022, 3:28 PM · SRE, SRE-Access-Requests

Dec 1 2022

Milimetric updated the task description for T323642: Spark Streaming Dumps POC: Backfill metadata table.
Dec 1 2022, 5:35 PM · Data Pipelines (Sprint 11), Data-Engineering-Planning
Milimetric claimed T323642: Spark Streaming Dumps POC: Backfill metadata table.
Dec 1 2022, 5:35 PM · Data Pipelines (Sprint 11), Data-Engineering-Planning
Milimetric moved T323641: Spark Streaming Dumps POC: Backfill content table from Next Up to In Progress on the Event-Platform Value Stream (Sprint 05) board.
Dec 1 2022, 5:29 PM · Event-Platform Value Stream, Data-Engineering-Planning
Milimetric assigned T323641: Spark Streaming Dumps POC: Backfill content table to MunizaA.
Dec 1 2022, 5:29 PM · Event-Platform Value Stream, Data-Engineering-Planning

Nov 30 2022

Milimetric added a comment to T321231: Bug: User History has mismatching order of fields in Parquet vs. Hive.

Works! Not an issue in hive either, @BTullis

Nov 30 2022, 5:55 PM · Patch-For-Review, Data Pipelines (Sprint 05-06), Data-Engineering-Planning

Nov 23 2022

Milimetric added a comment to T314981: Add a webrequest sampled topic and ingest into druid/turnilo.

@Volans asked me, basically, how come count(distinct ip) gives slightly inaccurate results in superset -> druid queries. I didn't know, but found out that Druid by default has: useApproximateCountDistinct: true. See more at: https://support.imply.io/hc/en-us/articles/360056362993-Getting-exact-count-distinct-results-using-druid-SQL. Here's an example, and how to go about getting exact answers without tuning that setting.

Nov 23 2022, 3:34 PM · Patch-For-Review, Traffic, Data Pipelines, User-fgiunchedi, Data-Engineering-Planning, Foundational Technology Requests
Milimetric added a comment to T323662: NEW FEATURE REQUEST: Dataset with active and non-active Wikis.

Just for anyone that grabs this, we already define "active wikis" and use it in datasets like public geoeditors, the query for a dataset would be something like:

Nov 23 2022, 1:00 AM · Data Pipelines, Data-Engineering-Planning

Nov 22 2022

Milimetric placed T323642: Spark Streaming Dumps POC: Backfill metadata table up for grabs.
Nov 22 2022, 8:15 PM · Data Pipelines (Sprint 11), Data-Engineering-Planning
Milimetric placed T323641: Spark Streaming Dumps POC: Backfill content table up for grabs.
Nov 22 2022, 8:15 PM · Event-Platform Value Stream, Data-Engineering-Planning
Milimetric placed T323645: Spark Streaming Dumps POC: Update iceberg tables up for grabs.
Nov 22 2022, 8:15 PM · Event-Platform Value Stream, Data-Engineering-Planning
Milimetric created T323645: Spark Streaming Dumps POC: Update iceberg tables.
Nov 22 2022, 8:15 PM · Event-Platform Value Stream, Data-Engineering-Planning
Milimetric updated the task description for T323641: Spark Streaming Dumps POC: Backfill content table.
Nov 22 2022, 7:41 PM · Event-Platform Value Stream, Data-Engineering-Planning
Milimetric updated the task description for T323642: Spark Streaming Dumps POC: Backfill metadata table.
Nov 22 2022, 7:40 PM · Data Pipelines (Sprint 11), Data-Engineering-Planning
Milimetric updated the task description for T323641: Spark Streaming Dumps POC: Backfill content table.
Nov 22 2022, 7:40 PM · Event-Platform Value Stream, Data-Engineering-Planning
Milimetric updated the task description for T323641: Spark Streaming Dumps POC: Backfill content table.
Nov 22 2022, 7:40 PM · Event-Platform Value Stream, Data-Engineering-Planning
Milimetric created T323642: Spark Streaming Dumps POC: Backfill metadata table.
Nov 22 2022, 7:40 PM · Data Pipelines (Sprint 11), Data-Engineering-Planning
Milimetric created T323641: Spark Streaming Dumps POC: Backfill content table.
Nov 22 2022, 7:35 PM · Event-Platform Value Stream, Data-Engineering-Planning
Milimetric moved T322326: Prototype Spark Streaming Job for Content Dumps from Next Up to In Progress on the Event-Platform Value Stream (Sprint 04) board.
Nov 22 2022, 7:18 PM · Data Pipelines (Sprint 11), Patch-For-Review

Nov 21 2022

Milimetric closed T320966: Prototype Flink job for content Dumps as Declined.

Deciding against Flink, at least for now. Documenting as a decision record here.

Nov 21 2022, 2:46 PM · Event-Platform Value Stream (Sprint 04), Data-Engineering-Planning
Milimetric updated the task description for T320966: Prototype Flink job for content Dumps.
Nov 21 2022, 2:12 PM · Event-Platform Value Stream (Sprint 04), Data-Engineering-Planning

Nov 15 2022

Milimetric updated the task description for T320966: Prototype Flink job for content Dumps.
Nov 15 2022, 2:10 PM · Event-Platform Value Stream (Sprint 04), Data-Engineering-Planning

Nov 9 2022

Milimetric updated the task description for T320966: Prototype Flink job for content Dumps.
Nov 9 2022, 3:19 PM · Event-Platform Value Stream (Sprint 04), Data-Engineering-Planning
Milimetric updated the task description for T320966: Prototype Flink job for content Dumps.
Nov 9 2022, 3:16 PM · Event-Platform Value Stream (Sprint 04), Data-Engineering-Planning

Nov 7 2022

Milimetric added a comment to T322525: 503 on Superset (reproducible).

@Michael: to add a little more detail on what Joseph said, querying 5 days of webrequest (only text) means moving 5 * 1.3T = 6.5T over the network. So there are two important points here.

Nov 7 2022, 4:41 PM · Data-Engineering-Planning

Nov 1 2022

Milimetric added a comment to T320966: Prototype Flink job for content Dumps.

I was wrong to think I'd finish this by the end of the week. It's just been a series of errors with no docs to help. Current state is Iceberg is having trouble reading metadata, seems like somehow it doesn't know how to use HDFS?

Nov 1 2022, 10:45 PM · Event-Platform Value Stream (Sprint 04), Data-Engineering-Planning

Oct 26 2022

Milimetric added a comment to T321707: Bot Detection.

@Milimetric: See Pageviews-Anomaly and especially T263908. What is the "stats team" exactly?

Oct 26 2022, 5:47 PM · Data-Engineering
Milimetric moved T321707: Bot Detection from Incoming to Datasets on the Data-Engineering board.
Oct 26 2022, 3:49 PM · Data-Engineering
Milimetric added a subtask for T138207: [Open question] Improve bot identification at scale: T321707: Bot Detection.
Oct 26 2022, 3:49 PM · Data-Engineering, Research-Backlog
Milimetric added a parent task for T321707: Bot Detection: T138207: [Open question] Improve bot identification at scale.
Oct 26 2022, 3:48 PM · Data-Engineering
Milimetric created T321707: Bot Detection.
Oct 26 2022, 3:46 PM · Data-Engineering
Milimetric added a comment to T320966: Prototype Flink job for content Dumps.

Got the basics set up in the Flink SQL client. Updating my code from before. I think I'm going to leave Flink SQL here. The problem is that it has pretty bad actual SQL support (like no built-in timestamp functions etc.) So to use it to actually do the kinds of transformations we need we'd have to build timestamp parsing UDFs and stuff like that. I feel that if you're writing Java/Scala anyway, you might as well just stay in Java and write the whole job there. That way at least all the logic is in one place and understanding the code doesn't require understanding multiple environments. Maybe if we do more work to make the Flink SQL environment painless, we can come back to this. For now, a scala or python Flink job seem to me the best way forward.

Oct 26 2022, 3:32 PM · Event-Platform Value Stream (Sprint 04), Data-Engineering-Planning

Oct 21 2022

Milimetric added a comment to T311315: [Wikistats] Add newly translated languages.

Ok, looks good, please check and let me know. If any other language is ready, just file a task and let us know. We'd have no way of knowing that on our own, because even if all the messages are done, they may not be ready for use.

Oct 21 2022, 7:26 PM · Data Pipelines, Data-Engineering-Wikistats, Data-Engineering-Planning
Milimetric added a comment to T311315: [Wikistats] Add newly translated languages.

Thanks @Aftabuzzaman, I didn't know about Bengali. Releasing a new language is a manual process at the moment. I'm building and deploying now. It should be available in the next half hour or so.

Oct 21 2022, 6:54 PM · Data Pipelines, Data-Engineering-Wikistats, Data-Engineering-Planning

Oct 19 2022

Milimetric added a comment to T221482: Identify imported revisions in mediawiki_history.

I think we should do this. We can limit the pages we look at with the import log as Neil says, and then just mark all the revisions that have much larger revision ids than their parent (via rev_parent_id as revision_is_probably_imported

Oct 19 2022, 8:51 PM · Data-Engineering, Product-Analytics
Milimetric created T321231: Bug: User History has mismatching order of fields in Parquet vs. Hive.
Oct 19 2022, 7:09 PM · Patch-For-Review, Data Pipelines (Sprint 05-06), Data-Engineering-Planning

Oct 17 2022

Milimetric created T320966: Prototype Flink job for content Dumps.
Oct 17 2022, 2:31 PM · Event-Platform Value Stream (Sprint 04), Data-Engineering-Planning