Page MenuHomePhabricator

gmodena (GModena (WMF))
User

Projects

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Monday

  • Clear sailing ahead.

User Details

User Since
Nov 2 2020, 1:15 PM (108 w, 4 d)
Availability
Available
LDAP User
Gmodena
MediaWiki User
GModena (WMF) [ Global Accounts ]

Recent Activity

Thu, Dec 1

gmodena moved T323217: [SPIKE] Evaluate a pyflink version of Mediawiki Stream Enrichment from In Progress to In Review on the Event-Platform Value Stream (Sprint 05) board.
Thu, Dec 1, 8:31 PM · Data-Engineering-Planning, Event-Platform Value Stream (Sprint 05)
gmodena added a comment to T323217: [SPIKE] Evaluate a pyflink version of Mediawiki Stream Enrichment.

A pyflink implementation of Mediawiki Stream Enrichment has been developed and deployed on YARN. While this implementation did not write to a kafka topic directly, all enriched messages (48 hours worth of data) passed jsonschema validation. The python implementation has feature parity with the Scala one. In particular:

Thu, Dec 1, 8:31 PM · Data-Engineering-Planning, Event-Platform Value Stream (Sprint 05)

Wed, Nov 30

gmodena created T324144: Flink Tables should have a default ROWTIME column..
Wed, Nov 30, 7:56 PM · Data-Engineering-Planning, Event-Platform Value Stream

Tue, Nov 29

gmodena moved T323217: [SPIKE] Evaluate a pyflink version of Mediawiki Stream Enrichment from Next Up to In Progress on the Event-Platform Value Stream (Sprint 05) board.
Tue, Nov 29, 12:35 PM · Data-Engineering-Planning, Event-Platform Value Stream (Sprint 05)
gmodena added a comment to T323914: Deploy Mediawiki Stream Enrichment on an-launcher1002..

The cluster and mediawiki stream enrichment job are running at https://yarn.wikimedia.org/proxy/application_1663082229270_434209/#/task-manager/container_e50_1663082229270_434209_01_000002/logs

Tue, Nov 29, 10:50 AM · Data-Engineering-Planning, Event-Platform Value Stream (Sprint 05)
gmodena moved T323914: Deploy Mediawiki Stream Enrichment on an-launcher1002. from In Progress to In Review on the Event-Platform Value Stream (Sprint 05) board.
Tue, Nov 29, 10:42 AM · Data-Engineering-Planning, Event-Platform Value Stream (Sprint 05)

Mon, Nov 28

gmodena renamed T323217: [SPIKE] Evaluate a pyflink version of Mediawiki Stream Enrichment from [NEEDS GROOMING][SPIKE} Evaluate a pyflink version of Mediawiki Stream Enrichment to [SPIKE] Evaluate a pyflink version of Mediawiki Stream Enrichment.
Mon, Nov 28, 3:01 PM · Data-Engineering-Planning, Event-Platform Value Stream (Sprint 05)
gmodena claimed T323914: Deploy Mediawiki Stream Enrichment on an-launcher1002..
Mon, Nov 28, 2:09 PM · Data-Engineering-Planning, Event-Platform Value Stream (Sprint 05)
gmodena created T323914: Deploy Mediawiki Stream Enrichment on an-launcher1002..
Mon, Nov 28, 2:09 PM · Data-Engineering-Planning, Event-Platform Value Stream (Sprint 05)
gmodena moved T323217: [SPIKE] Evaluate a pyflink version of Mediawiki Stream Enrichment from In Progress to Next Up on the Event-Platform Value Stream (Sprint 05) board.
Mon, Nov 28, 2:07 PM · Data-Engineering-Planning, Event-Platform Value Stream (Sprint 05)
gmodena moved T323217: [SPIKE] Evaluate a pyflink version of Mediawiki Stream Enrichment from Next Up to In Progress on the Event-Platform Value Stream (Sprint 05) board.
Mon, Nov 28, 2:04 PM · Data-Engineering-Planning, Event-Platform Value Stream (Sprint 05)

Wed, Nov 23

gmodena added a comment to T322022: Flink SQL queries should access Kafka topics from a Catalog.

However, once you try to insert something, it gets a bit messy. The kafka connector only allows you to sink to one topic, so for topics with eqiad and codfw, there has to be a way to select between them.

Wed, Nov 23, 11:28 AM · Event-Platform Value Stream (Sprint 05), Data-Engineering-Planning
gmodena updated the task description for T322022: Flink SQL queries should access Kafka topics from a Catalog.
Wed, Nov 23, 11:22 AM · Event-Platform Value Stream (Sprint 05), Data-Engineering-Planning
gmodena renamed T322022: Flink SQL queries should access Kafka topics from a Catalog from [NEEDS GROOMING] Flink SQL queries should access Kafka topics from a Catalog to Flink SQL queries should access Kafka topics from a Catalog.
Wed, Nov 23, 11:20 AM · Event-Platform Value Stream (Sprint 05), Data-Engineering-Planning

Thu, Nov 17

gmodena added a comment to T322843: Deploy Mediawiki Stream enrichment on YARN.

Work covered in this phab:

Thu, Nov 17, 11:03 AM · Event-Platform Value Stream (Sprint 04)
gmodena claimed T323217: [SPIKE] Evaluate a pyflink version of Mediawiki Stream Enrichment.
Thu, Nov 17, 8:48 AM · Data-Engineering-Planning, Event-Platform Value Stream (Sprint 05)

Wed, Nov 16

gmodena updated the task description for T322843: Deploy Mediawiki Stream enrichment on YARN.
Wed, Nov 16, 2:35 PM · Event-Platform Value Stream (Sprint 04)
gmodena moved T322843: Deploy Mediawiki Stream enrichment on YARN from In Progress to In Review on the Event-Platform Value Stream (Sprint 04) board.
Wed, Nov 16, 2:34 PM · Event-Platform Value Stream (Sprint 04)
gmodena created T323217: [SPIKE] Evaluate a pyflink version of Mediawiki Stream Enrichment.
Wed, Nov 16, 1:24 PM · Data-Engineering-Planning, Event-Platform Value Stream (Sprint 05)
gmodena updated the task description for T320812: [SPIKE] Deploy event driven stateless Flink service to DSE cluster.
Wed, Nov 16, 1:17 PM · Event-Platform Value Stream, Shared-Data-Infrastructure, Data-Engineering-Planning
gmodena updated the task description for T320812: [SPIKE] Deploy event driven stateless Flink service to DSE cluster.
Wed, Nov 16, 1:17 PM · Event-Platform Value Stream, Shared-Data-Infrastructure, Data-Engineering-Planning
gmodena updated the task description for T322843: Deploy Mediawiki Stream enrichment on YARN.
Wed, Nov 16, 12:17 PM · Event-Platform Value Stream (Sprint 04)
gmodena updated the task description for T322843: Deploy Mediawiki Stream enrichment on YARN.
Wed, Nov 16, 11:03 AM · Event-Platform Value Stream (Sprint 04)

Tue, Nov 15

gmodena updated the task description for T311084: [Shared Event Platform] Mediawiki Stream Enrichment should consume the consolidated page-change stream..
Tue, Nov 15, 2:10 PM · Event-Platform Value Stream (Sprint 04), Data-Engineering-Planning

Mon, Nov 14

gmodena added a comment to T320812: [SPIKE] Deploy event driven stateless Flink service to DSE cluster.

Hi @gmodena - I believe that since the flink-rdf-streaming-updater service uses the Deployment Pipeline, the decision about whether or not you can publish a one-off docker image based on that ultimately lies with the owners of that repository. I don't have any strong feelings against it, but care is needed. Personally, I would seriously consider trying to rationalize the names of things first and de-couple this requirement from the rdf-streaming-updater somehow.

Mon, Nov 14, 1:44 PM · Event-Platform Value Stream, Shared-Data-Infrastructure, Data-Engineering-Planning

Thu, Nov 10

gmodena claimed T322843: Deploy Mediawiki Stream enrichment on YARN.
Thu, Nov 10, 12:34 PM · Event-Platform Value Stream (Sprint 04)
gmodena created T322843: Deploy Mediawiki Stream enrichment on YARN.
Thu, Nov 10, 12:34 PM · Event-Platform Value Stream (Sprint 04)
gmodena updated subscribers of T320812: [SPIKE] Deploy event driven stateless Flink service to DSE cluster.

Now that a namespace has been created, I'm picking this back up and wanted to sync.

Thu, Nov 10, 12:05 PM · Event-Platform Value Stream, Shared-Data-Infrastructure, Data-Engineering-Planning
gmodena moved T311084: [Shared Event Platform] Mediawiki Stream Enrichment should consume the consolidated page-change stream. from In Progress to In Review on the Event-Platform Value Stream (Sprint 04) board.
Thu, Nov 10, 11:56 AM · Event-Platform Value Stream (Sprint 04), Data-Engineering-Planning

Tue, Nov 8

gmodena added a comment to T321682: Create kubernetes namespace and user for the stream_enrichment PoC project.

[...]

Although I can see the appeal of using a working branch, the safer option is definitely to get it reviewed and approved first.

Tue, Nov 8, 2:54 PM · Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)), Event-Platform Value Stream (Sprint 04)
gmodena added a comment to T321682: Create kubernetes namespace and user for the stream_enrichment PoC project.
Tue, Nov 8, 2:43 PM · Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)), Event-Platform Value Stream (Sprint 04)

Nov 2 2022

gmodena moved T311084: [Shared Event Platform] Mediawiki Stream Enrichment should consume the consolidated page-change stream. from Next Up to In Progress on the Event-Platform Value Stream (Sprint 04) board.
Nov 2 2022, 2:01 PM · Event-Platform Value Stream (Sprint 04), Data-Engineering-Planning

Nov 1 2022

gmodena updated subscribers of T322125: [NEEDS GROOMING] Improve reliability of simple stateless services.
Nov 1 2022, 11:45 AM · Data-Engineering-Planning, Event-Platform Value Stream
gmodena created T322125: [NEEDS GROOMING] Improve reliability of simple stateless services.
Nov 1 2022, 11:43 AM · Data-Engineering-Planning, Event-Platform Value Stream
gmodena moved T320812: [SPIKE] Deploy event driven stateless Flink service to DSE cluster from In Progress to Blocked/Paussed on the Event-Platform Value Stream (Sprint 03) board.
Nov 1 2022, 11:17 AM · Event-Platform Value Stream, Shared-Data-Infrastructure, Data-Engineering-Planning
gmodena added a comment to T320968: Easy Flink Python UDF + SQL enrichment.

Hm! Is this true? I would assume that Flink would have to unzip the virtualenv for every new taskmanager, but not every time you execute the UDF?

I should rephrase that since I just assumed a single UDF is being called per a query. If let's say you have 2 UDFs get_images and get_backlinks and did SELECT get_images(1221227) as images, get_backlinks(1221227) as backlinks;, then yes from what I can tell it creates a single pyflink gateway and executes both UDFs in the same python environment. However, it doesn't seem like a UDF can't be 'warmed up' in advance, so calling SELECT get_images(1221227) as images; twice in a row will bootstrap the python environment twice, making it annoyingly time-consuming when messing around in the SQL Client CLI

Nov 1 2022, 9:27 AM · Event-Platform Value Stream (Sprint 04), Spike, Data-Engineering-Planning
gmodena added a comment to T320968: Easy Flink Python UDF + SQL enrichment.

Can you elaborate on that? I thought the executable is venv/bin/python3

Uh hm. I just checked and I also see the executable in a virtualenv. I must be remembering incorrectly! I thought this was one of the whole reasons we embarked on the conda support instead...but I guess not? Perhaps it has more to do with dynamically linked binary dependencies? Or maybe the virtualenv's un-relocatability?

Okay well scratch that reason then.

Nov 1 2022, 9:18 AM · Event-Platform Value Stream (Sprint 04), Spike, Data-Engineering-Planning

Oct 31 2022

gmodena added a comment to T321682: Create kubernetes namespace and user for the stream_enrichment PoC project.

Folks one suggestion - we should aim to use the Deployment Pipeline (https://wikitech.wikimedia.org/wiki/Deployment_pipeline) and GitOps as much as possible, because it is really easy to kubectl apply things and forget to remove them. I am fully aware that we need to test first, and I support it 100%, but let's be organized and have few people at the time operating via kubectl. I'd also suggest to have clear rollback procedures in place, and a way to track what was applied and what wasn't.

Oct 31 2022, 3:21 PM · Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)), Event-Platform Value Stream (Sprint 04)
gmodena updated the task description for T322022: Flink SQL queries should access Kafka topics from a Catalog.
Oct 31 2022, 1:51 PM · Event-Platform Value Stream (Sprint 05), Data-Engineering-Planning
gmodena created T322022: Flink SQL queries should access Kafka topics from a Catalog.
Oct 31 2022, 12:32 PM · Event-Platform Value Stream (Sprint 05), Data-Engineering-Planning
gmodena added a comment to T321682: Create kubernetes namespace and user for the stream_enrichment PoC project.
Oct 31 2022, 11:37 AM · Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)), Event-Platform Value Stream (Sprint 04)
gmodena added a comment to T320812: [SPIKE] Deploy event driven stateless Flink service to DSE cluster.

[...]

  1. A chart for a service that submits a job to session cluster. The pod itself simply acts as a listener to an heartbeat (e.g. same thing we do on stat machines within a tmux session).
  2. A chart with an application deployment cluster that bundles the stateless service jar as well as Flink itself.
Oct 31 2022, 11:23 AM · Event-Platform Value Stream, Shared-Data-Infrastructure, Data-Engineering-Planning

Oct 27 2022

bking awarded T315428: [SPIKE] Assess what is required for the enrichment pipeline to run on k8s a Love token.
Oct 27 2022, 4:45 PM · Data-Engineering-Planning, Event-Platform Value Stream (Sprint 01), Spike
gmodena added a comment to T320812: [SPIKE] Deploy event driven stateless Flink service to DSE cluster.

Here's a summary of discussions I had with folks currently involved with Flink and k8s.

Oct 27 2022, 12:51 PM · Event-Platform Value Stream, Shared-Data-Infrastructure, Data-Engineering-Planning

Oct 26 2022

gmodena added a comment to T320812: [SPIKE] Deploy event driven stateless Flink service to DSE cluster.

Depends on https://phabricator.wikimedia.org/T321682.

Oct 26 2022, 12:56 PM · Event-Platform Value Stream, Shared-Data-Infrastructure, Data-Engineering-Planning
gmodena added a comment to T320968: Easy Flink Python UDF + SQL enrichment.

To be fair, the actual idea is easy enough to implement for simple mappings

def python_to_flink_datatype(val: type) -> DataType:
    if val is str:
        return DataTypes.STRING()
    elif val is int:
        return DataTypes.INT()
    elif val is bool:
        return DataTypes.BOOLEAN()
Oct 26 2022, 12:44 PM · Event-Platform Value Stream (Sprint 04), Spike, Data-Engineering-Planning
gmodena added a comment to T320968: Easy Flink Python UDF + SQL enrichment.

Mapping python to SQL will be tricky, since as you point out there is no 1:1 relationship (floating-point and decimal will be funky too). The db-api doc could give some pointers on how database drivers approach this; but AFAIK there is not one standard way to do this mapping. E.g. SQL Server vs PostgreSQL.

Oct 26 2022, 12:17 PM · Event-Platform Value Stream (Sprint 04), Spike, Data-Engineering-Planning
gmodena added a comment to T320968: Easy Flink Python UDF + SQL enrichment.

Thanks for this write up @Ottomata!

Oct 26 2022, 9:44 AM · Event-Platform Value Stream (Sprint 04), Spike, Data-Engineering-Planning

Oct 24 2022

gmodena added a comment to T320812: [SPIKE] Deploy event driven stateless Flink service to DSE cluster.

Related https://phabricator.wikimedia.org/T321491

Oct 24 2022, 2:46 PM · Event-Platform Value Stream, Shared-Data-Infrastructure, Data-Engineering-Planning
gmodena added a comment to T320812: [SPIKE] Deploy event driven stateless Flink service to DSE cluster.

Related https://phabricator.wikimedia.org/T318535 https://phabricator.wikimedia.org/T318712

Oct 24 2022, 12:48 PM · Event-Platform Value Stream, Shared-Data-Infrastructure, Data-Engineering-Planning

Oct 20 2022

gmodena moved T320812: [SPIKE] Deploy event driven stateless Flink service to DSE cluster from Next Up to In Progress on the Event-Platform Value Stream (Sprint 03) board.
Oct 20 2022, 10:37 AM · Event-Platform Value Stream, Shared-Data-Infrastructure, Data-Engineering-Planning

Oct 14 2022

gmodena added a comment to T318856: [SPIKE] Build simple stateless service using Flink SQL.
Oct 14 2022, 9:22 AM · Data-Engineering-Planning, Event-Platform Value Stream (Sprint 03), Spike
gmodena added a comment to T318856: [SPIKE] Build simple stateless service using Flink SQL.

A summary of this spike, and evaluation of the approach, can be found at https://www.mediawiki.org/wiki/Platform_Engineering_Team/Event_Platform_Value_Stream/Build_simple_stateless_service_using_Flink_SQL

Oct 14 2022, 6:23 AM · Data-Engineering-Planning, Event-Platform Value Stream (Sprint 03), Spike
gmodena moved T318856: [SPIKE] Build simple stateless service using Flink SQL from In Progress to In Review on the Event-Platform Value Stream (Sprint 02) board.
Oct 14 2022, 6:22 AM · Data-Engineering-Planning, Event-Platform Value Stream (Sprint 03), Spike

Oct 13 2022

gmodena updated subscribers of T303543: eventgate chart should use common_templates.

The patch has been reviewed and merged.

Oct 13 2022, 2:03 PM · Event-Platform Value Stream (Sprint 03), Data-Engineering-Planning, Data-Engineering-Kanban, Patch-For-Review, SRE, serviceops
gmodena moved T303543: eventgate chart should use common_templates from In Review to Ready to Deploy on the Event-Platform Value Stream (Sprint 02) board.
Oct 13 2022, 1:59 PM · Event-Platform Value Stream (Sprint 03), Data-Engineering-Planning, Data-Engineering-Kanban, Patch-For-Review, SRE, serviceops
gmodena added a comment to T318856: [SPIKE] Build simple stateless service using Flink SQL.

Flink has an interface that implements Loookup Join semantics. I managed to implement it in a connector to asynchronously query http endpoints.

Oct 13 2022, 12:38 PM · Data-Engineering-Planning, Event-Platform Value Stream (Sprint 03), Spike

Oct 3 2022

gmodena added a comment to T319214: Evaluate Benthos as stream processor.

thanks for the ping @JAllemandou .

Oct 3 2022, 5:54 PM · Patch-For-Review, Event-Platform Value Stream, Data-Engineering-Planning, Observability-Logging, Machine-Learning-Team, observability
gmodena moved T318856: [SPIKE] Build simple stateless service using Flink SQL from Next Up to In Progress on the Event-Platform Value Stream (Sprint 02) board.
Oct 3 2022, 8:12 AM · Data-Engineering-Planning, Event-Platform Value Stream (Sprint 03), Spike

Sep 22 2022

gmodena moved T303543: eventgate chart should use common_templates from Blocked/Paussed to In Review on the Event-Platform Value Stream (Sprint 01) board.
Sep 22 2022, 6:51 PM · Event-Platform Value Stream (Sprint 03), Data-Engineering-Planning, Data-Engineering-Kanban, Patch-For-Review, SRE, serviceops
gmodena moved T303543: eventgate chart should use common_templates from In Progress to Blocked/Paussed on the Event-Platform Value Stream (Sprint 01) board.
Sep 22 2022, 6:51 PM · Event-Platform Value Stream (Sprint 03), Data-Engineering-Planning, Data-Engineering-Kanban, Patch-For-Review, SRE, serviceops

Sep 21 2022

gmodena moved T303543: eventgate chart should use common_templates from Blocked/Paussed to In Progress on the Event-Platform Value Stream (Sprint 01) board.
Sep 21 2022, 10:37 AM · Event-Platform Value Stream (Sprint 03), Data-Engineering-Planning, Data-Engineering-Kanban, Patch-For-Review, SRE, serviceops

Sep 12 2022

gmodena moved T303543: eventgate chart should use common_templates from In Progress to Blocked/Paussed on the Event-Platform Value Stream (Sprint 01) board.
Sep 12 2022, 7:46 PM · Event-Platform Value Stream (Sprint 03), Data-Engineering-Planning, Data-Engineering-Kanban, Patch-For-Review, SRE, serviceops
gmodena moved T310721: eventstreams chart should use latest common_templates from Next Up to In Progress on the Event-Platform Value Stream (Sprint 01) board.
Sep 12 2022, 7:46 PM · Event-Platform Value Stream (Sprint 02), Patch-For-Review, Data-Engineering, SRE, serviceops
gmodena added a comment to T303543: eventgate chart should use common_templates.

Hi - what is the status of the linked CR?

Sep 12 2022, 7:19 PM · Event-Platform Value Stream (Sprint 03), Data-Engineering-Planning, Data-Engineering-Kanban, Patch-For-Review, SRE, serviceops
gmodena updated the task description for T311084: [Shared Event Platform] Mediawiki Stream Enrichment should consume the consolidated page-change stream..
Sep 12 2022, 12:39 PM · Event-Platform Value Stream (Sprint 04), Data-Engineering-Planning
gmodena updated the task description for T311084: [Shared Event Platform] Mediawiki Stream Enrichment should consume the consolidated page-change stream..
Sep 12 2022, 12:37 PM · Event-Platform Value Stream (Sprint 04), Data-Engineering-Planning

Sep 8 2022

gmodena added a comment to T315674: Remove materialized .json files from event schema repositories.

I generally lean towards @Tgr's opinion of yaml as a format, and @DLynch's opinion of yaml not being all that readable. In our case here, yaml is being de-referenced and re-generated by a yaml dumper, so it seems unlikely that we'd hit problems with the format itself. It's kind of a subset of yaml that seems safe (this is hand-waivey, we maybe should look at it closer).

My main point here is that I think we need a UI anyway.

Sep 8 2022, 2:57 PM · Event-Platform Value Stream (Sprint 02), Data-Engineering-Planning

Sep 7 2022

gmodena moved T303543: eventgate chart should use common_templates from Next Up to In Progress on the Event-Platform Value Stream (Sprint 01) board.
Sep 7 2022, 8:26 AM · Event-Platform Value Stream (Sprint 03), Data-Engineering-Planning, Data-Engineering-Kanban, Patch-For-Review, SRE, serviceops

Sep 6 2022

gmodena added a comment to T315674: Remove materialized .json files from event schema repositories.

I'll echo my comment in https://phabricator.wikimedia.org/T308450#8214099.

Sep 6 2022, 2:26 PM · Event-Platform Value Stream (Sprint 02), Data-Engineering-Planning
gmodena added a comment to T308450: [BUG] jsonschema-tools materializes fields in yaml in a different order than in json files.

@JAllemandou @Milimetric @phuedx ...what do you think about removing the .json files from the schema repositories altogether? I don't think we really use them, and maintaining both .json and .yaml files might be a little confusing. @gmodena has told me he's for removing the .json files.

Sep 6 2022, 1:20 PM · Event-Platform Value Stream (Sprint 02), Patch-For-Review, Data-Engineering-Kanban
gmodena added a comment to T308017: Design Schema for page state and page state with content (enriched) streams.

Adding idea discussed with @Ottomata earlier on. It's probably interesting to separate streams by project, to allow optimal reading for both all-projects readers and single-project readers.

Sep 6 2022, 10:33 AM · Event-Platform Value Stream, Data-Engineering, Patch-For-Review

Sep 1 2022

gmodena updated subscribers of T316557: [SPIKE] Investigate building out automated data pipeline job for Similarusers service.

Thanks for overview @xcollazo. To the best of my knowledge, this well captures the project status.

Sep 1 2022, 9:43 AM · Data Pipelines (Sprint 01)

Aug 30 2022

gmodena updated the task description for T311084: [Shared Event Platform] Mediawiki Stream Enrichment should consume the consolidated page-change stream..
Aug 30 2022, 8:21 PM · Event-Platform Value Stream (Sprint 04), Data-Engineering-Planning
gmodena added a comment to T316555: Use RowTypeInfo to ensure better validation of the event data within the Mediawiki Stream Enrichment pipeline.

These changes have been merged into https://gitlab.wikimedia.org/repos/data-engineering/mediawiki-stream-enrichment/-/merge_requests/12, which is part of https://phabricator.wikimedia.org/T311084.

Aug 30 2022, 8:19 PM · Event-Platform Value Stream (Sprint 00), Data-Engineering
gmodena moved T316555: Use RowTypeInfo to ensure better validation of the event data within the Mediawiki Stream Enrichment pipeline from In Progress to In Review on the Event-Platform Value Stream (Sprint 00) board.
Aug 30 2022, 8:17 PM · Event-Platform Value Stream (Sprint 00), Data-Engineering
gmodena updated the task description for T311084: [Shared Event Platform] Mediawiki Stream Enrichment should consume the consolidated page-change stream..
Aug 30 2022, 10:04 AM · Event-Platform Value Stream (Sprint 04), Data-Engineering-Planning
gmodena updated the task description for T311084: [Shared Event Platform] Mediawiki Stream Enrichment should consume the consolidated page-change stream..
Aug 30 2022, 10:03 AM · Event-Platform Value Stream (Sprint 04), Data-Engineering-Planning
gmodena moved T316555: Use RowTypeInfo to ensure better validation of the event data within the Mediawiki Stream Enrichment pipeline from Next Up to In Progress on the Event-Platform Value Stream (Sprint 00) board.
Aug 30 2022, 10:01 AM · Event-Platform Value Stream (Sprint 00), Data-Engineering
gmodena updated the task description for T315428: [SPIKE] Assess what is required for the enrichment pipeline to run on k8s.
Aug 30 2022, 9:13 AM · Data-Engineering-Planning, Event-Platform Value Stream (Sprint 01), Spike
gmodena updated the task description for T315428: [SPIKE] Assess what is required for the enrichment pipeline to run on k8s.
Aug 30 2022, 9:09 AM · Data-Engineering-Planning, Event-Platform Value Stream (Sprint 01), Spike
gmodena added a comment to T315428: [SPIKE] Assess what is required for the enrichment pipeline to run on k8s.

For reference, some resources on how Google and Spotify are operating Flink on k8:

Aug 30 2022, 9:07 AM · Data-Engineering-Planning, Event-Platform Value Stream (Sprint 01), Spike
gmodena added a comment to T314389: [SPIKE] Decide on technical solution for page state stream backfill process.

For now, I'll start experimenting with Spark + Iceberg unless there's some major objection. I also read in some slack threads about Spark 3.0 being on the horizon for us, which would definitely make Iceberg more appealing

Aug 30 2022, 9:00 AM · Data-Engineering, Event-Platform Value Stream (Sprint 00), Spike
gmodena moved T316555: Use RowTypeInfo to ensure better validation of the event data within the Mediawiki Stream Enrichment pipeline from Backlog to Sprint 00 on the Event-Platform Value Stream board.
Aug 30 2022, 8:30 AM · Event-Platform Value Stream (Sprint 00), Data-Engineering

Aug 29 2022

gmodena updated the task description for T315428: [SPIKE] Assess what is required for the enrichment pipeline to run on k8s.
Aug 29 2022, 3:16 PM · Data-Engineering-Planning, Event-Platform Value Stream (Sprint 01), Spike
gmodena added a comment to T315409: Access request to analytics system(s) for TThoabala.

Hey @Jelto - it's a notebook like the one described in https://wikitech.wikimedia.org/wiki/Analytics/Systems/Jupyter#Access.

Aug 29 2022, 3:04 PM · SRE, SRE-Access-Requests, Data-Engineering-Operations, Data-Engineering
gmodena created T316555: Use RowTypeInfo to ensure better validation of the event data within the Mediawiki Stream Enrichment pipeline.
Aug 29 2022, 2:54 PM · Event-Platform Value Stream (Sprint 00), Data-Engineering
gmodena renamed T315428: [SPIKE] Assess what is required for the enrichment pipeline to run on k8s from [SPIKE] Assess what is required for the enrichment pipline to run on k8 to [SPIKE] Assess what is required for the enrichment pipeline to run on k8.
Aug 29 2022, 2:27 PM · Data-Engineering-Planning, Event-Platform Value Stream (Sprint 01), Spike
gmodena moved T315428: [SPIKE] Assess what is required for the enrichment pipeline to run on k8s from In Progress to In Review on the Event-Platform Value Stream (Sprint 00) board.
Aug 29 2022, 2:22 PM · Data-Engineering-Planning, Event-Platform Value Stream (Sprint 01), Spike
gmodena added a comment to T315428: [SPIKE] Assess what is required for the enrichment pipeline to run on k8s.

I explored with adjusting the k8 workshop to Apache Flink. It boils down to running Flink on minikube. This can be done locally, without the need of a cloud vps vm.

Aug 29 2022, 12:35 PM · Data-Engineering-Planning, Event-Platform Value Stream (Sprint 01), Spike
gmodena renamed T315428: [SPIKE] Assess what is required for the enrichment pipeline to run on k8s from [SPIKE][NEEDS GROOMING] Assess what is required for the enrichment pipline to run on k8 to [SPIKE] Assess what is required for the enrichment pipline to run on k8.
Aug 29 2022, 12:06 PM · Data-Engineering-Planning, Event-Platform Value Stream (Sprint 01), Spike
gmodena updated the task description for T311084: [Shared Event Platform] Mediawiki Stream Enrichment should consume the consolidated page-change stream..
Aug 29 2022, 11:09 AM · Event-Platform Value Stream (Sprint 04), Data-Engineering-Planning
gmodena updated the task description for T311084: [Shared Event Platform] Mediawiki Stream Enrichment should consume the consolidated page-change stream..
Aug 29 2022, 11:09 AM · Event-Platform Value Stream (Sprint 04), Data-Engineering-Planning
gmodena added a comment to T298105: ImageMatching algo implementation does not support concurrent writes.

This code base was part of a completed PoC, and is not in use or development anymore.

Aug 29 2022, 9:20 AM · Generated Data Platform

Aug 24 2022

gmodena added a comment to T275551: Using docker in WMF production network outside of kubernetes.

will it be possible to consume e.g. events from kafka infra, or read/write to swift?

Nopers :/

Is this the recommended way for running containers for non-local development? cc @Ottomata

cc @gmodena but I don't think so. Maybe?

Aug 24 2022, 7:34 PM · serviceops, Analytics-Radar, Machine-Learning-Team
gmodena added a comment to T315409: Access request to analytics system(s) for TThoabala.

Is this monthly data dump script something that runs in Hadoop or perhaps on the stat boxes? If so, analytics-privatedata-users is correct, and I approve! :)

I believe it is yes - I think @gmodena could confirm this.

Aug 24 2022, 6:32 PM · SRE, SRE-Access-Requests, Data-Engineering-Operations, Data-Engineering

Aug 23 2022

gmodena added a comment to T308017: Design Schema for page state and page state with content (enriched) streams.
Aug 23 2022, 10:18 AM · Event-Platform Value Stream, Data-Engineering, Patch-For-Review

Aug 18 2022

gmodena updated the task description for T315428: [SPIKE] Assess what is required for the enrichment pipeline to run on k8s.
Aug 18 2022, 10:48 AM · Data-Engineering-Planning, Event-Platform Value Stream (Sprint 01), Spike

Aug 17 2022

gmodena added a comment to T314389: [SPIKE] Decide on technical solution for page state stream backfill process.

Would we? stream "generic" code I can think of would mostly be HTTP callbacks to the Action API. If possible, I think we should avoid them at backfill time, and hit HDFS instead.

Perhaps we can made the function used to get the content pluggable and parameterized? Then we could provide 3 parameters to the Flink job:

Aug 17 2022, 6:32 PM · Data-Engineering, Event-Platform Value Stream (Sprint 00), Spike
gmodena added a comment to T314389: [SPIKE] Decide on technical solution for page state stream backfill process.
Aug 17 2022, 3:31 PM · Data-Engineering, Event-Platform Value Stream (Sprint 00), Spike