- User Since
- Jun 21 2021, 2:34 PM (101 w, 1 d)
- LDAP User
- MediaWiki User
- TChin (WMF) [ Global Accounts ]
Thu, May 25
Wed, May 24
Mon, May 8
Sun, May 7
Oof, was looking at how to potentially mock the http session and response object, but turns out mocks don't work when pickled/multiprocessed. I guess the only option is to spin up a web server during testing and hit that instead
Fri, May 5
Do we know what's turning them into ecs format in the first place?
Thu, May 4
Tue, May 2
Apr 19 2023
Apr 18 2023
Apr 17 2023
Don't know how to connect gitlab merge requests to phab but here's the link for posterity's sake:
Bundle Java jars when building wheel
Apr 10 2023
A less opaque place for inspiration is that the py4j library bundles the jar with its python wheel as well. They leave adding it to the classpath to the user though. I actually don't see how we'd include the jars in the classpath without injecting them at runtime. Does something in pyflink automatically find it?
Apr 3 2023
Mar 29 2023
Mar 22 2023
Mar 16 2023
Mar 15 2023
I don't think I'm fully understand what the options are for. You can set watermarks for tables in the catalog by doing something like
CREATE TABLE with_watermark ( event_time AS meta['dt'], WATERMARK FOR event_time AS event_time - INTERVAL '5' SECOND ) LIKE some_table;
or with kafka metadata
CREATE TABLE with_kafka_timestamp ( event_time TIMESTAMP(3) METADATA FROM 'timestamp' VIRTUAL, WATERMARK FOR event_time AS event_time - INTERVAL '5' SECOND ) LIKE some_table;
Mar 14 2023
Mar 7 2023
Feb 13 2023
Looking back at the code, it seems like the only thing that needs to be moved (moved is a generous term) is the cassandra sink. Should probably just think about implementing a cassandra sink builder like the kafka builder in event utilities. The code to make the source is also now covered by event utilities eventDataStreamFactory.kafkaSourceBuilder so that can be tossed completely. The rest of the stuff is specific to the pipeline.
Feb 6 2023
This is so hard to describe through text I just made a miro board to try to logic out all the behavior. Take a look at it if you want and add comments as you see fit
Jan 31 2023
After testing more, it seems better to explicitly define options event-stream-name and event-stream-prefix instead of trying to derive the stream name and prefix from the table name. We need the unprefixed stream name to look up the schema, and the prefix is only needed if the user is trying to insert into kafka.
Jan 29 2023
Jan 21 2023
Some updates after some discussion
Dec 20 2022
Perhaps the end goal would have a user experience like:
Dec 14 2022
After experimenting a bit, I was able to get the catalog to override the schema option of specific tables using
ALTER TABLE `mediawiki.api-request` SET ('schema'='0.0.1');
Dec 13 2022
Are these so bad? Could we default to the latest version for both sources and sinks, but allow SQL hints to override?
They're mostly just wordy. If we don't really care, then it's actually not bad at all.
INSERT INTO `eventgate-main.test.event` /*+ OPTIONS('schema'='1.0.0') */ (`test`, `test_map`) VALUES ('test_from_catalog', MAP['test_key', 'test_val']);
Dec 12 2022
Trying to create different functionalities when sourcing/sinking is like trying to fit a square peg into a round hole, due to the fact that you can only really specify global options for all tables created by the catalog.
Dec 6 2022
Nov 30 2022
The catalog is now able to sink to a specific prefixed topic by overriding the ResolvedCatalogTable before passing it to the Kafka connector
Nov 29 2022
Here's the working code so far, sans the stuff I talk about below
Nov 23 2022
I was able to implement a flink catalog that acts as an options passthrough to the built-in kafka connector and uses eventutilities to dynamically create the tables with schemas and with sensible defaults. So something like
CREATE CATALOG wmfeventcatalog WITH ( 'type' = 'wmfeventcatalog', 'properties.group.id' = 'catalog-test' );
Will let you use a table like eventgate-main.test.event as if you made it like
CREATE TABLE `eventgate-main.test.event` ( $schema STRING, meta ROW<...>, test STRING, test_map MAP<STRING, STRING> ) WITH ( 'connector' = 'kafka', 'format' = 'json', 'topic' = 'eqiad.eventgate-main.test.event;codfw.eventgate-main.test.event', 'properties.group.id' = 'catalog-test', 'properties.bootstrap.servers' = 'kafka-jumbo1001.eqiad.wmnet:9092', 'scan.startup.mode' = 'latest-offset', 'json.timestamp-format.standard' = 'ISO-8601' );
Nov 15 2022
Nov 9 2022
(When it comes to this task of making an example reading/writing with Flink SQL and a UDF, With Andrew's example and also a more simplified example in the example repo, this can be marked as done; although there are still good conversations here)
I think the custom type inference in Java/Scala is really powerful, but if someone already is at a point where they're writing UDFs in Java then they probably already having working knowledge of DataTypes, When it comes to UX for people who only want to write python it might be worth just reimplementing JsonSchemaFlinkConverter, although I don't know how difficult that would be. Having something like
looks very sleek. (although it feels a bit wrong to have 2 codebases that do the same thing)
Nov 8 2022
So I was using Kafka Client 3.2.3, but I noticed you were using 2.4.1. Switched to that and it solves the cluster authorization issue. Gonna have to note that somewhere
Nov 7 2022
The exact error I get is org.apache.kafka.common.errors.ClusterAuthorizationException: Cluster authorization failed when trying to produce to a topic. Tried the test topic and the platform-wiki-image-links topic from Gabriele's first event POC. Consuming from the topic works, so I wonder if there's different default permissions for producing/consuming.
Nov 4 2022
Huh we've been doing everything on Yarn so far so I guess I overlooked this, but if I wanted to produce to Kafka or Hadoop using the SQL Cli I would need to generate a Kerberos keytab. Yarn is the only deployment method where you can forego the keytab and use the ticket cache. Looking at wikitech, it seems like generating a keytab is a non-trivial task? Or at least it's something I don't have permission to do. I feel like this is starting to get into the 'figuring out deployment' issue. @Ottomata
Nov 1 2022
The UDFs appear to be being executing inside of a process.
Oct 31 2022
A problem with virtualenvs is that they don't include the python executable.
Can you elaborate on that? I thought the executable is venv/bin/python3
Oct 29 2022
I wrote out the steps on how to package a python virtual environment so that people can use external dependencies in the UDFs
Oct 26 2022
To be fair, the actual idea is easy enough to implement for simple mappings
def python_to_flink_datatype(val: type) -> DataType: if val is str: return DataTypes.STRING() elif val is int: return DataTypes.INT() elif val is bool: return DataTypes.BOOLEAN()
I definitely feel like the biggest issue here is how we'd map from python types to pyflink DataTypes. Would a python int turn into a DataTypes.INT or perhaps a DataTypes.BIGINT? It's hard to say how well abstracting away the types would go considering the DataTypes are supposed to represent the columns that are being sunk to.
Oct 24 2022
Oct 14 2022
Here's the repo with example datastream and table equivalent. It reads from the mediawiki.page-create and then uses its page_id to fetch the list of images on the page from the action api. I'm also working on a summary writeup of what I've experienced
Oct 4 2022
Sep 20 2022
Sep 8 2022
Sep 1 2022
Aug 29 2022
Ok so to summarize:
Aug 17 2022
Aug 16 2022
- Are we backfilling both the page state change stream and/or the one with content?
- Do we want both the full history and/or a compacted one with only the most recent revision?
- The two obvious options for backfill is either Spark or Flink
- Upside of Flink is that we can potentially reuse code written for the stream
- Upside of Spark is that it's more established and there are more people in the foundation who know how to use and support it
- Also, depending on what the page state schema will contain, we might have to join the wikitext history table (avro) with the mediawiki history table (parquet)
- Flink 1.15 does not have full parquet support (can't read complex data types)
- Flink + iceberg is a thing, but I haven't tried it yet. It does seem to have a library for parquet support
- Tangent: I've heard iceberg mentioned before, but how does it factor into the rest of the foundation's tech stack? A superficial search on Wikitech brought up nothing
Jul 22 2022
Jun 30 2022
I made it work with this:
ParameterTool will let you have external config, but, it is not a Configuration object, so you'd need to convert it by doing something like:
val env = StreamExecutionEnvironment.getExecutionEnvironment val parameters = ParameterTool.fromPropertiesFile("config.properties") val config = Configuration.fromMap(parameters.toMap)
Jun 13 2022
Is the user column under the feedback table supposed to be text? The feedback event schema currently outputs a user_id instead so I'm wondering if it's supposed to be transformed into a username or if the Cassandra table needs to be updated
Mar 31 2022
Mar 4 2022
I can sort of get around this issue if I go to the the project group page (in my case repos/api-platform) > New project > Import project > import by URL or by GitLab exports. That seems like the only time the project group shows up as an option in the dropdown for me. It doesn't show up if I use any other import options, or if I get to the import page through the '+' button on the website header instead of the 'New project' button on the project group page.
Feb 22 2022
Gitlab group request ticket -> T301164
Feb 16 2022
Feb 7 2022
Jan 25 2022
Jan 12 2022
Forgot to link the patch that replaces most of the easily replaceable ones. T297797 lists places where it wasn't easily removable.
Jan 7 2022
I basically just converted every global $wgGlobal into MediaWikiServices::getInstance()->getMainConfig()->get( 'Global' ) to see what blows up and then backtracked from there.