Reason for access: need query search usage via jupyter for Structured Data pipelines
I'm not sure if analytics-platform-eng-admins is the correct group for this. I think you want analtyics-privatedata-users with kerberos access.
BTW, there may be other better ways to do this than a custom serialization format. Please comment / update description with findings.
Writing down some ideas and thoughts from todays talk with @gmodena:
Thank you both!
It might be easier to either leave these files in place and revisit this in February, or to have them moved under your ownership. When we archive, we zip everything up and put it in HDFS. We can of course get it back, but maybe it is easier to be able to browse them as usual?
+1, sounds good. Let's decline this task then. We can reopen or make a new one if/when bscarone leaves again :)
@Dendelele, please approve the removal of the following files:
There are no leftover data files owned by nikafor in the analytics cluster. nikafor's hdfs and regular homedirs have already been removed.
@Miriam, please approve for removal of the following files and Hive tables.
Alright! Leaving this task open for now then.
@mark, please approve for removal of the following files:
There are no leftover data files owned by ejoseph in the analytics cluster.
There are no leftover data files owned by dpifke in the analytics cluster.
Hi @jrobell1, the following files are leftover in eyener's home directories on the stat boxes. Do you approve their removal? We can archive things that need to be kept, but we'd prefer to remove.
Mon, Nov 28
I also merged stream config changes to configure message_key_fields for the rc0.mediawiki.page_change stream, and in beta, tested that keys were produced to consistent topic partitions.
Hm, yes, but I guess I mean at least this hardcoded producer code in MW core wouldn't have a hardcoded external dependency?
needs to happen during installation, we can't rely on extensions
We also can't rely on them, as perhaps we want stats on a MediaWiki with no extensions installed?
can you please install the latest conda deb package on an-test-client1001
Nice, @pfischer, please keep your eye on T308017: Design Schema for page state and page state with content (enriched) streams, there are some structural changes we may make to the schema (flattening?) in the next RC.
Oof, I didn't know there was a hardcoded use of EventLogging inside of MediaWiki core. This seems pretty fragile. This migration makes sense, but are we sure we want to continue doing this in the long term?
Wenjun's access is ssh-less access to analytics-privatedata-users group, right? If so, to remove their public key from the task description
Hi, this sounds like an issue with your ssh config and your ssh key. If your key is configured correctly, ssh should not prompt you for a password:
Very cool! Code? :)
I'm fine either way. I think I prefer two packages if we want to keep the worker installed size smaller, if we don't care, then let's just remove the debconf variable.
build increasingly complex code to not fall out of sync with Mediawiki (akin to the heroic scale of what Joseph put together for mediawiki-history)
@EChetty I don't think this task belongs in Event Platform. Removing tag.
In stream-beta they should show up automatically. I do see https://stream-beta.wmflabs.org/v2/ui/#/?streams=mediawiki.wikistories_contribution_event. Can we close this task?
Mon, Nov 21
Thanks @Tgr! At this point it is easy enough to remove, and we can always add it back in later if/when we need it. I'd prefer to solve this problem by making the event model simpler for now anyway.
o/ I am working on Flink and flink operator images now:
Done, I removed irrelevant parts, if that is okay.
@jcrespo I can make this change once the other approvals have been given.
Thu, Nov 17
Although I don't love putting 'kafka' in the name here, who knows maybe one day we won't be using kafka for this.
Are you open to bikeshedding on the name key_fields
I'm not following the aspect about page properties not being persisted through edits
I don't know if I totally follow either, but there is more context the initial collab design doc see "Do we want page properties?" and the comment.
OR! You could get fancier and make an HDFS puppet file provider :)
Yes, that's what I was thinking. Make it so that the exec in the defined type pulls the file from HDFS and diffs the content with the secret as part of the unless condition.
Interesting! I see there are some checks in the older EventBusHooks that guard against this. Will add the same ones in PageChangeHooks now.
2 more questions to answer:
@Isaac, for sake of continuity, let's have this discussion over on T308017: Design Schema for page state and page state with content (enriched) streams.
We can, but it isn't done with the Puppet file resources, so there isn't any detection of file changes, only an exec to put the file if it doesn't exist
Wed, Nov 16
Hi, checking in, any updates here?
Best SQL Example here. Will be much better with a catalog.
Tue, Nov 15
@phuedx , want to check in with you about this, and see if you have any thoughts.
Oo, I see. Right empty I remember now.
Mon, Nov 14
Interesting. Ideally PHP would just do the right thing and distinguish between integer indexed arrays and associative arrays (objects).
I've already demoed creating a Event Platform based Spark Streaming DataFrarme here. (If we chose to invest in spark streaming, we'd abstract more of that, like we have for Flink with DataFrame factory functions and or a Catalog implementation). Defining the UDF is pretty much the same as in Flink, except you don't always have to specify the return type. I believe you do if the type is a complex/nested one, and in that case you'd use Spark's own DataType system, which is similar to Flink's.
Thu, Nov 10
The idea is to proceed for a first iteration (namely a new kafka topic + druid indexation on a separate datasource, say webrequest_128_live) without having a proper event and schema in place, so that we could validate if the whole workflow works and if it is valuable for SRE. Then we can definitely add one, what do you think?
Ya sounds good. When you are ready, the minimal requirement of adding the schema and event stream config won't be hard.
overridden by the DE batch jobs
QQ: do the existent DE batch jobs already produce all the same info you are trying to produce here with benthos?
Approved from DE.
Wed, Nov 9
@EChetty why Event Platform here?
Re-opening to discuss a schema change.
if we're deriving it from schemas then the user would have to go to the schema repo and figure out what they have to return anyways
Yeah, you are right, and it is kind of weird to associate the return value of a UDF with an event JSONSchema. It makes total sense for the inputs and outputs of the streaming pipelines, but not so much for intermediate steps, like function calls.
Some thoughts and trials of implementing a Flink Event Platform catalog here and in some comments below.
Mon, Nov 7
I got sidetracked into getting a working content enrichment pyflink UDF SQL thing working. Finally got it!
Hm, maybe you can do a different topic? It might be better to do a temp topic with your name in it, so it is clear that is just you testing things. 'tchin.test0'?
If you want to run Flink in k8s and write to HDFS, then this will be a problem: this is the k8s "kerbarrier".
Since this is ops/sre/root(?) access, is there any approval that needs to happen from SRE?
Fri, Nov 4
FYI, we have deployed a rc0.mediawiki.page_change stream to group0 wikis! Example event here. It has the development/mediawiki/page/change schema. We put this in development/ as we wanted to indicate that it is still a WIP and subject to change.
Thu, Nov 3
No clue if this is the right approach, but perhaps we could use ingestion transforms to augment the existent kafka ingestion with Event Platform event schemas? From 5 minutes of reading docs, I think we'd do this by implementing a transformer that can transform the schemaMetadata aspect of a dataset entity?
FYI I am working on making a more specific list of requirements spec for event platform producers:
BTW we are live on group0 wikis now.
To ease the creation of simple DAGs, we could implement a wizard
Instead of a wizard, perhaps we could just create an abstraction (task groups?) around simple input/output jobs? Parameterize input and output frequency and locations (hive table / hdfs path), and the job the user wants to run?