User Details
- User Since
- Oct 9 2014, 4:50 PM (526 w, 4 d)
- Availability
- Available
- IRC Nick
- ottomata
- LDAP User
- Ottomata
- MediaWiki User
- Ottomata [ Global Accounts ]
Yesterday
is this what you were hoping to see?
Fri, Nov 8
Reading
Thu, Nov 7
Interesting!
run as different users by passing the spark.kerberos.principal parameter to spark
Thanks for this write up Antoine! This will make it much easier to remember this stuff in the future.
as far as I can tell, mediawiki-content-dump is the only production case where we are currently doing this.
Wed, Nov 6
The downside of deploy-mode cluster is that driver logs are not available to the process that submits the spark job.
In T379187: Create new GitLab project: repos/wmf-packages repos/wmf-packages was created as a WMF global package registry. We can publish shareable python packages there.
Volume mounts based on either managed PVs or images.
Wow TIL k8s images and OCI 'artifacts'. Interesting!
Does that sound right to you @Ottomata?
Can we decline / resolve this task?
Just so I'm clear, we don't have the driver logs in Airflow at the moment though, do we?
Could we not just add a spark.jars default configuration option, pointing to an HDFS location of the iceberg jar?
It isn't?
This should be fixed as part of the work done for T356762: [Refine refactoring] Refine jobs should be scheduled by Airflow: implementation. We can resolve this after we are done T369845: [Refine Refactoring] Refine jobs should be scheduled by Airflow: deployment
Hm, are we sure we need this? IIRC yarn client handles picking the hostname to talk to directly via yarn ResourceManager HA stuff.
Iceberg jar:
Perhaps it would be possible to retrieve this jar from HDFS, rather than bake it into the airflow image. Not sure what's best here.
@CDanis I believe people currently want this kind of work to be planned more strategically, and to prioritize it appropriately I agree this would be very useful!
Tue, Nov 5
The data in cu_private_event is needed to do proper analysis for T372702: editors are repeatedly getting logged out (August 2024), as so far it seems that the queries have been limited to just enwiki.
My MR that adds a jsonschema-tools diff subcommand is nice. We could configured GitLab CI to run it in the schema repos using dyff, and get reviewable CI pipeline output like this:
The problem will be that jsfh renders the schema file but also renders js and css
schema.wikimedia.org is currently hosted using nginx.
Hm, yes I think both could be done!
Mon, Nov 4
I expect the access token to the central repo would be the only real stumbling block.
I think we should have 1 WMF wide global package registry in GitLab.
I think one registry would make help simplify configuration. Publishing and dependency configuration could then all use the same package registry location configuration.
Reposting a use case from https://phabricator.wikimedia.org/T367322#9884996
Okay, so it looks the answer to my question
Agree.
Yeah I was thinking of trying that next too! So
MariaDb -> Debezium -> Kafka
-> Flink CDC Iceberg
or
-> Paimon Sync Database action.
Re 4/ Refined table that won't Refine with new process:
There was never an answer to some questions there.
But keeping those 171 custom blocks statically in the MediaWiki-config repo is OK.
Still not understanding. Why are there 171 static custom blocks? Won't the defaults just add them automatically?
Fri, Nov 1
Is there a reason why this can't be added to the default block, like we do for Gobblin?
allowing metric owners to choose which mechanism and data store best suits them?
FWIW, with the event platform consolidation proposal, metrics owners do not have to choose. All of these metrics will go both to Prometheus and to the Data Lake automatically.
Thu, Oct 31
or adding the 171 block into the PHP file manually.
You know what else would be cool!?
Tue, Oct 29
Merged!
Approved
<3 thank you!
Skein support in Kubernetes might not be required
Indeed!
we need to have the spark3-submit binary be a symlink to spark-submit, as we use it extensively in airflow-dags
Mon, Oct 28
Incident report has been moved to Wikitech.
Just sent this email to the flink and paimon user email groups.
how do we help the revision level MERGE INTO? The suggested partitioning schema doesn't, because we have, at an hour level about ~150k events, and at a day level it is ~3.6M
@xcollazo maybe stupid idea:
Sat, Oct 26
I think MW will write various logs to files in ./cache, which is a mounted volume so you should be able to tail them from your host machine. Is there anything tricky in there?
Fri, Oct 25
so how would one go about "enriching" wmf_dumps.wikitext_raw_rc2 with a diff column? the job could filter the full history for only the pages changed in that hour (broadcast join) and then do the self join, but that would still require a full pass over the data which seems expensive. This certainly is solvable, e.g. one could decrease the update interval, but it is tempting to instead implement the diff as a streaming "enrichment" pipeline.
A quick codesearch (https://codesearch.wmcloud.org/search/?q=kafka-php&files=&excludeFiles=&repos=) and local grep yields no results, so this knot might have neatly tied itself
Other use cases: