Tue, Jul 5
Thank you for the head up @MoritzMuehlenhoff. The migration should be done in a matter of weeks so I'm confident we'll be done by September.
We have a duplicate of this task currently used to track the migration. @BTullis do you mind if I merge this one onto the other one?
Mon, Jul 4
Thu, Jun 30
No, superset staging doesn't use presto-test - there is almost no data nor computation power under that one. I don't have a better solution than deploying and testing, to possibly roll-back if too many problems show up :S
I forgot to add on this:
Yes! We have examples of spec as well as data for you.
The hadoop-ingestion spec (templated) of the mediawiki_history_reduced dataset is here: https://github.com/wikimedia/analytics-refinery/blob/master/oozie/mediawiki/history/reduced/load_mediawiki_history_reduced.json.template
I confirm it works for me! Let's maybe give it a try on the prod cluster and ask our end-users to check their queries/dashboards?
Wed, Jun 29
This has been tested successfully - we'll need simple HQL jobs to load cassandra from now on :)
Tue, Jun 28
Open question: Presto latest is 0.273.3 - Would we bump more than the minimal one for Iceberg?
Hi @BPirkle - I'll gladly spend some time with you (and anyone interested) to explain more about Druid if needed :)
Fri, Jun 24
Thanks a lot @Ottomata for the backfill.
Thu, Jun 23
While this is feasible, it feels not that simple: there is more in the shaded than just the hadoop depencies, noticeably scala libs. Using the shaded should be a lot simpler. I don't know how it is done with the skein operator, but skein allows to get files from HDFS (see the code example and its comments in https://jcristharif.com/skein/distributing-files.html#specifying-files-for-a-service)
Adding perspective on this: Using map instead of structured and defined schema allows for more flexibility and versatility, but prevents deeper validation. The decision of using one versus the other should not be made lightly :)
Tue, Jun 21
I faced the same issue and the problem was due to a failed install of a previous package due to a missing dependency on the host (see https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow/Developer_guide#Setting_up_the_environment). I'd be interested to know if this issue was the same.
Thu, Jun 16
Question about AQS Should we wait for AQS-2.0 to do this instead of changing the old node code?
Wed, Jun 15
Tue, Jun 14
The gobblin problem is a known issue - we have setup alerts (that worked!) that cover us from this.
Mon, Jun 13
Thu, Jun 9
I think we can close this indeed. Thanks @BTullis .
Wed, Jun 8
Jun 6 2022
Jun 2 2022
May 25 2022
It is! Closing the task
May 24 2022
May 23 2022
decisions for Spark3:
- We're gonna merge and release the refinery-source patch bumping Spark and Scala as is, changing refinery-source verison to 0.2.0 (not all jobs have been tested, the list is documented in the commit message)
- We're gonna use this new refinery-source release to migrate existing Airflow jobs to Spark3, using the SaprkNoCLIDriver in cluster mode instead of the skein in client mode deploy strategy. Some airflow hacking might be needed here.
- The merge of the refinery-source code doesn't impact already running jobs as we refence jars by version. However it means that any new change to scala code needs to be done in Scala 2.12, and the relative jobs need to be migrated to Spark3 (and therefore airflow). This shall push us to migrate to airflow faster :)
May 20 2022
There is work in hive on that front but AFAICS it's for version 3+: https://github.com/apache/hive/blob/master/kafka-handler/README.md
May 19 2022
May 13 2022
May 12 2022
From what I have seen we can't specify neither a load-balancing policy nor the local datacenter. BUT, from the docs: Connections are never made to data centers other than the data center of spark.cassandra.connection.host. https://github.com/datastax/spark-cassandra-connector/blob/master/doc/1_connecting.md
The spark-cassandra-connector indeed supports setting the consistency. It defaults to LOCAL_QUORUM for writes and LOCAL_ONE for reads.
May 11 2022
May 10 2022
Actually there would some difference, as using Spark3 would make the related HQL queries in the form:
INSERT INTO aqs.local_group_default_T_pageviews_per_project_v2.data SELECT ...
instead of just the select part. But the select part would be the same though...
The concern with waiting (I forgot to mention above) is that it prevents us from decommissioning the old AQS nodes.
If we decide to wait for spark3, we should update the oozie jobs.
After a great talk with @Antoine_Quhen a wider discussion needs to happen: Spark3 offers the possibility to write to cassandra through SQL-like queries (see https://github.com/datastax/spark-cassandra-connector/blob/master/doc/14_data_frames.md).
@Eevans : The AQS-loader is not datacenter-aware. It takes base hosts as a parameter and gets the cassandra cluster topology asking to the known host(s). However I think it'd be good to restrict data loading from hadoop to just eqiad, to prevent pushing too much data at once accross DCs. In my mind we would restrict sending data to just eqiad hosts, and let cassandra replicate the data (possibly with throuput limitation). Is that even feasible? Is it a good idea? Thoughts welcome :)