I wonder whether a deploy token would make more sense, see https://gitlab.wikimedia.org/help/user/project/deploy_keys/index and https://gitlab.wikimedia.org/help/user/project/deploy_tokens/index#gitlab-deploy-token
Looks like deploy tokens can only read, and there is no option for them to commit, which is what we use the ssh keypair for.
All right! Just verified that the image_suggestions job is running smoothly on the new an-airflow1004.eqiad.wmnet Airflow instance.
Applied (1) via https://gitlab.wikimedia.org/repos/generated-data-platform/image-suggestions/-/merge_requests/1. Closing.
I have updated the Wikitech documentation for that
Thanks @KCVelaga_WMF !
Wed, Aug 17
Verified that the table content and airflow job are working as designed. Closing.
If we can do this with Flink, we should, since then we don't have to maintain 2 codebases that do the same thing. But, it also might prove too difficult, and in that case we'd use Spark.
Tue, Aug 16
Successfully deployed the migration script and Airflow job today.
Updated migration notes to use Spark3.
Copying some notes I sent thru email here for completeness:
Mon, Aug 15
Just wanted to confirm if there's any specific requirement for the CI user credentials that will commit the release
Fri, Aug 5
Thu, Aug 4
( After a conversation with @KCVelaga, we decided to keep things simple for now, thus the monthly table unique_editors_per_country_monthly will not be using GROUPING SETs .)
Merge request for the new pipeline https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/114
Notes to do migration contained here:
Wed, Aug 3
Synced up with Ben over chat, copying here:
Mon, Aug 1
Thank you for looking into this @mforns. As we briefly discussed over slack, regardless of us upgrading Airflow to pickup the fix, we should also implement a log rotation mechanism since many other issues could make the logs balloon. We discussed this in today's standup, and I believe @EChetty is going to open a task for it.
Fri, Jul 29
Confirmed sudo access:
Thank you all for taking care of this.
Thu, Jul 28
Blocked until T311176 is resolved (lack of privileges).
Wed, Jul 27
Fixed a sensor correctness issue via https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/108.
@EChetty can you add to the description, or point me to the right person to ask for details? Thanks!
(not confident about the patch above, but still wanted to have something for review.)
Tue, Jul 26
While in the midst of following instructions to make the puppet changes for https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow#Create_a_scap_deployment_source, I hit a wall. It seems @Ottomata had set it up so that converting the current platform_eng Airflow instance would be a simple config change as seen here: https://github.com/wikimedia/puppet/blob/production/hieradata/role/common/analytics_cluster/airflow/platform_eng.yaml#L53-L57. However, since we have the prod run of the image_suggestions dag on the original server, going forward with this I believe will nuke it.
Put together what I think the correct scap configuration is at https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags-scap-platform_eng.
Synced up with @mforns on this task. We will attempt to move it forward as much as we can until we get an SRE to help.
Mon, Jul 25
@mfossati, here are a couple pointers:
After today's architectural discussion, we decided that we will not be pursuing making the data available in MariaDB for now. We can still do the diffs in the future with the approach above. For now, closing.
Following today's architecture discussion, we will not be pursuing making the data available in MariaDB to be consumed publicly for now. But if we want to keep that option open, let's make sure that we identify a primary key for all tables in Hive (that is, a column or set of columns that make the row unique).
Some more context:
Though about this a bit last Friday. I think it will be straightforward as long as we can identify a primary key for all the tables that we want to sync up.
Wed, Jul 20
Jul 19 2022
We're thinking of doing deltas for section topics, so we'd need to be able to do those too
@Cparle If you really wanted to do that step in MariaDB, you could use HiveToMariaDB to import the new data into a temp table, and then use MySqlOperator to run your INSERT/DELETE transformations against the old data.
I'd like to understand your use case, where can I go read about it?
Jul 18 2022
- We need a non-test use case for the MariaDB DBA folks to let us connect to their servers.
- The prerequisites for making MariaDB connections to the misc-clusters both from the Airflow cluster as well as from the Hadoop cluster are there.
- We need to provide tooling (the aforementioned HiveToMariaDB refine helper class) for moving data from Hive tables into MariaDB.
As a complement to the notes above, I also tried to make sure that connections are possible from within the Hadoop cluster. I manually ran the following, which runs a simple test in one of the Spark worker nodes:
Jul 15 2022
Some notes regarding this POC:
Connection is successful:
Jul 14 2022
If that's all you'd need you can just try a telnet (or netcat) to one of the proxies to the port 3306.
Sorry, I misspoke. I want my workflow system (Airflow) to connect to this MariaDB instance, as there may be helper functions we'd need to create to make this sort of connection easy for our customers, so it's not just making sure the connection is possible.
What do you need to test the connection apart from that?
We want to make sure there is no connectivity/firewall issues between our Airflow and Analytics cluster to the MariaDB instances.
Jul 13 2022
Jul 12 2022
Jul 11 2022
Closing this ticket for now since there have been no reply or activity on my request upstream, and there is the yarn logs ... workaround.
Jul 1 2022
In general, the skein project seems to be... not dead, but certainly dormant. This particular log bug PR has been open since 2020... !
Making good progress here.
There is a discussion of this issue in a skein PR from 2020 ( https://github.com/jcrist/skein/pull/212 ). Unfortunately, it was never merged. The stop gap change is quite simple: https://github.com/jcrist/skein/pull/212/commits/fe906f746e0a3b8b3cb89ce61140271e02601699.
Thank you @Aklapper!
Jun 30 2022
Jun 29 2022
( For the github repo containing the business logic, following git instructions at https://stackoverflow.com/questions/1365541/how-to-move-some-files-from-one-git-repo-to-another-not-a-clone-preserving-hi to avoid losing the commit history. )
Jun 28 2022
I'm worried that the Presto Iceberg connector might not have kerberos support?
Typically, you can pass these details down with a Hadoop Configuration object. Is that not the case with Presto?
Jun 27 2022
Created T311417 to track the implementation work.
I had meetings with the stakeholders for this effort.
Jun 22 2022
@lbowmaker shared with me the following Slack thread with @JAllemandou's rationale: https://wikimedia.slack.com/archives/C02BB8L2S5R/p1654174524991399?thread_ts=1654106678.906859&cid=C02BB8L2S5R
The page content will be too large to efficiently store and query as parquet, so needs a special case to be stored in avro