Page MenuHomePhabricator

[Developer Experience] [SPIKE] Investigate process to automate deployment of folders and artifacts to HDFS
Closed, ResolvedPublic8 Estimated Story PointsSpike

Description

As a Data Engineer, I need to define a process for automatically deploying hdfs artifacts, so we automatically deploy things like hql files on merge

Design doc:
https://docs.google.com/document/u/1/d/1gytt1rzO5wO1IWmmrGKdfuwSlVvFyQ9pFK87wePUYFE/edit

Done is:

Event Timeline

Restricted Application changed the subtype of this task from "Task" to "Spike". · View Herald TranscriptMar 26 2024, 12:23 AM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
JAllemandou renamed this task from [Developer Experience] [SPIKE] Investigate process to automate deployment of hdfs artifacts to [Developer Experience] [SPIKE] Investigate process to automate deployment of folders and artifacts to HDFS.Mar 26 2024, 5:50 PM

Q: Have we discussed these ideas with Release Engineering folks? They are currently working on a similar CD project, but it might be MediaWiki focused only.

Q: Have we discussed these ideas with Release Engineering folks? They are currently working on a similar CD project, but it might be MediaWiki focused only.

@BTullis is familiar with the design document and these ideas, I'm not sure which CD project in particular you're referring to though.

@thcipriani I think I recall you or RelEng mentioning a GitLab CD project. Got any links?

Tagging Release Engineering for consultation.

@lbowmaker can we clarify the user story / requirement here? As written it makes sense, but we might be missing something.

Is the intention to just have CD of artifacts/HQL/repos to HDFS? Or is the intention to also support CD for airflow-dags that use those artifacts?

As written, CD will work for job artifacts, but there will be no CD for schedulers to use those new artifacts. I think this is fine and good, but I just want to make sure this is the intention.

(It will make the implementation simpler if we don't need CD for real use of artifacts.)

@Ottomata the intention was to avoid the weekly deployments for simple things.

Is this possible with the proposed process?

  • Analyst/Eng makes change to existing hql file
  • Changes are reviewed and merged
  • New hql file is synced to HDFS
  • On next run of Airflow job it picks up the change

Discussed in meeting:

It is mostly up to us. So, Aleks will proceed as if we want to support e.g. current deployment and implement it, unless something comes up where it is too complex to deal with.

In DAGs, folks can choose to use current where appropriate. My opinion is that we would use explicit versions for critical pipelines, e.g. pageviews and unique devices.

cc @Milimetric @JAllemandou @mforns

I agree. I think we should support current as well as versioned. Both have their use cases.

Has the lower-tech option of pulling from the git origin into the destination HDFS using a systemd timer been considered? This is bascially how /srv/deployment-charts is managed on the production deployment servers as an example.

Good q, I'm not certain. @amastilovic ?

The tricky bit is that HDFS isn't a normal filesystem, so we can't really just git pull there. We'd have to git pull on some box somewhere, and then sync from that box to HDFS. Currently, we manually sync to new locations every time we deploy (See Data_Platform/Systems/Refine/Deploy_Refinery and refinery-deploy-to-hdfs ).

I think we do want to keep deploying versioned artifacts (kind of like how scap manages different checkouts of different commits under the hood), so we need to keep support for deploying multiple copies of the files (probably versioned by git tag or sha1).

I'm not sure if we want to deploy every merge to main? Do we? I suppose if we wanted to systemd timer + pull, we could just do so for every git tag instead?

I've considered the option of pulling from the git origin into the destination HDFS, albeit not using a systemd timer. I've actually done something similar before in previous jobs/roles, by mounting HDFS onto a local file system, but I don't think this is a viable solution for a number of reasons:

  1. Using a timer is not an option - we want to be integrated into the GitLab's CI pipeline and utilize all the information these pipelines provide to users. HDFS deployment should be triggered on merge to main branch, and its status should be visible from GitLab UI itself.
  2. Mounting HDFS to a local filesystem is (was?) finicky and buggy, plus it would require new development and support from the SRE team.
  3. Considering the fact that we want old versions kept in HDFS, we would still need that functionality implemented somehow, which increases the complexity of what started as a simpler solution - ie it's no longer just a "systemd + git pull" but something more complicated than that.