Page MenuHomePhabricator

Set up the section topics data pipeline Spark code base
Closed, ResolvedPublic

Description

Data Platform has changed the data pipelines scaffolding.

BeforeNow
both the Spark jobs and the Airflow DAG were living in the same repositoryseparated: Spark jobs are in a standalone repo, while DAGs live here
a mix of build tools handled packaging and deployment of both Spark jobs and the Airflow DAGGitLab CI and conda distribution are used to package, release. and deploy Spark jobs

Use the example job to set up the section topics data pipeline repo.

Event Timeline

mfossati changed the task status from Open to In Progress.Jul 1 2022, 3:59 PM
mfossati claimed this task.
mfossati moved this task from Incoming to Doing on the Structured-Data-Backlog (Current Work) board.

The code is currently located at https://gitlab.wikimedia.org/mfossati/section-topics. Waiting for T312037: Create new GitLab project group: structured-data to move it to its final location. This should also enable continuous integration, as runners available in /repos
Moving this ticket to code review. @xcollazo , maybe you can have a quick look and comment here?

@mfossati, here are a couple pointers:

  1. To separate business code from project setup code, I suggest you create a folder section_topics, and keep all your pyspark code there. Your setup.py should pick it up automatically.
  2. I know you just started, but once you have it, keep your conda env definition in the root folder under a file named conda-environment.yaml. This will make it easy for the gitlab CI to pick it up. ( Example. )
  3. Looks like your test build phase is failing, you may need a tox.ini?

Do please look at all my mistakes on history from image-suggestions, so that you don't spend as much time setting up the gitlab CI.

Also, once you have progressed more on the pypsark code, happy to take a further look.

Hope this helps.

@xcollazo , thanks for your feedback. CI is all set up except the trigger_release action, which is crying due to missing CI variables, see https://gitlab.wikimedia.org/repos/structured-data/section-topics/-/jobs/22971#L514
Just wanted to confirm if there's any specific requirement for the CI user credentials that will commit the release? Talking about user name, email, and SSH key.

As a side note, can you please grant me access to https://gitlab.wikimedia.org/repos/generated-data-platform/image-suggestions, so that I can autonomously dig into such details?

Just wanted to confirm if there's any specific requirement for the CI user credentials that will commit the release

I do suggest you create a unique keypair that you use exclusively for this. As an example, check out how I set it up for image-suggestions: https://gitlab.wikimedia.org/repos/generated-data-platform/image-suggestions/-/settings/ci_cd now that you have access.

There is one limitation for CI_GIT_USER_SSH_PRIVATE_KEY: we can't use Gitlab variable's masked option since that feature only supports single line variables.

I do suggest you create a unique keypair that you use exclusively for this. As an example, check out how I set it up for image-suggestions: https://gitlab.wikimedia.org/repos/generated-data-platform/image-suggestions/-/settings/ci_cd now that you have access.

There is one limitation for CI_GIT_USER_SSH_PRIVATE_KEY: we can't use Gitlab variable's masked option since that feature only supports single line variables.

@xcollazo , I wonder whether a deploy token would make more sense, see https://gitlab.wikimedia.org/help/user/project/deploy_keys/index and https://gitlab.wikimedia.org/help/user/project/deploy_tokens/index#gitlab-deploy-token

I wonder whether a deploy token would make more sense, see https://gitlab.wikimedia.org/help/user/project/deploy_keys/index and https://gitlab.wikimedia.org/help/user/project/deploy_tokens/index#gitlab-deploy-token

Looks like deploy tokens can only read, and there is no option for them to commit, which is what we use the ssh keypair for.

@xcollazo: sounds good, variables added. I'll keep you updated as soon as I trigger a meaningful release. Closing this ticket, thanks again for your support!