Page MenuHomePhabricator

[NEEDS GROOMING] We should improve and automate python linting
Open, Needs TriagePublic

Description

This is a placeholder meant to collect feedback / discussion I had with @mfossati on Slack and Gitlab.

For context; we use a cookiecutter template to scaffold python based datapipelines. Among other CI checks, it configures a set of flake8 rules to test for:
https://gitlab.wikimedia.org/repos/generated-data-platform/datapipelines/-/blob/main/datapipeline-scaffold/%7B%7Bcookiecutter.pipeline_directory%7D%7D/pyspark/setup.cfg

This is a lax set of rules @Clarakosi and I arbitrarily picked as a baseline.

There's two areas of improvement.

Add linting rules

For new projects we should be more strict and include (at least):

  • E501: line too long
  • E202: un-ncecessary whitespaces
  • F401: unused imports
  • W292: no newline at end of file

Code formatting and style enforcement

@mfossati suggests integrating isort and autoflake in CI. +1 as far as I'm concerned.
We also discussed having formatting (and maybe other checks) as a git pre-commit hook. @Marco suggested evaluating https://pre-commit.com/.

Event Timeline

@gmodena I was bold and added pre-commit to the next data pipeline, see initial repo here: https://gitlab.wikimedia.org/repos/structured-data/section-topics

@xcollazo @Antoine_Quhen Does this apply to the airflow ci/cd that DE manages? Are there any improvements that we wish to adopt or expand?

If not, is this ticket still valid?

@xcollazo @Antoine_Quhen Does this apply to the airflow ci/cd that DE manages? Are there any improvements that we wish to adopt or expand?

If not, is this ticket still valid?

@odimitrijevic seems like this ticket was referring to standards for Python 'Business Logic' repos like image_suggestions.

We should however pursue some coding standards for airflow-dags for sure, and that conversation was started recently on Slack. Happy to move that to a Phab ticket.