Page MenuHomePhabricator

Add linter and formatter to wmfdata-python (and link check)
Closed, ResolvedPublic

Description

Description

New

Adding linting and a format checker to wmfdata-python would be a great way to assure code consistency and quality (see issues in original task text below). This task would add a Python linter like Ruff (alternatively flake8) as well as a library to help with code formatting. For the latter we could make use of Ruff itself, black or autopep8. The Product Analytics Styleguide suggests PEP 8, but maybe we want to consider black in that it's now Python Software Foundation maintained and can integrate with Ruff.

Assuring that all the issues detailed in the original text below are warned for (and potentially autocorrected) would be the final step in this issue.

Original

Many of the Python files for wmfdata-python are using inconsistent formatting - specifically indentation of two and sometimes even three spaces. This task would go through and fix the various indentation problems such that all .py files would have indentation of four spaces. Along with this would also be included:

  • Removing white space at the end of lines
  • Making sure that all files end on a blank line (unsure if this is a problem, but a VS Code extension I have will do this)
  • Fixing some links in the readmes
    • Including this as I don't think that it's enough for its own task
    • For example the "pull requests here on GitHub" hyper link in the README.md takes the visitor to /pulls as a directory in the package rather than /pulls the URL subdomain

Contribution

I'd be happy to work on this, and other suggestions would be welcome!

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

I think the most bang for the buck would be to do these fixes automatically by adopting a tool like flake8. See https://github.com/PyCQA/flake8.

And while we are it, we should probably add a proper pytest testing setup instead of the manual test harness that we have today.

If you're interested in pursuing this, let me know and I can fetch a couple examples from other projects.

I'd be happy to look into this, @xcollazo! We were discussing linting and checks like this in an open-source project I work on and decided to go with Ruff. Would that be worth adopting instead of flake8? I could also look into a GitHub action to check new PRs?

As far as testing, we're likely talking two different tasks with implementing pytest being a new one to make sure that the scopes are not too broad?

Also just FYI, generally what I've been talking about with folks over here is that it'd be a good use of my learning budget to help you all maintain projects like wmfdata-python 😊 Starting with this makes sense, but also feel free to reach out if there are other tasks that you all would like support with :)

We were discussing linting and checks like this in an open-source project I work on and decided to go with Ruff. Would that be worth adopting instead of flake8?

Fine with me!

I could also look into a GitHub action to check new PRs?

Here I'm not sure its worth investing since I believe the mid term plan is to move this project to Gitlab. Pinging @nshahquinn-wmf for confirmation.

two different tasks with implementing pytest being a new one to make sure that the scopes are not too broad?

Agreed. Better to split.

it'd be a good use of my learning budget to help you all maintain projects like wmfdata-python 😊 Starting with this makes sense, but also feel free to reach out if there are other tasks that you all would like support with :)

Happy to hear. Since you seem like an active user of wmfdata-python, let's focus on that one for now, as I think the dogfooding will keep you motivated to contribute.

Ok, great! I'll check out some ideas for Ruff and maybe get started on this next week :) I'll hold off on the GitHub action as well as I'm remembering now that @nshahquinn-wmf mentioned the move to GitLab in T341589: Improve df_to_remarkup formatting for wmfdata-python.

Agreed. Better to split.

Should I write the new task for this?

Happy to hear. Since you seem like an active user of wmfdata-python, let's focus on that one for now, as I think the dogfooding will keep you motivated to contribute.

Sounds good to me! 🐶🍪😊

Should I write the new task for this?

Yes please.

AndrewTavis_WMDE renamed this task from wmfdata-python formatting and link check to Add linter and formatter to wmfdata-python (and link check).Oct 23 2023, 3:52 PM
AndrewTavis_WMDE claimed this task.
AndrewTavis_WMDE updated the task description. (Show Details)

Updated the task and also mentioned maybe adding an auto-formatter that could help to make sure that things are running smoothly before we have workflows set up to test PRs/MRs :)

The Product Analytics Styleguide suggests PEP 8, but maybe we want to consider black in that it's now Python Software Foundation maintained and can integrate with Ruff.

I think it's well worth considering! The only thing I'd want to check first is whether this is significantly different from the MediaWiki Python style and what other Wikimedia Python projects do. It may be that there are already existing rulesets for another formatting package.

Ruff also just brought out its own formatter that's fully compatible with Black. As far as formatting adoption at WMF, product analytics states here in their styleguide that they follow the MediaWiki styleguide that's Pep 8 based, however for Airflow the formatting is supposed to be Black (see their docs). I'll let you two decide what that means for wmfdata-python, but if we go with Black then going with the Ruff implementation to just have one dependency/framework makes sense to me 🙃

We could further make use of pre-commit to check code before commits, which is also something that's suggested in the WMF Airflow docs.

what other Wikimedia Python projects do

FWIW, in Data Engineering land, we have (arbitrarily) chosen black as de facto for recent projects.

Having said that, I think the value of these tools resides in the automation, rather than the specifics. So whatever keeps its maintainers happy should prevail. If you do have a preference, @nshahquinn-wmf, then let's go with that.

@nshahquinn-wmf, @xcollazo: checking in on this one again. I would have some time in the next two weeks or so to implement a PR workflow check of linting and code formatting. If folks are fine with Ruff that'd be easiest on my end, but also happy to consider others! I'd also suggest adding in a .vscode/extensions.json file that would allow us to suggest VS Code extensions like the Ruff extension so people are getting the appropriate warnings during editing. Included would of course also be some documentation on how to run the checks locally before a PR 😊

Let me know if this would be of interest on your all's end!

@AndrewTavis_WMDE that sounds great to me! I personally have no preference about the package we choose, so I'm on board with the idea of Ruff with the Black ruleset (since that's what Data Engineering uses).

Looking forward to it! 😁

Exciting! I'll start playing around with it towards the end of next week and send along a PR with the workflow, docs and changes given the local run warnings 😊 Will let you know if anything comes up before then, @nshahquinn-wmf. Have a nice weekend when it comes along!

Manuel moved this task from Incoming to To-Do on the Wikidata Analytics (Kanban) board.
nshahquinn-wmf added a subscriber: fkaelin.

@fkaelin has started working on this and is pretty far along. Hopefully we can get it merged this week.

Awesome to hear! Thanks for working on this, @fkaelin! 😊

While the referenced MR added support for linting/formatting, shouldn't this task include the work required to make the static analysis stage of the pipeline pass/green? Otherwise that work should be tracked in another task. For example, this is the output with typing issues to fix to make mypy happy. Similarly, if your MR is fixing formatting issues these changes will show up on the code quality section of the MR.

@fkaelin good point that we should track the work needed to make the pipeline pass!

My inclination is to keep tasks smaller in scope, so I've opened T381656 for the type-checking and T381657 for the linting.