Page MenuHomePhabricator

[backend][jobs] Refactor populate_db job
Closed, ResolvedPublic

Description

The main goal is to make sure that this script can run as a stand-alone cron job. We'll also refine it a little bit, details TBD.

To start, this script should contain the whole ETL pipeline (Extract, Transform, Load). We should also make sure it's idempotent, i.e. running it twice should produce the exact same outcome as running it once, provided that the data fetched from Toolhub and the state of our db haven't changed between the two runs.

The script being stand-alone also means it shouldn't depend on anything that requires having an up-and-running Flask app instance, including app.config and the db instance.

Event Timeline

Slst2020 triaged this task as High priority.Mar 2 2023, 2:00 PM

I am going to need a little guidance on this one, I'm afraid.

So, what I'm gathering is that I would need to create a db connection within the populate_db.py file itself rather than rely on one that I've imported. I would also need to look to the environment for variable definitions rather than app.config (I need to read up on how environmental variables will be set and accessed within Toolforge, as I don't want to modify the file in such a way that I will no longer be able to run it in a test environment.)

I'm not sure what you mean by the ETL pipeline, though my rough assumption is that, in this case, it would mean that the file contains everything necessary to get the data, transform the data, and insert it into the database. In order to hit the first requirement I'd need to create an instance of the ToolhubClient within the file, the second is already fulfilled, and for the third I'd need to create a db connection as I mentioned above. Am I still on the right track?

It isn't strictly idempotent in the sense that the first run (applied to an empty db) will produce a very different outcome, but each subsequent run will produce the same results (i.e., no change) so long as nothing's changed in Toolhub or our db.

You'd mentioned refactoring some of the functions; I will definitely need advice on that.

There's also logging; as it stands I have print() statements sprinkled liberally throughout the file, primarily so I could keep track of things while loading the test data. In a "real" scenario those statements produce an excessive amount of feedback. What information do you think is worth logging?

So, what I'm gathering is that I would need to create a db connection within the populate_db.py file itself rather than rely on one that I've imported. I would also need to look to the environment for variable definitions rather than app.config (I need to read up on how environmental variables will be set and accessed within Toolforge, as I don't want to modify the file in such a way that I will no longer be able to run it in a test environment.)

You'd be able to access env variables using os.environ.get()

I'm not sure what you mean by the ETL pipeline, though my rough assumption is that, in this case, it would mean that the file contains everything necessary to get the data, transform the data, and insert it into the database. In order to hit the first requirement I'd need to create an instance of the ToolhubClient within the file, the second is already fulfilled, and for the third I'd need to create a db connection as I mentioned above. Am I still on the right track?

That's right. This also means that manage.py would have to be refactored, as the "Extract" (fancy word to say fetching the tool data from Toolhub) functionality would be moved from there to this script.

It isn't strictly idempotent in the sense that the first run (applied to an empty db) will produce a very different outcome, but each subsequent run will produce the same results (i.e., no change) so long as nothing's changed in Toolhub or our db.

No, you got that right. My wording was poor – If running the script a second time produces no changes, provided that nothing changed either in Toolhub or in our db in between the runs, then the script is idempotent. Just like if you submit the same PUT request several times. The first one will result in a change, but subsequent requests will do nothing.

You'd mentioned refactoring some of the functions; I will definitely need advice on that.

I intentionally omitted the details – this would be another task for later. :)

There's also logging; as it stands I have print() statements sprinkled liberally throughout the file, primarily so I could keep track of things while loading the test data. In a "real" scenario those statements produce an excessive amount of feedback. What information do you think is worth logging?

  • Anything that would help with debugging
  • Status updates to know where you are in the script
  • Results

Example:

Inserting field info into the database...
Done!
Loading data from Toolhub... Hold on, this will take a moment
Done! Fetched 2876 tool records in 1m13s
Inserting tasks into the database...
...
Removing tasks from the database...
...
All done! 
{Summary of changes}
NicoleLBee changed the task status from Open to In Progress.Mar 6 2023, 7:51 PM
NicoleLBee moved this task from Backlog to In Progress on the Toolhunt board.

I've opened a draft PR to address this task: https://github.com/wikimedia/toolhunt/pull/50

It would be good to have this done before we switch the deployment over to the "real Toolhub", together with a slight refactor of our db tables, including whatever code changes that would entail. There's no proper task for this yet, but some ideas can be seen in this PoC notebook: https://github.com/blancadesal/jupyter-notebooks/blob/main/toolhub-etl-example.md

We had several tasks that amounted to the same thing. Merged them.

@Slst2020 and I reviewed the PoC notebook linked above, and I'm going to revamp populate_db.py entirely using its contents.

https://github.com/wikimedia/toolhunt/pull/56

This is an unfortunately hefty PR, in which I've

  • added a new table, completed_task, to the db.
  • completely overhauled the contents of populate_db.py (using code provided by @Slst2020 )
  • updated task-related schemas and all the routes that dealt with tasks and task-related information

I've also opened a related PR for the frontend:
https://github.com/wikimedia/toolhunt-ui/pull/51

The backend PR has been approved; waiting on approval for the frontend component before moving forward.

Front- and back-end PRs have been merged.