Page MenuHomePhabricator

Add Link engineering: Consolidate dedcode/addlink and mgerlach/mwaddlink-query into single repository
Closed, ResolvedPublic

Description

From T261403#6541529:

@kostajh

  1. take https://github.com/martingerlach/mwaddlink-query and move utility methods from https://github.com/dedcode/mwaddlink into it
    1. maybe that involves making a small shared library between the two repos, depending on whether the model trainer also needs access to these methods? The overall goal would be to remove https://github.com/martingerlach/mwaddlink-query/blob/main/addlink-query_links.py#L8-L11

this could work as a temporary solution. the better option would probably be to have a shared library for both the training and the query-part in order to make the parsing is consistent across both. this will probably become more important later as we make tweaks to the training of the model (when seeing what needs improvement when applying to different languages). maybe we can deal with the more general solution later.
I also want to incorporate some of the suggestions mentioned in T258978#6532612

@MGerlach, maybe it would be easier if we just have a single repository, and then use multiple requirements.txt files so that the code we ship in the production query service doesn't have all of the heavier libraries used for training the model, and the code used for training the model doesn't have the HTTP API libraries, etc? The advantage would be reduced overhead in making updates to code shared across training / querying (e.g. you wouldn't have to update a library, commit and push, then update the training and query repos to use the updated library).

Event Timeline

kostajh created this task.

Assigning to you, but if you'd like help with this (reviewing or implementing) let Growth-Team know.

Looks like the code is in a single repo (and will soon be imported to gerrit, where we should push patches), but leaving this open to implement the multiple requirements.txt approach.

Looks like the code is in a single repo (and will soon be imported to gerrit, where we should push patches), but leaving this open to implement the multiple requirements.txt approach.

mwaddlink-query is deprecated and all its functionality merged into the main mwaddlink-repo, which is the one repo to be maintained and moved to gerrit (via T261403)

@kostajh are there any naming/structuting conventions for virtual environments in production that I should follow? For example, in the solution described above there will be several requirements-files in a reuqirements-folder, with the requirements.txt in the main folder mirroring the production environment.

@kostajh are there any naming/structuting conventions for virtual environments in production that I should follow? For example, in the solution described above there will be several requirements-files in a reuqirements-folder, with the requirements.txt in the main folder mirroring the production environment.

I think it is flexible. Maybe requirements-training.txt and requirements.txt, where the latter is the slimmed down version used for the production query service? They could both be in the root of the repository. Also, AIUI, the production environment wouldn't use a virtual environment; we'd install the libraries during the process of building the Docker image. But we can talk to Release Engineering about it in T265893: Add Link engineering: Deployment Pipeline setup.

@kostajh at the moment the gerrit-repo contains two requirements-files:

  • requirements.txt (the full environment required for training and querying)
  • requirements_query.txt (the lighter environment only for training)

We could easily switch the names according to your suggestion above but if it works either way I would just leave as is. also wanted to check whether there is any dependence to setups?