Page MenuHomePhabricator

Add Link engineering: Deployment Pipeline setup
Closed, ResolvedPublic

Description

Following https://wikitech.wikimedia.org/wiki/Deployment_pipeline/Migration/Tutorial we should add deployment pipeline configuration to the research/mwaddlink repository.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
kostajh added a subscriber: MGerlach.

Hi Release-Engineering-Team, @MGerlach and Growth team will be working on implementing the deployment pipeline for https://gerrit.wikimedia.org/r/plugins/gitiles/research/mwaddlink/+/refs/heads/main. It's a python application, and there will be two requirements.txt files, one for building the application with the capability for training a ML model, and the other for a slimmed-down application that handles web requests and queries data dictionaries.

Do you have any preferences on naming for the requirements.txt files, placement in the repo, whether we should use a virtual env in the Docker image building process, or other concerns?

Change 635275 had a related patch set uploaded (by Kosta Harlan; owner: Kosta Harlan):
[research/mwaddlink@main] Add config for linting code with flake8

https://gerrit.wikimedia.org/r/635275

Change 635277 had a related patch set uploaded (by Kosta Harlan; owner: Kosta Harlan):
[integration/config@master] layout: Add deployment pipeline config for research/mwaddlink

https://gerrit.wikimedia.org/r/635277

Change 635383 had a related patch set uploaded (by Kosta Harlan; owner: Kosta Harlan):
[research/mwaddlink@main] Add deployment pipeline configuration

https://gerrit.wikimedia.org/r/635383

Change 635277 merged by jenkins-bot:
[integration/config@master] layout: Add deployment pipeline config for research/mwaddlink

https://gerrit.wikimedia.org/r/635277

Change 635275 merged by jenkins-bot:
[research/mwaddlink@main] Add deployment pipeline configuration

https://gerrit.wikimedia.org/r/635275

Change 635383 merged by jenkins-bot:
[research/mwaddlink@main] Enable tox entrypoint, and fix issues

https://gerrit.wikimedia.org/r/635383

Change 636900 had a related patch set uploaded (by Kosta Harlan; owner: Kosta Harlan):
[research/mwaddlink@main] Create a single production image

https://gerrit.wikimedia.org/r/636900

Change 636916 had a related patch set uploaded (by Kosta Harlan; owner: Kosta Harlan):
[operations/deployment-charts@master] linkrecommendation: Add deployment chart

https://gerrit.wikimedia.org/r/636916

Change 636900 merged by jenkins-bot:
[research/mwaddlink@main] Create a single production image

https://gerrit.wikimedia.org/r/636900

Change 636916 merged by Alexandros Kosiaris:
[operations/deployment-charts@master] linkrecommendation: Add deployment chart

https://gerrit.wikimedia.org/r/636916

Change 645064 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] Add tokens and users for 3 new k8s services

https://gerrit.wikimedia.org/r/645064

Change 645323 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[labs/private@master] Add tokens and users for 3 new k8s services

https://gerrit.wikimedia.org/r/645323

Change 645323 merged by Alexandros Kosiaris:
[labs/private@master] Add tokens and users for 3 new k8s services

https://gerrit.wikimedia.org/r/645323

Change 645064 merged by Alexandros Kosiaris:
[operations/puppet@production] Add tokens and users for 3 new k8s services

https://gerrit.wikimedia.org/r/645064

Change 645376 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/deployment-charts@master] linkrecommendation: Create the namespace

https://gerrit.wikimedia.org/r/645376

Change 645376 merged by jenkins-bot:
[operations/deployment-charts@master] linkrecommendation: Create the namespace

https://gerrit.wikimedia.org/r/645376

@kostajh, @MGerlach I 've gone ahead and created tokens and namespaces. You should be able to now deploy the service. Docs are at https://wikitech.wikimedia.org/wiki/Deployments_on_kubernetes#Deploying_with_helmfile. But the TL;DR is you 'll need to create a change similar to https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/645076, but tailored to your service and then following the helmfile -e <env> stuff mentioned in the docs. Also, you should have +2 access to the repo, let me know if you don't so I can add those.

Btw, if the service is going to reach out via the network to any other resource than MySQL (which has already been taken care of), now is the time to say so :-)

@kostajh, @MGerlach I 've gone ahead and created tokens and namespaces. You should be able to now deploy the service. Docs are at https://wikitech.wikimedia.org/wiki/Deployments_on_kubernetes#Deploying_with_helmfile. But the TL;DR is you 'll need to create a change similar to https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/645076, but tailored to your service and then following the helmfile -e <env> stuff mentioned in the docs. Also, you should have +2 access to the repo, let me know if you don't so I can add those.

Thanks @akosiaris! I will work on getting a patch for you this week.

I don't have +2 to that repo, could you please add me, @MGerlach and @Tgr there?

Btw, if the service is going to reach out via the network to any other resource than MySQL (which has already been taken care of), now is the time to say so :-)

I think we will be downloading database dumps via Swift from within the container. Does that require additional setup on your end?

Change 646649 had a related patch set uploaded (by Kosta Harlan; owner: Kosta Harlan):
[operations/deployment-charts@master] linkrecommendation: Add helmfile.d config

https://gerrit.wikimedia.org/r/646649

@kostajh, @MGerlach I 've gone ahead and created tokens and namespaces. You should be able to now deploy the service. Docs are at https://wikitech.wikimedia.org/wiki/Deployments_on_kubernetes#Deploying_with_helmfile. But the TL;DR is you 'll need to create a change similar to https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/645076, but tailored to your service and then following the helmfile -e <env> stuff mentioned in the docs. Also, you should have +2 access to the repo, let me know if you don't so I can add those.

Thanks @akosiaris! I will work on getting a patch for you this week.

I don't have +2 to that repo, could you please add me, @MGerlach and @Tgr there?

I also don't have access to deployment servers so would need help with the helmfile -e step, or @Tgr could do it, and in the meantime I'll see about access.

Change 646658 had a related patch set uploaded (by Kosta Harlan; owner: Kosta Harlan):
[operations/deployment-charts@master] linkrecommendation: Add private config for DB write user

https://gerrit.wikimedia.org/r/646658

Btw, if the service is going to reach out via the network to any other resource than MySQL (which has already been taken care of), now is the time to say so :-)

I think we will be downloading database dumps via Swift from within the container. Does that require additional setup on your end?

I also filed T269581: Add Link engineering: Allow external traffic to linkrecommendation service for discussion.

Change 646658 merged by jenkins-bot:
[operations/deployment-charts@master] linkrecommendation: Add private config for DB admin user

https://gerrit.wikimedia.org/r/646658

Change 646649 merged by jenkins-bot:
[operations/deployment-charts@master] linkrecommendation: Add helmfile.d config

https://gerrit.wikimedia.org/r/646649

Change 647216 had a related patch set uploaded (by JMeybohm; owner: JMeybohm):
[operations/deployment-charts@master] linkrecommendation: Allow MySQL egress and set public_port

https://gerrit.wikimedia.org/r/647216

Change 647216 merged by jenkins-bot:
[operations/deployment-charts@master] linkrecommendation: Allow MySQL egress and set public_port

https://gerrit.wikimedia.org/r/647216

@kostajh, @MGerlach I 've gone ahead and created tokens and namespaces. You should be able to now deploy the service. Docs are at https://wikitech.wikimedia.org/wiki/Deployments_on_kubernetes#Deploying_with_helmfile. But the TL;DR is you 'll need to create a change similar to https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/645076, but tailored to your service and then following the helmfile -e <env> stuff mentioned in the docs. Also, you should have +2 access to the repo, let me know if you don't so I can add those.

Thanks @akosiaris! I will work on getting a patch for you this week.

I don't have +2 to that repo, could you please add me, @MGerlach and @Tgr there?

@Tgr already had it, I 've added you, I 've left out @MGerlach per our IRC conversation, let me know if there is something wrong.

Btw, if the service is going to reach out via the network to any other resource than MySQL (which has already been taken care of), now is the time to say so :-)

I think we will be downloading database dumps via Swift from within the container. Does that require additional setup on your end?

Yes it will require opening a rule. Let me know if that's indeed the way you are going.

@akosiaris my preference is to use the existing workflow on stats machines to automatically publish a dataset. Then, a request from stats machine to the kubernetes service would instruct the app to download new datasets from the public endpoint via curl / HTTP.

Would this workflow be acceptable? If not I will look into using Swift.

Maybe the service could just regularly query the datasets and compare them to local ones. The stats webserver can probably be configured to publish checksums; worst case, they could be published as a separate file. Or maybe the stats machine could expose a readonly rsync daemon.

That way, we could use the same update mechanism in production and for local development / Wikimedia Cloud / etc, which makes maintenance and debugging easier.

To clarify what I'm proposing:

Then, a request from stats machine to the kubernetes service would instruct the app to download new datasets from the public endpoint via curl / HTTP.

We have a script, run-pipeline.sh which trains a model for a wiki and generates datasets in SQLite and MySQL table format. We would add a step to that script which copies the datasets to something like /srv/published/datasets/research/mwaddlink.

So, we run ./run-pipeline.sh on stats1008. Once the files have published to datasets.wikimedia.org (~15 minutes), we run something like curl {kubernetesService}/refreshdata?wiki_id={wiki_id}. Since the service isn't exposed to external traffic it seems OK to do this without an additional check for a shared secret.

Similarly, should also be safe to use in your local environment.

We would want to add support for using a shared secret for the Toolforge deployment of the app, or a feature flag to disable this functionality there. There could be a python script that triggers the same workflow, just like how we have query.py (CLI) and api.py (Flask) that end up using the same code paths.

Another consideration is how to pause the cron job for refreshLinkRecommendations.php for a wiki while we refresh the datasets. Is it straightforward to temporarily stop a cron job in our setup while the database tables are re-imported?

@kostajh, @MGerlach I 've gone ahead and created tokens and namespaces. You should be able to now deploy the service. Docs are at https://wikitech.wikimedia.org/wiki/Deployments_on_kubernetes#Deploying_with_helmfile. But the TL;DR is you 'll need to create a change similar to https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/645076, but tailored to your service and then following the helmfile -e <env> stuff mentioned in the docs. Also, you should have +2 access to the repo, let me know if you don't so I can add those.

Thank you @akosiaris. I've run helmfile -e for staging (verified with curl -L https://staging.svc.eqiad.wmnet:4005/apidocs + service-checker-swagger staging.svc.eqiad.wmnet https://staging.svc.eqiad.wmnet:4005 -t 2 -s /apispec_1.json), eqiad and codfw. AIUI it's back to your team now to implement the LVS / networking setup for eqiad/codfw.

@kostajh, @MGerlach I 've gone ahead and created tokens and namespaces. You should be able to now deploy the service. Docs are at https://wikitech.wikimedia.org/wiki/Deployments_on_kubernetes#Deploying_with_helmfile. But the TL;DR is you 'll need to create a change similar to https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/645076, but tailored to your service and then following the helmfile -e <env> stuff mentioned in the docs. Also, you should have +2 access to the repo, let me know if you don't so I can add those.

Thank you @akosiaris. I've run helmfile -e for staging (verified with curl -L https://staging.svc.eqiad.wmnet:4005/apidocs + service-checker-swagger staging.svc.eqiad.wmnet https://staging.svc.eqiad.wmnet:4005 -t 2 -s /apispec_1.json), eqiad and codfw. AIUI it's back to your team now to implement the LVS / networking setup for eqiad/codfw.

\o/. Nice! We 'll proceed with that.

To clarify what I'm proposing:

Then, a request from stats machine to the kubernetes service would instruct the app to download new datasets from the public endpoint via curl / HTTP.

That's a bad pattern IMHO. The main reason is that we will be using part of the capacity of the service to achieve something something that has nothing to do with the actual task of the service which is to serve requests (i.e. network, CPU and memory will be utilized for this that could be utilized just for serving requests) but rather a batch job. Furthermore, we would be adding an API endpoint that can be called by anyone internally to achieve that, an attacker that somehow gains access to WMF IPs would be able to exploit this to cause an outage.

This functionality has all the aspects of a batch maintenance job and should be treated as such, that (and regardless of whether it shares code or not with the service) is, to be launched from a different part of the infrastructure. Whether that is from the mwmaint* servers, from some analytics bastion or from a specifically dedicated kubernetes pod (e.g. as part of https://kubernetes.io/docs/concepts/workloads/controllers/job/ that is run as a helm hook) is something we should figure out and discuss, but having the service do batch stuff should not happen.

We have a script, run-pipeline.sh which trains a model for a wiki and generates datasets in SQLite and MySQL table format. We would add a step to that script which copies the datasets to something like /srv/published/datasets/research/mwaddlink.

So, we run ./run-pipeline.sh on stats1008. Once the files have published to datasets.wikimedia.org (~15 minutes)

+1 up to here.

we run something like curl {kubernetesService}/refreshdata?wiki_id={wiki_id}. Since the service isn't exposed to external traffic it seems OK to do this without an additional check for a shared secret.

Instead, let's just create a batch job script that that runs on e.g. mwmaint* hosts, fetches that file, parses it and populates MySQL.

Similarly, should also be safe to use in your local environment.

Same batch job script. In fact if you want it to share code with the service, it's fine to ship it with the service. We could reuse the image to spawn some kubernetes job.

We would want to add support for using a shared secret for the Toolforge deployment of the app, or a feature flag to disable this functionality there. There could be a python script that triggers the same workflow, just like how we have query.py (CLI) and api.py (Flask) that end up using the same code paths.

Another consideration is how to pause the cron job for refreshLinkRecommendations.php for a wiki while we refresh the datasets. Is it straightforward to temporarily stop a cron job in our setup while the database tables are re-imported?

No, not really (it's in fact one of the mild pain points of a datacenter switchover, that some jobs keep on running). But your observation has me asking if it would make sense for refreshLinkRecommendations.php to do encapsulate this batch job. It's going to be a maint job anyway so it seems a more logical home for this. Or, at least a calling script that calls in sequence the batch job updater (based on some if clause to only happen when some trigger is satisfied, e.g. the checksums @Tgr mentioned above) and then refreshLinkRecommendations.php.

To clarify what I'm proposing:

Then, a request from stats machine to the kubernetes service would instruct the app to download new datasets from the public endpoint via curl / HTTP.

That's a bad pattern IMHO. The main reason is that we will be using part of the capacity of the service to achieve something something that has nothing to do with the actual task of the service which is to serve requests (i.e. network, CPU and memory will be utilized for this that could be utilized just for serving requests) but rather a batch job. Furthermore, we would be adding an API endpoint that can be called by anyone internally to achieve that, an attacker that somehow gains access to WMF IPs would be able to exploit this to cause an outage.

The mwmaint option would be open to this vulnerability too, right? You could reset the checksum stored locally and re-run refreshLinkRecommendations.php.

This functionality has all the aspects of a batch maintenance job and should be treated as such, that (and regardless of whether it shares code or not with the service) is, to be launched from a different part of the infrastructure. Whether that is from the mwmaint* servers, from some analytics bastion or from a specifically dedicated kubernetes pod (e.g. as part of https://kubernetes.io/docs/concepts/workloads/controllers/job/ that is run as a helm hook) is something we should figure out and discuss, but having the service do batch stuff should not happen.

We have a script, run-pipeline.sh which trains a model for a wiki and generates datasets in SQLite and MySQL table format. We would add a step to that script which copies the datasets to something like /srv/published/datasets/research/mwaddlink.

So, we run ./run-pipeline.sh on stats1008. Once the files have published to datasets.wikimedia.org (~15 minutes)

+1 up to here.

we run something like curl {kubernetesService}/refreshdata?wiki_id={wiki_id}. Since the service isn't exposed to external traffic it seems OK to do this without an additional check for a shared secret.

Instead, let's just create a batch job script that that runs on e.g. mwmaint* hosts, fetches that file, parses it and populates MySQL.

That would definitely be easier, but the database is restricted to be accessed from a kubernetes pod (T267214#6662762), so we would need to change those access rules if we wanted to run the import from mwmaint, right?

Similarly, should also be safe to use in your local environment.

Same batch job script. In fact if you want it to share code with the service, it's fine to ship it with the service. We could reuse the image to spawn some kubernetes job.

I think it would make sense to share code with the service for ad hoc deployments (e.g. the Toolforge instance) but per your comment about cron, for production we should probably have a separate implementation in refreshLinkRecommendations.php (or a different maintenance script that can be invoked separately from refreshLinkRecommendations.php)

We would want to add support for using a shared secret for the Toolforge deployment of the app, or a feature flag to disable this functionality there. There could be a python script that triggers the same workflow, just like how we have query.py (CLI) and api.py (Flask) that end up using the same code paths.

Another consideration is how to pause the cron job for refreshLinkRecommendations.php for a wiki while we refresh the datasets. Is it straightforward to temporarily stop a cron job in our setup while the database tables are re-imported?

No, not really (it's in fact one of the mild pain points of a datacenter switchover, that some jobs keep on running). But your observation has me asking if it would make sense for refreshLinkRecommendations.php to do encapsulate this batch job. It's going to be a maint job anyway so it seems a more logical home for this. Or, at least a calling script that calls in sequence the batch job updater (based on some if clause to only happen when some trigger is satisfied, e.g. the checksums @Tgr mentioned above) and then refreshLinkRecommendations.php.

Right, I think this could work. I assumed we weren't allowed to read/write to this DB from mwmaint which is why I had not pursued this before. So refreshLinkRecommendations.php runs:

  • checks the checksum on datasets.wikimedia.org against what we have stored in the current wiki DB
  • initiate the download, import process
  • continue with the refresh link process

I don't think we need a lock file to indicate that the download/import process is occurring as the import should complete well within the hourly occurrences of the cron job, but bringing it up in case it's something to consider.

Also to clarify -- the checksum comparison will be useful to import new data as new datasets are generated, but it's anticipated that this is something that might happen a few times a year at most. The most common use-case is going to be adding a new wiki to the list of wikis that the service supports, and doing the initial data import.

To clarify what I'm proposing:

Then, a request from stats machine to the kubernetes service would instruct the app to download new datasets from the public endpoint via curl / HTTP.

That's a bad pattern IMHO. The main reason is that we will be using part of the capacity of the service to achieve something something that has nothing to do with the actual task of the service which is to serve requests (i.e. network, CPU and memory will be utilized for this that could be utilized just for serving requests) but rather a batch job. Furthermore, we would be adding an API endpoint that can be called by anyone internally to achieve that, an attacker that somehow gains access to WMF IPs would be able to exploit this to cause an outage.

The mwmaint option would be open to this vulnerability too, right? You could reset the checksum stored locally and re-run refreshLinkRecommendations.php.

It's a different attack path though. One would need to obtain first access to the mwmaint server, not just ANY server in the infrastructure. That limits the scope already a lot. Furthermore, they would need to obtain the privileges to alter+run the tool (at this point they would have access to the database, which is an issue but a different one altogether). And finally they would be overwhelming at most the mwmaint server which runs asynchronous jobs, not the infrastructure that powers the service.

This functionality has all the aspects of a batch maintenance job and should be treated as such, that (and regardless of whether it shares code or not with the service) is, to be launched from a different part of the infrastructure. Whether that is from the mwmaint* servers, from some analytics bastion or from a specifically dedicated kubernetes pod (e.g. as part of https://kubernetes.io/docs/concepts/workloads/controllers/job/ that is run as a helm hook) is something we should figure out and discuss, but having the service do batch stuff should not happen.

We have a script, run-pipeline.sh which trains a model for a wiki and generates datasets in SQLite and MySQL table format. We would add a step to that script which copies the datasets to something like /srv/published/datasets/research/mwaddlink.

So, we run ./run-pipeline.sh on stats1008. Once the files have published to datasets.wikimedia.org (~15 minutes)

+1 up to here.

we run something like curl {kubernetesService}/refreshdata?wiki_id={wiki_id}. Since the service isn't exposed to external traffic it seems OK to do this without an additional check for a shared secret.

Instead, let's just create a batch job script that that runs on e.g. mwmaint* hosts, fetches that file, parses it and populates MySQL.

That would definitely be easier, but the database is restricted to be accessed from a kubernetes pod (T267214#6662762), so we would need to change those access rules if we wanted to run the import from mwmaint, right?

Yes, that would easy enough, but it's allowed already anyway

mysql -h m2-master mwaddlink -u linkrecommendation -p
Enter password: 
Welcome to the MariaDB monitor.  Commands end with ; or \g.

Similarly, should also be safe to use in your local environment.

Same batch job script. In fact if you want it to share code with the service, it's fine to ship it with the service. We could reuse the image to spawn some kubernetes job.

I think it would make sense to share code with the service for ad hoc deployments (e.g. the Toolforge instance) but per your comment about cron, for production we should probably have a separate implementation in refreshLinkRecommendations.php (or a different maintenance script that can be invoked separately from refreshLinkRecommendations.php)

Sure, fine by me. We can do it in lockstep with refreshLinkRecommendations.php . That being said, since we are talking about populating a MySQL db, all it takes is running the update in a single transaction, right? Which means that there is no reason for implementing the locking mechanism on our side, not recreate the logic differently for production vs local dev envs, but rather rely on the DB to perform the locking.

We would want to add support for using a shared secret for the Toolforge deployment of the app, or a feature flag to disable this functionality there. There could be a python script that triggers the same workflow, just like how we have query.py (CLI) and api.py (Flask) that end up using the same code paths.

Another consideration is how to pause the cron job for refreshLinkRecommendations.php for a wiki while we refresh the datasets. Is it straightforward to temporarily stop a cron job in our setup while the database tables are re-imported?

No, not really (it's in fact one of the mild pain points of a datacenter switchover, that some jobs keep on running). But your observation has me asking if it would make sense for refreshLinkRecommendations.php to do encapsulate this batch job. It's going to be a maint job anyway so it seems a more logical home for this. Or, at least a calling script that calls in sequence the batch job updater (based on some if clause to only happen when some trigger is satisfied, e.g. the checksums @Tgr mentioned above) and then refreshLinkRecommendations.php.

Right, I think this could work. I assumed we weren't allowed to read/write to this DB from mwmaint which is why I had not pursued this before. So refreshLinkRecommendations.php runs:

  • checks the checksum on datasets.wikimedia.org against what we have stored in the current wiki DB
  • initiate the download, import process
  • continue with the refresh link process

+1

I don't think we need a lock file to indicate that the download/import process is occurring as the import should complete well within the hourly occurrences of the cron job, but bringing it up in case it's something to consider.

If we rely on a single transaction for the import, we will not need it at all I think.

Also to clarify -- the checksum comparison will be useful to import new data as new datasets are generated, but it's anticipated that this is something that might happen a few times a year at most. The most common use-case is going to be adding a new wiki to the list of wikis that the service supports, and doing the initial data import.

Cool, thanks for clarifying.

Talked this over with @kostajh and the options we saw for the batch job were:

  • Use a MediaWiki maintenance script to download the dataset over the web and import it into the production table. This is the well-trodden path since there's plenty of scaffolding for MediaWiki batch jobs, but conceptually wrong: the script belongs to the mwaddlink repository, not MediaWiki.
  • Make the script part of the mwaddlink repo, download the dataset over the web, and set up a cronjob on the same group host as the link recommendation service. Not quite sure how that looks in detail; I imagine one of the Kubernetes pods would be designated somehow to be the job runner. Seems like stateless service pods are not ideal for this kind of thing.
  • Make the script part of the mwaddlink repo but run it on stats1008. No need to transfer large datasets over the web; OTOH the DB table credentials will have to be duplicated to the stats machine.

The third approach seemed the simplest to us. Unlike an MW maintenance script that's part of the same cron entry as refreshLinkRecommendations.php, traffic to the service would not be suspended while the datasets are being imported, but that's not a big deal - we just lock the tables during import, let the service error out, and make refreshLinkRecommendations.php wait and retry on error. @akosiaris does that sound reasonable?

(The task for this is T272419: Add Link engineering: Script for updating datasets used by the mwaddlink service btw. Currently it is written based on the first option, but that would change.)

Re-opening while we discuss T265893#6760579

Talked this over with @kostajh and the options we saw for the batch job were:

  • Use a MediaWiki maintenance script to download the dataset over the web and import it into the production table. This is the well-trodden path since there's plenty of scaffolding for MediaWiki batch jobs, but conceptually wrong: the script belongs to the mwaddlink repository, not MediaWiki.

Agreed on the conceptually wrong part. Let's keep this as a backup plan, but +1 on NOT following that route.

  • Make the script part of the mwaddlink repo, download the dataset over the web, and set up a cronjob on the same group host as the link recommendation service. Not quite sure how that looks in detail; I imagine one of the Kubernetes pods would be designated somehow to be the job runner.

Just to clarify and since it seems to be a bit unclear per your comment, there isn't a "host group" for services in kubernetes. As in, we don't treat the underlying machines in the same way that we treat the rest of the fleet (where e.g. we would setup a cronjob to do something service related relying on the fact the code is on the filesystem) as they don't really have the software for the service "installed" (contrast this with the mwmaint paradigm where the code is clearly populated in an easy to find place on the host). Rather, we ask the cluster to spawn a number of pods using a specific container image and configuration and the cluster tries to do so. Which hosts the pods will end up running on, where the code will be populated etc, we don't really interfere with much, aside from providing the cluster with hints (e.g. I want 10 pods/instances and try to spread them out) to avoid issues like creating SPOFs.

However, there does exist functionality to run recurrent and scheduled jobs on kubernetes. The docs are at https://kubernetes.io/docs/concepts/workloads/controllers/cron-jobs/ but conceptually the premise is that 1 or more new pods (so nothing to do with the ones currently serving traffic) are being started (on which machines is again completely unimportant), do whatever we instruct them to do and then terminate. Which could happen by just bundling the script in the mwaddlink repo and amending the helm chart a bit.

Seems like stateless service pods are not ideal for this kind of thing.

They most definitely aren't. That's why the cronjob resource linked above exists.

  • Make the script part of the mwaddlink repo but run it on stats1008. No need to transfer large datasets over the web; OTOH the DB table credentials will have to be duplicated to the stats machine.

That would work, but would also require that the mwaddlink repo is checkout and kept up to date on stats1008 with whatever is deployed in production. If that's already happening, it's quite probably fine. But if not, it's another moving part to think about. It also requires that we open some firewall holes, which is another moving part too and one that has a tendency to get out of sync over time (it is becoming however more automated and it's thus felt less) as hosts get replaced/refreshed etc and causes some friction between teams.

I am guessing it would also make the loading of a new dataset more streamlined for you as it would become part of the releasing process (is that correct?)

The third approach seemed the simplest to us. Unlike an MW maintenance script that's part of the same cron entry as refreshLinkRecommendations.php, traffic to the service would not be suspended while the datasets are being imported, but that's not a big deal - we just lock the tables during import, let the service error out, and make refreshLinkRecommendations.php wait and retry on error. @akosiaris does that sound reasonable?

Both 2 and 3 sound fine to me, with the added note that for option 3, if/when we get a streamlined ETL process for populating datasets from analytics to production (instead of the adhoc direct access to database hosts) we should aim to migrate to it, to avoid some of the issues above (e.g. the firewall hole punching).

In all causes, I 'd urge to consider instead of locking the tables, you could just use a database transaction and do the entirety of the import in a single transaction. That would allow the service to continue working while the tables are being loaded with new data (albeit obviously serving results using the old one) and switch atomically to the new dataset once the transaction is committed, thus avoiding the erroring out on purpose.

Talked this over with @kostajh and the options we saw for the batch job were:

  • Use a MediaWiki maintenance script to download the dataset over the web and import it into the production table. This is the well-trodden path since there's plenty of scaffolding for MediaWiki batch jobs, but conceptually wrong: the script belongs to the mwaddlink repository, not MediaWiki.

Agreed on the conceptually wrong part. Let's keep this as a backup plan, but +1 on NOT following that route.

+1s all around then. Let's not do that.

  • Make the script part of the mwaddlink repo, download the dataset over the web, and set up a cronjob on the same group host as the link recommendation service. Not quite sure how that looks in detail; I imagine one of the Kubernetes pods would be designated somehow to be the job runner.

Just to clarify and since it seems to be a bit unclear per your comment, there isn't a "host group" for services in kubernetes. As in, we don't treat the underlying machines in the same way that we treat the rest of the fleet (where e.g. we would setup a cronjob to do something service related relying on the fact the code is on the filesystem) as they don't really have the software for the service "installed" (contrast this with the mwmaint paradigm where the code is clearly populated in an easy to find place on the host). Rather, we ask the cluster to spawn a number of pods using a specific container image and configuration and the cluster tries to do so. Which hosts the pods will end up running on, where the code will be populated etc, we don't really interfere with much, aside from providing the cluster with hints (e.g. I want 10 pods/instances and try to spread them out) to avoid issues like creating SPOFs.

However, there does exist functionality to run recurrent and scheduled jobs on kubernetes. The docs are at https://kubernetes.io/docs/concepts/workloads/controllers/cron-jobs/ but conceptually the premise is that 1 or more new pods (so nothing to do with the ones currently serving traffic) are being started (on which machines is again completely unimportant), do whatever we instruct them to do and then terminate. Which could happen by just bundling the script in the mwaddlink repo and amending the helm chart a bit.

Ok, so if we go this route then in the mwaddlink repo we would need:

  • load-datasets.py, set to execute hourly (?)
  • that script looks at the latest checksums published on datasets.wikimedia.org and compares them with the checksum values stored in the link recommendation service database
    • if they differ, the script downloads the datasets and imports them using the single transaction @akosiaris suggested

Seems reasonable, and the load-datasets.py script would also be usable in local environment, on toolforge, etc.

Seems like stateless service pods are not ideal for this kind of thing.

They most definitely aren't. That's why the cronjob resource linked above exists.

  • Make the script part of the mwaddlink repo but run it on stats1008. No need to transfer large datasets over the web; OTOH the DB table credentials will have to be duplicated to the stats machine.

That would work, but would also require that the mwaddlink repo is checkout and kept up to date on stats1008 with whatever is deployed in production. If that's already happening, it's quite probably fine. But if not, it's another moving part to think about. It also requires that we open some firewall holes, which is another moving part too and one that has a tendency to get out of sync over time (it is becoming however more automated and it's thus felt less) as hosts get replaced/refreshed etc and causes some friction between teams.

I am guessing it would also make the loading of a new dataset more streamlined for you as it would become part of the releasing process (is that correct?)

Yeah, could be nice to be able to import it immediately but it's not so terrible to have to wait for the scheduled cron job to pick up that there's a dataset to import.

The third approach seemed the simplest to us. Unlike an MW maintenance script that's part of the same cron entry as refreshLinkRecommendations.php, traffic to the service would not be suspended while the datasets are being imported, but that's not a big deal - we just lock the tables during import, let the service error out, and make refreshLinkRecommendations.php wait and retry on error. @akosiaris does that sound reasonable?

Both 2 and 3 sound fine to me, with the added note that for option 3, if/when we get a streamlined ETL process for populating datasets from analytics to production (instead of the adhoc direct access to database hosts) we should aim to migrate to it, to avoid some of the issues above (e.g. the firewall hole punching).

In all causes, I 'd urge to consider instead of locking the tables, you could just use a database transaction and do the entirety of the import in a single transaction. That would allow the service to continue working while the tables are being loaded with new data (albeit obviously serving results using the old one) and switch atomically to the new dataset once the transaction is committed, thus avoiding the erroring out on purpose.

That makes sense, thanks for the suggestion.

That would work, but would also require that the mwaddlink repo is checkout and kept up to date on stats1008 with whatever is deployed in production.

The stat1008 repo is used for producing the dataset (since that requires access to stats data, and maybe GPUs as well, I'm not sure about that), and that logic is much more likely to change over time than the very simple "take a bunch of tables and copy them verbatim to another database", so I don't think this would require keeping in sync to any larger extent than it is the case.

I am guessing it would also make the loading of a new dataset more streamlined for you as it would become part of the releasing process (is that correct?)

Yes, or at the very least we could easily tell exactly when the dataset has been updated. With a crobjob deployments would be more erratic. I'm not sure we expect changes to the dataset format though; this process would mostly serve to retrain the model to keep track with wiki content changes, and I don't think we'd care much when that happens.

In all causes, I 'd urge to consider instead of locking the tables, you could just use a database transaction and do the entirety of the import in a single transaction.

That's a good point. In scenario 3, it would be simple if we are fine with the extra storage need (how large are these datasets? also, do we have to change them all at the same time, or is one at a time OK?). In scenario 2, if I understand the docs correctly, Kubernetes can't guarantee that it won't run two instances of the job simultaneously; in that case I think we'd need table-level locking. Although we could still use a transaction to make the process smooth and just use a table-level write lock to prevent a second cronjob running in parallel.

All in all, the third option looks best to me:

  • less moving parts as we don't need to transfer the data over the web and deal with resuming, corruptions etc (we'd want that anyway probably, to support local development setups, but that code would be a lot less critical)
  • more control over timing (easy to kick off the process manually), probably a bit easier to debug / test
  • maybe most importantly, it would not require making the dumps public, so we could include private data. We don't plan to (there was some but it just got removed recently) but it's nice to keep that option open.

I can't judge the ops burden of firewalls, credentials etc. but it sounds like @akosiaris you aren't concerned about it and it will get lifted eventually due to infrastructure changes, so if that's the case, I'd vote for #3.

That would work, but would also require that the mwaddlink repo is checkout and kept up to date on stats1008 with whatever is deployed in production.

The stat1008 repo is used for producing the dataset (since that requires access to stats data, and maybe GPUs as well, I'm not sure about that), and that logic is much more likely to change over time than the very simple "take a bunch of tables and copy them verbatim to another database", so I don't think this would require keeping in sync to any larger extent than it is the case.

ACK.

I am guessing it would also make the loading of a new dataset more streamlined for you as it would become part of the releasing process (is that correct?)

Yes, or at the very least we could easily tell exactly when the dataset has been updated. With a crobjob deployments would be more erratic. I'm not sure we expect changes to the dataset format though; this process would mostly serve to retrain the model to keep track with wiki content changes, and I don't think we'd care much when that happens.

ACK.

In all causes, I 'd urge to consider instead of locking the tables, you could just use a database transaction and do the entirety of the import in a single transaction.

That's a good point. In scenario 3, it would be simple if we are fine with the extra storage need (how large are these datasets? also, do we have to change them all at the same time, or is one at a time OK?). In scenario 2, if I understand the docs correctly, Kubernetes can't guarantee that it won't run two instances of the job simultaneously; in that case I think we'd need table-level locking. Although we could still use a transaction to make the process smooth and just use a table-level write lock to prevent a second cronjob running in parallel.

Unless those 2 (or N for that matter) jobs are NOT going to be doing DDL statements (e.g. ALTER/TRUNCATE TABLE), each one will be in its own transaction and thus won't interfere with each other (I am assuming here the MySQL default transaction isolation level, that is REPEATABLE READ). If they are going to do DDL (e.g. a TRUNCATE TABLE instead of DELETE FROM) then yes, it's problematic and a lock will be required.

All in all, the third option looks best to me:

  • less moving parts as we don't need to transfer the data over the web and deal with resuming, corruptions etc (we'd want that anyway probably, to support local development setups, but that code would be a lot less critical)
  • more control over timing (easy to kick off the process manually), probably a bit easier to debug / test
  • maybe most importantly, it would not require making the dumps public, so we could include private data. We don't plan to (there was some but it just got removed recently) but it's nice to keep that option open.

I can't judge the ops burden of firewalls, credentials etc. but it sounds like @akosiaris you aren't concerned about it and it will get lifted eventually due to infrastructure changes, so if that's the case, I'd vote for #3.

It's not unmanageable for sure, but we definitely need to consolidate later on.

@akosiaris do we have examples in deployment-charts of a https://kubernetes.io/docs/concepts/workloads/controllers/cron-jobs/ implementation? I see one kind: Job definition for setup-db-job.yaml but I don't see where setup-db-job is referenced elsewhere, and I don't see kind: CronJob anywhere. Once https://gerrit.wikimedia.org/r/c/research/mwaddlink/+/660334 is merged, the script we'd want to run is python3 load-datasets.py --download=1 --path={???} -- not sure if you have preferences about the path, the data downloaded there needs to persist for the duration of the command execution; maybe "/tmp/datasets" is fine?

I think we don't have any CronJob resources by now. Feel free to take the example from the docs and extend it with the default set of labels (like the rest of the object in your chart have):

labels:
  app: {{ template "wmf.chartname" . }}
  chart: {{ template "wmf.chartid" . }}
  release: {{ .Release.Name }}
  heritage: {{ .Release.Service }}

As of the "setup-db-job": There is no need for further references as helm will render all files in the template/ directory and apply them to the cluster.
So I suggest you simply create a new file in your charts template directory (like template/cronjob.yaml).

Change 660394 had a related patch set uploaded (by Kosta Harlan; owner: Kosta Harlan):
[operations/deployment-charts@master] [WIP] linkrecommendation: Cron job to load datasets

https://gerrit.wikimedia.org/r/660394

Change 660436 had a related patch set uploaded (by Kosta Harlan; owner: Kosta Harlan):
[research/mwaddlink@main] Use admin credentials when performing write operations

https://gerrit.wikimedia.org/r/660436

Change 660436 abandoned by Kosta Harlan:
[research/mwaddlink@main] Use admin credentials when performing write operations

Reason:

https://gerrit.wikimedia.org/r/660436

Change 660394 merged by jenkins-bot:
[operations/deployment-charts@master] linkrecommendation: Cron job to load datasets

https://gerrit.wikimedia.org/r/660394

Change 664277 had a related patch set uploaded (by JMeybohm; owner: JMeybohm):
[operations/deployment-charts@master] linkrecommendation: Read DB_USER from public config

https://gerrit.wikimedia.org/r/664277

Change 664277 merged by jenkins-bot:
[operations/deployment-charts@master] linkrecommendation: Read DB_USER from public config

https://gerrit.wikimedia.org/r/664277

Change 664310 had a related patch set uploaded (by Kosta Harlan; owner: Kosta Harlan):
[operations/deployment-charts@master] linkrecommendation: Set backoffLimit to 1

https://gerrit.wikimedia.org/r/664310

Change 664310 merged by jenkins-bot:
[operations/deployment-charts@master] linkrecommendation: Set backoffLimit to 1

https://gerrit.wikimedia.org/r/664310

Change 664529 had a related patch set uploaded (by Kosta Harlan; owner: Kosta Harlan):
[operations/deployment-charts@master] linkrecommendation: Disable cronjob for external release

https://gerrit.wikimedia.org/r/664529

Change 664529 merged by jenkins-bot:
[operations/deployment-charts@master] linkrecommendation: Disable cronjob for external release

https://gerrit.wikimedia.org/r/664529

kostajh moved this task from In Progress to QA on the Growth-Team (Current Sprint) board.

The scope of this task expanded somewhat. Anyway, the pipeline infrastructure is set up, the service is running in kubernetes, and the cronjob is set up for importing datasets, so I'm marking this as resolved.

Change 664550 had a related patch set uploaded (by Kosta Harlan; owner: Kosta Harlan):
[operations/deployment-charts@master] linkrecommendation: Disable cron job on codfw

https://gerrit.wikimedia.org/r/664550

Change 664550 abandoned by Kosta Harlan:
[operations/deployment-charts@master] linkrecommendation: Disable cron job on codfw

Reason:

https://gerrit.wikimedia.org/r/664550