Spinning this off from questions @jijiki had on {T325635] which seemed independent enough that we should probably split the discussion from the other discussion happening around ongoing imports and which perhaps shouldn't be happening on a stale-ish/redundant-ish ticket.
Copy pastaing:In order to perform the initial data import, I suggest we run once a [[ https://kubernetes.io/docs/concepts/workloads/controllers/job/ | kubernetes Job, ]], which will download the latest dump, extract it, and import the data into the database.
**What will this job do?**
>>! In T325635#8931805, @jijiki wrote:* Create the db schema
> Since #serviceops is done with T336163, we must consider how we are going to do the initial import in production. This is our suggestion:* Download the latest dump from the provider and extract it
>
> * Download the latest dump on deploy1002
> * Introduce flags to the application (or have a separate application/script) to instruct it to make the data import and exit* After the import is complete, the job will exit.
Job example (taken from mw-in-k8s):
> * Introduce a flag to specify from which file to read from
> * Run the import as a standalone kubernetes Job (one-off)```
> * Have the ability to restart the job, or continue from where it left off in case of an error (eg the node it was running died)apiVersion: batch/v1
>
> Given that we have the ability to provide as many resources as the pod needs,kind: Job
metadata:
name: setup-db-{{ template "base.name.release" . in which case, we can make it possible to load the whole dump in memory, if that would help with our current challenges.
>>! In T325635#8934016, @STran wrote}}-{{ .Release.Revision }}
spec:
template:
spec:
>> Download the latest dump on deploy1002 restartPolicy: Never
> Can someone do this manually? Or do you want a programmatic way of doing it? For the latter, {T325630} is not done yet. containers:
>> Introduce flags to the application (or have a separate application/script) to instruct it to make the data import and exit - name: setup-db
> `import-data.js` does this already, looking for the file from a source specified as an environment variable command: [ /var/config/setup.sh ]
>> Introduce a flag to specify from which file to read from image: "{{ .Values.docker.registry }}/{{ .Values.main_app.image }}:{{ .Values.main_app.version }}"
> Is the environment variable alright? imagePullPolicy: {{ .Values.docker.pull_policy }}
>> Run the import as a standalone kubernetes Job (one-off){{ include "mediawiki-env" . | indent 10 }}
> I don't know what a job is from kubernetes' understanding of the word,{{ include "mediawiki-volumeMounts" . but this could be manually done by running `node ./import-data.js` with the feed where the script expects it to be.| indent 10 }}
>> Have the ability to restart the job{{ include "mediawiki-volumes" . | indent 6 }}
```
**Requirements:**
* A job config similar to the above
* The job should be able to be restarted in case of failure, and continue where it left off
Due to the fact that:
* pods are ephemeral, and they could be killed at any give time for a myriad of reasons
* this is a large dataset (~23mil rows), thus heavy on mysql operations
We must have the ability to resume this data import, or continue from where it left off in case of an error (eg the node it was running died)in case it is interrupted.
**Notes:**
> This is not a feature atm.We have the ability to provide as resources to match the pod's needs, so in this case we could make it possible to load the whole dump in memory, Is it a blocker?if that would help with our current challenges (needs to be discussed within #serviceops