In order to perform the initial data import, I suggest we run once a kubernetes Job,, which will download the latest dump, extract it, and import the data into the database.
What will this job do?
- Create the db schema
- Download the latest dump from the provider and extract it
- After the import is complete, the job will exit.
Job example (taken from mw-in-k8s):
apiVersion: batch/v1 kind: Job metadata: name: setup-db-{{ template "base.name.release" . }}-{{ .Release.Revision }} spec: template: spec: restartPolicy: Never containers: - name: setup-db command: [ /var/config/setup.sh ] image: "{{ .Values.docker.registry }}/{{ .Values.main_app.image }}:{{ .Values.main_app.version }}" imagePullPolicy: {{ .Values.docker.pull_policy }} {{ include "mediawiki-env" . | indent 10 }} {{ include "mediawiki-volumeMounts" . | indent 10 }} {{ include "mediawiki-volumes" . | indent 6 }}
Requirements:
- A job config similar to the above
- The job should be able to be restarted in case of failure, and continue where it left off
Due to the fact that:
- pods are ephemeral, and they could be killed at any give time for a myriad of reasons
- this is a large dataset (~23mil rows), thus heavy on mysql operations
We must have the ability to resume this data import, in case it is interrupted.
Notes:
We have the ability to provide as resources to match the pod's needs, so in this case we could make it possible to load the whole dump in memory, if that would help with our current challenges (needs to be discussed within serviceops
TODO:
T341122: Implement daily data update routine uses the env variable SPUR_API_KEY and presumably this won't match with production. We'll need to update to match.