Page MenuHomePhabricator

Prepare for initial data import on production servers
Closed, ResolvedPublic

Description

In order to perform the initial data import, I suggest we run once a kubernetes Job,, which will download the latest dump, extract it, and import the data into the database.

What will this job do?

  • Create the db schema
  • Download the latest dump from the provider and extract it
  • After the import is complete, the job will exit.

Job example (taken from mw-in-k8s):

apiVersion: batch/v1
kind: Job
metadata:
  name: setup-db-{{ template "base.name.release" . }}-{{ .Release.Revision }}
spec:
  template:
    spec:
      restartPolicy: Never
      containers:
        - name: setup-db
          command: [ /var/config/setup.sh ]
          image: "{{ .Values.docker.registry }}/{{ .Values.main_app.image }}:{{ .Values.main_app.version }}"
          imagePullPolicy: {{ .Values.docker.pull_policy }}
{{ include "mediawiki-env" . | indent 10 }}
{{ include "mediawiki-volumeMounts" . | indent 10 }}
{{ include "mediawiki-volumes" . | indent 6 }}

Requirements:

  • A job config similar to the above
  • The job should be able to be restarted in case of failure, and continue where it left off

Due to the fact that:

  • pods are ephemeral, and they could be killed at any give time for a myriad of reasons
  • this is a large dataset (~23mil rows), thus heavy on mysql operations

We must have the ability to resume this data import, in case it is interrupted.

Notes:
We have the ability to provide as resources to match the pod's needs, so in this case we could make it possible to load the whole dump in memory, if that would help with our current challenges (needs to be discussed within serviceops

TODO:
T341122: Implement daily data update routine uses the env variable SPUR_API_KEY and presumably this won't match with production. We'll need to update to match.

Event Timeline

Given that we have the ability to provide as many resources as the pod needs, in which case, we can make it possible to load the whole dump in memory, if that would help with our current challenges.

@jijiki Does this only apply to the initial import, or could we do something similar for the daily updates - i.e. have the two dumps uncompressed in memory? I'm wondering about faster ways to compare the two dumps without reading from the database. We have a diffing script that works quickly, but uses a lot of storage space, and I'm wondering if it's at all worth pursuing.

@STran answering your questions from the previous thread

Download the latest dump on deploy1002

Can someone do this manually? Or do you want a programmatic way of doing it? For the latter, T325630: Implement call to data vendor is not done yet.

If I understand correctly, the same requirement applies for T341122: Implement daily data update routine, so I reckon it is a general requirement?

Introduce flags to the application (or have a separate application/script) to instruct it to make the data import and exit

import-data.js does this already, looking for the file from a source specified as an environment variable

Introduce a flag to specify from which file to read from

Is the environment variable alright?

It depends on the implementation, we are using ENV variables already, so that is sorted. Regarding the actual location of the dump file, I believe that this depends on solving T325630: Implement call to data vendor too.

Run the import as a standalone kubernetes Job (one-off)

I don't know what a job is from kubernetes' understanding of the word, but this could be manually done by running node ./import-data.js with the feed where the script expects it to be.

A Kubernetes Job will create a pod which will run the data import script from the iPoid image (or however you wish you to package it). To my knowledge, we do not have a way to run a node application manually in production for an one time job.

Have the ability to restart the job, or continue from where it left off in case of an error (eg the node it was running died)

This is not a feature atm. Is it a blocker?

I updated the task description so to include with this is potentially a blocker.

Given that we have the ability to provide as many resources as the pod needs, in which case, we can make it possible to load the whole dump in memory, if that would help with our current challenges.

@jijiki Does this only apply to the initial import, or could we do something similar for the daily updates - i.e. have the two dumps uncompressed in memory? I'm wondering about faster ways to compare the two dumps without reading from the database. We have a diffing script that works quickly, but uses a lot of storage space, and I'm wondering if it's at all worth pursuing.

Please copy/paste your question under T341122: Implement daily data update routine, and move this part of the discussion there.

STran added a subscriber: Dreamy_Jazz.

Create the db schema

Does this mean the tables don't exist yet? I know the service exists in some state on staging as of T341326: Deploy ipoid to staging on Kuberenetes. Asking because @Dreamy_Jazz pointed out that we haven't been keeping up writing update sql files for the schema as we've been making the changes. I assumed since nothing had been imported yet, we were free to make these changes and when we were ready to import, we'd drop and recreate. Tagging DBAs as well.

The tables don't exist yet because we (DBAs) were only asked to create the database, which was done at: T305114 - the tables were discussed but we never got asked to create them.
The ipoid_rw user though, has privileges to create tables, so you can just self-serve as needed whenever you believe they are ready to be created.

For the import itself: How much data will be inserted? How would it be rate limited to avoid affecting the rest of services that live on that same host?

kostajh claimed this task.

We've done the initial data import. Marking this as resolved.