Prepare for initial data import on production servers
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	STran
	Jun 16 2023, 8:05 AM

Description

In order to perform the initial data import, I suggest we run once a kubernetes Job,, which will download the latest dump, extract it, and import the data into the database.

What will this job do?

Create the db schema
Download the latest dump from the provider and extract it
After the import is complete, the job will exit.

Job example (taken from mw-in-k8s):

apiVersion: batch/v1
kind: Job
metadata:
  name: setup-db-{{ template "base.name.release" . }}-{{ .Release.Revision }}
spec:
  template:
    spec:
      restartPolicy: Never
      containers:
        - name: setup-db
          command: [ /var/config/setup.sh ]
          image: "{{ .Values.docker.registry }}/{{ .Values.main_app.image }}:{{ .Values.main_app.version }}"
          imagePullPolicy: {{ .Values.docker.pull_policy }}
{{ include "mediawiki-env" . | indent 10 }}
{{ include "mediawiki-volumeMounts" . | indent 10 }}
{{ include "mediawiki-volumes" . | indent 6 }}

Requirements:

A job config similar to the above
The job should be able to be restarted in case of failure, and continue where it left off

Due to the fact that:

pods are ephemeral, and they could be killed at any give time for a myriad of reasons
this is a large dataset (~23mil rows), thus heavy on mysql operations

We must have the ability to resume this data import, in case it is interrupted.

Notes:
We have the ability to provide as resources to match the pod's needs, so in this case we could make it possible to load the whole dump in memory, if that would help with our current challenges (needs to be discussed within serviceops

TODO:
T341122: Implement daily data update routine uses the env variable SPUR_API_KEY and presumably this won't match with production. We'll need to update to match.

Related Objects
Search...

Status	Assigned	Task
In Progress	Niharika	T324492 Temporary accounts - MVP
Open	None	T340895 [Epic] IP Info accommodations for temporary accounts
Open	STran	T341395 Display Spur data on IPInfo infobox
Resolved	kostajh	T339284 Deploy ipoid
Resolved	kostajh	T339331 Prepare for initial data import on production servers

Event Timeline

STran created this task.Jun 16 2023, 8:05 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 16 2023, 8:05 AM

STran merged a task: T325628: Investigate: Ensure there's a production-ready init process.Jun 16 2023, 8:28 AM

STran added subscribers: Jelto, kostajh, ARamirez_WMF, Niharika.

STran added a parent task: T339279: [Epic] Display data from ipoid in IPInfo infobox.Jun 16 2023, 8:36 AM

kostajh mentioned this in T341122: Implement daily data update routine.Jul 5 2023, 11:03 AM

STran mentioned this in T325635: Investigate: Alternatives or improvements to data import method.Jul 6 2023, 9:14 AM

Given that we have the ability to provide as many resources as the pod needs, in which case, we can make it possible to load the whole dump in memory, if that would help with our current challenges.

@jijiki Does this only apply to the initial import, or could we do something similar for the daily updates - i.e. have the two dumps uncompressed in memory? I'm wondering about faster ways to compare the two dumps without reading from the database. We have a diffing script that works quickly, but uses a lot of storage space, and I'm wondering if it's at all worth pursuing.

jijiki updated the task description. (Show Details)Jul 7 2023, 3:23 PM

@STran answering your questions from the previous thread

In T325635#8934016, @STran wrote:

Download the latest dump on deploy1002

Can someone do this manually? Or do you want a programmatic way of doing it? For the latter, T325630: Implement call to data vendor is not done yet.

If I understand correctly, the same requirement applies for T341122: Implement daily data update routine, so I reckon it is a general requirement?

Introduce flags to the application (or have a separate application/script) to instruct it to make the data import and exit

import-data.js does this already, looking for the file from a source specified as an environment variable

Introduce a flag to specify from which file to read from

Is the environment variable alright?

It depends on the implementation, we are using ENV variables already, so that is sorted. Regarding the actual location of the dump file, I believe that this depends on solving T325630: Implement call to data vendor too.

Run the import as a standalone kubernetes Job (one-off)

I don't know what a job is from kubernetes' understanding of the word, but this could be manually done by running node ./import-data.js with the feed where the script expects it to be.

A Kubernetes Job will create a pod which will run the data import script from the iPoid image (or however you wish you to package it). To my knowledge, we do not have a way to run a node application manually in production for an one time job.

Have the ability to restart the job, or continue from where it left off in case of an error (eg the node it was running died)

This is not a feature atm. Is it a blocker?

I updated the task description so to include with this is potentially a blocker.

In T339331#8992755, @Tchanders wrote:

Given that we have the ability to provide as many resources as the pod needs, in which case, we can make it possible to load the whole dump in memory, if that would help with our current challenges.

@jijiki Does this only apply to the initial import, or could we do something similar for the daily updates - i.e. have the two dumps uncompressed in memory? I'm wondering about faster ways to compare the two dumps without reading from the database. We have a diffing script that works quickly, but uses a lot of storage space, and I'm wondering if it's at all worth pursuing.

Please copy/paste your question under T341122: Implement daily data update routine, and move this part of the discussion there.

jijiki added a parent task: T339284: Deploy ipoid.Jul 7 2023, 4:16 PM

• AGueyte moved this task from Untriaged to Similar Editors on the Anti-Harassment board.Jul 12 2023, 6:32 PM

• AGueyte moved this task from Similar Editors to Untriaged on the Anti-Harassment board.Jul 13 2023, 11:43 AM

Create the db schema

Does this mean the tables don't exist yet? I know the service exists in some state on staging as of T341326: Deploy ipoid to staging on Kuberenetes. Asking because @Dreamy_Jazz pointed out that we haven't been keeping up writing update sql files for the schema as we've been making the changes. I assumed since nothing had been imported yet, we were free to make these changes and when we were ready to import, we'd drop and recreate. Tagging DBAs as well.

The tables don't exist yet because we (DBAs) were only asked to create the database, which was done at: T305114 - the tables were discussed but we never got asked to create them.
The ipoid_rw user though, has privileges to create tables, so you can just self-serve as needed whenever you believe they are ready to be created.

For the import itself: How much data will be inserted? How would it be rate limited to avoid affecting the rest of services that live on that same host?

Marostegui moved this task from Triage to In progress on the DBA board.Jul 19 2023, 7:00 AM

Tchanders removed a parent task: T339279: [Epic] Display data from ipoid in IPInfo infobox.Aug 23 2023, 1:59 PM

STran updated the task description. (Show Details)Sep 6 2023, 1:58 AM

STran mentioned this in T325630: Implement call to data vendor.

We've done the initial data import. Marking this as resolved.

Marostegui mentioned this in T305114: Set up MariaDB for iPoid.Nov 14 2023, 9:53 AM

Maintenance_bot moved this task from In progress to Done on the DBA board.Nov 14 2023, 10:15 AM

STran moved this task from Backlog to Done on the iPoid-Service board.Jan 23 2024, 12:03 PM

Prepare for initial data import on production serversClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Prepare for initial data import on production servers
Closed, ResolvedPublic
Actions

Related Objects
Search...