Page MenuHomePhabricator

Add Link engineering: Pipeline for moving MySQL database(s) from stats1008 to production MySQL server
Closed, ResolvedPublic

Description

In our work for the Add Link structured task project (https://wikitech.wikimedia.org/wiki/Add_Link), we will be moving DB tables from the staging database on the stats1008 server to a production MySQL instance on the misc cluster.

We need help with what process to use in moving the data from the stats1008 machine to that production instance.

  1. The https://gerrit.wikimedia.org/r/plugins/gitiles/research/mwaddlink/ contains a script for generating datasets (run-pipeline.sh). The script creates SQLite files, imports those SQLite files as MySQL DB tables on the staging database on stats1008, and generates checksums for both sets of files. The files are then copied over to a location on stats1008 which results in their publishing to datasets.wikimedia.org
  2. The refreshLinkRecommendations.php script which runs hourly on mwmaint should query datasets.wikimedia.org, compare the latest checksums there, and if they do not match what is stored in the local wiki database, it should download the MySQL tables and import them into the addlink database on the misc cluster. It should then proceed with the rest of the refreshLinkRecommendations.php work of refreshing the data in the local wiki cache tables. This is tracked in T272419.

Event Timeline

kostajh added a project: DBA.

@Ottomata, @MMiller_WMF was telling me you might have some ideas on how to solve this problem. Thanks in advance for any help you can provide!

We need a database on stats1008 where we can write data to. (@Marostegui, would you prefer one database per language wiki or would you rather have the datasets for each language wiki contained in a single database, using a naming convention like {language}_anchors and {language}_nav2vec etc?)

Not sure if this is meant to be but the database we discussed on our meeting won't be on stats1008 but on any of our misc clusters (yet to be defined), we call the m1, m2, m3 and m5. Unless you need a database on stats1008 to do all your processing there and then insert data on the production databases.
Regarding the naming conversion, let's do one database and everything in there.
I am going to remove the DBA tag for now, as this is mostly a discussion. I will stay on the task in case I am needed.
Once you are ready to request the database, please create a DBA tag with the following information there as mentioned here T265610#6588138:

QPS:
Size: (An estimation is fine)
DB Name:
User:
Accessed from server (s):
Backup Policy: Needed? Frecuency?
Grants needed:

Thank you!

Not sure if this is meant to be but the database we discussed on our meeting won't be on stats1008 but on any of our misc clusters (yet to be defined), we call the m1, m2, m3 and m5. Unless you need a database on stats1008 to do all your processing there and then insert data on the production databases.

Yeah, that's what I was thinking of @Marostegui, sorry for not wording it clearly. I was thinking of the following setup:

  1. A database on stats1008 that we can use for testing out the research/mwaddlink tool. That's where the models are trained, and we can run queries against the generated datasets there. (@razzi @Ottomata could you please help us with setting up a database on stats1008 or point us to documentation for doing it ourselves? I don't see info on that in https://wikitech.wikimedia.org/wiki/Analytics/Systems/Clients)
  2. A database on one of the misc clusters that is read-only for the link recommendation service running in kubernetes
  3. When we are happy with the generated datasets on the database in stats1008, we would trigger a process to export the data from the stats1008 database and import it into one of the misc database clusters (this is what we need help with from @razzi and @Ottomata)

In the context of the above, I had wanted to clarify if we should have a single database for all wikis or one database per wiki, to ease the export/import process; and to confirm, we will have a single database on stats1008 for testing/development and a single database on one of the misc clusters.

Thanks for the clarification!
Just one note:

  1. A database on one of the misc clusters that is read-only for the link recommendation service running in kubernetes

This database is writable, as it is shared with other different projects.

Hiya!

I have a lot of questions, perhaps we should set up a meeting. We don't usually just 'set up a database' in the analytics cluster for individual projects or for deploying production features, but I do recognize that there aren't a lot of options here because GPUs.

So ya, can you set up a meeting with me and @razzi sometime next week? TTYT! :)

CCing @elukey as well.

Ah Kosta is in EU which makes a meeting with both me and Razzi difficult to schedule! We can just do me and Kosta and I'll sync up here with Razzi separately.

@Marostegui @Joe, @Ottomata and I met yesterday (notes). It doesn't sound like there is an existing worfklow for pushing database contents from stats machines to the misc cluster.

As far as I understand the infrastructure we have, this is what I think we need to do:

  • The research/mwaddlink tool is going to write its datasets to tables on the staging database on stats1008. Those table names will be in the format addlink_{language}_{dataset_name}, e.g. addlink_en_anchors for the "anchors" dataset for enwiki.
  • When we are happy with the dataset quality for a particular language, we generate a MySQL dump of the tables (5 tables)
  • Then we will have a script that pushes those files to Swift
  • Then (this part I am unclear on) we either have some service that subscribes to SwiftUploadComplete which imports those tables into the production database on the misc cluster, or we (Growth team engineers) have access to manually import the tables.

How does this sound to you, @Marostegui & @Joe?

I am not familiar with those steps, those are probably more for @Ottomata or @Joe.
I can advise on the mysql production side, for that we need: T266826#6590618
Also, as discussed during the meeting, let's make sure we throttle the writes to the production databases once we are ready to schedule that process.

@Marostegui @Joe, @Ottomata and I met yesterday (notes). It doesn't sound like there is an existing worfklow for pushing database contents from stats machines to the misc cluster.

As far as I understand the infrastructure we have, this is what I think we need to do:

  • The research/mwaddlink tool is going to write its datasets to tables on the staging database on stats1008. Those table names will be in the format addlink_{language}_{dataset_name}, e.g. addlink_en_anchors for the "anchors" dataset for enwiki.
  • When we are happy with the dataset quality for a particular language, we generate a MySQL dump of the tables (5 tables)
  • Then we will have a script that pushes those files to Swift
  • Then (this part I am unclear on) we either have some service that subscribes to SwiftUploadComplete which imports those tables into the production database on the misc cluster, or we (Growth team engineers) have access to manually import the tables.

How does this sound to you, @Marostegui & @Joe?

I think the fundamental model is sound. We can think of implementing it in increasing levels of effort/good engineering. You might want to skip some steps ofc.

I think you should add to your repository an export script that can take the content from a db, prepare a compressed sql dump, upload it to swift (linkctl --export <source_dsn> <swift-url>)

Once the data is exported, this script should be able to signal somehow that the data is available (using kafka I guess)?

Now for actually loading the data in production we can follow two paths:

1 - (quicker) you include an import functionality in the linkctl script above that can be ran manually, and we'll set you up to be able to run it manually
2 - (better) you implement a small daemon that listens for the signal, and does the upload for you. If it's in the same repository as the rest of the code, we can run it in k8s with the same docker image as the main link service.

OFC on the long term I'd prefer option 2 :)

I think the fundamental model is sound. We can think of implementing it in increasing levels of effort/good engineering. You might want to skip some steps ofc.

I think you should add to your repository an export script that can take the content from a db, prepare a compressed sql dump, upload it to swift (linkctl --export <source_dsn> <swift-url>)

Got it, that sounds OK. There is already this script that @Ottomata pointed me to, so that is a good starting point.

Once the data is exported, this script should be able to signal somehow that the data is available (using kafka I guess)?

Would this be different than the SwiftUploadComplete event that @Ottomata pointed me to? (@Ottomata as an aside, I can't find any documentation on this in Wikitech or via codesearch, is there something you could point me to please?)

Now for actually loading the data in production we can follow two paths:

1 - (quicker) you include an import functionality in the linkctl script above that can be ran manually, and we'll set you up to be able to run it manually
2 - (better) you implement a small daemon that listens for the signal, and does the upload for you. If it's in the same repository as the rest of the code, we can run it in k8s with the same docker image as the main link service.

Option 2 sounds fine, is there any prior art to reference?

As a slight variation on option 2, what about:

  1. add the linkctl --export command as you proposed (dump tables, compress them, use the script in analytics/refinery repo for uploading the directory to swift)
  2. issue a request to the link recommendation service using a secret key that tells the service data is ready to be imported
  3. the kubernetes app responds to the request and looks in the relevant directories (are these mounted in the container?), decompresses the files and imports the data to the MySQL instance

Would this be different than the SwiftUploadComplete event that @Ottomata pointed me to? (@Ottomata as an aside, I can't find any documentation on this in Wikitech or via codesearch, is there something you could point me to please?)

Here's the schema: https://schema.wikimedia.org/repositories//secondary/jsonschema/swift/upload/complete/latest
I don't have any specific documentation about this schema on wikitech, but there are lots of more generic Event Platform docs. This instrumentation how to is focused on frontend browser clients via EventLogging, but the rest of the details about schemas and stream config apply here.

issue a request to the link recommendation service using a secret key that tells the service data is ready to be imported

This is probably simpler for your use case than using an event for notification.

Would this be different than the SwiftUploadComplete event that @Ottomata pointed me to? (@Ottomata as an aside, I can't find any documentation on this in Wikitech or via codesearch, is there something you could point me to please?)

No, not at all. The idea was you'd submit a swift/upload/complete message on eventgate (on a specific topic) and that the application would just constantly listen for it.

As a slight variation on option 2, what about:

  1. add the linkctl --export command as you proposed (dump tables, compress them, use the script in analytics/refinery repo for uploading the directory to swift)
  2. issue a request to the link recommendation service using a secret key that tells the service data is ready to be imported
  3. the kubernetes app responds to the request and looks in the relevant directories (are these mounted in the container?), decompresses the files and imports the data to the MySQL instance

The reason I wanted to keep the two systems separated is that this way importing the data would not be able to affect the service while running. As for the files, you will need to send the Swift url to your service, that would then need to fetch them from swift to perform the dump.

@Tgr pointed out that making the artifacts (datasets + model) from the training pipeline publicly available would be nice so that people without access to WMF-only systems could run the link recommendation service themselves for development / testing. It would also simplify the process of downloading the artifacts from within Vagrant provisioning.

Even so, I guess we still want to use Swift and its upload/complete event so that we can implement the model @Joe has proposed with a daemon listening for that event.

2 - (better) you implement a small daemon that listens for the signal, and does the upload for you. If it's in the same repository as the rest of the code, we can run it in k8s with the same docker image as the main link service.

@Joe I could use a bit of guidance on this. I understand what you are proposing but not how that would get implemented with our infrastructure. Would this be a second container that is run in response to the swift/upload/complete event with an entrypoint that uses e.g. python import-data.py instead of the existing Flask app entrypoint?

making the artifacts (datasets + model) from the training pipeline publicly available would be nice
...
Even so, I guess we still want to use Swift and its upload/complete event

Hm, if you are making this public, I suppose you could use either https://analytics.wikimedia.org/published (https://wikitech.wikimedia.org/wiki/Analytics/Web_publication) or https://dumps.wikimedia.org/other/ (https://wikitech.wikimedia.org/wiki/Dumps) and then just download via HTTP. Or, if Swift is still the better option, there is probably a way to make the object URLs in Swift public; IIUC that is how image content is served.

making the artifacts (datasets + model) from the training pipeline publicly available would be nice
...
Even so, I guess we still want to use Swift and its upload/complete event

Hm, if you are making this public, I suppose you could use either https://analytics.wikimedia.org/published (https://wikitech.wikimedia.org/wiki/Analytics/Web_publication) or https://dumps.wikimedia.org/other/ (https://wikitech.wikimedia.org/wiki/Dumps) and then just download via HTTP. Or, if Swift is still the better option, there is probably a way to make the object URLs in Swift public; IIUC that is how image content is served.

Before we proceed with this, @MGerlach is going to confirm that the datasets are OK to make public.

Before we proceed with this, @MGerlach is going to confirm that the datasets are OK to make public.

One table contains features generated according to the algorithm from the navigation vectors using reading-sessions as input (sensitive data); however, output was published previously. We are currently reviewing whether we can publish this data (and if yes in what form). An alternative would be to remove this feature from the model. hopefully, I have updates on this next week (slow week due to US-holidays).

Before we proceed with this, @MGerlach is going to confirm that the datasets are OK to make public.

One table contains features generated according to the algorithm from the navigation vectors using reading-sessions as input (sensitive data); however, output was published previously. We are currently reviewing whether we can publish this data (and if yes in what form). An alternative would be to remove this feature from the model. hopefully, I have updates on this next week (slow week due to US-holidays).

OK, cool. Another use case would be to load SQLite files for a smaller wiki (probably simplewiki) for running integration tests in CI.

Change 649452 had a related patch set uploaded (by Kosta Harlan; owner: Kosta Harlan):
[research/mwaddlink@main] Set env variables with fallbacks

https://gerrit.wikimedia.org/r/649452

Change 649452 merged by jenkins-bot:
[research/mwaddlink@main] Set env variables with fallbacks

https://gerrit.wikimedia.org/r/649452

Change 654930 had a related patch set uploaded (by Kosta Harlan; owner: Kosta Harlan):
[research/mwaddlink@main] training pipeline: Export MySQL dumps and publish on web

https://gerrit.wikimedia.org/r/654930

Change 655056 had a related patch set uploaded (by Kosta Harlan; owner: Kosta Harlan):
[research/mwaddlink@main] training-pipeline: Generate and publish checksums

https://gerrit.wikimedia.org/r/655056

Change 655084 had a related patch set uploaded (by Kosta Harlan; owner: Kosta Harlan):
[research/mwaddlink@main] [WIP] Add load-tables.py script

https://gerrit.wikimedia.org/r/655084

There are notes in T265893 (specifically T265893#6687983) about the import process.

Change 655084 abandoned by Kosta Harlan:
[research/mwaddlink@main] [WIP] Add load-tables.py script

Reason:
Might come back to this later, but the main mechanism for importing should happen via refreshLinkRecommendations.php (T265893)

https://gerrit.wikimedia.org/r/655084

kostajh updated the task description. (Show Details)

Change 655056 abandoned by Kosta Harlan:
[research/mwaddlink@main] training-pipeline: Generate and publish checksums

Reason:
folded into I90ca18eb22d2f5e22d238de96681c93fb0e17d70

https://gerrit.wikimedia.org/r/655056

Change 657901 had a related patch set uploaded (by Kosta Harlan; owner: Kosta Harlan):
[research/mwaddlink@main] [WIP] Add load-datasets script

https://gerrit.wikimedia.org/r/657901

Change 654930 merged by jenkins-bot:
[research/mwaddlink@main] training pipeline: Export MySQL dumps and publish on web

https://gerrit.wikimedia.org/r/654930

Change 657901 merged by jenkins-bot:
[research/mwaddlink@main] Add load-datasets script

https://gerrit.wikimedia.org/r/657901

Change 659979 had a related patch set uploaded (by Kosta Harlan; owner: Kosta Harlan):
[research/mwaddlink@main] [WIP] Allow downloading datasets from datasets.wikimedia.org

https://gerrit.wikimedia.org/r/659979

Change 660334 had a related patch set uploaded (by Kosta Harlan; owner: Kosta Harlan):
[research/mwaddlink@main] load-datasets: Allow download/import of all published datasets

https://gerrit.wikimedia.org/r/660334

Change 662091 had a related patch set uploaded (by Gergő Tisza; owner: Gergő Tisza):
[mediawiki/vagrant@master] [WIP] addlink: Download mwaddlink dataset from the web

https://gerrit.wikimedia.org/r/662091

Change 659979 merged by jenkins-bot:
[research/mwaddlink@main] Provide ability to download datasets from remote store

https://gerrit.wikimedia.org/r/659979

Change 663259 had a related patch set uploaded (by Gergő Tisza; owner: Gergő Tisza):
[research/mwaddlink@main] Make sure the standard tables exist

https://gerrit.wikimedia.org/r/663259

Change 663259 abandoned by Gergő Tisza:
[research/mwaddlink@main] Make sure the standard tables exist

Reason:
I forgot this is done as part of I6868267c31fabcea.

https://gerrit.wikimedia.org/r/663259

Change 660394 had a related patch set uploaded (by Kosta Harlan; owner: Kosta Harlan):
[operations/deployment-charts@master] linkrecommendation: Cron job to load datasets

https://gerrit.wikimedia.org/r/660394

Change 660334 merged by jenkins-bot:
[research/mwaddlink@main] load-datasets: Allow download/import of all published datasets

https://gerrit.wikimedia.org/r/660334

Change 660394 merged by jenkins-bot:
[operations/deployment-charts@master] linkrecommendation: Cron job to load datasets

https://gerrit.wikimedia.org/r/660394

Change 664277 had a related patch set uploaded (by JMeybohm; owner: JMeybohm):
[operations/deployment-charts@master] linkrecommendation: Read DB_USER from public config

https://gerrit.wikimedia.org/r/664277

Change 664277 merged by jenkins-bot:
[operations/deployment-charts@master] linkrecommendation: Read DB_USER from public config

https://gerrit.wikimedia.org/r/664277

Change 662091 merged by jenkins-bot:
[mediawiki/vagrant@master] addlink: Download mwaddlink dataset from the web

https://gerrit.wikimedia.org/r/662091

kostajh moved this task from In Progress to QA on the Growth-Team (Sprint 0 (Growth Team)) board.

Pipeline is set up. On stat1008, you can run WIKI_ID=arwiki ./run-pipeline.sh to generate the datasets, then WIKI_ID=arwiki ./publish-datasets.sh will publish the datasets to https://analytics.wikimedia.org/published/datasets/one-off/research-mwaddlink/arwiki. The cron job running in the kubernetes deployment for linkrecommendation runs a script that checks to see if the datasets are new or updated compared to what the service has stored, and if new/updated it will download and import them.

Change 665057 had a related patch set uploaded (by Kosta Harlan; owner: Kosta Harlan):
[research/mwaddlink@main] Use random query string to bypass cache

https://gerrit.wikimedia.org/r/665057

Change 665057 merged by jenkins-bot:
[research/mwaddlink@main] Use random query string to bypass cache

https://gerrit.wikimedia.org/r/665057