Page MenuHomePhabricator

'morelike' recommendation API: Bulk import data to MySQL in chunks
Closed, ResolvedPublic

Description

We have an article recommendation API that suggests articles for creation based on a seed article. For example, here you can see that articles similar to 'Book' and identified by Wikidata item IDs are missing from enwiki and thus are being suggested for creation.

The API pulls data from various places, and one of those places is MySQL. Data gets into MySQL by the article-recommender/deploy repository. Since we run the import script in a shared host and import data to a shared database, we'd like to not block other processes while importing large quantities of data. For this reason we'd like to import data in chunks.

Mentors

Skills required

  • MySQL, Python

Acceptance Criteria

Event Timeline

@bmansurov Thank you for listing this project! Could you maybe add more details to the project description, also add skills required and then add it to the our ideas page https://www.mediawiki.org/wiki/Google_Summer_of_Code/2019?

As a sidenote I'm not sure what "A/C" in the task description means :)

@srishakatux thanks for the feedback. Please let me know if anything else is needed for this task to be considered ready for work.

Hey, I'm Muhammad Usman. I am interested in this project. I have a good experience with working on Python and MySQL as well. I have quite a few number of contributions to open source projects.

  • Are there any micro-tasks for this project?
  • Could you please fix the link of the repository as it's currently broken?

Thanks.

@Usmanmuhd, welcome! The link's been fixed. There are no micro tasks for this project, unless you want to split up the work into meaningful parts and work on them separately. But I think the project is self containing.

In the README file it says Data is in stat1007:/home/bmansurov/tp9/article-recommender-deploy/ Where can I get this file?
Is there some kind of project board for this and also any recommended issues to work on?

In the README file it says Data is in stat1007:/home/bmansurov/tp9/article-recommender-deploy/ Where can I get this file?

I've updated the readme. Here's the file: https://analytics.wikimedia.org/datasets/one-off/article-recommender/20181130.tar.gz

Is there some kind of project board for this and also any recommended issues to work on?

Yes, we have https://phabricator.wikimedia.org/project/view/1351/, but this is the only task ready for work. I'll be creating more tasks before the program starts.

This comment was removed by Usmanmuhd.

@Usmanmuhd on IRC you mentioned that you'd submit a patch for this task. If you started working on the task, feel free to assign it to yourself. Also feel free to ask questions here, on IRC, or via email.

I would first prefer to work on https://phabricator.wikimedia.org/T216721 as it kind of looks simpler on the first look.

@bmansurov How do I get the chunks that need to be imported? If I need to generate the chunks by myself, what's the procedure to generate it?

@Usmanmuhd I was thinking that we should split up the file into multiple files and import them one by one. But I'm curious to know if you can find a better way. Can you do some research on this? Thanks.

@bmansurov As we are using LOAD DATA there is no way apart from splitting the file into chunks and then inserting each file separately.

So basically the flow to import_normalized_ranks will be:

  1. Split the tsv file into chunks of 'k' rows.
  2. Enter the number of chunks that need to be imported. Example command: python3 deploy.py import_normalized_ranks 20181130 ******** --normalized_ranks_file_chunks 20181130/en-es/ --source_language en --target_language es --number_of_chunks 10 --offset_of_chunks 2
  3. Have a clean up command to clear all the generated chunks.

How about we pass the combined filename to the script and chunking and importing is done automatically behind the scenes? That way we don't have to worry about calculating the number of chunks ourselves. We should of course add logging and an ability continue from where we left off in case of an error.

Yeah, that's a great idea. In case of a failure while importing a chunk (which can arise due to invalid data present in the chunk), how do we make sure that the values already entered inside that chunk are not entered again?
One approach would be to validate the chunk each time before importing it.

Also maybe add a UNIQUE constraint?

Yeah, I think the UNIQUE constraint would be (wikidata_id, normalized_rank, source_id, target_id).

Also do we need the feature to stop after importing k chunks and then allow to continue from there on?

No, you don't have to stop. I think, the system will schedule other processes between each step in your loop.

@bmansurov I shifted to a different system and now when I run python3 deploy.py import_languages 20181130 localhost 3306 recommendationapi usman db_password.txt --language_file 20181130/language.tsv I get the error as below. Tried various settings after searching online. None seem to work. Basically the error is on LOAD DATA LOCAL INFILE commands.

usman@swift~/deploy$ python3 deploy.py import_languages 20181130 localhost 3306 recommendationapi usman db_password.txt --language_file 20181130/language.tsv
Starting ...
Traceback (most recent call last):
  File "/home/usman/.local/lib/python3.6/site-packages/mysql/connector/connection_cext.py", line 472, in cmd_query
    raw_as_string=raw_as_string)
_mysql_connector.MySQLInterfaceError: The used command is not allowed with this MySQL version

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "deploy.py", line 240, in <module>
    main()
  File "deploy.py", line 216, in main
    options.version, options.language_file)
  File "deploy.py", line 143, in import_languages
    insert_languages_to_table(cursor, version, tsv)
  File "deploy.py", line 109, in insert_languages_to_table
    cursor.execute(sql.format(**data))
  File "/home/usman/.local/lib/python3.6/site-packages/mysql/connector/cursor_cext.py", line 266, in execute
    raw_as_string=self._raw_as_string)
  File "/home/usman/.local/lib/python3.6/site-packages/mysql/connector/connection_cext.py", line 475, in cmd_query
    sqlstate=exc.sqlstate)
mysql.connector.errors.ProgrammingError: 1148 (42000): The used command is not allowed with this MySQL version

@Usmanmuhd what is your OS, what version? What about MySQL and mysql connector versions? In README.org it's written that you need to install python-mysql.connector on Debian. Try figuring out the package version in Debian and installing the same in your system.

@bmansurov Thanks, figured out the error. I was using python3 along with the python-mysql.connector which is for python2.

Change 527571 had a related patch set uploaded (by Usmanmuhd; owner: Usmanmuhd):
[research/article-recommender/deploy@master] Bulk import data to MySQL in chunks

https://gerrit.wikimedia.org/r/527571

@bmansurov The working currently is:

  1. Chunks the file and stores in temp/<dir>/chunk-<i>.tsv
  2. Imports each chunk, executes the sql command and commits each transaction at the end of each chunk.
  3. Deletes the directory.

TODO:

  1. Incase of some error due to which it stop at nth chunk, provide ability to continue from thereon after sorting out the error.

@bmansurov Added the ability to continue when it stops at a given chunk.
Example:

python deploy/deploy.py import_normalized_ranks 20181130 localhost 3306 rapi usman db_password.txt --normalized_ranks_file 20181130/de-en.tsv --source_language de --target_language en, when this exits or is stopped by using KeyboardInterrupt, we can continue by adding a --continue_chunks flag in the above command. (python deploy/deploy.py import_normalized_ranks 20181130 localhost 3306 rapi usman db_password.txt --normalized_ranks_file 20181130/de-en.tsv --source_language de --target_language en --continue_chunks). If the flag is not added, it starts the import from scratch.

@bmansurov I'm deleting the chunks once it's imported to keep a track of which all chunks have already been imported and which are yet to be imported. I did not really understand the advantages of placing it in /tmp if I'm going to delete the files anyway.

@bmansurov Made the changes as required. Please take a look.

@bmansurov Updated the patch. Please take a look.

@bmansurov How would I move from the temp table to the actual table?

Using insert into normalized_rank_20181130 select * from temp_normalized_rank_20181130; takes quite a bit of time (35.90 sec). Is there any other approach to go about this?

@bmansurov Updated the patch. Instead of committing the change after each chunk, it is committing all the chunks at the end. So if it fails at any point running the same previous command will import the tsv file without any problems.

Change 527571 merged by Bmansurov:
[research/article-recommender/deploy@master] Bulk import data to MySQL in chunks

https://gerrit.wikimedia.org/r/527571