Maniphest T211980

'morelike' recommendation API: Bulk import data to MySQL in chunks
Closed, ResolvedPublic
Actions

Assigned To

None

Authored By

	• bmansurov
	Dec 14 2018, 2:52 PM

Description

We have an article recommendation API that suggests articles for creation based on a seed article. For example, here you can see that articles similar to 'Book' and identified by Wikidata item IDs are missing from enwiki and thus are being suggested for creation.

The API pulls data from various places, and one of those places is MySQL. Data gets into MySQL by the article-recommender/deploy repository. Since we run the import script in a shared host and import data to a shared database, we'd like to not block other processes while importing large quantities of data. For this reason we'd like to import data in chunks.

Mentors

@bmansurov (IRC channel: #wikimedia-research)

Skills required

MySQL, Python

Acceptance Criteria

The TSV files generated by T210844: Generate article recommendations in Hadoop for use in production are split up into multiple chunks, each say 50,000 rows.
The import script to import all chunks.

Details

	Subject	Repo	Branch	Lines +/-
	Bulk import data to MySQL in chunks	research/article-recommender/deploy	master	+76 -13

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Open	None	T214074 Productionize article recommender systems
Resolved	Ottomata	T213566 Transferring data from Hadoop to production MySQL database
Resolved	None	T217655 Improve article recommendation pipeline
Resolved	None	T211980 'morelike' recommendation API: Bulk import data to MySQL in chunks

Event Timeline

• bmansurov created this task.Dec 14 2018, 2:52 PM

• bmansurov added a project: Article-Recommendation.Dec 14 2018, 3:42 PM

• bmansurov mentioned this in T213566: Transferring data from Hadoop to production MySQL database.Jan 18 2019, 2:34 PM

• bmansurov edited parent tasks, added: T213566: Transferring data from Hadoop to production MySQL database; removed: T210844: Generate article recommendations in Hadoop for use in production.

• bmansurov updated the task description. (Show Details)Jan 18 2019, 7:47 PM

• bmansurov added a project: Google-Summer-of-Code (2019).Feb 27 2019, 2:42 PM

• bmansurov updated the task description. (Show Details)Feb 27 2019, 2:57 PM

@bmansurov Thank you for listing this project! Could you maybe add more details to the project description, also add skills required and then add it to the our ideas page https://www.mediawiki.org/wiki/Google_Summer_of_Code/2019?

As a sidenote I'm not sure what "A/C" in the task description means :)

srishakatux moved this task from Backlog to Featured Projects on the Google-Summer-of-Code (2019) board.Feb 27 2019, 11:48 PM

• bmansurov updated the task description. (Show Details)Feb 28 2019, 3:24 PM

@srishakatux thanks for the feedback. Please let me know if anything else is needed for this task to be considered ready for work.

Hey, I'm Muhammad Usman. I am interested in this project. I have a good experience with working on Python and MySQL as well. I have quite a few number of contributions to open source projects.

Are there any micro-tasks for this project?
Could you please fix the link of the repository as it's currently broken?

Thanks.

@Usmanmuhd, welcome! The link's been fixed. There are no micro tasks for this project, unless you want to split up the work into meaningful parts and work on them separately. But I think the project is self containing.

In the README file it says Data is in stat1007:/home/bmansurov/tp9/article-recommender-deploy/ Where can I get this file?
Is there some kind of project board for this and also any recommended issues to work on?

In T211980#4994147, @Usmanmuhd wrote:

In the README file it says Data is in stat1007:/home/bmansurov/tp9/article-recommender-deploy/ Where can I get this file?

I've updated the readme. Here's the file: https://analytics.wikimedia.org/datasets/one-off/article-recommender/20181130.tar.gz

Is there some kind of project board for this and also any recommended issues to work on?

Yes, we have https://phabricator.wikimedia.org/project/view/1351/, but this is the only task ready for work. I'll be creating more tasks before the program starts.

Usmanmuhd added a comment.Mar 2 2019, 2:42 PM

This comment was removed by Usmanmuhd.

• bmansurov mentioned this in T217655: Improve article recommendation pipeline.Mar 5 2019, 1:53 PM

• bmansurov added a parent task: T217655: Improve article recommendation pipeline.

@Usmanmuhd on IRC you mentioned that you'd submit a patch for this task. If you started working on the task, feel free to assign it to yourself. Also feel free to ask questions here, on IRC, or via email.

I would first prefer to work on https://phabricator.wikimedia.org/T216721 as it kind of looks simpler on the first look.

Usmanmuhd mentioned this in T218971: GSoC 2019 Proposal: Improve article recommendation pipeline.Mar 22 2019, 4:03 AM

leila edited projects, added Research-Freezer; removed Research.Jul 11 2019, 4:07 PM

@bmansurov How do I get the chunks that need to be imported? If I need to generate the chunks by myself, what's the procedure to generate it?

@Usmanmuhd I was thinking that we should split up the file into multiple files and import them one by one. But I'm curious to know if you can find a better way. Can you do some research on this? Thanks.

@bmansurov As we are using LOAD DATA there is no way apart from splitting the file into chunks and then inserting each file separately.

So basically the flow to import_normalized_ranks will be:

Split the tsv file into chunks of 'k' rows.
Enter the number of chunks that need to be imported. Example command: python3 deploy.py import_normalized_ranks 20181130 ******** --normalized_ranks_file_chunks 20181130/en-es/ --source_language en --target_language es --number_of_chunks 10 --offset_of_chunks 2
Have a clean up command to clear all the generated chunks.

How about we pass the combined filename to the script and chunking and importing is done automatically behind the scenes? That way we don't have to worry about calculating the number of chunks ourselves. We should of course add logging and an ability continue from where we left off in case of an error.

Yeah, that's a great idea. In case of a failure while importing a chunk (which can arise due to invalid data present in the chunk), how do we make sure that the values already entered inside that chunk are not entered again?
One approach would be to validate the chunk each time before importing it.

Also maybe add a UNIQUE constraint?

Yeah, I think the UNIQUE constraint would be (wikidata_id, normalized_rank, source_id, target_id).

Also do we need the feature to stop after importing k chunks and then allow to continue from there on?

No, you don't have to stop. I think, the system will schedule other processes between each step in your loop.

@bmansurov I shifted to a different system and now when I run python3 deploy.py import_languages 20181130 localhost 3306 recommendationapi usman db_password.txt --language_file 20181130/language.tsv I get the error as below. Tried various settings after searching online. None seem to work. Basically the error is on LOAD DATA LOCAL INFILE commands.

usman@swift~/deploy$ python3 deploy.py import_languages 20181130 localhost 3306 recommendationapi usman db_password.txt --language_file 20181130/language.tsv
Starting ...
Traceback (most recent call last):
  File "/home/usman/.local/lib/python3.6/site-packages/mysql/connector/connection_cext.py", line 472, in cmd_query
    raw_as_string=raw_as_string)
_mysql_connector.MySQLInterfaceError: The used command is not allowed with this MySQL version

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "deploy.py", line 240, in <module>
    main()
  File "deploy.py", line 216, in main
    options.version, options.language_file)
  File "deploy.py", line 143, in import_languages
    insert_languages_to_table(cursor, version, tsv)
  File "deploy.py", line 109, in insert_languages_to_table
    cursor.execute(sql.format(**data))
  File "/home/usman/.local/lib/python3.6/site-packages/mysql/connector/cursor_cext.py", line 266, in execute
    raw_as_string=self._raw_as_string)
  File "/home/usman/.local/lib/python3.6/site-packages/mysql/connector/connection_cext.py", line 475, in cmd_query
    sqlstate=exc.sqlstate)
mysql.connector.errors.ProgrammingError: 1148 (42000): The used command is not allowed with this MySQL version

@Usmanmuhd what is your OS, what version? What about MySQL and mysql connector versions? In README.org it's written that you need to install python-mysql.connector on Debian. Try figuring out the package version in Debian and installing the same in your system.

@bmansurov Thanks, figured out the error. I was using python3 along with the python-mysql.connector which is for python2.

Change 527571 had a related patch set uploaded (by Usmanmuhd; owner: Usmanmuhd):
[research/article-recommender/deploy@master] Bulk import data to MySQL in chunks

https://gerrit.wikimedia.org/r/527571

gerritbot added a project: Patch-For-Review.Aug 2 2019, 2:45 PM

@bmansurov The working currently is:

Chunks the file and stores in temp/<dir>/chunk-<i>.tsv
Imports each chunk, executes the sql command and commits each transaction at the end of each chunk.
Deletes the directory.

TODO:

Incase of some error due to which it stop at nth chunk, provide ability to continue from thereon after sorting out the error.

@bmansurov Added the ability to continue when it stops at a given chunk.
Example:

python deploy/deploy.py import_normalized_ranks 20181130 localhost 3306 rapi usman db_password.txt --normalized_ranks_file 20181130/de-en.tsv --source_language de --target_language en, when this exits or is stopped by using KeyboardInterrupt, we can continue by adding a --continue_chunks flag in the above command. (python deploy/deploy.py import_normalized_ranks 20181130 localhost 3306 rapi usman db_password.txt --normalized_ranks_file 20181130/de-en.tsv --source_language de --target_language en --continue_chunks). If the flag is not added, it starts the import from scratch.

@bmansurov I'm deleting the chunks once it's imported to keep a track of which all chunks have already been imported and which are yet to be imported. I did not really understand the advantages of placing it in /tmp if I'm going to delete the files anyway.

@bmansurov Made the changes as required. Please take a look.

@bmansurov Updated the patch. Please take a look.

@bmansurov How would I move from the temp table to the actual table?

Using insert into normalized_rank_20181130 select * from temp_normalized_rank_20181130; takes quite a bit of time (35.90 sec). Is there any other approach to go about this?

@bmansurov Updated the patch. Instead of committing the change after each chunk, it is committing all the chunks at the end. So if it fails at any point running the same previous command will import the tsv file without any problems.

Change 527571 merged by Bmansurov:
[research/article-recommender/deploy@master] Bulk import data to MySQL in chunks

https://gerrit.wikimedia.org/r/527571

Maintenance_bot removed a project: Patch-For-Review.Aug 24 2019, 8:10 AM

Usmanmuhd closed this task as Resolved.Aug 24 2019, 11:14 AM

Usmanmuhd mentioned this in rRARD9830d3e496e1: Bulk import data to MySQL in chunks.Aug 24 2019, 1:26 PM