We have an article recommendation API that suggests articles for creation based on a seed article. For example, here you can see that articles similar to 'Book' and identified by Wikidata item IDs are missing from enwiki and thus are being suggested for creation.
The API pulls data from various places, and one of those places is MySQL. Data gets into MySQL by the article-recommender/deploy repository. Since we run the import script in a shared host and import data to a shared database, we'd like to not block other processes while importing large quantities of data. For this reason we'd like to import data in chunks.
Mentors
- @bmansurov (IRC channel: #wikimedia-research)
Skills required
- MySQL, Python
Acceptance Criteria
- The TSV files generated by T210844: Generate article recommendations in Hadoop for use in production are split up into multiple chunks, each say 50,000 rows.
- The import script to import all chunks.