We have an article recommendation API that suggests articles for creation based on a seed article. For example, [[ https://en.wikipedia.org/api/rest_v1/data/recommendation/article/creation/morelike/Book | here ]] you can see that articles similar to 'Book' and identified by Wikidata item IDs are missing from enwiki and thus are being suggested for creation.
The API pulls data from various places, and one of those places is MySQL. Data gets into MySQL by the [[ https://gerrit.wikimedia.org/r/research/article-recommender/deploy | article-recommender/deploy ]] repository. Since we run the import script in a shared host and import data to a shared database, we'd like to not block other processes while importing large quantities of data. For this reason we'd like to import data in chunks.
= Mentors
- @bmansurov (IRC channel: `#wikimedia-research`)
= Skills required
- MySQL, Python
= Acceptance Criteria
- [ ] The TSV files generated by {T210844} are split up into multiple chunks, each say 50,000 rows.
- [ ] The import script to import all chunks.