Page MenuHomePhabricator

Performance with Wikidata parsing and storing into sqlite and toolsdb (MySQL)
Closed, ResolvedPublic

Description

To give some context: I am creating datasets for the project WCDO (https://meta.wikimedia.org/wiki/Grants:Project/Wikipedia_Cultural_Diversity_Observatory_(WCDO) ). I am using both Wikipedia replicas and Wikidata. At the moment I am having problems with Wikidata dumps and the toolforge.

I need to parse a Wikidata dump to select specific triplets (qitem-property-qitem) and I first did put it in a sqlite3 database (a .db file). To run all the dump and parse the specific data takes 5h in my laptop. It is quite fast.

It seems the same code takes around 24h in the grid, since the NFS connection is not as fast.

I was suggested to use the Mysql Toolsdb databases. I actually use those databases as the final place where to put the data (to generate some visualizations in a website) before extracting them into .csv datasets.

The reason I did not use Mysql for this was that I suspected that retrieving large number of rows (millions) would be problematic. Sometimes in the past I retrieved a large amount of rows and put them in memory to perform some opeartions and later I realized they were not really in memory but in the mysqlconnection stilll, which eventually had fallen off...with a timeout.

In any case, I did change the code from sqlite3 to mysql to introduce the parsed dump directly into new toolsdb tables. At the same time I started using the wikidata dump from '/public/dumps/public/wikidatawiki/entities/latest-all.json.gz' instead of the version I downloaded into my directory.

But now it seems the code is much slower... With 6h the job could only parse 10% of the wikidata dump.

I do not know what to do because this is not a one time thing. It needs to be automatized weekly or every couple of weeks. If this operations are so slow, I do not want to think what will take to run the generated mysql databases hundreds of times with data from the 300 languages.

I am surprised the code is running faster in my laptop (Macbook 4 years old). I would ask to find a solution. Perhaps it would be a good idea to have a dedicated server, as some similar projects like WIGI (wigi.wmflabs.org) have. This would be a good solution also to have a website subdomain wcdo.wmflabs.org (now I have https://tools.wmflabs.org/wcdo/). Although this is a different thing. Now I am concerned about the data and how the whole thing will work.

Thank you very much for your help!! :)

Event Timeline

Perhaps it would be a good idea to have a dedicated server, as some similar projects like WIGI (wigi.wmflabs.org) have. This would be a good solution also to have a website subdomain wcdo.wmflabs.org (now I have https://tools.wmflabs.org/wcdo/). Although this is a different thing. Now I am concerned about the data and how the whole thing will work.

This part involves creating a Cloud VPS project, and managing your own virtual server see https://wikitech.wikimedia.org/wiki/Help:Getting_Started#Getting_started_with_VPS_Projects

I'll leave the rest of my analysis of why it is slow for later.

sqlite on NFS is generally slow.

I am surprised the code is running faster in my laptop (Macbook 4 years old).

Speaking anecdotally, I've found some computational code to run faster on my laptop than on the virtualized tools environment. But network requests to the API are extremely fast since they're in the same data center to the point where it more than makes up for any slowness.

This has been resolved by T189165