I am using this server (wcdo) to create some datasets and run some data visualizations. To do it, I read the dumps (pagelinks, categorylinks, wikidata, among others).
I simply read them line by line and parse the content and store them into a SQLite database. After a few months of not updating data (I was focused on some analyses and visualizations), I realized that the reading is now much slower than it was before (last October).
Originally, I used to download the dump from the website (dumps.wikimedia.org) to my disk, read them and delete them. However, now I cannot do this now because I don’t have the space to have both the dump and create the database. I used to do it but reading them from /public/dumps when it became available to me was faster as I didn’t have to download the dumps and the reading was at the same speed as if they were on the same server.
To have an idea, the last time I read the Wikidata dump from /public/dumps/ directly, it used to take 12 hours (counting the parsing and storing, without that, around 8). When I last tried to run this script on Friday, in 10 hours it only read the 3% of the dump. I haven’t tried reading all the other dumps (pagelinks, categorylinks, etc.) but I assume the speed will be similar.
Considering that I am generating a database/datasets for the 307 languages, I prefer reading from the dumps, as I already had to re-code many scripts that were using the Replicas because they got stuck on some queries or they used to be very slow.
Steps to Reproduce:
Here is my code to read the dump directly from /public/dumps and to download and read it locally.
One way to reproduce it would be trying this script from wcdo or a server at the same exact place and try it at a server closer (I assume they are in different networks) to /public/.
It now takes 10 hours to read the 3% of the current dump, it took 12 hours to read the entire dump a few months ago. I cannot give more detailed results, but in any case, it is substantially slower and it makes it impossible to run the script.
I would expect the same speed it had in July-October 2019.
Sorry if it brings any inconvenience, but it is quite critical as it does not allow me to update the data that is needed for the visualizations that are online. I don’t know the best solution though. Let me know if I can bring any more information or do any test in specific.
Thank you very much.