Page MenuHomePhabricator

Reading Wikidatadump at NFS share instance from the wcdo Cloud VPS project is too slow
Closed, ResolvedPublicBUG REPORT

Description

I am using this server (wcdo) to create some datasets and run some data visualizations. To do it, I read the dumps (pagelinks, categorylinks, wikidata, among others).

I simply read them line by line and parse the content and store them into a SQLite database. After a few months of not updating data (I was focused on some analyses and visualizations), I realized that the reading is now much slower than it was before (last October).

Originally, I used to download the dump from the website (dumps.wikimedia.org) to my disk, read them and delete them. However, now I cannot do this now because I don’t have the space to have both the dump and create the database. I used to do it but reading them from /public/dumps when it became available to me was faster as I didn’t have to download the dumps and the reading was at the same speed as if they were on the same server.

To have an idea, the last time I read the Wikidata dump from /public/dumps/ directly, it used to take 12 hours (counting the parsing and storing, without that, around 8). When I last tried to run this script on Friday, in 10 hours it only read the 3% of the dump. I haven’t tried reading all the other dumps (pagelinks, categorylinks, etc.) but I assume the speed will be similar.

Considering that I am generating a database/datasets for the 307 languages, I prefer reading from the dumps, as I already had to re-code many scripts that were using the Replicas because they got stuck on some queries or they used to be very slow.

Steps to Reproduce:
Here is my code to read the dump directly from /public/dumps and to download and read it locally.
https://pastebin.com/fxUBmCV2

One way to reproduce it would be trying this script from wcdo or a server at the same exact place and try it at a server closer (I assume they are in different networks) to /public/.

Actual Results:
It now takes 10 hours to read the 3% of the current dump, it took 12 hours to read the entire dump a few months ago. I cannot give more detailed results, but in any case, it is substantially slower and it makes it impossible to run the script.

Expected Results:
I would expect the same speed it had in July-October 2019.

Sorry if it brings any inconvenience, but it is quite critical as it does not allow me to update the data that is needed for the visualizations that are online. I don’t know the best solution though. Let me know if I can bring any more information or do any test in specific.

Thank you very much.

Event Timeline

Restricted Application edited projects, added cloud-services-team (Kanban); removed cloud-services-team. · View Herald TranscriptApr 26 2020, 8:43 PM
marcmiquel renamed this task from NFS share on instances in the wcdo Cloud VPS project is SLOWER to Reading Wikidatadump at NFS share instance from the wcdo Cloud VPS project is too slow.Apr 26 2020, 8:44 PM
Reedy removed Bstorm as the assignee of this task.Apr 26 2020, 10:16 PM
JHedden triaged this task as Medium priority.May 5 2020, 4:13 PM
bd808 added a comment.May 5 2020, 4:36 PM

I think it might be possible that the slowness is actually being caused by the sqlite writes. This instance seems to be currently hosted on cloudvirt1007 which is in our oldest generation of hypervisors. It looks to me like the instance was probably moved to this hypervisor on 2020-01-23. We had two different hypervisors crash on that date and moved a bunch of instances around to keep them running. Unfortunately I am not sure if we have a better hypervisor to relocate the instance to right now.

Thanks for looking at it. I remember it worked very well last July. Even in
November, I am not sure it was working as fast as in July.
I'll do one more test and I'll get back to you. Thanks.

Missatge de bd808 <no-reply@phabricator.wikimedia.org> del dia dt., 5 de
maig 2020 a les 18:36:

bd808 added a comment. View Task
https://phabricator.wikimedia.org/T251065

I think it might be possible that the slowness is actually being caused by
the sqlite writes. This instance seems to be currently hosted on
cloudvirt1007 which is in our oldest generation of hypervisors. It looks to
me like the instance was probably moved to this hypervisor on 2020-01-23.
We had two different hypervisors crash on that date and moved a bunch of
instances around to keep them running. Unfortunately I am not sure if we
have a better hypervisor to relocate the instance to right now.

*TASK DETAIL*
https://phabricator.wikimedia.org/T251065

*EMAIL PREFERENCES*
https://phabricator.wikimedia.org/settings/panel/emailpreferences/

*To: *bd808
*Cc: *Aklapper, Bstorm, bd808, marcmiquel, Phamhi, Legado_Shulgin,
94rain, Killertrap, sietec, Jony, Zylc, Giuliamocci, 1978Gage2001,
Chicocvenancio, Allthingsgo, Tbscho, JJMC89, mys_721tx, Jitrixis, Gryllida,
scfc, Krenair

marcmiquel added a comment.EditedMay 11 2020, 11:20 AM

I did a couple of tests to understand better how it is working. The test consisted in running an estimation of the 1% of the Wikidata dump (900,000 lines), assuming that it has 85,696,352 qitems. (If I remember correctly, last October it was around 60M, the increase is notable but not comparable to the increase in time reading the dump).

The first test was including the SQLite3 write batch-writing operations. The second one was without them.

This is my code (https://pastebin.com/L91tmqfM). The code presents the lines in which I write to SQLite3 as commented for the second test.

This is what the code printed with SQLite3 writing operations:

Created the Wikidata sqlite3 file and tables.
0:00:00.592773
Iterating the dump.
900000
1.050219733974207 % completed
current time: 2415.8241255283356
number of line per second: 372.54367194436776

With this, simply multiplying by 100, we have 67 hours of estimated time to read the entire dump. (Few months ago, as I reported, it was around 12 hours).

This is what the code printed without the writing operations:

0:00:00.610148
Iterating the dump.
900000
1.050219733974207
current time: 1526.5433514118195
number of line per second: 589.5672663012574

With this, we would get 42 hours of estimated time to read the entire dump. It is still 3,5x as it used to be (however, writing to the SQlite3 is essential…. so I cannot not save the data). Since few months ago, the number of writing operations might have increased 20%.

The problem is that it is not only the Wikidata dump that I am reading from /public/dumps but the pagelinks, categorylinks, pages, and a long etcetera, including the new mediawiki history dump. I initially used the replicas but I had many bottlenecks and the time it took to run for the 300 languages was impossible (I had created issues here and discussed this in the wikimedia cloud channel and I rewrote the scripts to use the dumps). I am afraid that if there is such a increase in time reading the wikidata dump, it means that it will increase for all the processes using them.

What could we do? Thanks in advance for your time! :)

What could we do? Thanks in advance for your time! :)

This instance seems to be currently hosted on cloudvirt1007 which is in our oldest generation of hypervisors. [...] Unfortunately I am not sure if we have a better hypervisor to relocate the instance to right now.

Moving your instance to a different hypervisor would be the obvious answer. This would almost certainly increase the speed of disk writes and likely would increase the speed of raw processing operations as well. I spent some time staring at our capacity dashboard, and I can not find a good place to move your instance to today.

cloudvirt1023 is the only hypervisor server we currently have that does not have known hardware issues, is not dedicated to a particular workload (like a database cluster), and is not oversubscribed already. This hypervisor however is also our only current "spare" to take load should some other hypervisor run into hardware/software problems. As you may be able to see on the dashboard, this "spare" currently has 17 instances running on it as well.

I will raise this for discussion in the cloud-services-team meeting on 2020-05-13 to see if we can find any better response for you other than "wait for new hardware to be in service."

bd808 moved this task from Backlog to Shared Storage on the Data-Services board.May 11 2020, 11:26 PM
bd808 moved this task from Shared Storage to Dumps on the Data-Services board.

Hi @bd808,
Thanks for your answer and for dedicating this time to look at the capacity dashboard.

I understand that it is not easy to move the instances and that there might be different factors to take into account. I will continue doing other tasks meanwhile. Please, let me know if there is any news coming out of the meeting or there are has any other possibility.
Thanks.

bd808 claimed this task.May 13 2020, 11:29 PM
bd808 moved this task from Needs discussion to Soon! on the cloud-services-team (Kanban) board.

@marcmiquel we decided that we can put this instance on our cloudvirt1023 hypervisor. To do that migration we need to shutdown the instance while the virtual disk image that it runs from is copied from one hypervisor to another. The downtime is basically a function of how long it takes to copy the disk across the network. Your instance's disk is relatively large so it could take up to an hour (although hopefully much shorter).

What kind of notice do you need before we can do that and is there any time of day or week that would be less disruptive for your work on the instance?

@bd808 thank you for the news. the downtime is not a problem at all. if the migration could be done on Sunday or Monday it would be great.
Thank you very much for taking care of it and finding a solution.
Best,

bd808 added a comment.May 19 2020, 6:57 PM

Ugh. I meant to do this on Monday 2020-05-18 and even set a calendar reminder for myself. Attention to detail fail. I will check the instance and migrate today if it looks like it is reasonably inactive.

Sure! No problem. Please, let me know when it is done. :) Thanks.

Missatge de bd808 <no-reply@phabricator.wikimedia.org> del dia dt., 19 de
maig 2020 a les 20:57:

bd808 added a comment. View Task
https://phabricator.wikimedia.org/T251065

Ugh. I meant to do this on Monday 2020-05-18 and even set a calendar
reminder for myself. Attention to detail fail. I will check the instance
and migrate today if it looks like it is reasonably inactive.

*TASK DETAIL*
https://phabricator.wikimedia.org/T251065

*EMAIL PREFERENCES*
https://phabricator.wikimedia.org/settings/panel/emailpreferences/

*To: *bd808
*Cc: *Aklapper, Bstorm, bd808, marcmiquel, Phamhi, Legado_Shulgin,
94rain, Killertrap, sietec, Jony, Zylc, Giuliamocci, 1978Gage2001,
Chicocvenancio, Allthingsgo, Tbscho, JJMC89, mys_721tx, Jitrixis, Gryllida,
scfc, Krenair

Mentioned in SAL (#wikimedia-cloud) [2020-05-19T21:12:52Z] <bd808> Migrating wcdo.wcdo.eqiad.wmflabs to cloudvirt1023 (T251065)

Mentioned in SAL (#wikimedia-cloud) [2020-05-19T23:06:39Z] <bd808> Migrated wcdo.wcdo.eqiad.wmflabs to cloudvirt1023 (T251065)

@marcmiquel Your instance has been migrated and is now running on the cloudvirt1023 hypervisor. Hopefully this helps with your performance issues.

Thanks!

Missatge de bd808 <no-reply@phabricator.wikimedia.org> del dia dc., 20 de
maig 2020 a les 1:09:

bd808 added a comment. View Task
https://phabricator.wikimedia.org/T251065

@marcmiquel https://phabricator.wikimedia.org/p/marcmiquel/ Your
instance has been migrated and is now running on the cloudvirt1023
hypervisor. Hopefully this helps with your performance issues.

*TASK DETAIL*
https://phabricator.wikimedia.org/T251065

*EMAIL PREFERENCES*
https://phabricator.wikimedia.org/settings/panel/emailpreferences/

*To: *bd808
*Cc: *Aklapper, Bstorm, bd808, marcmiquel, Phamhi, Legado_Shulgin,
94rain, Killertrap, sietec, Jony, Zylc, Giuliamocci, 1978Gage2001,
Chicocvenancio, Allthingsgo, Tbscho, JJMC89, mys_721tx, Jitrixis, Gryllida,
scfc, Krenair

Oh, wow, it is almost 10x faster! Thanks.

Missatge de Marc Miquel <marcmiquel@gmail.com> del dia dc., 20 de maig 2020
a les 9:30:

Thanks!

Missatge de bd808 <no-reply@phabricator.wikimedia.org> del dia dc., 20 de
maig 2020 a les 1:09:

bd808 added a comment. View Task
https://phabricator.wikimedia.org/T251065

@marcmiquel https://phabricator.wikimedia.org/p/marcmiquel/ Your
instance has been migrated and is now running on the cloudvirt1023
hypervisor. Hopefully this helps with your performance issues.

*TASK DETAIL*
https://phabricator.wikimedia.org/T251065

*EMAIL PREFERENCES*
https://phabricator.wikimedia.org/settings/panel/emailpreferences/

*To: *bd808
*Cc: *Aklapper, Bstorm, bd808, marcmiquel, Phamhi, Legado_Shulgin,
94rain, Killertrap, sietec, Jony, Zylc, Giuliamocci, 1978Gage2001,
Chicocvenancio, Allthingsgo, Tbscho, JJMC89, mys_721tx, Jitrixis, Gryllida,
scfc, Krenair

bd808 closed this task as Resolved.Jun 3 2020, 4:58 PM