Page MenuHomePhabricator

Request creation of WCDO VPS project
Closed, ResolvedPublic


**Project Name: WCDO <PROJECT-NAME - This will be used as the URL subdomain, and must be alphanumeric - no spaces or special characters other than - (hyphen).>

Wikipedia Cultural Diversity Observatory

Wikitech Usernames of requestors: <USERNAMES - 2 or more is best!>

To parse wikidata dumps data and retrieve data from the 288 wikipedias and create some specific datasets (every 15 days) aimed at fostering interlanguage and intercultural collaborations to bridge the language gaps.

<REASON for project request - a 1 sentence overview of what it will be used for>

Brief description: <more DETAILS and links about expected project administration needs, including links to software you intend to install (+ licence-pages) and if there are expected needs for higher disk-quota>

The project uses several strategies to retrieve data and process them to create the datasets. At the same time, it will generate a website on the same.

  • It will run a python script with a collection packages (reverse-geocoder, numpy, json, mysqldb, sqlite3) and it will create temporary sqlite3 databases to work the data.
  • It will run a python script to parse the wikidata dump and store the necessary info. This was attempted with mysql and it gave a very bad performance:
  • It will generate a python static website using Nikola framework.

In terms of storage:
It needs to have for 5-7 hours: 29 GB of the wikidata dump (downloaded, parsed and deleted).
It needs to have for a couple of days: 6GB of the parsed wikidata dump.
It stores approx. 7-8GB of generated data for the 288 languages in a sqlite3 file.
It will generate approx. 7 GB of datasets copies every two weeks.

The history of datasets copies can be deleted every few months. So, 80GB would be enough for the project.

How soon you are hoping this can be fulfilled: <this week/month/quarter - "as soon as possible" is an acceptable response but keep in mind resources are finite, and we may ask you to wait depending on availability and the size of the request.>
this week or the next one

Event Timeline

Chicocvenancio triaged this task as Medium priority.

This seems needed to be able to complete the tasks set forth in the grant, running it in toolforge does not seem sustainable and will impact other users.

However, there are a few things that could be worked on to lessen the resource use and make the analysis faster. Specifically, parsing the dump (especially from Wikidata) is a very resource intensive task, no matter how it is done.

@Legoktm in T189058#4033467 wrote:
Speaking anecdotally, I've found some computational code to run faster on my laptop than on the virtualized tools environment. But network requests to the API are extremely fast since they're in the same data center to the point where it more than makes up for any slowness.

This is a big issue, the big advantage of WMCS data services (Toolforge and Cloud VPS alike) is the proximity to the data sources, be wikireplicas or the API. Parsing the dump does not take much advantage of that (besides being there already in the NFS share and possibly being quickly copied).

@marcmiquel expressed to me in IRC that there is data that is not available in the wikireplicas for WIkidata, it seems we should investigate moving the analysis to API calls, if possible, and if it is possible/desired add this data to the wikireplicas (or how it can be derived from existing data today, ideally).

Another big source of improvements would be to move out from a serial workflow into one that allows for some parallelization. At the moment the script does one thing at a time and any waiting will slow it down, even if other tasks have all that is need to continue.

E.G. the script reads the dump one line at a time and processes it to insert data into a database. The script could be reading, and processing, the dump, independently from the data insertion in the database. At the moment it waits for the db to insert the data before reading the next line of the dump.

Finally, there are improvements to be made to the way data is read/inserted to the databases (even if using sqlite in cloudVPS, though that may become moot by the filesystem read/write speeds).

None of these issues seem urgent to me, but it is a good idea to keep them in mind in further developing this, and remembering to mitigate possible resource overuse in a shared environment. In the end, this project seems to have a good plan to develop the needed analyses.

OK. Sorry if it becomes boring, but let me reframe it again and explain bit better. I develop the current situation with some new experiments with MySQL in Toolforge.

I create the datasets named CCC from the WCDO project. I use:
a) Data from the Wikipedia mysql mediawiki databases replicas from each language edition.
b) Data from Wikidata.

In regards to a) there is no specific problem other than the general performance issues that are possible to handle. In regards to b) we are looking for an optimized solution because the datasets cannot be generated.

So, in b) I need to revise around 7 groups of properties for Wikidata items. I select the 'good data' to create the CCC datasets according to the properties they have, their relationships and some other criteria. So I need this data to be able to be read and processed several times.

I first tried with the API (SPARQLE) to obtain the Wikidata qitems and it was not viable. Too much info. Too many queries. Too many properties.

I later tried with the Wikidata dump because it needed to go through all the items. So I created a 'cleaned' version of the dump and went from 30 GB compressed to 6 GB in a sqlite3 database file.
Sqlite3 has the advantages of being a database (so I can select info), it is fast but it is no good at concurrency (I don't care about that).

I created a sqlite3 file in my laptop with the wikidata data in. One table named sitelinks (page_title, langcode, qitems) and the rest of tables with the qitems triplets (qitem1, property, qitem2). This way I can filter by languagecode with an inner join. This allowed me to do all the operations.

When moving from my laptop to tool labs grid I find some bottlenecks.

  1. If I want to create the sqlite3 file from the wikidata dump (whether I download it to the tool directory or I retrieved from somewhere else), it goes from 4h (time of parsing and storing in my laptop to 2 days).

This happens because the file is in NFS and the job is somewhere else and read-write speed is not good in this scenario.

@Chicocvenancio suggested me not to use the sqlite3 file but to introduce the data into the mysql user-created tables.
I use these tables in a), with the data selected from the replicas, for the final dataset creation (when data is verified as good) and I also use them for final 'results' that are going to be depicted in the website/wp.
I create new mysql tables to store the parsed dump as I did in the sqlite3.

  1. At this point, I come across the mysql bottlenecks. Parsing the wikidata dump and storing it to the new mysql tables takes 4 days.

Chicocvenancio points out that I could optimize the code and insert by batches. This is something I do with the verified data comiong from a) but I didn't think of doing it for b) since I kept the code I used for sqlite3 the same, and in that case writing to disk did not matter. This could be done.

The problem appears when I want to retrieve the data recently stored in the mysql tables. A simple query that uses two tables (sitelinks) and geolocation property takes 15-20 min. 'SELECT * FROM s53619wcdo.wd_geolocated_property INNER JOIN s53619wcdo.wd_sitelinks ON wd_sitelinks.qitem=wd_geolocated_property.qitem WHERE wd_sitelinks.langcode="'+languagecode+'wiki";'
This is the kind of query I mentioned that I would need to run for every WP of the 288. It uses the languagecode to filter by language with the sitelinks table.

In a sqlite3 file (on my laptop) the query execution is direct, it has no response time. But the mysql takes a lot, even having the indexes in both sitelinks and wd_geolocated_property tables. A lot means around 50 min for just one language (I tried with cawiki, which has roughly 600,000 articles).
I executed the query 'optimize table sitelinks' and it took 1h and 23 min and I am not sure it was positive, but now it takes 20 min for the cawiki. It might have helped.
However, I am concerned with the rest of properties tables. Obtaining the geolocations is direct. Same for the language-based properties. But in some cases there is recursion, and I need to compare qitems that are set as origin with others that are in destination and I may need to read the table several times. This is something I need to test well, but it may not be viable with mysql.

In the end of the day, I think that having a small VPS server would really help. It would allow the project to be independent from the mysql databases and only use them for the website displayed data.

@Chicocvenacio was suggesting that perhaps in the future the wikidata data from every item is available and possible to query from a replica. This would mean that the triplets would be in a table in the same way that links or langlinks from mediawiki database schema are in tables in the wp replicas. If this was done, it would simply the process for b).

In any other case, processing the dump and having another space to work the data is unavoidable. This means, every 15 days, to download the dump (30GB), to create the parsed version of it in few sqlite3 db tables (6GB) and generate the datasets for all languages (I estimated 7-9 GB for a collection).
This would mean that having a VPS with a disk of around 80 GB would be safe. Any new dataset could be moved to another place with all the history of copies, so the disk would be only a working space for the dataset generation and the website.

All in all, I think that VPS is the way to go. I shared all the process as best as I can with all the details from my experiments. This is my humble perspective. If anyone knows a better solution, I'd like to hear it.

+1 for trying a Cloud VPS project for this.

This seems needed to be able to complete the tasks set forth in the grant, running it in toolforge does not seem sustainable and will impact other users.

If I wasn't clear on that comment, I support the project request as well.

Project has been created.

Public read only info:
To create instances and manage the project:

Current quotas are the default ones:

root@labcontrol1001:~# nova --os-project-id wcdo quota-show 
| Quota                       | Limit |
| instances                   | 8     |
| cores                       | 8     |
| ram                         | 16384 |
| floating_ips                | 0     |
| fixed_ips                   | 200   |
| metadata_items              | 128   |
| injected_files              | 5     |
| injected_file_content_bytes | 10240 |
| injected_file_path_bytes    | 255   |
| key_pairs                   | 100   |
| security_groups             | 10    |
| security_group_rules        | 20    |
| server_groups               | 10    |
| server_group_members        | 10    |
bd808 assigned this task to aborrero.
bd808 moved this task from Inbox to Approved on the Cloud-VPS (Project-requests) board.