**Project Name: WCDO <PROJECT-NAME - This will be used as the URL subdomain, and must be alphanumeric - no spaces or special characters other than - (hyphen).>
Wikipedia Cultural Diversity Observatory
https://meta.wikimedia.org/wiki/Grants:Project/Wikipedia_Cultural_Diversity_Observatory_(WCDO)
Wikitech Usernames of requestors: <USERNAMES - 2 or more is best!>
marcmiquel
Purpose:
To parse wikidata dumps data and retrieve data from the 288 wikipedias and create some specific datasets (every 15 days) aimed at fostering interlanguage and intercultural collaborations to bridge the language gaps.
<REASON for project request - a 1 sentence overview of what it will be used for>
Brief description: <more DETAILS and links about expected project administration needs, including links to software you intend to install (+ licence-pages) and if there are expected needs for higher disk-quota>
The project uses several strategies to retrieve data and process them to create the datasets. At the same time, it will generate a website on the same.
- It will run a python script with a collection packages (reverse-geocoder, numpy, json, mysqldb, sqlite3) and it will create temporary sqlite3 databases to work the data.
- It will run a python script to parse the wikidata dump and store the necessary info. This was attempted with mysql and it gave a very bad performance: https://phabricator.wikimedia.org/T189058
- It will generate a python static website using Nikola framework.
In terms of storage:
It needs to have for 5-7 hours: 29 GB of the wikidata dump (downloaded, parsed and deleted).
It needs to have for a couple of days: 6GB of the parsed wikidata dump.
It stores approx. 7-8GB of generated data for the 288 languages in a sqlite3 file.
It will generate approx. 7 GB of datasets copies every two weeks.
The history of datasets copies can be deleted every few months. So, 80GB would be enough for the project.
How soon you are hoping this can be fulfilled: <this week/month/quarter - "as soon as possible" is an acceptable response but keep in mind resources are finite, and we may ask you to wait depending on availability and the size of the request.>
this week or the next one