Page MenuHomePhabricator

Wikimedia Cloud Services support for contribution from the IGN to Structured Data on Commons
Closed, ResolvedPublic

Description

Wikimedia España and the Instituto Geográfico Nacional (IGN), the Spanish National Geographic Institute, want the most valuable content from the IGN to be widely shared under a CC-BY 4.0 license as one of the first contributions to Structured Data on Commons. The selected content includes:

  • a series of raster topographic maps at a scale of 1:25, formats TIFF and ECW (~785 GB),
  • a series of scanned maps (some topographic maps at a scale of 1:25, some of them at a scale of 1:50, and some maps for Spanish provinces and autonomous communities) from the first to the last edition (T203010), and
  • a series of orthophotos, just the most recent versions, excluding historical mosaics (~1.22 TB).

Although the IGN offers us more material, the described files are the most valuable and usable ones for the Wikimedia projects, so none of them should be excluded. However, preparing and transferring them all is not a small feat. We should:

  • transfer all the files from the servers of the IGN to any other server under the control of the Wikimedia movement (for example, Wikimedia Cloud Services),
  • transform all the media files to the corresponding accepted formats for Wikimedia Commons,
  • organize and format the available metadata for Structured Data on Commons, and
  • upload the files.

As a first step, we would like some people in charge of Wikimedia Cloud Services to tell us what is the best way to proceed and if Wikimedia Cloud Services could be used for this purpose.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Hey, this is great news, since IGN is indeed a powerful source of data regarding geography, topology and mountains (which I love).

From the Wikimedia Cloud Services point of view, one thing we could offer is a Cloud VPS temporal project, where you can create one or two virtual machines to do all this work. We can delete the project when you are done.

If you plan to hold all the data at the same time, it seems you need about 3TB just to store what IGN offers. Then I guess you need at least the same for processing the data and the intermediate step before uploading to Wikimedia Commons.
This is about 6TB storage, which is not trivial to allocate I think.

Some questions:

  • Do you have an estimation on how many time all the operations would take? i.e, how much time should we provide the storage facilities? 1 week, 1 month, 6 month, 1 year...
  • Is it possible to do the processing by type (your first 3 points) instead of all at the same time? So you could reuse small chunks of storage instead of having to allocate a big one
  • How do they offer the datasets? Do they have an API you can query for more data? Do they offer a single big file for downloading with all the data? Depending on this, our approach could be very different. We could build a "pipeline" instead of working with large batches

Thanks for notifying me of this project! Looking forward to this upload :-)

It will be possible to upload files with full structured data to Commons from (approximately) early 2019, as I heard recently from the development team. If that is too late in terms of planning, you can upload the files in the 'old style' (with wikitext templates and categories) in a way that the files will be easily convertible to structured data as soon as the technology becomes available.

I plan to soon publish some documentation about that - hopefully before Wikimania. See T190493: [Epic] Document how to prepare 'old-style' (non-structured) Wikimedia Commons uploads for conversion to Structured Data on Commons.

Vvjjkkii renamed this task from Contribution from the IGN to Structured Data on Commons to ancaaaaaaa.Jul 1 2018, 1:09 AM
Vvjjkkii triaged this task as High priority.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed a subscriber: Aklapper.
CommunityTechBot renamed this task from ancaaaaaaa to Contribution from the IGN to Structured Data on Commons.Jul 1 2018, 5:36 PM
CommunityTechBot raised the priority of this task from High to Needs Triage.
CommunityTechBot updated the task description. (Show Details)
CommunityTechBot added a subscriber: Aklapper.

Thanks for your support, @aborrero and @SandraF_WMF! We hope to upload some of these series in 2018, so we'll have to choose the old but easily convertible style.

  • Is it possible to do the processing by type (your first 3 points) instead of all at the same time? So you could reuse small chunks of storage instead of having to allocate a big one

We'll definitely do the processing by type to reuse small chunks of storage, yes.

  • How do they offer the datasets? Do they have an API you can query for more data? Do they offer a single big file for downloading with all the data? Depending on this, our approach could be very different. We could build a "pipeline" instead of working with large batches

The data are spread through several downloadable files. I think I'll be able to structure these data somehow.

  • Do you have an estimation on how many time all the operations would take? i.e, how much time should we provide the storage facilities? 1 week, 1 month, 6 month, 1 year...

No, for now we can't guess how many time the operations would take since we still don't know what's the best way to transfer the files from a hard drive to Commons. The points are:

  • We have to buy a hard drive so that the IGN can store all the files in it. Do we need a SSD to be able to transfer/upload the files within a reasonable time but at the expense of having a limited capacity? Would a HDD be enough? Will this difference in time be actually noticeable?
  • How do we transfer the files in the SSD/HDD to the Cloud VPS? Can we physically send the drive to the location of the server? Or should we transfer the files via the Internet? Considering the bandwith and the size of the files, is this last option possible?

@aborrero, maybe you can help us with some of these points?

  • Is it possible to do the processing by type (your first 3 points) instead of all at the same time? So you could reuse small chunks of storage instead of having to allocate a big one

We'll definitely do the processing by type to reuse small chunks of storage, yes.

ACK

  • How do they offer the datasets? Do they have an API you can query for more data? Do they offer a single big file for downloading with all the data? Depending on this, our approach could be very different. We could build a "pipeline" instead of working with large batches

The data are spread through several downloadable files. I think I'll be able to structure these data somehow.

Do you have an estimation on how big is each downloadable file? You mentioned a series of raster topographic maps at a scale of 1:25, formats TIFF and ECW (~785 GB),, but I guess this is composed of several smaller files.
Let's assume it's 10%, i.e. 79GB. We could allocate a ~200GB VM for you to download from there and do the processing, upload to commons, clean and continue with the next block.

  • Do you have an estimation on how many time all the operations would take? i.e, how much time should we provide the storage facilities? 1 week, 1 month, 6 month, 1 year...

No, for now we can't guess how many time the operations would take since we still don't know what's the best way to transfer the files from a hard drive to Commons. The points are:

  • We have to buy a hard drive so that the IGN can store all the files in it. Do we need a SSD to be able to transfer/upload the files within a reasonable time but at the expense of having a limited capacity? Would a HDD be enough? Will this difference in time be actually noticeable?
  • How do we transfer the files in the SSD/HDD to the Cloud VPS? Can we physically send the drive to the location of the server? Or should we transfer the files via the Internet? Considering the bandwith and the size of the files, is this last option possible?

@aborrero, maybe you can help us with some of these points?

Networking (specially international) will surely be a bottleneck. Depending on the file size (and protocol), data transfer might be very painfull. We may use rsync or some bittorrent protocol to ensure that even with long transfer times we are able to actually move the data from one point to the other.
I don't think moving around a physical drive would help us in this case (and I ignore which policy we have to plug a random drive in one of our servers).

So, this is my proposal, assuming ~79GB data files:

  • let's allocate a VM with a 200GB extra disk (to be discussed with @bd808 and the rest of the WMCS team)
  • let's give you access to this VM
  • start testing our pipeline
    • download of a single ~79GB data file from IGN, using something like rsync, torrent, etc
    • processing
    • uploading to commons
    • clean VM to leave room for next block

Adding @Reedy as a subscriber because I know that he has been involved in large media uploads to commons before and may be able to offer a perspective on the question of using a disk and international post vs internet transfers. (For example how big the dataset needs to be before it is worth the effort of having someone in data center operations and someone else with production shell rights involved.)

"It depends!"

On many things... Where the dataset is? How big is it? Does someone need to download it first, and then reupload it (and if so, what is their upload speed)? Or can they just get the files straight onto the disk at source...

If we can download it straight to wmcs or a production host, chances are that's potentially going to be faster if there is no way to get the disk to the source host. If the person doing it has a decent upload, it may be neglible for them to bring them locally first (as there will be other work to do too)

Chase has added @fgiunchedi, cause we definitely need to check about ingressing hundreds of GB (even TB in this case).

I would suggest as per @aborrero that a 79GB block shouldn't be much of an issue, and useful enough to speed test too.

Are all the files currently well named? Or do they need renaming/organisation?

The thing that is going to take a while, if it's not already done.. Is writing description text files for each file still to be done?

These last two questions probably start to have the biggest impact on which way is best to do it

I just had a videoconf with people @abian and Ruben Ojeda from Wikimedia Spain.

Some conclusions:

  • IGN offers a lot of data, in many different formats. @abian or someone else should get an idea on how to post-process these files to a format understandable by Commons.
  • We agreed on trying a 200GB VM for data processing before uploading to commons, and work by small chunks of data. Of these 200GB, 100GB is for the raw download, and 100GB for the post-process output before uploading to common. After a chunk is processed, the storage is cleaned to left space for next chunk.
  • Apparently IGN doesn't have an API or other structured web URL for us to download the data using a script. They use some custom POST parameters, and we would need some information on them before we can script those.
  • If we can't automate the download, there is an option to go to the IGN datacenter, plug a hard disk and fetch all the data without using the network. Once we have this hard disk we could either send it to a WMF datacenter or @abian can upload it from his home to our VM.

So, there are 2 different issues here:

  • How to fetch the data from IGN (web API, http POST, hard disk, etc)
  • How to process the data we fetched from IGN

In case we discover IGN has an API (or @abian can script the http POST easily) we could even think on having this pipeline build on Toolforge in our Grid Engine (download small chunk -> process -> upload to commons -> start again) .

Sounds like a nice project! With my swift maintainer hat on, testing a single 200-300 GB chunk of data sounds good to me. Let's coordinate though before uploading the full data set because swift is pending its annual expansion (T201937) and I'd like to have that completed to not push swift disk usage too much with substantial uploads.

Sounds like a nice project! With my swift maintainer hat on, testing a single 200-300 GB chunk of data sounds good to me. Let's coordinate though before uploading the full data set because swift is pending its annual expansion (T201937) and I'd like to have that completed to not push swift disk usage too much with substantial uploads.

Perhaps I didn't use correct words. Also, I don't know in deep how data looks like. But I believe files are small, like map tiles and other images.

In this case was using data chunk to refer to the downloadable files that IGN offers, which seems to composed of many of these map tiles or other images and metadata.

Sounds like a nice project! With my swift maintainer hat on, testing a single 200-300 GB chunk of data sounds good to me. Let's coordinate though before uploading the full data set because swift is pending its annual expansion (T201937) and I'd like to have that completed to not push swift disk usage too much with substantial uploads.

Perhaps I didn't use correct words. Also, I don't know in deep how data looks like. But I believe files are small, like map tiles and other images.

In this case was using data chunk to refer to the downloadable files that IGN offers, which seems to composed of many of these map tiles or other images and metadata.

Indeed, I was referring to those downloadable chunks you mentioned. Also as a data point mediawiki isn't going to allow uploads for single files greater than 4-5GB (!)

Thanks for your comments! I have just created T203010 with some information about the first series we will address.

I just formally requested a project with a single VM for this. See T203072. Please expect a week-long wait until resources are available.

aborrero claimed this task.

Since we found a way to do this operation, marking task as resolved now. Feel free to reopen if required.

SandraF_WMF renamed this task from Contribution from the IGN to Structured Data on Commons to Wikimedia Cloud Services support for contribution from the IGN to Structured Data on Commons.Dec 19 2018, 6:25 PM

The Cloud-Services project tag is not intended to have any tasks. Please check the list on https://phabricator.wikimedia.org/project/profile/832/ and replace it with a more specific project tag to this task. Thanks!