Page MenuHomePhabricator

Assemble various sets of interesting Commons files for testing SDC features in OpenRefine
Closed, ResolvedPublic

Description

We are now at the point where various SDC features in OpenRefine can be tested: Reconciliation with data extension, and first edits to (beta) Wikimedia Commons.

In order to be able to test these features thorougly and to catch all sorts of potentially (a)typical scenarios, it's good to have a set of 'interesting' files to test with:

  • Files with various SDC properties pointing to all supported datatypes
    • Existing files on Wikimedia Commons
    • Partly newly uploaded files on Beta Commons
  • Files with all supported file types
    • Existing files on Wikimedia Commons
    • Partly newly uploaded files on Beta Commons

Event Timeline

Spinster moved this task from Backlog to SDC-support Doing on the OpenRefine board.
Spinster added a project: Reconciliation.

I have collected examples of files pointing to all relevant datatypes in their SDC.

Data types which are not used on Commons, or still very exotic, seem to be the following ones. I found no Commons files pointing to...

So I'm ignoring these (also not doing sandbox experiments with them). As these are basically strings in the backend too, that shouldn't really pose problems, as briefly discussed with @Pintoch.

I will now proceed to further collect files that cover all supported file types on Commons - these should give us examples to work with for the preview function of the Commons Reconciliation service for instance (cfr T292526: Add preview service to reconciliation API).

I have added a selection of files that should cover all supported file types on Wikimedia Commons. In addition, I chose files that will be interesting for testing the preview function per T292526: Add preview service to reconciliation API and inserted a variety of formats of the file path itself so that we can stress-test T290088: The Structured Data on Commons reconciliation service recognizes the most widely used Commons file name formats.

The dataset is here. You can simply import this in OpenRefine and start/test reconciliation and data extension with this set.

Not done yet: make sure that a similar set of interesting testing files is available on Beta Commons.

Spinster updated the task description. (Show Details)

As discussed in our last team meeting (Jan 25): we will upload new files to Beta Commons while testing OpenRefine's upload functionality, so uploading a batch of files there now is not needed.