Page MenuHomePhabricator

Set up OpenRefine on Cloud VPS
Open, MediumPublic


We would like to set up a public instance of OpenRefine. It would be useful for users who may not have the expertise or time to run OpenRefine locally, or for groups who want to use OpenRefine collaboratively.

We must be clear about the restrictions of this instance: OpenRefine has no access control mechanism as far as I’m aware, so this would be completely open to vandalism. We must advise users to regularly back up their projects; we should also set up automatic backups (similar to this), but since I don’t expect we’ll be able to provide an automatic restore mechanism, that should be a last resort, requiring manual assistance from the instance administrators.

(For a minimal degree of vandalism protection, perhaps we could at least disable the project listing, so that you need to know the link to a project before you can vandalize it.)

Due to the memory requirements of OpenRefine, as well as the desire to set up automatic btrfs snapshots (see “automatic backups” link above), I think this should be done as a Cloud VPS project, not a Toolforge tool. (A formal project request will be filed as a subtask later.)

Who is “we”?

Event Timeline

Note to self: for this we would need to rethink Wikidata authentication in OpenRefine, migrating it to OAuth. This would include adding OAuth support in Wikidata-Toolkit. This has not been done yet because OAuth is not suited for open source software that is run directly by the user on their own machine.

Yeah, that’s going to be tricky… for a first version it might be easiest to completely disable Wikidata authentication, so that users have to use QuickStatements instead :/

To clarify – the problem is not that the server needs to do the edits (which should be possible, AFAIU, although usually the edits are done client-side), but that software running on localhost can’t provide a useful redirect URL to the OAuth registration?

Edit: We also need to restrict each OAuth access token and secret to one browser session, even though the API requests will actually be made by the server. (Right?)

When running software on localhost, the client needs to have OAuth consumer credentials, which are supposed to be private. If I apply for an OAuth consumer for OpenRefine, I cannot put the credentials in OpenRefine's source code, because it would allow anyone to reuse them for any other application. So every user would need to go through the OAuth registration themselves (and then OAuth login).

For hosted versions of OpenRefine the problem disappears, but indeed we need to be more careful with tying OAuth tokens to sessions.

Perhaps you could use an owner-only consumer for default installations? Those are tied to a single account and don’t need confirmation, so I think it might be possible to request them automatically (but I’m not sure if that’s a good idea).

Okay, I started setting up the server and OpenRefine is running. I haven’t set up any proxy yet, so for now you can only test it via SSH proxy:

ssh -L 3333:localhost:3333 openrefine01.eqiad.wmflabs

@Pintoch can you see if you’re able to access the server? Then we can figure out the next steps.

1997kB renamed this task from 4wcaaaaaaa to Set up OpenRefine on Cloud VPS.Jul 1 2018, 2:30 AM
1997kB lowered the priority of this task from High to Medium.
1997kB updated the task description. (Show Details)
1997kB added a subscriber: Aklapper.

Users need to authenticate with their SUL accounts to login to PAWS, and this actually generates authentication that can be accessed from inside the apps that run inside it! The following environment variables are set:


You should be able to make edits as the logged in user with these credentials, without needing to do an additional login step. There is pywikibot config that automatically authenticates this - we can probably do something like that for openrefine as well.

There's a 3G memory limit per user on PAWS with a guarantee of 1G. I hope that's enough for most openrefine - although it's possible we need to explicitly tell OpenRefine how much RAM it can use. JVM apps can be a bit picky like that.

Not sure what phab etiquette is anymore - should this have a PAWS tag now? Or?