As part of my master’s thesis, I am working on a tool that will let authenticated users automatically infer Shape Expressions from a set of exemplary Wikidata items. For example: given a list of 50 items about humans, it may infer that humans have a human father and mother and optionally siblings, have a place of birth and perhaps of death as well, that a place of birth or death is usually located within an administrative territorial entity and associated with a country, and so on.
I run this automatic inference process on the Grid Engine, and it’s fairly expensive – it requires 8 GB of memory, and can take between 15 and 60 minutes of wall-clock time. Therefore, I want to make sure that I’m not violating Toolforge Rule 6 – Do not provide direct access to Cloud Services resources to unauthenticated users – before lifting the current restriction that only I can submit jobs (i. e., create a new inference process).
I don’t expect this tool to see too much use – Shape Expressions are still a fairly niche topic in the Wikidata community. The tool is currently restricted to running only one job at a time for technical reasons, but even when those technical restrictions are limited, I don’t intend to allow running more than, say, 5 jobs at a time (though this limit could also be lowered further if you prefer, according to the resources in the Grid Engine).
The tool only allows authenticated users (via OAuth) with a verified email address and who are not blocked on Wikidata to submit jobs, and each job is attributed to the user who started it. It also has a “kill switch” which anyone may activate: if any file other than README exists in the directory /data/scratch/wd-shex-infer/, the tool rejects the creation of new jobs. Anyone can therefore disable the tool if necessary with commands like the following:
touch /data/scratch/wd-shex-infer/T123456 cat > /data/scratch/wd-shex-infer/T123456 << 'EOF' We are experiencing high NFS error rates at the moment, see T123456. EOF
The directory is world-writable with the sticky bit set, so that anyone can activate this block, but only tool maintainers (own the directory) or the person who activated a block (owns the file) can disable it again.
The base URL for the tool is https://tools.wmflabs.org/wd-shex-infer/, though I haven’t written a proper index page yet. The page to submit a new job is /job/new (currently accessible only to me, but let me know if you want access before I remove that check entirely), and some examples of existing jobs are /job/1 and /job/5. The source code is available on Phabricator.
Is it okay if I release this tool and make it available to all users, and start advertising it at Wikidata:WikiProject ShEx? Do you have any questions for me? Would you like to see additional access control mechanisms or resource restrictions?