Page MenuHomePhabricator

Review Wikidata Shape Expressions Inference tool regarding Toolforge Rule 6
Closed, ResolvedPublic

Description

As part of my master’s thesis, I am working on a tool that will let authenticated users automatically infer Shape Expressions from a set of exemplary Wikidata items. For example: given a list of 50 items about humans, it may infer that humans have a human father and mother and optionally siblings, have a place of birth and perhaps of death as well, that a place of birth or death is usually located within an administrative territorial entity and associated with a country, and so on.

I run this automatic inference process on the Grid Engine, and it’s fairly expensive – it requires 8 GB of memory, and can take between 15 and 60 minutes of wall-clock time. Therefore, I want to make sure that I’m not violating Toolforge Rule 6Do not provide direct access to Cloud Services resources to unauthenticated users – before lifting the current restriction that only I can submit jobs (i. e., create a new inference process).

I don’t expect this tool to see too much use – Shape Expressions are still a fairly niche topic in the Wikidata community. The tool is currently restricted to running only one job at a time for technical reasons, but even when those technical restrictions are limited, I don’t intend to allow running more than, say, 5 jobs at a time (though this limit could also be lowered further if you prefer, according to the resources in the Grid Engine).

The tool only allows authenticated users (via OAuth) with a verified email address and who are not blocked on Wikidata to submit jobs, and each job is attributed to the user who started it. It also has a “kill switch” which anyone may activate: if any file other than README exists in the directory /data/scratch/wd-shex-infer/, the tool rejects the creation of new jobs. Anyone can therefore disable the tool if necessary with commands like the following:

touch /data/scratch/wd-shex-infer/T123456
cat > /data/scratch/wd-shex-infer/T123456 << 'EOF'
We are experiencing high NFS error rates at the moment, see T123456.
EOF

The directory is world-writable with the sticky bit set, so that anyone can activate this block, but only tool maintainers (own the directory) or the person who activated a block (owns the file) can disable it again.

The base URL for the tool is https://tools.wmflabs.org/wd-shex-infer/, though I haven’t written a proper index page yet. The page to submit a new job is /job/new (currently accessible only to me, but let me know if you want access before I remove that check entirely), and some examples of existing jobs are /job/1 and /job/5. The source code is available on Phabricator.


Is it okay if I release this tool and make it available to all users, and start advertising it at Wikidata:WikiProject ShEx? Do you have any questions for me? Would you like to see additional access control mechanisms or resource restrictions?

Event Timeline

This does sound like a resource intensive tool, but I don't think it violates the "direct access" rule. This is more intended to prevent people from making tools similar to Quarry and PAWS which allow arbitrary users to access the wiki replicas and other computing resources with very few restrictions. I do have a small amount of concern that running too many parallel jobs which take 8G of RAM and have long runtimes could cause contention on the job grid. It would be nice if you started out with only allowing one or two concurrent jobs. This sounds like the kind of project where the important part is that the data is computed rather than that the data is computed very quickly. If the jobs can be spread out more evenly across time this will give other grid engine users a fair chance at also running their jobs.

Thanks for your response!

It would be nice if you started out with only allowing one or two concurrent jobs.

Okay, I’ll keep that in mind.

This sounds like the kind of project where the important part is that the data is computed rather than that the data is computed very quickly. If the jobs can be spread out more evenly across time this will give other grid engine users a fair chance at also running their jobs.

Yes, that’s a good characterization. I’ve just noticed that qsub has a -p priority option, I’ll use that to set the jobs to a lower priority. I could also implement an in-tool job queue, I guess, but I’ll see how many users the tool gets (i. e., how often the hard limit of “you can’t submit another job now” is even reached) before I start working on that. For now, as I said, it’s limited to one job at a time anyways for other reasons.

FYI, since it seems there are no open issues with the tool, I’m going to announce it to the community today. Let me know if there are any problems or other questions.

LucasWerkmeister claimed this task.

I recently fixed the technical limitation that prevented concurrent jobs, so now the tool is limited artificially to no more than two concurrent jobs, as requested. (It’s also not being used very much right now anyways.)

I think we can close this task now. Thanks again for your feedback!