Page MenuHomePhabricator

Toolforge bot approval request for SQL report generating bot
Closed, ResolvedPublic

Description

Some time ago, I wrote a bot that allows users to provide an SQL query and have the output neatly formatted on a wiki-page. It was discontinued after a few days of activity when it was pointed out it is contrary to Toolforge rule #6 "Do not provide direct access to Cloud Services resources to unauthenticated users”.

I am requesting an exception from this rule, as suggested in this discussion (cc @Legoktm @zhuyifei1999). Thanks for your consideration.

While this tool indeed allows anyone to trigger arbitrary queries on the enwiki replica db, the following measures are in place:

  • It uses a max of 5 db connections at any time Due to its asynchronous nature, more connections aren't necessary - if there are more 5 concurrent users, the bot can still output all reports albeit taking more time.
  • A timeout of 10 minutes is applied on every query, using MariaDB's max_statement_time directive.

No db connections are kept idle for more than 5 seconds (per policy). I believe these measures rule out any harm to the toolforge infrastructure.

Full function details:

  • User places a template on a userspace/project page containing the SQL query and output formatting options. The bot detects new transclusions and writes the output. Periodic updates (with min interval of 1 day) can also be configured.
  • An "Update the table now" link is provided in the template which triggers a manual update (intended for testing purpose, say after making changes in the formatting options).

What are the advantages of having this in addition to Quarry / existing database reports?

  • Quarry doesn't provide clickable links to pages. Using something like CONCAT('[[', page_title, ']]'), then exporting Quarry result as wikitable and pasting it on-wiki is tedious.
  • Even the above only works if pages are in the same namespace. Quarry doesn't provide any way to map namespace names to namespace numbers, so users are forced to ugly hacks like https://quarry.wmflabs.org/query/55915 and https://www.mediawiki.org/wiki/Topic:Ulxr3i1tc8uzgtgh to get readable results.
  • For reports requiring periodic updates, currently different botops set up custom bot code for each. Having different code and job run setups for each report makes maintenance tedious and less than ideal.

Source code: https://github.com/siddharthvp/SDZeroBot/tree/master/db-tabulator
Enwiki approval request: https://en.wikipedia.org/wiki/Wikipedia:Bots/Requests_for_approval/SDZeroBot_10

Event Timeline

Philosophically if this works out, this would be a nice next step to expand access to wiki replicas.

From a Toolforge admin perspective I think it would be fine if:

  • Each query has a comment associating it with the page that triggered it and maybe the last editor?
  • A log of queries run, when and why (could just be a file).
  • Maybe some defensive measure that people can't time a malicious edit right before the bot's scheduled daily run? E.g. only run the query if it's sat for 30 minutes.

sidenote: we should really improve Quarry so people can generate real links...

I'd be fine at least testing this as long as in addition of what's in the task description, there is a simple way to stop the bot in case of emergency and that way is clearly documented on the user page.

bd808 renamed this task from Toolforge bot approval request to Toolforge bot approval request for SQL report generating bot.Oct 19 2022, 3:12 PM

Approved, with minor conditions

This request was discussed in the 2022-10-19 Cloud Services team meeting with @Legoktm present to present the case for approval and help answer questions.

The consensus of the group was that this bot will provide a benefit for the movement that is not currently fulfilled by Quarry or similar systems, and thus should be allowed.

We discussed specifically the implications and rationale for the "Do not provide direct access to Cloud Services resources to unauthenticated users" rule and how it relates to this bot. The existing protections of limiting the number of parallel queries and enforcing a max_statement_time session based timeout should protect the wiki replicas from overload from this single source. The workflow of the bot is wiki-based rather than web-based which mitigates some concerns about being open to abuse from literally anyone on the internet. It would be a nice additional protection if @SD0001 could add some sort of "edit ageing" requirement similar to the 30 minute time proposed by @Legoktm in T320657#8313237. This is not strictly necessary for operation however.

The documentation requested by @taavi in T320657#8317367 should be created before setting the bot to work running these queries. It would also be helpful for there to be a "Tool" namespace page on wikitech with documentation or pointers to documentation elsewhere that explain how to stop and start the bot so that Toolforge admins can confidently disable/enable the bot as needed to debug infrastructure issues.

Thanks for the approval @bd808, and @Legoktm for presenting it. Bot is functional now. I have not implemented the "edit ageing" requirement, but emergency-shutoff can now be activated from User:SDZeroBot/Shutoff, which is linked from the bot user page. Also created wikitech:Tool:SDZeroBot.

The log of all queries run suggested in T320657#8313237 was also created, can be seen at /data/project/sdzerobot/SDZeroBot/db-tabulator/queries.log.

Philosophically if this works out, this would be a nice next step to expand access to wiki replicas.

T318191: Evaluate opening the readonly Wiki Replicas to the WAN (since we already have user authentication) Related?

Don't think there's anything left to do here, closing.