Page MenuHomePhabricator

Investigation: MediaWiki extension for database reports
Closed, DeclinedPublic

Description

Most of Reports-bot works by querying the database, generating a list of page titles (with the occasional annotation), and then posting them on a wiki page in a standard manner. See examples:

In principle, many of these functions could be handled by an extension that generated database reports on demand that could be transcluded on pages.

As I understand it, MediaWiki used to let you do SQL queries within MediaWiki itself; this was killed for performance reasons. If we were to revisit this we would need strategies to prevent overload, including caching and query timeouts.

From a user experience perspective, it might not necessarily work exactly like an SQL query interface like Quarry. Rather, the extension would produce reports with clickable links, etc. that make them more useful to editors than the usual plaintext output of SQL queries. If we are feeling especially ambitious, we could build in a query builder that makes it so that people without knowledge of SQL or the MediaWiki database architecture could put together some basic reports.

Event Timeline

There's been a lot of brainstorming in this area. Some people think "well let's just make every database report a Special page!" but there are open questions about how important certain functionalities are. For example, how important is having a page history so that a reader can see older versions of a report? How important is having watchlist notifications when a report is updated?

The last time I gave re-implementing database reports some thought, I concluded that the next step was probably doing a request for comments on mediawiki.org that basically surveyed existing report readers/users to figure out which features were and were not important to them. Once there are clearer hard and soft requirements (e.g., internationalization/localization support is a hard requirement in my mind), it should be easier to figure out whether to continue in Python or switch to PHP or some other language.

You presumably work at the moment by querying the labs replicas... I don't think we'd want MediaWiki user accounts having permissions to perform arbitrary queries against the uncensored databases, and then have the results immediately published, from a security perspective.

Do the Labs replicas do any censorship of the tables beyond not replicating certain tables and eliding certain columns? Could we create a restricted view within the MediaWiki software where it would be impossible to get access to private information? How was this problem managed back when there was a query interface in MediaWiki, or was it a problem then too?

Some people think "well let's just make every database report a Special page!"

What I have in mind specifically is one special page with various subpages for each report, rather than a Special:RecentDeaths, Special:IndefinitelyBlockedIPs, etc. (So that we're on the same page.)

For example, how important is having a page history so that a reader can see older versions of a report?

This part is actually really important to me—I want to do a historical analysis to see to what extent a page appearing as part of a report resulted in edits to that page. That can only be done with a revision log. Would it be worth re-implementing?

How important is having watchlist notifications when a report is updated?

With an extension we could generate notifications for updated reports through Echo.

The last time I gave re-implementing database reports some thought, I concluded that the next step was probably doing a request for comments on mediawiki.org that basically surveyed existing report readers/users to figure out which features were and were not important to them.

Is mediawiki.org the best place for this survey? I would want to survey users of the existing Database Reports page as well as other parts of Wikipedia that rely on automated reports such as WikiProjects.

I'd strongly be in support of a community editable list of queries that could be built and scheduled, and have them output to a versioning report. Perhaps along the lines of the AbuseFilter style interface. These should be restricted to administrators, and a new group that could edit the reports.

These should be restricted to administrators, and a new group that could edit the reports.

Any particular reason to limit to administrators?

What I have in mind specifically is one special page with various subpages for each report, rather than a Special:RecentDeaths, Special:IndefinitelyBlockedIPs, etc. (So that we're on the same page.)

That part doesn't really matter. You could easily add an additional pseudo-namespace (e.g., Special:Reports/) or several (e.g., Special:Reports/Daily/).

This part is actually really important to me—I want to do a historical analysis to see to what extent a page appearing as part of a report resulted in edits to that page. That can only be done with a revision log. Would it be worth re-implementing?

Special pages in MediaWiki don't have a persistent page history. That's kind of what makes them special. If you wanted versioning, you'd put the reports in a real, regular namespace, as far as I understand it.

With an extension we could generate notifications for updated reports through Echo.

Sure.

Is mediawiki.org the best place for this survey? I would want to survey users of the existing Database Reports page as well as other parts of Wikipedia that rely on automated reports such as WikiProjects.

Yes. You could also do it on Meta-Wiki or even here in Phabricator Maniphest, but mediawiki.org seems like the most natural fit to me. You'd want to come up with a list of current features (watchlist notifications, page history, enumerated results) and a list of future features (internationalization/localization, on-wiki configuration, Echo notifications) and then ask users to tell you which features are important, which are nice-to-have, which are irrelevant, etc. Plus users will invariably have feature requests (on-demand report updates, result post-processing, etc.) of their own to propose.

Related: T109708: Broken Database reports on enwiki (tracking).

Do the Labs replicas do any censorship of the tables beyond not replicating certain tables and eliding certain columns?

Not really. There's now a two-layer sanitization and exposure system so that we don't accidentally leak private data again (cf. T56847).

Could we create a restricted view within the MediaWiki software where it would be impossible to get access to private information? How was this problem managed back when there was a query interface in MediaWiki, or was it a problem then too?

I think you're referring to Special:AskSql; performance and security were problems then. (There's a Tim quote about being grateful for being granted shell access so that he could kill his own runaway queries, but I'm going to get lunch rather than dig it up.)

More to the point, there's a fundamental architecture question to be answered here: do we want/can we have reports generated that use the production SQL databases or is it necessary to use replicas? Some of these reports are fairly expensive to generate. For example, the way you get page counts by namespace is by doing a full scan of the page table and tallying the value of every row's page_namespace column (and page_is_redirect column).

@scfc did a lot of work on the Python database reports. He may have thoughts and insights here about building a better system.

Do the Labs replicas do any censorship of the tables beyond not replicating certain tables and eliding certain columns?

Yes. Revision deletion is applied to the data, for example

There's a Tim quote about being grateful for being granted shell access so that he could kill his own runaway queries, but I'm going to get lunch rather than dig it up.

Sep 17 04:45:02 <TimStarling> you know, one of the things I most liked about getting shell access in 2003 was that I was able to log in and kill my runaway queries when I accidentally took down wikipedia with Speical:Asksql

@Harej I'd start with admins and a usergroup that can design/run these - if there is any concern about jobs causing delay or performance issues (TBD).

Do the Labs replicas do any censorship of the tables beyond not replicating certain tables and eliding certain columns?

Yes, complex filters are maintained.

Could we create a restricted view within the MediaWiki software where it would be impossible to get access to private information?

I do not think that is possible. It is already very difficult on labs.

Please do not add derived data (that is not essential such as *link tables and other small caches) on the same dbs than mediawiki - you will be duplicating your data 40 times with is a huge resource waste.

I do not have anything about having 1st party reports, but please consider have an external server to contain those. I do not see why labs cannot be used for that- the same that pageviews or other links on mediawiki to labs work. Why not ask for those to be created on labs automatically instead like T59617#2384976 ?

Could we create a restricted view within the MediaWiki software where it would be impossible to get access to private information?

I do not think that is possible. It is already very difficult on labs.

I agree with Jcrespo. This seems like a super scary idea.

The only way I could see arbitrary sql working, was if MW made a db connection to the labs db replica (or equivalent), and basically worked the exact same way as Quarry. In which case, what's the point? There might be some UI niceties of having it "in" MediaWiki, but there are also real benefits to having Quarry as isolated as possible.

I think first party reports/precreated reports are interesting. We just need to find a way to set them up, keep them updated and proposed so that when found useful by more than one person they can be done only once instead of n times. what we cannot do is send queries from labs to production or vice-versa.

Another option is to send labs data to a specialized analytics store, where creating reports on the fly would be much easier and faster. I am very open to proposals, and I will help setting up any of these.

Another option is to send labs data to a specialized analytics store, where creating reports on the fly would be much easier and faster.

Is that all it would take to overcome the current architectural issue? Using the production DBs would not work, nor can we have an extension depend on Labs, but having a specialized analytics service based on the Labs data sounds like it might work based on what you're saying. Obviously it would be a significant undertaking but it's at least a workable solution.

We currently have the MariaDB database replicas on Wikimedia Tool Labs and we have MediaWiki's api.php. Why are these two interfaces insufficient?

Is that all it would take to overcome the current architectural issue?

What is the current architectural issue? I'm not following.

Using the production DBs would not work, nor can we have an extension depend on Labs, [...]

Why is using the production databases not possible? We already have "heavy" Special reports that get cached and updated infrequently (e.g., https://en.wikipedia.org/wiki/Special:LonelyPages). On the Labs side, database reports don't necessarily need to be implemented as a MediaWiki extension. It depends which features/requirements are particularly important.

Are there any examples of use cases which this would cater to?

Between Quarry, labs shell access, and the API, I don't think there are many gaps needing to be filled, at least for users with SQL knowledge. The idea of building a nice interface for users who don't have SQL knowledge is interesting, but taking into account the need to shield the user from both performance issues and the complexity of the underlying schema, I think it would actually be a heck of a lot of work to build (and maybe not ultimately successful).