Page MenuHomePhabricator

SMW Parser is missing grammar validation
Closed, ResolvedPublic

Description

Author: avinoamr

Description:
The SMW Parser seems to ignore grammar/syntax errors altogether. Instead of issuing an error message, it's generating an impossibly-heavy DB query:

SELECT /* SMW::getQueryResult Avinoamr */ DISTINCT t0.smw_id AS id,t0.smw_title AS t,t0.smw_namespace AS ns,t0.smw_iw AS iw,t0.smw_sortkey FROM smw_ids AS t0 INNER JOIN smw_ids AS t2 ON t0.smw_id=t2.smw_id WHERE t2.smw_iw!=':smw' AND t2.smw_iw!=':smw-redi' AND t2.smw_iw!=':smw-border' AND t2.smw_iw!=':smw-intprop' ORDER BY t0.smw_sortkey ASC LIMIT 21

This query is obviously malformed, it doesn't try to match anything useful, just filters out certain rows. On our current MediaWiki installation this query takes more than 1 hour to complete.

I've noticed it's being generated by (1) using a bad syntax (try [Field::->Value]]), or (2) by using the {#ask} parser function with an empty query string (try {#ask:}).

Obviously, the user should be aware of what he's doing - but a couple of these queries can really slow down the DBs for no reason.

I think it's a critical performance/stability bug. If I'm not mistaken, a single user can pretty much crash several of the DB servers simply by running these queries repeatedly.

I'm not certain what kind of parsing algorithm is being used (LL(1) ?), but I'm sure grammar validation is part of it. If you wish, I volunteer to help with developing this support - but I'd like to have a comprehensive understanding and guidance before I begin.

By the way - another approach would be to check the Query object before executing it. And if we realize that this meaningless SQL error will be executed, we can just throw a non-informative Parsing exception. That's a more quick & dirty solution, in my opinion.


Version: unspecified
Severity: critical

Details

Reference
bz21868

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 10:54 PM
bzimport set Reference to bz21868.

The query parser is pretty much ad hoc, but it works as expected in the cases you mention. The parser generally tries to ignore errors and interpret whatever remains of the query. In your case, there are no valid conditions in the malformed query, so that the query ends up selecting all pages. The same happens for the empty query. Note that you can use "format=debug" if you want to have more information about how the query was understood without actually executing it.

One could (and probably should) disallow queries for *all* pages, but the main issue here is that such a simple query actually freezes your database. Normally, the query without conditions should be easiest to answer (given that it imposes a small limit as in your case). I think the problem here is that the SQL generation adds another instance of the smw_id table (SMW's "page table") to the query, thus enforcing a redundant inner join of potentially large tables. I have now improved the query generation process to avoid this, so the performance of the empty query should be much improved.

I have also added a new parameter "$smwgIgnoreQueryErrors" that can be used to specify if queries should be executed at all if errors occured during parsing or preprocessing. The default for this setting is "true." Together with the above, this should fix this specific bug.

There is still much room for optimising the SQL query generation. Ultimately, it might be worth to investigate the use of more specialised storage backends than MySQL (relational DBs are not particularly great for SMW's data model), or to adopt a more sophisticated table layout in SQL (especially large inner joins are an issue with the "single table" approach of SMW). I am also somewhat disappointed about the bad performance that you indicate, given that the inner join is done over the primary key of the tables: retrieving these 21 results should really be no problem in spite of the large table as such!