Page MenuHomePhabricator

[Task] Ex:WikibaseQualityExternalValidation - performance review of Special:CrossCheck
Closed, DeclinedPublic


@aaron, can you take a look at this special page?

It eventually calls CrossChecker::crossCheckStatements, which if you follow the execution path has 6 foreach'es nested inside each other.

Besides throttling (T103905) is there anything they need to do to make sure this isn't a DoS vector?

Event Timeline

csteipp assigned this task to aaron.
csteipp raised the priority of this task from to Needs Triage.
csteipp updated the task description. (Show Details)

It would be interesting to dumps of all the queries for a request done for a typical case and a worse-type case.

In both cases 3 database queries are executed:
Query 1:

SELECT dump_id FROM wbqev_identifier_properties
WHERE identifier_pid IN ( ... );

Query 2:

SELECT * FROM wbqev_dump_information
LEFT JOIN wbqev_identifier_properties
ON wbqev_dump_information.dumpId = wbqev_identifier_properties.dump_id
WHERE wbqev_dump_information.dumpId IN ( ... );

Query 3:

SELECT ( dump_id, external_id, pid, external_value ) 
FROM wbqev_external_data
WHERE dump_id IN ( ... )
AND external_id IN ( ... )
AND pid IN ( ... );

Number of rows depends on how many external datasets are imported into the database. For now, we only use the GND as a counterpart database, which results in the following numbers:

Typical Case
Query 1 returns 1 row.
Query 2 returns 3 rows.
Query 3 returns 3 rows.

Worst Case
Query 1 returns 1 row.
Query 2 returns 3 rows.
Query 3 returns 22 rows.

Does this answers your question, or do you need more information?

Lydia_Pintscher renamed this task from Ex:WikibaseQualityExternalValidation - performance review of Special:CrossCheck to [Task] Ex:WikibaseQualityExternalValidation - performance review of Special:CrossCheck.Aug 17 2015, 4:05 PM
Lydia_Pintscher set Security to None.

@aaron Any update on this? This looks like it is the last remaining blocker for deploying the extension.

The JOIN for query #2 does not seem to have an index. wbqev_identifier_properties has:

PRIMARY KEY (identifier_pid, dump_id)

...but nothing like (dump_id). Is the wbqev_identifier_properties table going to pruned of older dumps or will it just keep growing?

Query #3 would require a few clever index dives to run well. I'm not sure how smart MariaDB is here. Has anyone tried "EXPLAIN format=json" with a good dataset (doesn't have to be full size, but a few 100k items). My fear would be having just the (dump_id) index part used and scanning happening for the other two IN()s. On the revision table, two IN()s work fine:

EXPLAIN EXTENDED SELECT * FROM revision FORCE INDEX(rev_page_id) WHERE rev_page IN (5043734,3535,6234) AND rev_id IN (343535,23255,33626);
stdClass Object
    [id] => 1
    [select_type] => SIMPLE
    [table] => revision
    [type] => range
    [possible_keys] => rev_page_id
    [key] => rev_page_id
    [key_len] => 8
    [ref] => 
    [rows] => 9
    [filtered] => 100.00
    [Extra] => Using index condition

For wbqev_external_data, the only index with pid has it as the third part of the index.

CREATE INDEX /*i*/dump_id_external_id_pid ON /*_*/wbqev_external_data (dump_id, external_id, pid); mariadb will need to handle 3 levels of IN nesting to avoid scanning extra rows. It may work just fine, though it would be nice to know for sure.

I'm having trouble finding documentation about wbqev_external_data. The PK is just an opaque rowid field. How many rows might there be for any given (dump_id, external_id, pid) combination? What decides that number?

aaron removed aaron as the assignee of this task.Dec 17 2015, 9:09 PM
aaron changed the task status from Open to Stalled.Dec 17 2015, 9:10 PM

Was this deployed already? Perhaps this can just be closed.