Description
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | Lydia_Pintscher | T99351 Review and deploy Wikibase-Quality on wikidata.org | |||
Resolved | Lydia_Pintscher | T99354 Review and deploy Wikibase-Quality-Constraints on wikidata.org | |||
Invalid | None | T103817 [Epic] version 2 of the quality extensions | |||
Declined | None | T99357 [Story] Review and deploy Wikibase-Quality-External-Validation on wikidata.org | |||
Declined | None | T99358 [Task] Security review of Wikibase-Quality-External-Validation branch master | |||
Declined | None | T103912 [Task] Ex:WikibaseQualityExternalValidation - performance review of Special:CrossCheck |
Event Timeline
It would be interesting to dumps of all the queries for a request done for a typical case and a worse-type case.
In both cases 3 database queries are executed:
Query 1:
SELECT dump_id FROM wbqev_identifier_properties WHERE identifier_pid IN ( ... );
Query 2:
SELECT * FROM wbqev_dump_information LEFT JOIN wbqev_identifier_properties ON wbqev_dump_information.dumpId = wbqev_identifier_properties.dump_id WHERE wbqev_dump_information.dumpId IN ( ... );
Query 3:
SELECT ( dump_id, external_id, pid, external_value ) FROM wbqev_external_data WHERE dump_id IN ( ... ) AND external_id IN ( ... ) AND pid IN ( ... );
Number of rows depends on how many external datasets are imported into the database. For now, we only use the GND as a counterpart database, which results in the following numbers:
Typical Case
Query 1 returns 1 row.
Query 2 returns 3 rows.
Query 3 returns 3 rows.
Worst Case
Query 1 returns 1 row.
Query 2 returns 3 rows.
Query 3 returns 22 rows.
Does this answers your question, or do you need more information?
@aaron Any update on this? This looks like it is the last remaining blocker for deploying the extension.
The JOIN for query #2 does not seem to have an index. wbqev_identifier_properties has:
PRIMARY KEY (identifier_pid, dump_id)
...but nothing like (dump_id). Is the wbqev_identifier_properties table going to pruned of older dumps or will it just keep growing?
Query #3 would require a few clever index dives to run well. I'm not sure how smart MariaDB is here. Has anyone tried "EXPLAIN format=json" with a good dataset (doesn't have to be full size, but a few 100k items). My fear would be having just the (dump_id) index part used and scanning happening for the other two IN()s. On the revision table, two IN()s work fine:
EXPLAIN EXTENDED SELECT * FROM revision FORCE INDEX(rev_page_id) WHERE rev_page IN (5043734,3535,6234) AND rev_id IN (343535,23255,33626); stdClass Object ( [id] => 1 [select_type] => SIMPLE [table] => revision [type] => range [possible_keys] => rev_page_id [key] => rev_page_id [key_len] => 8 [ref] => [rows] => 9 [filtered] => 100.00 [Extra] => Using index condition )
For wbqev_external_data, the only index with pid has it as the third part of the index.
CREATE INDEX /*i*/dump_id_external_id_pid ON /*_*/wbqev_external_data (dump_id, external_id, pid);
...so mariadb will need to handle 3 levels of IN nesting to avoid scanning extra rows. It may work just fine, though it would be nice to know for sure.
I'm having trouble finding documentation about wbqev_external_data. The PK is just an opaque rowid field. How many rows might there be for any given (dump_id, external_id, pid) combination? What decides that number?
We are archiving the extension T204490: Archive the WikibaseQualityExternalValidation extension