Page MenuHomePhabricator

Use individual select queries to determin which Entities are in the wbc_entity_usage table
Closed, ResolvedPublic

Description

In EntityUsageTable::getUsedEntityIdStrings, we currently use a bulk query to see which Entity ID from a given set are actually present in the wbc_entity_usage table:

SELECT DISTINCT eu_entity_id  FROM `wbc_entity_usage` 
WHERE eu_entity_id IN ('Q148475', 'Q54919', 'Q423048', 'Q2494649', 'Q13219454', 'Q131454', 'Q36578', 'Q1798125', 'Q2597810', 'Q19938912', 'Q193563', 'Q2553334', 'Q477675', 'Q623578', 'Q19675', 'Q13481868', 'Q384602')

This can lead to extremely slow queries and high resource consumption in some cases, as described in T116404#2370849.

We should instead use individual selects with LIMIT 1 for each ID:

SELECT eu_entity_id FROM wbc_entity_usage WHERE eu_entity_id = 'Q623578' LIMIT 1;

@jcrespo suggest that this is the better approach in T116404#2371094.

Event Timeline

daniel created this task.Jun 10 2016, 11:09 AM
Restricted Application added a subscriber: Zppix. · View Herald TranscriptJun 10 2016, 11:09 AM

A slight comment here: I know the second query is way better than the the first one- there could be an even better solution; but just doing this change would be a huge improvement (at least 100x faster).

Aklapper renamed this task from Use inidividual select queries to determin which Entities are in the wbc_entity_usage table to Use individual select queries to determin which Entities are in the wbc_entity_usage table.Jun 10 2016, 11:56 AM
hoo claimed this task.Jun 10 2016, 2:36 PM

Change 293736 had a related patch set uploaded (by Hoo man):
WIP: Use one (sub)query per entity id in getUsedEntityIdStrings

https://gerrit.wikimedia.org/r/293736

hoo added a comment.EditedJun 10 2016, 3:03 PM

Short test:

hoo@terbium:~$ mwscript eval.php --wiki=zhwiki                                                                                                                                                                                                                         
> $idStrings = [];                                                                                                                                                                                                                                                             
                                                                                                                                                                                                                                                                               
> for ( $i = 1; $i < 1001; $i++ ) { $idStrings[] = "Q$i"; }                                                                                                                                                                                                                    
                                                                                                                                                                                                                                                                               
> $subQueries = [];                                                                                                                                                                                                                                                            
                                                                                                                                                                                                                                                                               
> $connection = wfGetDB( DB_SLAVE );                                                                                                                                                                                                                                           
                                                                                                                                                                                                                                                                               
> foreach ( $idStrings as $idString ) { $subQueries[] = $connection->selectSQLText( 'wbc_entity_usage', 'eu_entity_id', [ 'eu_entity_id' => $idString ], '', [ 'LIMIT' => 1 ] ); }                                                                                             
                                                                                                                                                                                                                                                                               
> $sql = $connection->unionQueries( $subQueries, false );                                                                                                                                                                                                                      
                                                                                                                                                                                                                                                                               
> $t0 = microtime( true ); $res = $connection->query( $sql ); echo microtime( true ) - $t0;
0.10685896873474



hoo@terbium:~$ mwscript eval.php --wiki=zhwiki
> $idStrings = [];

> for ( $i = 1001; $i < 2001; $i++ ) { $idStrings[] = "Q$i"; }

> $connection = wfGetDB( DB_SLAVE );

> $t0 = microtime( true ); foreach ( $idStrings as $idString ) { $connection->select( 'wbc_entity_usage', 'eu_entity_id', [ 'eu_entity_id' => $idString ], '', [ 'LIMIT' => 1 ] ); }; echo microtime( true ) - $t0;
1.093230009079

Change 293745 had a related patch set uploaded (by Hoo man):
EntityUsageTable: Run all subqueries at once with UNION on MySQL

https://gerrit.wikimedia.org/r/293745

Change 293736 merged by jenkins-bot:
Use one query per entity id in getUsedEntityIdStrings

https://gerrit.wikimedia.org/r/293736

Note: the naive "one query per ID" patch (293736) is merged, and should already fix the performance issue. The "use a UNION" patch (293745) is still pending comments and testing.

The second patch isn't strictly necessary, but provides another 10x speedup (for large sets of IDs), according to Marius' tests. I'd leave this task open until the second patch is either merged or abandoned.

I like your approach @daniel, starting simple, then optimize with a more complex option. Thank you!

Change 293745 merged by jenkins-bot:
EntityUsageTable: Run all subqueries at once with UNION on MySQL

https://gerrit.wikimedia.org/r/293745

adrianheine closed this task as Resolved.Jun 14 2016, 7:18 AM
adrianheine moved this task from Proposed to Done on the Wikidata-Sprint-2016-05-24 board.