Page MenuHomePhabricator

How to join terms in new schema with other tables that have full entity ids
Closed, ResolvedPublic

Description

In the new schema, top-level tables do not contain full entity ids but only numeric ids, as the tables are seprated by type and the prefix was dropped to optimize space usage.

The old wb_terms containeed full entity ids, allowing it to be joined with other tables like page table.

How do I join terms in the new schema with other tables that store full entity ids?

example query:

SELECT term_text AS painter, COUNT(term_text) AS paintercount FROM page
JOIN pagelinks AS colllink ON page_id=colllink.pl_from AND
colllink.pl_from_namespace=0 AND
colllink.pl_namespace=0 AND colllink.pl_title='Q18600731'
LEFT JOIN pagelinks AS creatorlink ON page_id=creatorlink.pl_from AND
creatorlink.pl_from_namespace=0 AND
creatorlink.pl_namespace=120 AND
creatorlink.pl_title='P170'
JOIN wb_terms ON page_title=term_full_entity_id AND
term_entity_type='item' AND
term_language='en' AND
term_type='description' AND term_text LIKE 'painting by %'
WHERE
page_namespace=0 AND
page_is_redirect=0 AND
creatorlink.pl_from IS NULL
GROUP BY term_text
ORDER BY COUNT(term_text) DESC, painter
LIMIT 300;

Suggested solutions

Build full entity ids within the query (part of join condition or by doing a sub-query).

  • find a way to write the query in a simpler and performant way
  • test give solution on test node to get a sense on performance

Event Timeline

Attempt 1

drop-in replacement query that joins with new schema:

SELECT
	wbx_text AS painter,
	COUNT(wbx_text) AS paintercount
FROM page
JOIN pagelinks AS colllink ON page_id=colllink.pl_from AND
	colllink.pl_from_namespace=0 AND
	colllink.pl_namespace=0 AND
	colllink.pl_title='Q18600731'
LEFT JOIN pagelinks AS creatorlink ON page_id=creatorlink.pl_from AND
	creatorlink.pl_from_namespace=0 AND
	creatorlink.pl_namespace=120 AND
	creatorlink.pl_title='P170'
INNER JOIN wbt_item_terms ON page_title=CONCAT('Q', wbit_item_id)
INNER JOIN wbt_term_in_lang ON wbit_term_in_lang_id = wbtl_id
INNER JOIN wbt_type ON wbtl_type_id = wby_id
INNER JOIN wbt_text_in_lang ON wbtl_text_in_lang_id = wbxl_id
INNER JOIN wbt_text ON wbxl_text_id = wbx_id
WHERE
	wbxl_language='en' AND
	wby_name='description' AND
	wbx_text LIKE 'painting by %' AND
	page_namespace=0 AND
	page_is_redirect=0 AND
	creatorlink.pl_from IS NULL
GROUP BY wbx_text
ORDER BY COUNT(wbx_text) DESC, painter
LIMIT 300;

Possible issue is reduced performance due to using CONACT in join condition .. will be checked in test node once it is ready for such check.

Another possible idea is to introduce a virtual column that calculates the full entity ids and add a secondary index on it, in wbt_item_terms and wbt_proeprty_terms tables.

Will follow up on those possible solutions on T223554

(Resetting assignee as that account is not active anymore)

Ladsgroup reopened this task as Open.
Ladsgroup subscribed.

My bad

Addshore claimed this task.
Addshore subscribed.

Docs and example queries are at https://doc.wikimedia.org/Wikibase/master/php/md_docs_storage_terms.html
Closing this now as a cleanup as wb_terms removal work is coming to a close.