Page MenuHomePhabricator

Migrate wb_items_per_site to using prefixed entity IDs instead of numeric IDs
Closed, DeclinedPublic

Description

The implementation should be done using the same steps as for the wb_terms table, see T114903

Event Timeline

daniel created this task.Oct 7 2015, 4:35 PM
daniel raised the priority of this task from to Needs Triage.
daniel updated the task description. (Show Details)
daniel added projects: Wikidata, DBA.
daniel added subscribers: Aklapper, daniel.
daniel updated the task description. (Show Details)Oct 7 2015, 4:38 PM
daniel set Security to None.
daniel added a subscriber: hoo.Oct 7 2015, 4:49 PM
Multichill triaged this task as High priority.Nov 1 2017, 7:26 PM

So wb_entity_per_page was killed:

MariaDB [wikidatawiki_p]> SELECT * FROM wb_entity_per_page LIMIT 1;
ERROR 1356 (HY000): View 'wikidatawiki_p.wb_entity_per_page' references invalid table(s) or column(s) or function(s) or definer/invoker of view lack rights to use them

The wb_terms table offers a good alternative to change queries to:

MariaDB [wikidatawiki_p]> SELECT * FROM wb_terms LIMIT 1;
+-------------+----------------+---------------------+------------------+---------------+-----------+------------+-----------------+-------------+

term_row_idterm_entity_idterm_full_entity_idterm_entity_typeterm_languageterm_typeterm_textterm_search_keyterm_weight

+-------------+----------------+---------------------+------------------+---------------+-----------+------------+-----------------+-------------+

36347450445Q50445itemen-gblabelJungwon-gujungwon-gu0

+-------------+----------------+---------------------+------------------+---------------+-----------+------------+-----------------+-------------+
1 row in set (0.00 sec)

The wb_items_per_site offers no such option:

MariaDB [wikidatawiki_p]> SELECT * FROM wb_items_per_site LIMIT 1;
+------------+-------------+-------------+------------------+

ips_row_idips_item_idips_site_idips_site_page

+------------+-------------+-------------+------------------+

553596065abwikiЏьгьарда

+------------+-------------+-------------+------------------+
1 row in set (0.01 sec)

The killing of wb_entity_per_page broke all sorts of tools with no alternative available.

Marostegui added a subscriber: Ladsgroup.

@Multichill wb_items_per_site is *always* items. So to get the full item ID, just use concat('Q', ips_item_id).

That's not very nice, but viable I think. We could even make a virtual column ips_full_entity_id on the labs view...

I'm dropping the prio back to normal, since the issue seems easy enough to resolve. Please let me know if there are big performance issues with using concat('Q', ips_item_id) in joins, though.

daniel lowered the priority of this task from High to Normal.Nov 3 2017, 12:26 PM

@Multichill wb_items_per_site is *always* items. So to get the full item ID, just use concat('Q', ips_item_id).
That's not very nice, but viable I think. We could even make a virtual column ips_full_entity_id on the labs view...
I'm dropping the prio back to normal, since the issue seems easy enough to resolve. Please let me know if there are big performance issues with using concat('Q', ips_item_id) in joins, though.

I guess you missed T178459 ? Easy to read over, so yes, we currently experience performance problems. Queries don't complete in time or at all when using the concat hack.

daniel raised the priority of this task from Normal to High.Nov 5 2017, 12:18 AM

Sigh. I see. Bumping prio. Pinging @Lydia_Pintscher and @hoo

hoo added a comment.Nov 5 2017, 7:31 PM

Giving the size of the table, changing this shouldn't be overly horrible. It's a fair bit of migration work… but I assume doing this for maintenance queries and consistency is worth it.

hoo added a comment.Nov 5 2017, 7:39 PM

Note: I just also found T179793: Consider dropping the "wb_items_per_site.wb_ips_site_page" index while looking at this… maybe this can be done at once?!

daniel added a comment.Nov 7 2017, 1:57 PM

We should also consider a solution that will only put the full ids on labs. We don't actually need it in production. Could be a separate table, kept up to day by a trigger.

The query @Multichill is running is complex on its own and even if we resolve this task, it'll be still a very slow query, what I would recommend is to make a temporary table in labs for exactly this tool (which happens all the time, for example Magnus does it too) and query directly from there.

Addshore lowered the priority of this task from High to Normal.Mar 25 2019, 4:30 PM
Addshore added a subscriber: Addshore.

Dropping this back to normal, it hasn't seen activity in over a year

Ghuron added a subscriber: Ghuron.EditedApr 10 2019, 6:59 PM

Look at the query that is used to get missing articles for "List of articles every Wikipedia should have" https://quarry.wmflabs.org/query/26700 There are 2 joins:

  • For iwlinks/wb_items_per_site join I have to do either CONCAT ('Q', ips_item_id) or TRIM('Q' FROM iwl_title)
  • For wb_items_per_site/page join I have to do REPLACE(ips_site_page, ' ', '_') or vice versa and potentially take care of non-default namespace

Isn't it ironic that wikidatawiki_p.wb_items_per_site has 2 columns that is potentially joinable with wikipedia tables and both of them requires format transformation?

Addshore closed this task as Declined.Jun 30 2019, 6:28 PM

Per what I said in T114902#5294626 we are not going to do this now

daniel removed a subscriber: daniel.Jun 30 2019, 9:15 PM