Page MenuHomePhabricator

Create revision_commentindex on Wiki Replicas in Cloud VPS
Open, LowPublic

Description

While trying to query revision table based on comments in toolforge environment, I noticed awful performance for simple query like:
select rev_id, comment_text from revision inner join comment on rev_comment_id=comment_id where comment_text like '%CSD#G11%' limit 10;

Following the good work in T166733 and related tasks, I think it may be cool to have revision_commentindex similar to revision_userindex in Toolforge for use by tools that query based on comments.

Event Timeline

I note there are likely three sources of slowness here:

  1. Scanning comment for comment_text like '%CSD#G11%' can't use any indexes.
  2. Accessing comment in general is slow thanks to having to do many subqueries per row to check visibility.
  3. The query can't make use of the revcomment_comment_id index, because the revision view has that field aliased.

This request would help with #3. #2 is being discussed as T215445: comment and actor view challenges for Cloud Services. I doubt there's anything we can do about #1, that part will always be slow.

Krinkle subscribed.

Improving #2 and #3 would be a big win I think. On the bigger wikis, these queries currently time out for me after an hour or so.

Can we get anywhere by using the comment_revision view instead of comment? https://wikitech.wikimedia.org/wiki/News/Actor_storage_changes_on_the_Wiki_Replicas#The_actor_table_seems_really_slow--so_does_comment

Revision is a weird table (because of the temp table), so I'm curious if that will help @Anomie
That table only needs to check revision instead of 7 other subqueries.

Using comment_revision would help with #2 by reducing the number of subqueries needed.

I'd rather not make a slippery slope argument, but I must admit it's there in my mind. Revision is the table where I'd want this kind of thing vs. probably anywhere else.

We are proliferating views rather quickly lately, and the setup is becoming steadily more complex to maintain and use. The value of a commentindex view will drop off entirely once the temp table is no longer there, so we'd want to leave a comment to that effect, I think, so that when the aliasing is dropped, the commentindex view is deprecated (and dropped with warning) as well.

Can we get anywhere by using the comment_revision view instead of comment?

Barely. The simplest WHERE I can imagine still takes 15 min:

SELECT comment_id
FROM comment_revision
WHERE comment_id > 284004931
AND comment_text LIKE "Bluelink%"

If we cannot even scan 30M rows out of 300M and counting, how is one ever supposed to be able to query the comment table, let alone join it with something else?

Krinkle renamed this task from create revision_commentindex to Create revision_commentindex on Wiki Replicas in Cloud VPS.Mar 9 2020, 4:41 PM
Krinkle unsubscribed.