Page MenuHomePhabricator

Redesign querycache* tables
Open, Needs TriagePublic

Description

I have been looking at design of the three tables querycache, querycachetwo (!), and qeurycache_info. These are issues I have found so far:

  • None of them has PK
  • The names are really badly chosen, these are not caches of database queries, or caches of API queries, These are caches of QueryPage class (some special pages and some API modules). Name of the class is also bad but I consider out of scope of this ticket.
  • querycachetwo? seriously?
  • qc_type and qcc_type can be normalized if querycache_info had PK
  • qc_value is the most polymorphic column I have ever seen in my life. Sometimes it points out to user id, sometimes page id, sometimes number of pages linked to the certain page. This way we can't use qc_value to normalize using page or user table :/
  • The tables are fairly big in small wikis (a wiki that was made last week and has 7k articles, has 57k rows in querycache making it 8th biggest table there while it's 70th largest table in enwiki), taking lots of storage for s3 but nothing bad for large wikis (@Marostegui or @jcrespo can check better of course)
  • 'activeusers' type in querycachetwo actually doesn't take advantage of the extra columns there and simply can be moved to querycache table. As far as I checked. maybe I missed something.

I will try to come up with some new schema for them in the next couple of days. The only open question for me is the migration. Should we define new tables and move everything there (specially, they are being updated through jobs and not via user requests) or slowly fix things in the existing tables. I'm inclined to first option as adding PK or renaming tables can be tricky.

The other open question about it is if anyone willing to help with reviewing my patches. If anyone can help. It would be amazing

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 19 2019, 1:45 PM

The first two tables querycache and querycachetwo do not even have an UNIQUE that we could convert to PK, so adding the PK there with the existing data would be indeed very tricky, not to mention the fact that we might even have rows being exactly the same.

I am more inclined to go for the option of creating the new tables and migrate stuff there, but we'd need to check how much space that would require.
For all the sections but s3 it should be no problem, I just checked the table sizes there.
For s3 given that we have more than 900 wikis we have to do a better math to make sure we can support the size of the old tables + new tables at the same time until we are good to drop the old ones.

See also T146571: Add a primary key to querycache for some of the duplicates and crap that’s been left behind...

I’d imagine a large chunk of the data, at least on older wikis probably wouldn’t be migrated

Reedy added a comment.Apr 22 2019, 3:09 PM

The tables are fairly big in small wikis (a wiki that was made last week and has 7k articles, has 57k rows in querycache making it 8th biggest table there while it's 70th largest table in enwiki), taking lots of storage for s3

But it's only 10 meg or so.

Compare it to say enwikibooks, it's only 20 meg or so. Granted, for the 900 wiki or so on s3, that starts to add up... It's only 20GB (per host) for the old storage methods...

Do remember they're only caches for performance reasons. The data can be regenerated

Reedy added a comment.Apr 22 2019, 3:11 PM

And for amusement, the boardvote tables on enwikibooks use 100MB+

Anomie added a comment.EditedMon, Apr 29, 4:23 PM

I have been looking at design of the three tables querycache, querycachetwo (!), and qeurycache_info.

I think that you may not be fully clear on how these tables are intended to be used. In the interests of avoiding us talking past each other, here's how I see it:

  • querycache is caching the results of an expensive database query of the form SELECT namespace, title, value FROM ... ORDER BY value LIMIT 5000. In other words, a list of titles sorted by a value. Typically the "value" there is actually a COUNT() of some sort.
  • querycachetwo is similar, in this case for a query of the form SELECT namespace1, title1, namespace2, title2, value FROM ... ORDER BY value LIMIT 5000. In other words, pairs of titles sorted by a value. There doesn't seem to be very many uses of this, although I see MediaWiki-extensions-Disambiguator does make use of it.
  • querycache_info is completely different. It records the timestamps of the last time each expensive query was cached.
  • None of them has PK

I'm not sure the usual auto-generated index would be very valuable for querycache and querycachetwo, unless I'm mistaken the write usage patterns are to DELETE everything with qc_type = 'foo' and then insert all-new rows, no updates. Similarly, selects are pretty much always targeting the existing indexes to fetch bulk data.

querycache_info does have a PK, that being qci_type. True, that's not the traditional auto-generated integer PK.

  • The names are really badly chosen, these are not caches of database queries, or caches of API queries, These are caches of QueryPage class (some special pages and some API modules). Name of the class is also bad but I consider out of scope of this ticket.
  • querycachetwo? seriously?

The name querycache isn't that bad, it actually is caching expensive database queries. QueryPage is rather similar, they're supposed to be pages built around displaying the results of a database query.

I can't argue about querycachetwo though. Naming is hard. :(

I also agree that there are a good number of QueryPage subclasses that are either not able to use querycache/querycachetwo because they're selecting more than "namespace, value, title" or that seem to be using the tables in a somewhat strange way. Plus the whole structure of the class doesn't seem very conducive to good use. But, as you said, that's out of scope for this task about the tables themselves.

  • qc_type and qcc_type can be normalized if querycache_info had PK

That's probably not a bad idea: if we were to add an integer PK to querycache_info, then we could use it for querycache and querycachetwo instead of the current varchars.

  • qc_value is the most polymorphic column I have ever seen in my life. Sometimes it points out to user id, sometimes page id, sometimes number of pages linked to the certain page. This way we can't use qc_value to normalize using page or user table :/

You should look at log_search.ls_value ;)

As for qc_value, I don't see any case in which it points to a user ID or page ID. All the uses I see seem to be either (1) a count, (2) a Unix timestamp, or (3) a constant 0, that is typically done extremely stupidly in code by selecting page_title and then it getting cast to 0.

  • The tables are fairly big in small wikis (a wiki that was made last week and has 7k articles, has 57k rows in querycache making it 8th biggest table there while it's 70th largest table in enwiki), taking lots of storage for s3 but nothing bad for large wikis (@Marostegui or @jcrespo can check better of course)

I'm not sure that really matters, nor how you might fix it while still maintaining the actual use case.

  • 'activeusers' type in querycachetwo actually doesn't take advantage of the extra columns there and simply can be moved to querycache table. As far as I checked. maybe I missed something.

It looks like 'activeusers' is using querycachetwo because the usage patterns for it often want to use the qcc_title index. We'd have to add a similar index to querycache in order to move 'activeusers' there.

The other open question about it is if anyone willing to help with reviewing my patches. If anyone can help. It would be amazing

I'll review patches as circumstances allow me to. Unfortunately at the moment circumstances are severely limiting what code review I can do outside of certain teams.

I'm not sure the usual auto-generated index would be very valuable for querycache and querycachetwo, unless I'm mistaken the write usage patterns are to DELETE everything with qc_type = 'foo' and then insert all-new rows, no updates. Similarly, selects are pretty much always targeting the existing indexes to fetch bulk data.

querycache_info does have a PK, that being qci_type. True, that's not the traditional auto-generated integer PK.

All tables that are being used in a master-replica database need to have PK (preferably auto_increment integer PK). That makes replication faster and avoids corruption of data (See this bug for example). This is regardless of whether the PK column being used in the code or have conceptual meaning whatsoever.

I also agree that there are a good number of QueryPage subclasses that are either not able to use querycache/querycachetwo because they're selecting more than "namespace, value, title" or that seem to be using the tables in a somewhat strange way. Plus the whole structure of the class doesn't seem very conducive to good use. But, as you said, that's out of scope for this task about the tables themselves.

Yeah, maybe we can come up with some design that accommodate other usecases too. Do you have some examples?

You should look at log_search.ls_value ;)

nooooooooooooo :(((

As for qc_value, I don't see any case in which it points to a user ID or page ID. All the uses I see seem to be either (1) a count, (2) a Unix timestamp, or (3) a constant 0, that is typically done extremely stupidly in code by selecting page_title and then it getting cast to 0.

hmm, maybe I missed something when I read the code but we can take advantage of page_id joining with qc_value and storing empty string for qc_title to save some space.

I'm not sure that really matters, nor how you might fix it while still maintaining the actual use case.

After the migration is done, the tables will be smaller in s3 reducing the storage overhead.

It looks like 'activeusers' is using querycachetwo because the usage patterns for it often want to use the qcc_title index. We'd have to add a similar index to querycache in order to move 'activeusers' there.

Thanks for the note.

I'll review patches as circumstances allow me to. Unfortunately at the moment circumstances are severely limiting what code review I can do outside of certain teams.

Thanks.

All tables that are being used in a master-replica database need to have PK (preferably auto_increment integer PK)

That is not true, and I don't know where that myth started. All tables need to have PK, but auto_increment is not preferred, it is suggested if there is no other natural keys possible -otherwise it is a waste of space and slowdown due to not being the main access method; but if there is other options, it should be used instead (unless it is impractical or undesired for other reasons). This was clarified on mediawiki.org, but sadly @Krinkle deleted the wording.

That is not true, and I don't know where that myth started. All tables need to have PK, but auto_increment is not preferred, it is suggested if there is no other natural keys possible -otherwise it is a waste of space and slowdown due to not being the main access method; but if there is other options, it should be used instead (unless it is impractical or undesired for other reasons). This was clarified on mediawiki.org, but sadly @Krinkle deleted the wording.

Thanks for the correction. I learned something today.

Anomie added a comment.Wed, May 1, 4:33 PM

Yeah, maybe we can come up with some design that accommodate other usecases too. Do you have some examples?

I don't want to get into too much detail here since it's not really on-topic, but in brief it seems to me that Pager is designed around an assumption that you can specify the query without the ORDER BY and LIMIT and those can be tacked on later, which I've found tends to not hold true in many cases. It would likely be better to have a more flexible "implement this method to return a set of rows" interface and leave the details of query-building to be done internally.

We might also look at the separation between Pager and QueryPage to see if the line is drawn in the right place or not (QP should have formatting code but no row-fetching or query-building logic; I'm not sure if that's true or not). Then, ideally, we could use the same replacement-Pager to back many action API query modules as well.

hmm, maybe I missed something when I read the code but we can take advantage of page_id joining with qc_value and storing empty string for qc_title to save some space.

It would be much better to use the table you proposed in T222224: Normalizing *links tables to replace qc_namespace and qc_title rather than trying to repurpose qc_value.

After the migration is done, the tables will be smaller in s3 reducing the storage overhead.

With the proposals here I think you'll only really be able to convert the 33-byte varchar (average size probably a bit less) to a 4-byte integer. That's some savings, but I'm not sure how significant it will really be.

It would be much better to use the table you proposed in T222224: Normalizing *links tables to replace qc_namespace and qc_title rather than trying to repurpose qc_value.

With the proposals here I think you'll only really be able to convert the 33-byte varchar (average size probably a bit less) to a 4-byte integer. That's some savings, but I'm not sure how significant it will really be.

Yes indeed, basically let's normalize qc_title and qc_namespace in favor of the new table (for normalizing links tables) and qc_type in favor of PK in querycache_info. That would give us some space for s3 but not that much I'm afraid.