Page MenuHomePhabricator

Allow page_props to be queried by value.
Closed, ResolvedPublic

Description

Page_props associates key/value pairs with page ids:

`pp_page` int(11) NOT NULL,
`pp_propname` varbinary(60) NOT NULL,
`pp_value` blob NOT NULL,

Currently, this does not however allow us to efficiently find pages by a given key/value pair. That however would be quite useful, e.g. for marking pages as "stub" upon save, and then listing all stubs.

There are two indexes defined at the moment:

UNIQUE KEY `pp_page_propname` (`pp_page`,`pp_propname`),
UNIQUE KEY `pp_propname_page` (`pp_propname`,`pp_page`)

So, we can efficiently get all properties (or one specific property) for a page. And we can efficiently get all pages that *have* a specific property. We can however not get all pages with a specific *value* for a given property. For that, we would need an index over at least pp_propname and pp_value, but this doesn't work, since pp_value is a blob. So allow this, we'd need an id column. I propose to add a column:

`pp_value_id` varbinary(255) NOT NULL

This may contain a hash of the value blob, or it may contain the actual value (with pp_value being
the same, or empty, or null). This would allow us to create an index for looking up pages by value:

UNIQUE KEY `pp_propname_value` (`pp_propname`,`pp_value_id`, `pp_page`)

Perhaps the pp_propname_page could then be dropped, I don't really see a use case for it.


Version: 1.23.0
Severity: normal
Whiteboard: u=dev c=backend p=13 s=2014-04-23
See Also:
https://bugzilla.wikimedia.org/show_bug.cgi?id=40157
https://bugzilla.wikimedia.org/show_bug.cgi?id=15441

Details

Reference
bz58032

Event Timeline

bzimport raised the priority of this task from to High.Nov 22 2014, 2:20 AM
bzimport added a project: Wikimedia-Rdbms.
bzimport set Reference to bz58032.
bzimport added a subscriber: Unknown Object (MLST).
daniel created this task.Dec 5 2013, 11:48 AM

Oh, while we are messing with the table, that would probably be a good time to introduce a surrogate primary key.

We should also consider that the order of values may be relevant, not only equality. For numeric values, this would mean we'd need a padded version of the number in pp_value_id, so the alphanumeric order of the index can be used.

Alternatively, we need separate columns or even tables for different types of values.

After some discussion with Roan and Tim, it's seems the easiest approach is to add a new column to page_props, containing a float value to be used as a sort key. This would allow for exact matches as well as the typical "top k" queries often used on special pages. The column should be nullable so no value would be provided for page props that do not represent a quantity (such as a display title or redirect target).

Querying for strings would not be possible this way, a partial index on the value blob could be used for that.

Change 122349 had a related patch set uploaded by Daniel Kinzler:
(bug 58032) introducing pp_sortkey.

https://gerrit.wikimedia.org/r/122349

Change I217c4265 introduces the pp_sortkey field, and automatically uses it to index any property values that are int, float or bool.

Pending:

  • a maintenance script for populating pp_sortkey (this will be sloppy, because when reading pp_value from the database, we don't know the original type - the value will always be a string).
  • an API module for querying pages by property value.

Change 122349 had a related patch set uploaded by Aaron Schulz:
Introducing pp_sortkey.

https://gerrit.wikimedia.org/r/122349

Change 122349 merged by jenkins-bot:
Introducing pp_sortkey.

https://gerrit.wikimedia.org/r/122349

Created bug 64949 as a follow-up.

This comment was removed by Aklapper.