Page MenuHomePhabricator

Make growthexperiments_link_recommendations.gelr_data nullable in GrowthExperiments
Closed, ResolvedPublic

Description

As part of T382270: Store the fact that Add Link did not generate any recommendation for a page, don't try again, we want to store information about Add Link being unable to generate any recommendation for a particular article:

We decided that the easiest way to implement this is to alter the existing growthexperiments_link_recommendations table and to make the gelr_data field nullable. That field being null would then be interpreted as that we tried to get a recommendation for a page from the service but did not succeed. Such a row would usually be removed by the existing process when a page has been edited or through revalidation (currently implemented via a manual maintenance script).

Task scope: Make the schema change in GrowthExperiments. Once the change is merged, a follow-up task for DBA will be filled to do the change in Wikimedia production.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Change #1111673 had a related patch set uploaded (by Urbanecm; author: Urbanecm):

[mediawiki/extensions/GrowthExperiments@master] sql: Make gelr_data nullable

https://gerrit.wikimedia.org/r/1111673

Urbanecm_WMF moved this task from Unsorted to Change on the Schema-change board.
Urbanecm_WMF moved this task from Incoming to Code Review on the Growth-Team (Current Sprint) board.

Change #1111936 had a related patch set uploaded (by Urbanecm; author: Urbanecm):

[mediawiki/extensions/GrowthExperiments@master] sql: Update sql files to match output from script

https://gerrit.wikimedia.org/r/1111936

Before making the field nullable please consider these two notes:

If it doesn't affect your case, feel free to go ahead but otherwise, maybe use an empty string, or something else instead.

Before making the field nullable please consider these two notes:

This seems to be about creating an index on a nullable column or making joins on it. As I understand it, gelr_data is a pure data-storage column where we do neither have an index nor do any joins based on it.

The only thing relevant that I could see is when we revalidate the data in that table. What would we do with gelr_data=null rows then? One option is to drop them all in one swoop and then let the refreshLinkRecommendations script refill them again over time. Not sure if nullable vs "empty string" makes a difference for that scenario.

I'm not fully sure what this is about, but for us the primary key is the first column, the second column (pageid, with an index) is non-null as well. The nullable column will be the last one.

If it doesn't affect your case, feel free to go ahead but otherwise, maybe use an empty string, or something else instead.

Though, on the other hand, I did not yet take a deep dive into GrowthExperiments SQL queries, so maybe I might be missing something.

Basically the official recommendation is this (https://dev.mysql.com/doc/refman/8.4/en/data-size.html):

Declare columns to be NOT NULL if possible. It makes SQL operations faster, by enabling better use of indexes and eliminating overhead for testing whether each value is NULL. You also save some storage space, one bit per column. If you really need NULL values in your tables, use them. Just avoid the default setting that allows NULL values in every column.

If you need to have it nullable, I honestly think that's fine but there are some penalties attached to the change.

Before making the field nullable please consider these two notes:

As far as I can see, this doesn't apply to us, because this answer talks only about columns with an index. In the first point, it talks about effectiveness of b-tree index structure when the data variety is small. As far as I can see, this is not specific to NULL values: if 50% of a non-nullable indexed column had the same value, b-tree index lookups would be fairly inefficient as well, even with no null involved. In short, this points affects the RDBMS's ability to effectively use the index. The second point is about the DB engine's ability to use the index (if an indexed column is transformed first, the index cannot be used, as it includes pre-transformation values).

Since gelr_data is an unindexed column (designed to contain a blob later to be worked with at the application level), neither of those points can possibly apply to us, regardless of the queries and data variety we would end up with. Any query on gelr_data would have to result to a full table scan anyway.

This seems to talk about using COUNT. I don't really see any reason why we would do COUNTing queries. The only possible use-case that comes to mind is Special:NewcomerTasksInfo, but that is computed via a search query (as it needs to involve article topics anyway, which are not visible in the database).

However, I admit I don't understand this one much. I thought that doing COUNT(*) returns the actual number of rows (regardless of any nulls in any column), so I don't really get the relevance. I could see how COUNT(gelr_data) could be a very inefficient query (it would have to check if gelr_data is null or not, which cannot be done effectively, as there is no index, so it would likely resort to a full table scan), but that doesn't seem to be what the SO answer is about.

If it doesn't affect your case, feel free to go ahead but otherwise, maybe use an empty string, or something else instead.

Given what I said above, I think we're fine in both directions. However, if you disagree with that justification, I am happy to be convinced otherwise.

For context, we plan on using the NULL value to represent "no data available", which seems to be an appropriate use-case.

Change #1111936 merged by jenkins-bot:

[mediawiki/extensions/GrowthExperiments@master] sql: Update sql files to match output from script

https://gerrit.wikimedia.org/r/1111936

Change #1111673 merged by jenkins-bot:

[mediawiki/extensions/GrowthExperiments@master] [sql] growthexperiments_link_recommendations: Make gelr_data nullable

https://gerrit.wikimedia.org/r/1111673

QA note: This is currently only deployed to beta cluster (production deployment is tracked in the subtask). There should be no change based on this change; implementation of the actual change is tracked in T382270.