Page MenuHomePhabricator

page_assessments table uses an inadequate VARBINARY length for pa_class/pa_importance
Open, Needs TriagePublic8 Estimated Story Points

Description

As discovered during T328224: Deploy PageAssessments to Nepali Wikipedia, with thanks to @TheDJ for discovering, a VARBINARY(20) is inadequate for storing localised article classes/importances in languages with multibyte scripts.

describe page_assessments;
+------------------+------------------+------+-----+---------+-------+
| Field            | Type             | Null | Key | Default | Extra |
+------------------+------------------+------+-----+---------+-------+
| pa_page_id       | int(10) unsigned | NO   | PRI | NULL    |       |
| pa_project_id    | int(10) unsigned | NO   | PRI | NULL    |       |
| pa_class         | varbinary(20)    | YES  |     | NULL    |       |
| pa_importance    | varbinary(20)    | YES  |     | NULL    |       |
| pa_page_revision | int(10) unsigned | NO   |     | NULL    |       |
+------------------+------------------+------+-----+---------+-------+
5 rows in set (0.001 sec)

It is proposed that pa_class and pa_importance 's type is changed to VARBINARY(80)

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Tagging DBA for advice on:

Change 890869 had a related patch set uploaded (by Samtar; author: Samtar):

[mediawiki/extensions/PageAssessments@master] Schema: Increase VARBINARY size for pa_class and pa_importance

https://gerrit.wikimedia.org/r/890869

Hi, DB architect from data persistence here. The solution is feasible but generally this is not a good db schema design. I recommend normalizing pa_importance and pa_class into a dedicate table like page_assessment_types and use NameTableStore to read/write the data.

Change 890869 abandoned by Samtar:

[mediawiki/extensions/PageAssessments@master] Schema: Increase VARBINARY size for pa_class and pa_importance

Reason:

Will be implementing the suggestion at T330173#8633687 instead — patch not needed :)

https://gerrit.wikimedia.org/r/890869

MusikAnimal subscribed.

I started on this, but it's more of a "free time" project so not putting it in the sprint just yet.

@Ladsgroup for when you return -- or anyone else who can answer this question (@TheDJ perhaps), I see this as a multi-step process similar to other schema changes done in the past:

  1. Add new table which I'm calling page_assessments_values. It contains the unique values for assessments and importance levels.
  2. Add new columns to page_assessments: pa_class_id and pa_importance_id which reference the primary key of the values table, pav_id
  3. Add code so that it writes to both the new and old columns, as well as the new table, but still reads from old columns
  4. Deploy the schema change along with the new code, in that order.
  5. Run migration script to backfill data in page_assessments_values
  6. Remove the old string-based pa_class and pa_importance columns from page_assessments
  7. Change code to only read/write to the new columns/table.
  8. Deploy final schema change, then the new code, and drink a beer.

Does that sound about right? Should I coordinate with DBAs/deployers for each individual step, or is it okay to deal with 1-4 together, and 5-8 together?

This is fantastic, thank you! I also just found https://wikitech.wikimedia.org/wiki/Schema_changes. Not sure if any info should be copied over there.

I'm planning to overhaul the documentation regarding this soon. Stay tuned.

Marostegui subscribed.

Removing Schema-change-in-production in production as this is not ready for DBAs to execute.