Page MenuHomePhabricator

MediaWiki database policy and/or guidelines (2019)
Open, Needs TriagePublic

Description

Objective

  • The schemas and queries part of MediaWiki core and WMF-maintained and -deployed extensions/skins should adhere to a set of common standards.
  • The DBA review process from WMF for schema changes should be mandatory for these schema changes.
  • Contributors should be able to find the workflow for proposing and implementing a schema change.

Previous policy (from 2015)

From revision 3113832 on mediawiki.org:
Database policy

Per the RFC "Increase the strictness of MediaWiki SQL code" (T112637):

  • WMF will be enabling MariaDB/MySQL's strict mode (T108255), which will be the default anyway in MySQL 5.7. Prior to this, all new code must not generate any warning.
  • Code that touches the database must be compatible with the following MySQL SQL modes:
  • TRADITIONAL (equivalent to STRICT_TRANS_TABLES, STRICT_ALL_TABLES, NO_ZERO_IN_DATE, NO_ZERO_DATE, ERROR_FOR_DIVISION_BY_ZERO, NO_AUTO_CREATE_USER)
  • ONLY_FULL_GROUP_BY
  • Database code must be compatible with older versions of databases as listed in the MediaWiki installation requirements for database server (e.g. the stable MediaWiki release requires MySQL 5.5.8 or later). However, performance improvements that would only apply to the newest, supported versions (or its default or widely recommended defaults) should be favoured over those that apply only to unsupported releases (5.1 and lower).
  • All tables must have a primary key. When a candidate for primary key could not be created (for example, if all columns can be repeated), an auto_increment or another arbitrary value, depending on the case, has to be added.
  • Non-deterministic queries and unsafe statements for binlog should be avoided as they would return/write different results in a replication environment. The latter can be detected as warnings with the text "[Warning] Unsafe statement written to the binary log using statement format since BINLOG_FORMAT = STATEMENT". Those include INSERT ... SELECT when using an auto_increment key, UPDATE ... LIMIT without an ORDER BY, and using non-deterministic functions like SYSDATE(). MySQL Reference Manual has more info.
  • Primary keys, and fields that reference them, should be unsigned, in order to increase the maximum values. For more details, see WEBINAR-MySQL-Indexing-Best-Practices.pdf (404 Error). Database optimization (Draft) practices should also be followed.
    1. Database patches

If you change the database schema, follow these rules:

  • Update the installer – Please update includes/installer and add an appropriate SQL update file to maintenance/archives/. The naming convention, if you're adding a field, is patch-[table]-[field].sql. If you're removing a field, it's patch-drop-[table]-[field].sql. If you're adding a table, it's patch-[table].sql. Look at the commit history of includes/installer/ to find examples of how it's done. If you're adding a bunch of fields to the same table, make all those changes in one query in one patch file.
  • Make your schema change optional – All schema changes must go through a period of being optional. Some examples:
    • Instead of changing the format of a column, create a new column, make all writes happen to the old and new column (if it exists) and deprecate use of the old column. Check if the new column exists before blindly assuming that it does. Only eliminate support for the old column after it's clear the schema migration has completed and there's no chance that we'll need to roll back to the old version of the software. If this doesn't seem feasible, send mail to Wikitech-l asking for advice.
    • You could set your new feature to only work if a config option is set to true, and set the option to false by default. Then the commit can be safely deployed before the schema change is made. To deploy your feature to the Wikimedia cluster, file a ticket in Phabricator in the relevant project with the #schema-change tag. Once you've confirmed the change has been made, you can remove the config option to enable your feature.
    • Note that this means your schema change should be optional in code - for Wikimedia deployments, it is expected that every wiki with the relevant database table(s) will have the schema change applied to them. If you need different schema for different wikis, then apply the change using an extension and creating new tables dependent on that extension. There might be cases where the "make your schema change optional" rule will be prohibitive from a performance or logistics perspective. However, schema changes like that should be rare to begin with, and should have prominent discussion on the Wikitech-l mailing list. In the case where it's impossible to make your schema change optional, it is still critical to write scripts to roll back to the pre-change state.
  • Search for input from a WMF Database Administrator – MediaWiki is deployed to Wikimedia websites every week, and it takes considerable planning to apply schema changes to MySQL-based sites the size of Wikipedia. As of September 2017, Jaime Crespo (jcrespo on LDAP, jynus on irc and Manuel Arostegui, marostegui) are the best people to add to database reviews. In most cases, input is just needed on the logistics of the change.
  • Test your changes on Beta - in particular, it is a common mistake to change indexes and column definitions that would result in different query plans. Try to test the generated queries' plan with tools such as EXPLAIN; not doing so could mean that, when scaled to production, queries that only take 1 second locally, they pileup on production when they receive much more traffic and have larger tables.

Proposal 1

Event Timeline

Krinkle created this task.Apr 3 2019, 11:54 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 3 2019, 11:54 PM

Ref T190379#5073911.

I think the first two points in the objective should be addressed by reincorporating them into the development policy, similar to before.

The third point (about Phab workflow, coding format for installer patches, Beta testing, contact persons, etc.) seems better fitted as a guideline alongside other coding conventions and tutorials we have.

Jdforrester-WMF's edits are fair, and I think we should separate changes to mediawiki data architecture, which should be discussed with any or every mediawiki stakeholder, and changes to WMF infrastructure which should be discussed with (at least) the DBAs. Changes to "this_extension_table" are probably only interesting to the extension developers and operations people if the extension is deployed, but changes to the "revision" table should probably discussed at a technical meeting or somehow wider audience (even security, performance or legal may want to be involved depending on the philosophy of the changes proposed, and in the past, cloud and analytics also were affected by large refactorings)? I don't know how to organize that.

On the other side, developers proposing changes should be able to propose patches without too much paperwork, and they should have a voice but probably no vote on how those other teams organize internally for the reviews (e.g. a wmf security review should be organized however wmf security team thinks best for them, as long as the workflow is clear, fair, simple and does not create an excesive burden to others).

Not sure how to take all of that into consideration and what are the specific actionables to this, and which people should be involved on those.

I propose the following workflow:

  • DBAs write (rewrite/review) a proposed wmf schema changes workflow and presents it to other stakeholders, then request comments (this would be similar to the performance review or security reviews). That, being a WMF technology issue will live on Wikitech, but should be heavily referenced on mediawiki.org, specially substituting the outdated references
  • People involved on Mediawiki policy, with the help of DBAs -or course, review the 2015 approved policy (and the bits that were not approved) and update any issues or problems with the original text, then we search consensus again with all mw stakeholders for it
Anomie added a subscriber: Anomie.Apr 4 2019, 1:54 PM
Marostegui moved this task from Triage to Meta/Epic on the DBA board.Apr 5 2019, 10:02 AM
Joe added a subscriber: Joe.Apr 8 2019, 10:28 AM
daniel moved this task from Inbox to Backlog on the TechCom-RFC board.Apr 10 2019, 7:44 PM
daniel added a subscriber: daniel.

Moving this to the RFC blacklog, per @jcrespo's propsed workflow.