Page MenuHomePhabricator

Build a machine-readable catalogue of mariadb tables in production
Open, MediumPublic

Description

Problem: Currently information about tables we have in production are scattered (e.g. list of private tables) or non-existent (asking questions on "what is pif_edits table and what it's been doing in our production?"). This is also hindering our ability to automate more work (between-replicas comparison checks, over-arching drift tracking, etc. etc.).

Proposal: Introduce a file in puppet containing information on tables. Something like this:

version: 1
tables:
  - name: abuse_filter
    pk: af_id
    source: abusefilter
    canonicality: canonical
    size: small
    visibility: partially public
  - name: urlshortcodes
    pk: usc_id
    sections:
      - x1
    dbs:
      - wikishared
    source: urlshortener
    canonicality: canonical
    size: small
    visibility: public
  - name: flaggedtemplates
    dblist: flaggedrevs.dblist
    status: dropped
    canonicality: partially canonical
    size: large
    visibility: public
sources:
   abusefilter:
      - path: extensions/AbuseFilter/db_patches/tables.json
        gerrit: https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/extensions/AbuseFilter/+/refs/heads/master/db_patches/tables.json

Then this could become the foundation for many usecases and replacing ad-hoc table listings.

It also could be easily pull in to doc.wikimedia.org hosts and served in a more user-friendly way (search, filtering, all that jazz).
Notes:

  • This will grow way too large, to avoid that:
    • MediaWiki and non-MW tables should have their own catalogues. That way we can also start with mediawiki and then move forward with non-mw tables.
    • We need a lot of defaults and if omitted, it means "the default". e.g. dblist = all, sections: s1-s8, db: wiki's db, status: active
    • We probably could drop some options like size (it's also subjective)
  • Part of reviewing and approving new tables should be adding the new table to the catalogue

Comments?

Event Timeline

Ladsgroup triaged this task as Medium priority.Fri, Apr 26, 2:16 PM
Ladsgroup moved this task from Triage to In progress on the DBA board.
Ladsgroup renamed this task from Build a machine-readable catalogue of tables in production to Build a machine-readable catalogue of mariadb tables in production.Fri, Apr 26, 2:18 PM

I like the idea! A few questions

  • Do you intend to have some monitoring checks or a dashboard to compare the catalog to actual live prod state?
  • What's the actual meaning for the size field? And for canonicality?

I like the idea! A few questions

  • Do you intend to have some monitoring checks or a dashboard to compare the catalog to actual live prod state?

We have had something for this many years now but it's manual: e.g. https://drift-tracker.toolforge.org/report/checkuser/
I had to manually add each tables.json file to the script. That removes that barrier and makes it easier to run.

  • What's the actual meaning for the size field? And for canonicality?

Right after filing the ticket I realized we will need a README file explaining all fields.
size means how big is the table options include: empty, small, medium, large (similar to how we split wikis: https://noc.wikimedia.org/conf/highlight.php?file=dblists/large.dblist

canonicality is a bit more complicated. Many tables of mediawiki store derivative data that can be repaired by reparse (like pagelinks), these tables are super big and take a lot of space but some tables store information that is of high importance to protect integrity of (e.g. user, revision, etc.). We are trying to build an automation to at least compare checksum of these tables across different replicas to improve integrity (T207253: Automatically compare a few tables per section between hosts and DC).
With this catalogue we can feed this file to that automation. The available options for canonicality are: derivative, partially canonical, and canonical.

How will this be kept in sync with tables.json in MW core and extensions? Would it be possible to periodically scan for disrepancies?

How will this be kept in sync with tables.json in MW core and extensions?

I tried not to include any information that's already in tables.json (as you see, it doesn't even have "comment" field for tables). The idea is that you get the table name and then follow the source to the abstract schema to collect further information. The only possible drift from what I'm seeing is missing or extra tables.

Would it be possible to periodically scan for disrepancies?

That'd be quite possible and easy to do.

The idea seems great!

How will this be kept in sync with tables.json in MW core and extensions?

I tried not to include any information that's already in tables.json (as you see, it doesn't even have "comment" field for tables).

I'm wondering what should be the conceptual difference between information that'd be in tables.json and information that'd be in the proposed system. Should the distinction be "information that we need to create the tables is in tables.json, everything else is in the new system"? Or should it be "information that is MW-specific is in tables.json, whereas information that is WMF-specific is in the new system"? In the latter case, it might make sense to move certain fields to tables.json instead (such as canonicality).

The only possible drift from what I'm seeing is missing or extra tables.

AFAICS, the pk field is redundant to tables.json. Both the example in the description and https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/extensions/AbuseFilter/+/refs/heads/master/db_patches/tables.json define that the primary key of abuse_filter is af_id. Is that intentional?

Part of reviewing and approving new tables should be adding the new table to the catalogue

How would we do this in practice? Should we restrict createExtensionTables.php to only work for tables that are in the catalog? Or should we rely on monitoring alerting when someone creates a table without updating the catalog?

How will this be kept in sync with tables.json in MW core and extensions? Would it be possible to periodically scan for disrepancies?

Another place of disrepancy is that some tables are supposed to be public but never mirrored to cloud. For example, T345089: expose entityschema_id_counter table to cloud replica

I'm wondering what should be the conceptual difference between information that'd be in tables.json and information that'd be in the proposed system. Should the distinction be "information that we need to create the tables is in tables.json, everything else is in the new system"? Or should it be "information that is MW-specific is in tables.json, whereas information that is WMF-specific is in the new system"? In the latter case, it might make sense to move certain fields to tables.json instead (such as canonicality).

I initially wanted to do latter but that means mediawiki will need to support for "informational tags" to tables.json and that can cause all sorts of mayhem. I'd say for now, let's go with the former distinction.

AFAICS, the pk field is redundant to tables.json. Both the example in the description and https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/extensions/AbuseFilter/+/refs/heads/master/db_patches/tables.json define that the primary key of abuse_filter is af_id. Is that intentional?

I should have removed pk, in that patch I wasn't planning to create a generalized catalogue (it's only for automatic db-compare runs between replicas). I think I will still need to create a separate yaml file for db-compare runs but it will need to hold way less information. Maybe I add pk there since making it go through source files might be complicated in cumin hosts but still it doesn't make sense to hold that information in the generalized catalogue

How would we do this in practice? Should we restrict createExtensionTables.php to only work for tables that are in the catalog? Or should we rely on monitoring alerting when someone creates a table without updating the catalog?

I was focusing on the social part: We do have a policy that needs to be followed before creating tables and we can add that there: https://wikitech.wikimedia.org/wiki/Creating_new_tables. We can eventually force createExtensionTables.php to refuse creating tables that are not in the catalogue but that's not my highest concern right now.