Page MenuHomePhabricator

Add systemd timer to run `maintain-meta_p` daily on all Wiki Replica servers
Open, MediumPublic

Description

Handling T246056: Remove any references to fixcopyrightwiki from the meta-index in Wikimedia Cloud Services reminded me that we really should be running sudo maintain-meta_p --all-databases --purge more often. I feel like running it automatically has been discussed in the past, but I do not remember why we decided not to do it.

Tagging with DBA only to get their input on if this is a really bad idea. WMCS folks would take care of the Puppet automation and babysitting.

Event Timeline

I know that some tools (or at least one because I was pinged about it last time) will start trying to poll the databases if maintain-meta_p has run even if maintain-views hasn't run. Also, I do know that the database ends up possible to make visible before it is ready at this time, though maintain-meta_p won't expose it.

I'm not sure if that is entirely a reason to avoid it, though.

Announcement of a non-ready database is a problem that can happen even under manual control. I think we would want to add some new logic to maintain-meta_p (actually even if we don't cron it) to check for the db bare database, views in the database, and the expected DNS records before it adds a new row. When I ran the script for T246056 it added the row for ngwikimedia (T240772) which did have the bare db, but not the views or DNS records created yet.

Marostegui added a subscriber: Marostegui.

My worry is the same as @Bstorm, if this could expose a non sanitized new database. So far due to the existing MariaDB bug on 10.1 (fixed on 10.4) (https://jira.mariadb.org/browse/MDEV-14732) we need to manually add the labsdbuser's grant to the new wiki, so in some cases it will prevent the views from being created.

Anyways, if maintain-meta_p won't expose new non sanitized databases, that's good.

My second concern is...will this face metadata locking issues? We have often seeing that in order to recreate views, we run into metadata locking as there might be processes querying existing databases. I am not entirely sure what maintain-meta_p really does, so asking just in case.

My second concern is...will this face metadata locking issues? We have often seeing that in order to recreate views, we run into metadata locking as there might be processes querying existing databases. I am not entirely sure what maintain-meta_p really does, so asking just in case.

maintain-meta_p is a python script which populates a local database named "meta_p" which is deployed on each of the Wiki Replica servers outside of the normal replication process. This database contains 3 tables: meta_p.wiki, meta_p.legacy, and meta_p.properties_anon_whitelist.

A maintain-meta_p --all-databases --purge run does:

  • Ensure that the 3 tables are defined (seed_schema(ops))
  • START TRANSACTION;
  • TRUNCATE meta_p.wiki;
  • Read dblist files and other files in wmf-config to figure out the list of wikis which are in theory exposed on the Wiki Replica servers
  • For each wiki:
    • Call the action=query&meta=siteinfo API endpoint for that wiki
    • Collect the query.general.sitename and query.general.lang values from the API call to add to other information gathered from the config files
    • INSERT INTO meta_p.wiki (...) VALUES (...)
  • COMMIT;
  • START TRANSACTION;
  • DELETE FROM meta_p.properties_anon_whitelist;
  • INSERT INTO meta_p.properties_anon_whitelist VALUES ('gadget-%'), ('language'), ('skin'), ('variant');
  • COMMIT;

A major use case for meta_p.wiki is using it as a lookup table to decide which slice to connect to when querying a particular wiki. A related use case is finding all wikis in a "family" (wikipedia, wikibooks, wiktionary, ...) to perform some other operation on. A third use case is looking up the base URL of a given wiki.

$ sql meta_p
(u3518@s7.analytics.db.svc.eqiad.wmflabs) [meta_p]> select * from wiki where dbname='enwiki'\G
*************************** 1. row ***************************
          dbname: enwiki
            lang: en
            name: Wikipedia
          family: wikipedia
             url: https://en.wikipedia.org
            size: 3
           slice: s1.labsdb
       is_closed: 0
        has_echo: 1
 has_flaggedrevs: 1
has_visualeditor: 1
    has_wikidata: 1
    is_sensitive: 0
1 row in set (0.00 sec)

I do not think metadata locking is a high likelihood problem with this script as it functionally never changes the table structure. It does purge and repopulate the table, but it does so in a transaction which should make the changes relatively isolated from active queries.

Thanks for the detailed explanation @bd808! Indeed, it doesn't look like this will be suffering from the metada locking issue we have when re-creating views.

Discussed in the 2020-03-11 WMCS team meeting. After some discussion of pros and cons we feel that this would be a good way to proceed:

  • Review and update the maintain-meta_p python script to include a validation check of each database to ensure that the view layer exists
  • Verify that script works as intended and that its logging is useful
  • Introduce the systemd timer to run it
  • profit! :)

Discussed in the 2020-03-11 WMCS team meeting. After some discussion of pros and cons we feel that this would be a good way to proceed:

  • Review and update the maintain-meta_p python script to include a validation check of each database to ensure that the view layer exists

What would happen if there is a database with no view associated?
Thank you!

  • Verify that script works as intended and that its logging is useful
  • Introduce the systemd timer to run it
  • profit! :)
bd808 triaged this task as Medium priority.Mar 11 2020, 11:32 PM
bd808 moved this task from Needs discussion to Soon! on the cloud-services-team (Kanban) board.
  • Review and update the maintain-meta_p python script to include a validation check of each database to ensure that the view layer exists

What would happen if there is a database with no view associated?

No row would be created in the meta_p.wiki table if the view layer was not in place.

This would keep tools which use meta_p.wiki for the use case of iterating wikis in a family from attempting to connect to it. This check would be easiest if local to the wiki replica instance where maintain-meta_p is being run, so there could still be a gap for users if they end up querying the meta_p database on one replica server and then connecting to a different replica server where the view layer has not yet be created, but this should be a really narrow edge case. In theory it should only happen if the script fired in the middle of someone creating the views across the cluster or if a hard reload of the underlying db tables had been done to fix instance corruption and somehow the views were not created before the instance was repooled.

This validation process will need to be revisited in the future if we split the slices up across multiple instances, but I expect that a whole lot of things will need to be reexamined if and when that breaking change happens.

Thanks for the clarification. My only worries were the ones described at T246948#5943918