Wikimedia uses over 200 MariaDB instances to store content and metadata for Wikipedia and other free knowledge projects. It also uses Bacula, mydumper and xtrabackup to perform backups if its hundreds of terabytes of data in its infrastructure. While standard open source tools for both monitoring and automation are used when possible, there are some tasks that require custom development. There are several options for a project to choose here, of which you should do your own research about which is the right one (only one) to complete within the 175 hours of work:
== MariaDB instances inventory (zarcillo web interface) ==
There is an [[ https://phabricator.wikimedia.org/P14338 | existing database "zarcillo" ]] that inventories the existing mariadb instances. However, edits and check to it are using SQL directly. The aim is to change that into a web-based CRUD to inventory mariadb instances and its basic properties (servers, sections, etc.) and also develop a web api for easy querying from external services.
* Relevant script [[ https://phabricator.wikimedia.org/source/operations-puppet/browse/production/modules/profile/files/prometheus/mysqld_exporter_config.py;blame=off | using the zarcillo db to generate prometheus targets ]]
== Database object inventory ==
We would like to keep track of existing objects (databases, tables, columns, indexes) in our over 1000 databases throughout over 200 mariadb instances, and then detect (alert) on differences with the latest version of the schema deployed, in order to:
* Facilitate and schedule schema changes
* Detect backups errors (e.g. missing objects on backup)
* Provision more instances/more disks due to growth
For that, we need to 1) scrape information keep a metadata database up to data with reality 2) provide reports of interest (number of objects, size) 3) regularly check for inconsistencies over the same objects on different servers (e.g. a column has a different type on 2 different servers).
There is already some shy initiatives working on this direction:
* [[ https://phabricator.wikimedia.org/P14338 | Zarcillo ]] has a tables database, but it is not used a lot
* [[https://wikitech.wikimedia.org/wiki/MariaDB/Backups#Metadata | Backups metadata database]] has an objects table, but it is also not in use
* Ladsgroup has developed [[ https://phabricator.wikimedia.org/T104459 | some scripts ]] to compare production and [[ https://phabricator.wikimedia.org/source/mediawiki/browse/master/maintenance/tables.sql | tables.sql ]]/[[ https://gerrit.wikimedia.org/g/mediawiki/core/+/HEAD/maintenance/tables.json | tables.json ]] (current mediawiki schema)
We would like to merge all those efforts into a canonical metadata solution to monitor db objects.
== Database backup inventory improvements ==
Database backups provide a [[ https://wikitech.wikimedia.org/wiki/MariaDB/Backups#Metadata | database (mariadb) structured log of backup output ]], including timing, success, and backed up objects, so we can track that the dozens of backups produced everyday finish correctly. While there is [[ https://phabricator.wikimedia.org/diffusion/OSWB/browse/master/wmfbackups/check/check_mariadb_backups.py | a basic check ]] to ensure backups are fresh, we would like to expand the usefulness of this database by providing more fine grained info:
* A postprocesed information of the size of each backup object (filling in the backup_objects table)
* A web dashboard to the metadata, that easily reports the status of backups or its errors
* A web dashboard that tracks the status of ongoing backups (next scheduled run, running, postprocessing, finishing ETA, etc.)
* Relevant repository: [[ https://phabricator.wikimedia.org/diffusion/OSWB/ | operations/software/wmfbackups ]]
== Improve WMF Bacula monitoring ==
While there is already a basic [[ https://phabricator.wikimedia.org/source/operations-puppet/browse/production/modules/bacula/files/check_bacula.py;961ccde2ff8ef7b193df437ed154bcf69b688fd0?as=source&blame=off | check_bacula.py script ]], that makes sure backups are being correctly taken, as well as exporting to prometheus the necessary metrics, and basic cli information, we would like to expand existing functionality.
Some of the ways in which the script could be complemented would be:
* Provide additional metrics regarding retention time, media storage, available disk space, etc.
* Provide additional information on command line that makes simpler to query backup status by engineers
* Improve the [[ https://grafana.wikimedia.org/d/413r2vbWk/bacula | Grafana bacula dashboard ]] or create additional dashboards with those more detailed statistics
* If Grafana is not the right technology for that, improve the visibility of bacula statistics by providing a nice web dashboard summarizing the status of bacula, that helps engineers understand backup status
* Relevant repository: [[ https://phabricator.wikimedia.org/source/operations-puppet/browse/production/modules/bacula/manifests/director.pp?blame=off | WMF Bacula module on puppet ]] and [[ https://www.bacula.org/git/cgit.cgi/bacula/tree/bacula | Bacula source code ]]
== MySQL account metadata inventory ==
Currently, most of account management for MySQL happens on Puppet configuration management tool. However, puppet may be far from ideal for deployment of certain changes that have to be done synchronously throughout the cluster, as well as monitoring anomalies.
That is why we want to develop a tool that help track and monitor account configuration among different distributed instances. Such tool should be able to:
* Connect and read the account information as well as permissions (grants) for each instances
* Track what should be the account information for a given instances (including different roles, usages and peculiarities of different group of servers)
* Alert on anomalies found, such as accounts or grants missing from a Mariadb server, passwordless accounts or grants too open for a given account.
This will require maintaining a database inventory of accounts and instances, and a way to emit alerts (web interface or through a provided monitoring tool api).
* Relevant script: [[ https://phabricator.wikimedia.org/diffusion/OSOF/browse/master/dbtools/report_users.sh?blame=off | Reporting users with no password ]]
NOTE: * **Skills required:** Python (preferred, probably with Django or Flask), PHP, basic databases and SQL/file management knowledge
NOTE: * **Mentors**: [[https://meta.wikimedia.org/wiki/User:JCrespo (WMF)|Jaime Crespo]] aka #jynus [@jcrespo], [[https://meta.wikimedia.org/wiki/User:MArostegui_(WMF)|Manuel Arostegui]] [@marostegui].
NOTE: General documentation about our current setup for context: https://wikitech.wikimedia.org/wiki/MariaDB | https://wikitech.wikimedia.org/wiki/MariaDB/Backups | https://wikitech.wikimedia.org/wiki/Bacula