Page MenuHomePhabricator

Gather statistics about the backups on a database
Closed, ResolvedPublic

Description

We need to gather statistics (metrics) about the backups generated and store them.

This includes research about which ones will be useful. Some previous discussion:

** Create separate MySQL instance to store data (tendril host)
*** Have canonical dblists on this DB (shard/misc, hostname, port)
*** Store data from files (size, name, number of compressed files)
*** What to store
**** db_name,table_name (not small), compressed size, compressed file date, source of the backup (dbstore2001|dbstore2002)

** How to monitor/what to check?
*** Monitor script output? (so far no access to exit code because it runs in a cron)
*** mydumper log?
*** file sizes?
*** number of files?
*** check dblists against backups files list?
*** canonical source for the tables?

Event Timeline

jcrespo triaged this task as Medium priority.Jul 6 2018, 6:23 PM
jcrespo created this task.
root@neodymium:~$ ./section s1
db1052.eqiad.wmnet      3306
db1067.eqiad.wmnet      3306
db1080.eqiad.wmnet      3306
db1083.eqiad.wmnet      3306
db1089.eqiad.wmnet      3306
db1099.eqiad.wmnet      3311
db1105.eqiad.wmnet      3311
db1106.eqiad.wmnet      3306
db1114.eqiad.wmnet      3306
db1118.eqiad.wmnet      3306
db1119.eqiad.wmnet      3306
db1124.eqiad.wmnet      3311
db2048.codfw.wmnet      3306
db2055.codfw.wmnet      3306
db2062.codfw.wmnet      3306
db2070.codfw.wmnet      3306
db2071.codfw.wmnet      3306
db2072.codfw.wmnet      3306
db2085.codfw.wmnet      3311
db2088.codfw.wmnet      3311
db2092.codfw.wmnet      3306
db2094.codfw.wmnet      3311
dbstore1001.eqiad.wmnet 3311
dbstore1002.eqiad.wmnet 3306
dbstore2002.codfw.wmnet 3311
labsdb1009.eqiad.wmnet  3306
labsdb1010.eqiad.wmnet  3306
labsdb1011.eqiad.wmnet  3306
root@neodymium:~$ ./section es1
es1012.eqiad.wmnet      3306
es1016.eqiad.wmnet      3306
es1018.eqiad.wmnet      3306
es2011.codfw.wmnet      3306
es2012.codfw.wmnet      3306
es2013.codfw.wmnet      3306
--
-- Table structure for table `instances`
--

DROP TABLE IF EXISTS `instances`;
/*!40101 SET @saved_cs_client     = @@character_set_client */;
/*!40101 SET character_set_client = utf8 */;
CREATE TABLE `instances` (
  `name` varchar(300) CHARACTER SET latin1 NOT NULL,
  `server` varchar(300) CHARACTER SET latin1 NOT NULL,
  `port` int(10) unsigned NOT NULL DEFAULT '3306',
  PRIMARY KEY (`name`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;
/*!40101 SET character_set_client = @saved_cs_client */;

--
-- Table structure for table `masters`
--

DROP TABLE IF EXISTS `masters`;
/*!40101 SET @saved_cs_client     = @@character_set_client */;
/*!40101 SET character_set_client = utf8 */;
CREATE TABLE `masters` (
  `section` varchar(16) CHARACTER SET latin1 NOT NULL,
  `dc` varchar(100) CHARACTER SET latin1 NOT NULL,
  `instance` varchar(100) CHARACTER SET latin1 NOT NULL,
  PRIMARY KEY (`section`,`dc`),
  KEY `instance` (`instance`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
/*!40101 SET character_set_client = @saved_cs_client */;

--
-- Table structure for table `section_instances`
--

DROP TABLE IF EXISTS `section_instances`;
/*!40101 SET @saved_cs_client     = @@character_set_client */;
/*!40101 SET character_set_client = utf8 */;
CREATE TABLE `section_instances` (
  `instance` varchar(100) CHARACTER SET latin1 NOT NULL,
  `section` varchar(16) CHARACTER SET latin1 NOT NULL,
  PRIMARY KEY (`instance`,`section`),
  KEY `section` (`section`,`instance`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
/*!40101 SET character_set_client = @saved_cs_client */;

--
-- Table structure for table `sections`
--

DROP TABLE IF EXISTS `sections`;
/*!40101 SET @saved_cs_client     = @@character_set_client */;
/*!40101 SET character_set_client = utf8 */;
CREATE TABLE `sections` (
  `name` varchar(16) NOT NULL,
  `mysql_group` varchar(16) DEFAULT NULL,
  PRIMARY KEY (`name`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
/*!40101 SET character_set_client = @saved_cs_client */;

--
-- Table structure for table `servers`
--

DROP TABLE IF EXISTS `servers`;
/*!40101 SET @saved_cs_client     = @@character_set_client */;
/*!40101 SET character_set_client = utf8 */;
CREATE TABLE `servers` (
  `fqdn` varchar(300) CHARACTER SET latin1 NOT NULL,
  `hostname` varchar(100) CHARACTER SET latin1 DEFAULT NULL,
  `dc` varchar(100) DEFAULT NULL,
  `ipv4` int(10) unsigned DEFAULT NULL,
  `ipv6` binary(16) DEFAULT NULL,
  `last_boot` timestamp NULL DEFAULT NULL,
  PRIMARY KEY (`fqdn`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 ROW_FORMAT=DYNAMIC;

Change 444800 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb: Setup backups for zarcillo database on tendril

https://gerrit.wikimedia.org/r/444800

Change 444800 merged by Jcrespo:
[operations/puppet@production] mariadb: Setup backups for zarcillo database on tendril

https://gerrit.wikimedia.org/r/444800

So I have the first backup-specific statistics:

root@db1115.eqiad.wmnet[zarcillo]> select * FROM backup_files LIMIT 20;
+------+---------+---------------------+-----------------------------------------------+------------+---------------------+
| type | section | backup_date         | file_name                                     | size       | file_date           |
+------+---------+---------------------+-----------------------------------------------+------------+---------------------+
| dump | s1      | 2018-07-03 17:57:43 | enwiki-schema-create.sql.gz                   |         87 | 2018-07-03 17:57:43 |
| dump | s1      | 2018-07-03 17:57:43 | enwiki.abuse_filter-schema.sql.gz             |        409 | 2018-07-03 19:49:03 |
| dump | s1      | 2018-07-03 17:57:43 | enwiki.abuse_filter.sql.gz                    |     364358 | 2018-07-03 17:57:43 |
| dump | s1      | 2018-07-03 17:57:43 | enwiki.abuse_filter_action-schema.sql.gz      |        275 | 2018-07-03 19:49:03 |
| dump | s1      | 2018-07-03 17:57:43 | enwiki.abuse_filter_action.sql.gz             |       3899 | 2018-07-03 17:57:43 |
| dump | s1      | 2018-07-03 17:57:43 | enwiki.abuse_filter_history-schema.sql.gz     |        408 | 2018-07-03 19:49:03 |
| dump | s1      | 2018-07-03 17:57:43 | enwiki.abuse_filter_history.sql.gz            |   12847974 | 2018-07-03 17:57:46 |
| dump | s1      | 2018-07-03 17:57:43 | enwiki.abuse_filter_log-schema.sql.gz         |        453 | 2018-07-03 19:49:03 |
| dump | s1      | 2018-07-03 17:57:43 | enwiki.abuse_filter_log.00000.sql.gz          |  524914140 | 2018-07-03 18:00:16 |
| dump | s1      | 2018-07-03 17:57:43 | enwiki.actor-schema.sql.gz                    |        248 | 2018-07-03 19:49:03 |
| dump | s1      | 2018-07-03 17:57:43 | enwiki.arbcom1_vote-schema.sql.gz             |        351 | 2018-07-03 19:49:03 |
| dump | s1      | 2018-07-03 17:57:43 | enwiki.arbcom1_vote.sql.gz                    |      97006 | 2018-07-03 17:57:43 |
| dump | s1      | 2018-07-03 17:57:43 | enwiki.archive-schema.sql.gz                  |        469 | 2018-07-03 19:49:03 |
| dump | s1      | 2018-07-03 17:57:43 | enwiki.archive.00000.sql.gz                   | 3136213284 | 2018-07-03 18:04:18 |
| dump | s1      | 2018-07-03 17:57:43 | enwiki.archive.00001.sql.gz                   | 2428550099 | 2018-07-03 18:03:18 |
| dump | s1      | 2018-07-03 17:57:43 | enwiki.babel-schema.sql.gz                    |        239 | 2018-07-03 19:49:03 |
| dump | s1      | 2018-07-03 17:57:43 | enwiki.babel.sql.gz                           |      30322 | 2018-07-03 17:57:43 |
| dump | s1      | 2018-07-03 17:57:43 | enwiki.betafeatures_user_counts-schema.sql.gz |        229 | 2018-07-03 19:49:03 |
| dump | s1      | 2018-07-03 17:57:43 | enwiki.betafeatures_user_counts.sql.gz        |        561 | 2018-07-03 17:57:43 |
| dump | s1      | 2018-07-03 17:57:43 | enwiki.blob_orphans-schema.sql.gz             |        230 | 2018-07-03 19:49:03 |
+------+---------+---------------------+-----------------------------------------------+------------+---------------------+
20 rows in set (0.00 sec)

Interesting:

root@db1115.eqiad.wmnet[zarcillo]> select source, count(*) FROM backup_files GROUP BY source;
+------------------+----------+
| source           | count(*) |
+------------------+----------+
| dbstore1001:3311 |      468 |
| dbstore2002:3311 |      449 |
+------------------+----------+
2 rows in set (0.00 sec)
This comment was removed by Marostegui.

False alarm- empty tables do not get a data dump- we have to compare with schema dumps only, and they match:

mysql.py -BN -h db1115 zarcillo -e "SELECT DISTINCT SUBSTRING_INDEX(SUBSTRING_INDEX(file_name, '-schema.', 1), '.', -1) as tables FROM backup_files WHERE file_name like '%-schema.sql.gz' and source = 'dbstore1001:3311' and type = 'dump' and section = 's1' and backup_date = '2018-07-03 17:57:43' ORDER BY tables" > dbstore1001\:3311.backup.txt

mysql.py -BN -h db1115 zarcillo -e "select DISTINCT table_name FROM tables WHERE instance = 'dbstore1001:3311' and db = 'enwiki' ORDER BY table_name" > dbstore1001\:3311.tables.txt

diff dbstore1001\:3311.backup.txt dbstore1001\:3311.tables.txt

more useful stats (size is after compression):

root@db1115.eqiad.wmnet[zarcillo]> select source, section, backup_date, sum(size) from backup_files GROUP BY source, section, backup_date;
+------------------+---------+---------------------+--------------+
| source           | section | backup_date         | sum(size)    |
+------------------+---------+---------------------+--------------+
| dbstore1001:3311 | s1      | 2018-07-03 17:57:43 | 108262361080 |
| dbstore2001:3315 | s5      | 2018-07-03 23:40:19 |  50813370411 |
| dbstore2001:3316 | s6      | 2018-07-03 21:26:44 |  65586588489 |
| dbstore2001:3317 | s7      | 2018-07-03 17:00:01 |  82638220983 |
| dbstore2001:3318 | s8      | 2018-07-04 00:23:27 |  72221593133 |
| dbstore2002:3311 | s1      | 2018-07-04 01:24:18 | 108267561324 |
| dbstore2002:3312 | s2      | 2018-07-04 04:01:34 |  91698946356 |
| dbstore2002:3313 | s3      | 2018-07-03 17:00:01 | 102135306565 |
| dbstore2002:3314 | s4      | 2018-07-03 20:33:00 |  95039731345 |
| dbstore2002:3320 | x1      | 2018-07-04 06:50:21 |  20686018930 |
+------------------+---------+---------------------+--------------+
10 rows in set (0.01 sec)

What files do we have related to the potential recovery of enwiki.categorylinks?

root@db1115.eqiad.wmnet[zarcillo]> select backups.id, backups.source, backup_files.file_name, backup_files.size, backup_files.file_date, backups.creation_date as backup_date FROM backup_objects JOIN backup_files ON backup_objects.id = backup_files.backup_object_id JOIN backups ON backup_objects.backup_id = backups.id WHERE backup_objects.db = 'enwiki' and backup_objects.name='categorylinks';
+----+------------------+------------------------------------+-----------+---------------------+---------------------+
| id | source           | file_name                          | size      | file_date           | backup_date         |
+----+------------------+------------------------------------+-----------+---------------------+---------------------+
|  1 | dbstore1001:3311 | enwiki.categorylinks-schema.sql.gz |       377 | 2018-07-03 19:49:03 | 2018-07-03 17:57:43 |
|  1 | dbstore1001:3311 | enwiki.categorylinks.00000.sql.gz  | 557821984 | 2018-07-03 18:22:12 | 2018-07-03 17:57:43 |
|  1 | dbstore1001:3311 | enwiki.categorylinks.00001.sql.gz  | 409172922 | 2018-07-03 18:15:06 | 2018-07-03 17:57:43 |
|  1 | dbstore1001:3311 | enwiki.categorylinks.00002.sql.gz  | 383738072 | 2018-07-03 18:14:13 | 2018-07-03 17:57:43 |
|  1 | dbstore1001:3311 | enwiki.categorylinks.00003.sql.gz  | 375765813 | 2018-07-03 18:14:51 | 2018-07-03 17:57:43 |
|  1 | dbstore1001:3311 | enwiki.categorylinks.00004.sql.gz  | 352323341 | 2018-07-03 18:15:00 | 2018-07-03 17:57:43 |
|  1 | dbstore1001:3311 | enwiki.categorylinks.00005.sql.gz  | 336266280 | 2018-07-03 18:17:21 | 2018-07-03 17:57:43 |
|  2 | dbstore2002:3311 | enwiki.categorylinks-schema.sql.gz |       389 | 2018-07-04 03:52:37 | 2018-07-04 01:24:18 |
|  2 | dbstore2002:3311 | enwiki.categorylinks.00000.sql.gz  | 647009518 | 2018-07-04 02:03:46 | 2018-07-04 01:24:18 |
|  2 | dbstore2002:3311 | enwiki.categorylinks.00001.sql.gz  | 474168689 | 2018-07-04 01:53:37 | 2018-07-04 01:24:18 |
|  2 | dbstore2002:3311 | enwiki.categorylinks.00002.sql.gz  | 455466328 | 2018-07-04 01:52:35 | 2018-07-04 01:24:18 |
|  2 | dbstore2002:3311 | enwiki.categorylinks.00003.sql.gz  | 429877847 | 2018-07-04 01:52:20 | 2018-07-04 01:24:18 |
|  2 | dbstore2002:3311 | enwiki.categorylinks.00004.sql.gz  | 408667555 | 2018-07-04 01:51:54 | 2018-07-04 01:24:18 |
+----+------------------+------------------------------------+-----------+---------------------+---------------------+
13 rows in set (0.01 sec)

What's the file_date vs backup_date? backup_date is when the backup started and file_date when the file was last modified on the filesystem?

What's the file_date vs backup_date? backup_date is when the backup started and file_date when the file was last modified on the filesystem?

Yes, I will document all those with exact definitions when I have something stable, but not yet, as I am changing my mind all the time.

Here it is another example, original size and backup size:

root@db1115.eqiad.wmnet[zarcillo]> select * FROM tables where format is not null;
+------------------+--------+------------+------------+-------------------+--------------+------------+----------+-------------------
| instance         | db     | table_name | definition | format            | logical_size | file_size  | rows     | last_updated      
+------------------+--------+------------+------------+-------------------+--------------+------------+----------+-------------------
| dbstore1001:3311 | enwiki | comment    | NULL       | InnoDB/COMPRESSED |   1482299162 | 1258291200 | 14097136 | 2018-07-17 13:02:4
| dbstore1001:3311 | enwiki | slots      | NULL       | InnoDB/Compressed |            0 |      65536 |        0 | 2018-07-17 13:06:2
+------------------+--------+------------+------------+-------------------+--------------+------------+----------+-------------------
2 rows in set (0.10 sec)

root@db1115.eqiad.wmnet[zarcillo]> select * FROM backup_objects where backup_id = 1 and db = 'enwiki' and name in ('comment', 'slots');
+------+-----------+--------+---------+-----------+
| id   | backup_id | db     | name    | size      |
+------+-----------+--------+---------+-----------+
| 2131 |         1 | enwiki | comment | 416180697 |
| 2224 |         1 | enwiki | slots   |       261 |
+------+-----------+--------+---------+-----------+
2 rows in set (0.00 sec)

Change 449469 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/software/wmfmariadbpy@master] db backup statistics: Initial implementation of the backup stats

https://gerrit.wikimedia.org/r/449469

While backup is running:

| 19 | dump.test-s1.2018-08-01--08-11-18 | ongoing  | dbstore1001.eqiad.wmnet:3311 | dbstore1001.eqiad.wmnet | dump | test-s1 | 2018-08-01 08:11:18 |         NULL |

After it finishes correctly:

| 19 | dump.test-s1.2018-08-01--08-11-18 | finished | dbstore1001.eqiad.wmnet:3311 | dbstore1001.eqiad.wmnet | dump | test-s1 | 2018-08-01 08:11:18 |         NULL |

Why is the time the same? 2018-08-01 08:11:18 ? Is that when the backup was started?

I am literally changing this right now :-), this is what we will end up having:

`start_date` timestamp NOT NULL DEFAULT '1970-01-01 00:00:01',
`finish_date` timestamp NULL DEFAULT NULL,

After I run it:

root@db1115[zarcillo]> select * FROM backups WHERE id = 20\G
*************************** 1. row ***************************
         id: 20
       name: dump.test-s1.2018-08-01--08-29-25
     status: ongoing
     source: dbstore1001.eqiad.wmnet:3311
       host: dbstore1001.eqiad.wmnet
       type: dump
    section: test-s1
 start_date: 2018-08-01 08:29:25
finish_date: NULL
 total_size: NULL
1 row in set (0.00 sec)

After it finishes correctly:

root@db1115[zarcillo]> select * FROM backups WHERE id = 20\G
*************************** 1. row ***************************
         id: 20
       name: dump.test-s1.2018-08-01--08-29-25
     status: finished
     source: dbstore1001.eqiad.wmnet:3311
       host: dbstore1001.eqiad.wmnet
       type: dump
    section: test-s1
 start_date: 2018-08-01 08:29:25
finish_date: 2018-08-01 08:30:39
 total_size: NULL
1 row in set (0.00 sec)

If there is an immediate detectable error (mydumper critical, etc.) it will show the status as 'failed'.

If there is an immediate detectable error (mydumper critical, etc.) it will show the status as 'failed'.

That is awesome! "Easy" way to generate basic alerts from it!

That is awesome! "Easy" way to generate basic alerts from it!

Yes, although it does not yet generate heuristics checks or gather file stats- it only checks error codes at the moment.

That is awesome! "Easy" way to generate basic alerts from it!

Yes, although it does not yet generate heuristics checks or gather file stats- it only checks error codes at the moment.

That's why I mentioned basics :-) so we can at least stop checking them manually and get an email or whatever with a single MySQL query!
Step by step! :)

Change 449681 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] db backup statistics: Initial implementation of the backup stats

https://gerrit.wikimedia.org/r/449681

Change 449681 merged by Jcrespo:
[operations/puppet@production] db backup statistics: Initial implementation of the backup stats

https://gerrit.wikimedia.org/r/449681

Change 454504 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mydumper: Require python3-pymysql to store backup statistics

https://gerrit.wikimedia.org/r/454504

Change 454504 merged by Jcrespo:
[operations/puppet@production] mydumper: Require python3-pymysql to store backup statistics

https://gerrit.wikimedia.org/r/454504

Change 454509 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb: Capture connection error exceptions

https://gerrit.wikimedia.org/r/454509

This is how many minutes it takes for several backups to finish (it may be missleading, as because concurrent backups, it can vary a lot):

root@db1115.eqiad.wmnet[zarcillo]> select section, source, TIMESTAMPDIFF(MINUTE, start_date, end_date) as minutes FROM backups where start_date > '2018-08-21';
+---------+------------------------------+---------+
| section | source                       | minutes |
+---------+------------------------------+---------+
| s2      | db1095.eqiad.wmnet:3312      |      61 |
| s1      | dbstore1001.eqiad.wmnet:3311 |     150 |
| m1      | db1117.eqiad.wmnet:3321      |      24 |
| s4      | db1102.eqiad.wmnet:3314      |     164 |
| s5      | db1102.eqiad.wmnet:3315      |      37 |
| m5      | db1117.eqiad.wmnet:3325      |      13 |
| s3      | db1095.eqiad.wmnet:3313      |     145 |
| tendril | db1115.eqiad.wmnet:3306      |       0 |
| m3      | db1117.eqiad.wmnet:3323      |      81 |
| m2      | db1117.eqiad.wmnet:3322      |     211 |
| s3      | dbstore2002.codfw.wmnet:3313 |     203 |
| m2      | db2044.codfw.wmnet:3306      |     299 |
| s8      | dbstore2001.codfw.wmnet:3318 |     410 |
| s7      | dbstore2001.codfw.wmnet:3317 |     400 |
| s1      | dbstore2002.codfw.wmnet:3311 |     177 |
| m3      | db2042.codfw.wmnet:3306      |      55 |
| m1      | db2078.codfw.wmnet:3321      |      31 |
| m5      | db2037.codfw.wmnet:3306      |      16 |
| s4      | dbstore2002.codfw.wmnet:3314 |     308 |
| x1      | dbstore2002.codfw.wmnet:3320 |     102 |
| s2      | dbstore2001.codfw.wmnet:3312 |     150 |
| s5      | dbstore2001.codfw.wmnet:3315 |      57 |
| s6      | dbstore2001.codfw.wmnet:3316 |      94 |
+---------+------------------------------+---------+
23 rows in set (0.00 sec)

That is very useful! I know this is too early to discuss it but maybe, a good alert can be based on minutes, like if a backups takes more than 24h something might be going on.

Change 454509 merged by Jcrespo:
[operations/puppet@production] mariadb backups: Capture connection error exceptions

https://gerrit.wikimedia.org/r/454509

Change 449469 merged by Jcrespo:
[operations/software/wmfmariadbpy@master] db backup statistics: Initial implementation of the backup stats

https://gerrit.wikimedia.org/r/449469

Backups last week https://phabricator.wikimedia.org/T198987#4525868 vs this week (not all have finished yet):

root@db1115.eqiad.wmnet[zarcillo]> SELECT name, section, TIMESTAMPDIFF(MINUTE, start_date, end_date) as time  FROM backups where start_date BETWEEN now() -INTERVAL 10 DAY AND now() - INTERVAL 24 HOUR ORDER BY section;
+-----------------------------------+---------+------+
| name                              | section | time |
+-----------------------------------+---------+------+
| dump.m1.2018-08-21--18-01-08      | m1      |   24 |
| dump.m1.2018-08-22--22-44-14      | m1      |   31 |
| dump.m2.2018-08-21--22-32-16      | m2      |  211 |
| dump.m2.2018-08-22--10-09-53      | m2      |  299 |
| dump.m3.2018-08-21--21-10-34      | m3      |   81 |
| dump.m3.2018-08-22--21-49-09      | m3      |   55 |
| dump.m5.2018-08-21--20-08-03      | m5      |   13 |
| dump.m5.2018-08-22--23-15-51      | m5      |   16 |
| dump.s1.2018-08-21--17-00-01      | s1      |  150 |
| dump.s1.2018-08-22--20-23-26      | s1      |  177 |
| dump.s2.2018-08-21--17-00-01      | s2      |   61 |
| dump.s2.2018-08-23--01-13-57      | s2      |  150 |
| dump.s3.2018-08-21--20-21-22      | s3      |  145 |
| dump.s3.2018-08-22--10-09-53      | s3      |  203 |
| dump.s4.2018-08-21--18-25-46      | s4      |  164 |
| dump.s4.2018-08-22--23-21-23      | s4      |  308 |
| dump.s5.2018-08-21--19-30-39      | s5      |   37 |
| dump.s5.2018-08-23--03-44-07      | s5      |   57 |
| dump.s6.2018-08-23--04-29-59      | s6      |   94 |
| dump.s7.2018-08-22--15-08-58      | s7      |  400 |
| dump.s8.2018-08-22--13-33-01      | s8      |  410 |
| dump.tendril.2018-08-21--21-10-25 | tendril |    0 |
| dump.x1.2018-08-22--23-31-53      | x1      |  102 |
+-----------------------------------+---------+------+
23 rows in set (0.00 sec)

root@db1115.eqiad.wmnet[zarcillo]> SELECT name, section, TIMESTAMPDIFF(MINUTE, start_date, end_date) as time  FROM backups where start_date > now() - INTERVAL 24 HOUR ORDER BY section;                
+-----------------------------------+---------+------+
| name                              | section | time |
+-----------------------------------+---------+------+
| dump.m1.2018-08-28--19-28-33      | m1      |   24 |
| dump.m1.2018-08-29--06-01-07      | m1      |   28 |
| dump.m2.2018-08-28--17-00-01      | m2      |  316 |
| dump.m2.2018-08-28--18-00-58      | m2      |  291 |
| dump.m3.2018-08-28--17-00-01      | m3      |   64 |
| dump.m3.2018-08-28--23-21-34      | m3      |   30 |
| dump.m5.2018-08-28--18-04-44      | m5      |   16 |
| dump.m5.2018-08-28--22-21-30      | m5      |   35 |
| dump.s1.2018-08-28--22-57-21      | s1      |  156 |
| dump.s1.2018-08-29--06-14-04      | s1      | NULL |
| dump.s2.2018-08-28--22-16-32      | s2      |   65 |
| dump.s2.2018-08-28--22-38-06      | s2      |  207 |
| dump.s3.2018-08-28--17-00-01      | s3      |  148 |
| dump.s3.2018-08-29--08-16-06      | s3      | NULL |
| dump.s4.2018-08-28--19-53-11      | s4      |  148 |
| dump.s4.2018-08-29--01-16-49      | s4      |  284 |
| dump.s5.2018-08-28--17-00-01      | s5      |   60 |
| dump.s5.2018-08-28--23-51-34      | s5      |   28 |
| dump.s6.2018-08-28--22-52-14      | s6      |  144 |
| dump.s7.2018-08-29--02-05-23      | s7      |  248 |
| dump.s8.2018-08-28--18-21-09      | s8      |  256 |
| dump.tendril.2018-08-28--19-28-23 | tendril |    0 |
| dump.x1.2018-08-29--06-30-02      | x1      |  106 |
+-----------------------------------+---------+------+
23 rows in set (0.00 sec)

Change 456608 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb-backups: Provide backup file metadata information

https://gerrit.wikimedia.org/r/456608

Example output, so far:

root@db1115.eqiad.wmnet[zarcillo]> select * FROM backups WHERE id=77\G
*************************** 1. row ***************************
        id: 77
      name: dump.test-s1.2018-08-31--12-05-37
    status: finished
    source: dbstore1001.eqiad.wmnet:3311
      host: dbstore1001.eqiad.wmnet
      type: dump
   section: test-s1
start_date: 2018-08-31 12:05:37
  end_date: 2018-08-31 12:07:02
total_size: NULL
1 row in set (0.00 sec)

root@db1115.eqiad.wmnet[zarcillo]> select * FROM backup_files WHERE backup_id=77\G
*************************** 1. row ***************************
       backup_id: 77
       file_name: enwiki.recentchanges-schema.sql.gz
            size: 597
       file_date: 2018-08-31 12:05:37
backup_object_id: NULL
*************************** 2. row ***************************
       backup_id: 77
       file_name: enwiki.recentchanges.sql.gz
            size: 577390163
       file_date: 2018-08-31 12:07:02
backup_object_id: NULL
*************************** 3. row ***************************
       backup_id: 77
       file_name: metadata
            size: 321
       file_date: 2018-08-31 12:07:02
backup_object_id: NULL
3 rows in set (0.00 sec)

root@db1115.eqiad.wmnet[zarcillo]> select * FROM backup_objects WHERE backup_id=77\G
Empty set (0.03 sec)

Change 456613 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb-backups: Calculate total backup size

https://gerrit.wikimedia.org/r/456613

The size is in bytes:

root@db1115.eqiad.wmnet[zarcillo]> select * FROM backups WHERE id=79\G
*************************** 1. row ***************************
        id: 79
      name: dump.test-s1.2018-08-31--12-25-31
    status: finished
    source: dbstore1001.eqiad.wmnet:3311
      host: dbstore1001.eqiad.wmnet
      type: dump
   section: test-s1
start_date: 2018-08-31 12:25:31
  end_date: 2018-08-31 12:26:53
total_size: 577387964
1 row in set (0.00 sec)

Change 456608 merged by Jcrespo:
[operations/puppet@production] mariadb-backups: Provide backup file metadata information

https://gerrit.wikimedia.org/r/456608

Change 456613 merged by Jcrespo:
[operations/puppet@production] mariadb-backups: Calculate total backup size

https://gerrit.wikimedia.org/r/456613

This is done, I just need to explain and document what was done.

The zarcillo database (temporary code name for what will be the tendril replacement) now has 3 extra tables:

  • backups
  • backup_files
  • backup_objects (not in use)

backups contain an id and properties about backups (status -ongoing, finished (correctly), failed and deleted- dir name, source, section, start_time, end_time, total_size, etc.

  • backup files contain a backup id and a list of files for that backup and its properties (date, size, name)
  • backup objects will link the backup files to specific objects (tables, databases, triggers, etc.) for further checking (at the moment, this is not filled in). This will be useful when we maintain an inventory of database objects for all servers so we can make sure no objects are left uncopied/have appropiate size, etc.

Metadata is collected during generation, by the same script, dump_section.py It connects to the metadata database (db1115:zarcillo at the moment) and logs the information. If the logging fails, the backup should continue, but it cannot log a successful backup without the backups being successful, too.

The metadata database is controlled on the backup parameters, on the /etc/mysql/backup.yaml, the same place that controls the backups to be done.

Checks are done at icinga, and are controlled by a hiera setting: role/common/mariadb/misc/tendril_and_zarcillo.yaml