Page MenuHomePhabricator

Bring an-mariadb100[12] into service
Closed, ResolvedPublic

Description

Update October 2023

We have delayed completing this ticket for some time, but it would now be beneficial to move forward with it.

  • an-coord100[1-2] and ready to be refershed with an-coord100[3-4]
  • The new servers for the analytics-meta database are: an-mariadb100[1-2]

We still don't have a planned method for managed failover/failback of the MariaDB servers.

Prior status
  • an-coord1001 runs the 'analytics meta' MariaDB master instance. This instance has several databases for Analytics Cluster operations.
  • an-coord1002 runs a standby replica of this instance, but in the case of a failure, switching to an-coord1002 is a error prone and manual process.
  • matomo1002 runs a MariaDB instance for the 'piwik' database.
  • db1208 runs backup replicas of analytics-meta and matamo MariaDB instances, and backula is used to keep historical backups.
  • Relevant MariaDB configs do not necessarily match between masters and replicas.
Desired status
  • All existing analytics_meta databases running from an-mariadb100[12] instead of an-coord100[12]
  • We have confidence in the veracity of both the failover replica (an-mariadb1002) and the backup replica (db1208)
  • Regular and comprehensive backups are running from db1208
  • The failover method from an-mariadb1001 to an-mariadb1002 has been well-defined and tested
  • The restore method from db1208 has been well defined
Implementation steps
  • Dedicated DB hardware to be ordered in Q1 FY2021-2022 to replace an-coord100[12]: an-db100[12].
  • an-coord1002 fully in sync with an-coord1001 and ready for failover.
  • db1208 fully recreated from snapshot of an-coord1001 and performing regular backups.
  • an-mariadb1001 instantiated as a replica of an-coord1001
  • an-mariaddb1002 instantiated as a replica of an-mariadb1001
  • an-mariadb1002 switched to replicate from an-mariadb1001
  • db1208 switched to replicate from an-mariadb1001

Switch-over time

  • an-coord1001 switched to read-only
  • an-mariadb1001 promoted to master
  • All applications switched to use an-mariadb1001 instead of an-coord1001

Post Switch-over time

  • Ensure backups are running on the right host(s) and with the latest data (e.g. attempt a test recovery)
  • MariaDB instances removed from ab-coord100[12]

Notes and Migration Plan here:
https://etherpad.wikimedia.org/p/analytics-meta


Originally, this ticket was setting up multi master instances and being able to do failover for individual MariaDB database instances. However, it was discovered that Data Persistence does not really support MariaDB multi instance master setups, and the reasons for us doing so aren't really that useful. Most of the time, failovers will be manual and done for hardware reasons, meaning all DBs would have to be failed over anyway. Having many master setups means more replicas and binlogs to manage, which makes maintenance like that harder, not easier. Ideally each app's DB would be totally isolated from the others, but we will have to wait until perhaps one day we get persitent volumes in k8s to do this really properly.

For now we are going with a single analytics-meta instance for all databases.

TBD - The matomo server is in the private1 vlan. Do we want to move its database to an-mariadb100[12] and require a new hardware firewall rule for this?

Details

Related Changes in Gerrit:
SubjectRepoBranchLines +/-
operations/puppetproduction+1 -1
operations/puppetproduction+0 -15
operations/puppetproduction+1 -1
operations/puppetproduction+2 -3
operations/puppetproduction+0 -3
operations/puppetproduction+10 -10
operations/deployment-chartsmaster+26 -26
operations/puppetproduction+6 -6
operations/puppetproduction+10 -10
operations/software/transferpymaster+12 -5
operations/puppetproduction+3 -0
operations/puppetproduction+20 -1
operations/puppetproduction+0 -3
operations/puppetproduction+290 -18
operations/puppetproduction+0 -298
operations/puppetproduction+345 -7
operations/puppetproduction+5 -70
Show related patches Customize query in gerrit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Ah, fabulous, thanks @jcrespo.

I had found some grants for the dump user defined here: https://phabricator.wikimedia.org/source/operations-puppet/browse/production/modules/profile/templates/mariadb/grants/dumps-eqiad-analytics_meta.sql.erb but I couldn't work out if they were applied anywhere automatically, or if not, why not.

The question I was just about to ask was this...

  • Is it normal our practice to a) create this user directly in the mysql table on a backup replica, or b) keep the mysql.user table fully in sync across all replicas?

...but I see from your answer that it's a) because you said:

we normally don't add the dump user to the primary to prevent accidentally taking backups from it in the first place.

That makes perfect sense now.

Grant management and checking is a pending task we have to solve, but it is not easy for all use cases, in a safe and reliable way for production. Short term we were satisfied by knowing they are documented on puppet, until a more automatic setup is figured out.

Pausing this task since the database migration has been de-prioritized in favour of other, more pressing tasks.
We still want to do it and hardware is now available, so we will return to it as soon as practicable.

Change 736019 abandoned by Ottomata:

[operations/puppet@production] Add role::analytics_cluster::database::meta on an-db100[12]

Reason:

Done differently

https://gerrit.wikimedia.org/r/736019

BTullis renamed this task from Refactor analytics-meta MariaDB layout to use an-db100[12] to Refactor analytics-meta MariaDB layout to use an-mariadb100[12].Mar 2 2023, 12:52 PM
JArguello-WMF raised the priority of this task from Medium to High.Mar 14 2023, 5:53 PM
JArguello-WMF set the point value for this task to 3.
BTullis renamed this task from Refactor analytics-meta MariaDB layout to use an-mariadb100[12] to Bring an-mariadb100[12] into service.Aug 7 2023, 9:54 PM
BTullis removed the point value 3 for this task.
BTullis edited subscribers, added: brouberol, Stevemunene; removed: jbond, jcrespo, Kormat and 5 others.

Change 965756 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Create a new role for analytics_cluster::mariadb and assign it

https://gerrit.wikimedia.org/r/965756

Change 965761 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Remove the need for the analytics-meta database to require java

https://gerrit.wikimedia.org/r/965761

I've been bold and added an extra test about testing backup is working after maintenance, let me know if that is reasonable and adapt to your needs.

I've been bold and added an extra test about testing backup is working after maintenance, let me know if that is reasonable and adapt to your needs.

Good thinking, thanks @jcrespo - Will do.

Change 965761 merged by Btullis:

[operations/puppet@production] Remove the need for the analytics-meta database to require java

https://gerrit.wikimedia.org/r/965761

Change 965756 merged by Btullis:

[operations/puppet@production] Create a new role for analytics_cluster::mariadb and assign it

https://gerrit.wikimedia.org/r/965756

I am running the following to create a binary backup of the mariadb instance on an-coord1002.

btullis@cumin1001:~$ sudo transfer.py --type=xtrabackup an-coord1002.eqiad.wmnet:/run/mysqld/mysqld.sock an-mariadb1001.eqiad.wmnet:/srv/sqldata

I would have used our dedicated backup host db1208, but it seems that transfer.py isn't able to accept /run/mysqld/mysqld.analytics_meta.sock as a valid socket name to connect to.

btullis@cumin1001:~$ sudo transfer.py --type=xtrabackup db1208.eqiad.wmnet:/run/mysqld/mysqld.analytics_meta.sock an-mariadb1001.eqiad.wmnet:/srv/sqldata
Traceback (most recent call last):
  File "/usr/bin/transfer.py", line 33, in <module>
    sys.exit(load_entry_point('transferpy==1.1', 'console_scripts', 'transfer.py')())
  File "/usr/lib/python3/dist-packages/transferpy/transfer.py", line 265, in main
    result = t.run()
  File "/usr/lib/python3/dist-packages/transferpy/Transferer.py", line 556, in run
    self.sanity_checks()
  File "/usr/lib/python3/dist-packages/transferpy/Transferer.py", line 374, in sanity_checks
    self.original_size = self.disk_usage(self.source_host, self.source_path,
  File "/usr/lib/python3/dist-packages/transferpy/Transferer.py", line 146, in disk_usage
    path = self.get_datadir_from_socket(path)
  File "/usr/lib/python3/dist-packages/transferpy/Transferer.py", line 247, in get_datadir_from_socket
    raise Exception('the given socket does not have a known format')
Exception: the given socket does not have a known format

Rather thanb patch it now, I thought it best to use our standby replica an-coord1002 as the source of the mariabackup operation.

The output of that command was as follows:

2023-11-06 11:57:30  INFO: About to transfer /run/mysqld/mysqld.sock from an-coord1002.eqiad.wmnet to ['an-mariadb1001.eqiad.wmnet']:['/srv/sqldata'] (41213809718 bytes)
2023-11-06 12:03:09  WARNING: Original size is 41213809718 but transferred size is 37001877480 for copy to an-mariadb1001.eqiad.wmnet
2023-11-06 12:03:10  INFO: Parallel checksum of source on an-coord1002.eqiad.wmnet and the transmitted ones on an-mariadb1001.eqiad.wmnet match.
2023-11-06 12:03:11  INFO: 37001877480 bytes correctly transferred from an-coord1002.eqiad.wmnet to an-mariadb1001.eqiad.wmnet
2023-11-06 12:03:12  INFO: Cleaning up....

I subsequently prepared the backup with:

btullis@an-mariadb1001:/srv/sqldata$ sudo /opt/wmf-mariadb104/bin/mariabackup --prepare --use-memory=100GB --target-dir=/srv/sqldata
/opt/wmf-mariadb104/bin/mariabackup based on MariaDB server 10.4.28-MariaDB Linux (x86_64)
[00] 2023-11-06 12:15:15 cd to /srv/sqldata/
[00] 2023-11-06 12:15:15 open files limit requested 0, set to 1024
[00] 2023-11-06 12:15:15 This target seems to be not prepared yet.
[00] 2023-11-06 12:15:15 mariabackup: using the following InnoDB configuration for recovery:
[00] 2023-11-06 12:15:15 innodb_data_home_dir = .
[00] 2023-11-06 12:15:15 innodb_data_file_path = ibdata1:12M:autoextend
[00] 2023-11-06 12:15:15 innodb_log_group_home_dir = .
[00] 2023-11-06 12:15:15 InnoDB: Using Linux native AIO
[00] 2023-11-06 12:15:15 Starting InnoDB instance for recovery.
[00] 2023-11-06 12:15:15 mariabackup: Using 107374182400 bytes for buffer pool (set by --use-memory parameter)
2023-11-06 12:15:15 0 [Note] InnoDB: Mutexes and rw_locks use GCC atomic builtins
2023-11-06 12:15:15 0 [Note] InnoDB: Uses event mutexes
2023-11-06 12:15:15 0 [Note] InnoDB: Compressed tables use zlib 1.2.12
2023-11-06 12:15:15 0 [Note] InnoDB: Number of pools: 1
2023-11-06 12:15:15 0 [Note] InnoDB: Using SSE2 crc32 instructions
2023-11-06 12:15:15 0 [Note] InnoDB: Initializing buffer pool, total size = 100G, instances = 1, chunk size = 100G
2023-11-06 12:15:17 0 [Note] InnoDB: Completed initialization of buffer pool
2023-11-06 12:15:17 0 [Note] InnoDB: page_cleaner coordinator priority: -20
2023-11-06 12:15:17 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=1084679501860
2023-11-06 12:15:17 0 [Note] InnoDB: Starting final batch to recover 218 pages from redo log.
2023-11-06 12:15:19 0 [Note] InnoDB: Last binlog file './analytics-meta-bin.002056', position 546239513
[00] 2023-11-06 12:15:19 Last binlog file ./analytics-meta-bin.002056, position 546239513
[00] 2023-11-06 12:15:21 completed OK!

I configured the ownership of the /srv/sqldata with sudo chown -R mysql:mysql /srv/sqldata

I started the mariadb service with: sudo systemctl start mariadb

I configured the replication threads with:

CHANGE MASTER TO MASTER_HOST='10.64.21.104', MASTER_USER='repl', MASTER_PASSWORD='<redacted>',MASTER_LOG_FILE='analytics-meta-bin.019776', MASTER_LOG_POS=438746524, MASTER_SSL=1;
START SLAVE;

We can now see that this slave thread is up to date with an-coord1001 from here: https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&refresh=1m&var-job=All&var-server=an-mariadb1001&var-port=9104

Cleaned up and re-enabled puppet on an-mariadb1001. Icinga is green.

Repeating the above steps with an-mariadb1002.

Change 971942 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Configure the new mariadb servers to be replicas

https://gerrit.wikimedia.org/r/971942

the given socket does not have a known format

I think it is because it doesn't know how to transform that into a datadir, as it assumes all section names are documented on

cumin1001
root@cumin1001:~$ grep analytics_meta /etc/wmfmariadbpy/section_ports.csv 
analytics_meta, 3352
✔

I can see analytics_meta there- so there must be some discrepancy between what's configured and what's on the host.

Ah, I see the issue:

https://github.com/wikimedia/operations-software-transferpy/blob/cd9027a9beee2cf2ae51b2b6f1be216637775bf9/transferpy/Transferer.py#L244

The port file is not used- even if it should- so it is too restrictive, and it only allows sections starting with s, x or m. I will change that to use the section_ports file instead, where analytics_meta is a known section.

I will change that to use the section_ports file instead, where analytics_meta is a known section.

Thanks @jcrespo - I think that would be very useful in future.
For now, I have worked around it by taking my backup from an-coord1002, which wasn't under load at the time.

I have also now switched an-mariadb1002 to replicate from an-mariadb1001.

I'll merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/971942/ which will configure these two nbew servers to be replicas with puppet and enable monitoring, then I will also switch the replication of the backup server (db1208) to an-mariadb1001, as this can be done ahead of time.

Change 971942 merged by Btullis:

[operations/puppet@production] Configure the new mariadb servers to be replicas

https://gerrit.wikimedia.org/r/971942

Change 972424 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Use new mariadb server for analytics_meta

https://gerrit.wikimedia.org/r/972424

Change 972433 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/software/transferpy@master] Tranferrer: Enable transfers other than misc, core or x1 sections

https://gerrit.wikimedia.org/r/972433

I switched both an-mariadb100[12] servers to use GTID based replication, rather than a simple binlog position.

I have also switched db1208 to replicate from an-mariadb1001 instead of an-coord1001.
I'll remove downtime and enable monitoring on an-mariadb100[12] now, so we will know if replication stops for any reason.

Change 972823 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Switch datahub to use the new an-mariadb servers instead of an-coord

https://gerrit.wikimedia.org/r/972823

Change 972433 merged by Jcrespo:

[operations/software/transferpy@master] Tranferrer: Enable transfers other than misc, core or x1 sections

https://gerrit.wikimedia.org/r/972433

Change 974164 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Increase the size of the innodb pool on analytics_meta

https://gerrit.wikimedia.org/r/974164

Change 974165 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Enable notifications for new analytics_meta hosts

https://gerrit.wikimedia.org/r/974165

Change 974167 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Promote an-mariadb1001 to be the new primary for analytics_meta

https://gerrit.wikimedia.org/r/974167

Change 974172 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] WIP - Temporarily disable the production jobs that write to HDFS

https://gerrit.wikimedia.org/r/974172

Change 974173 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] WIP Re-enable the production pipelines that write to HDFS

https://gerrit.wikimedia.org/r/974173

I have announced a maintenance window for tomorrow, November 15th at 11:00 UTC.

The implementation plan will be as follows:

  • 10:30 - Merge and deploy https://gerrit.wikimedia.org/r/974172 to disable gobblin and refine jobs - execute run-puppet-agent on an-launcher1002
  • 11:00
    • Enter HDFS safe mode with the command: sudo -u hdfs kerberos-run-command hdfs hdfs dfsadmin -safemode enter on an-master1001
    • Execute SET @@global.read_only=1; in mysql on an-coord1001
    • Execute FLUSH TABLES WITH READ LOCK; on an-coord1001 - this will prevent writes going to the database
    • Execute SHOW MASTER STATUS; and SELECT @@global.gtid_binlog_pos; on an-coord1001 - noting down the results
    • Execute SHOW SLAVE STATUS\G; on an-mariadb1001 and verify the values for Master_Log_File, Exec_Master_Log_Pos, and Gtid_IO_Pos from the previous step. This ensures that replication is fully caught up before stopping the slave threads. We are not expecting to have to wait for replication to catch up, but it is auseful failsafe to check these positions.
    • Execute STOP ALL SLAVES; and RESET SLAVE ALL; on an-mariadb1001
    • Execute SET @@global.read_only=0; in mysql on an-mariadb1001
    • Merge and deploy https://gerrit.wikimedia.org/r/974167 which will also execute set global read_only = 0 on an-mariadb1001 (duplicate command, but harmless to run twice).
    • Merge and deploy https://gerrit.wikimedia.org/r/c/operations/puppet/+/972424 which will update Hive, Hue, Druid, and Superset
    • Force a puppet run on all affected servers with sudo cumin '(P{C:bigtop::hue} or P{C:bigtop::hive} or P{C:druid} or P{C:superset})' run-puppet-agent
    • Restart the hive-server2 and hive-metastore services on an-coord1001[12]
    • Restart the Druid clusters with: sudo cookbook sre.druid.roll-restart-workers analytics and sudo cookbook sre.druid.roll-restart-workers public
    • Restart Superset with systemctl restart superset on an-tool1005 and an-tool1010
    • Restart Hue with systemctl restart hue on an-tool1009
    • Merge https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/972823 and deploy datahub to pick up the new database settings
    • Exit HDFS safe mode with the command: sudo -u hdfs kerberos-run-command hdfs hdfs dfsadmin -safemode leave on an-master1001
    • Merge and deploy https://gerrit.wikimedia.org/r/c/operations/puppet/+/974173 which will re-enable the gobblin jobs - execute run-puppet-agent on an-launcher1002
  • Monitor pipeline progress
  • Add downtime for mariadb services on an-coord100[12]
  • Shut down mariadb services on an-coord100[12]
  • Begin removal of the profile::analytics::database::meta classes from the analytics_cluster::coordinator and analytics_cluster::coordinator::replica roles in puppet.
  • Check that all replication threads on downstream servers are working
  • Check that backups from db1208 are working as expected

The roll-back plan, should it be required will involve:

  • Reconfiguring applications to continue using an-coord1001 as the primary server
  • Executing REMOVE LOCK; on an-coord1001
  • Executing SET @@global.read_only=0; on an-coord1001.

I'd be glad of anyone being able to sanity check the plan. The intention is to minimise the chances of any data discrepancies, whilst minimising errors from attempted writes to Hive/HDFS while they are in a read-only state, during the change-over.

Change 974172 merged by Btullis:

[operations/puppet@production] Temporarily disable the production jobs that write to HDFS

https://gerrit.wikimedia.org/r/974172

MariaDB [(none)]> SET @@global.read_only=1;
Query OK, 0 rows affected (0.000 sec)

MariaDB [(none)]> FLUSH TABLES WITH READ LOCK;
Query OK, 0 rows affected (0.040 sec)

MariaDB [(none)]> SHOW MASTER STATUS;
+---------------------------+-----------+--------------+------------------+
| File                      | Position  | Binlog_Do_DB | Binlog_Ignore_DB |
+---------------------------+-----------+--------------+------------------+
| analytics-meta-bin.019777 | 809663593 |              |                  |
+---------------------------+-----------+--------------+------------------+
1 row in set (0.001 sec)

MariaDB [(none)]> SELECT @@global.gtid_binlog_pos;
+--------------------------+
| @@global.gtid_binlog_pos |
+--------------------------+
| 0-171971944-1090937511   |
+--------------------------+
1 row in set (0.000 sec)

MariaDB [(none)]>

Mentioned in SAL (#wikimedia-analytics) [2023-11-15T11:04:56Z] <btullis> position confirmed, resetting all slaves on an-mariadb1001 for T284150

Change 972424 merged by Btullis:

[operations/puppet@production] Use new mariadb server for analytics_meta

https://gerrit.wikimedia.org/r/972424

Change 972823 merged by jenkins-bot:

[operations/deployment-charts@master] Switch datahub to use the new an-mariadb servers instead of an-coord

https://gerrit.wikimedia.org/r/972823

Change 974173 merged by Btullis:

[operations/puppet@production] Re-enable the production pipelines that write to HDFS

https://gerrit.wikimedia.org/r/974173

I have issues the following on an-coord1001;

MariaDB [(none)]> SHUTDOWN;
Query OK, 0 rows affected (0.002 sec)

MariaDB [(none)]>

There was a missing grant in the permissions table for superset.
I had to add this:

GRANT SELECT, INSERT, UPDATE, DELETE, CREATE, DROP, REFERENCES, INDEX, ALTER, CREATE TEMPORARY TABLES, LOCK TABLES, EXECUTE, CREATE VIEW, SHOW VIEW, CREATE ROUTINE, ALTER ROUTINE, EVENT, TRIGGER ON `superset`.* TO `superset`@`%`

Change 974165 merged by Btullis:

[operations/puppet@production] Enable notifications for new analytics_meta hosts

https://gerrit.wikimedia.org/r/974165

Change 974167 merged by Btullis:

[operations/puppet@production] Promote an-mariadb1001 to be the new primary for analytics_meta

https://gerrit.wikimedia.org/r/974167

Change 974512 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Update the database host for superset-next

https://gerrit.wikimedia.org/r/974512

Change 974512 merged by Btullis:

[operations/puppet@production] Update the database host for superset-next

https://gerrit.wikimedia.org/r/974512

Now monitoring for any stray traffic being sent to the mariadb service on an-coord1001 with the following:

btullis@an-coord1001:~$ sudo tcpdump -i any dst port 3306 and dst host an-coord1001.eqiad.wmnet
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on any, link-type LINUX_SLL (Linux cooked), capture size 262144 bytes

Change 974516 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Clean up hadoop coordinator roles by removing analytics_meta DB

https://gerrit.wikimedia.org/r/974516

Change 974516 merged by Btullis:

[operations/puppet@production] Clean up hadoop coordinator roles by removing analytics_meta DB

https://gerrit.wikimedia.org/r/974516

We got an error from puppet running on an-coord1001 because I hadn't changed the locaiton of the oozie database server.

Error: '/usr/lib/oozie/bin/ooziedb.sh create -run' returned 1 instead of one of [0]
Error: /Stage[main]/Bigtop::Oozie::Server/Exec[oozie_mysql_create_schema]/returns: change from 'notrun' to ['0'] failed: '/usr/lib/oozie/bin/ooziedb.sh create -run' returned 1 instead of one of [0] (corrective)

I'm tempted not to bother fixing this, but to skip straight ahead to: T341893: [Data Platform] Stop and remove oozie services

BTullis moved this task from In Progress to Done on the Data-Platform-SRE board.

Change 974164 merged by Btullis:

[operations/puppet@production] Increase the size of the innodb pool on analytics_meta

https://gerrit.wikimedia.org/r/974164

BTullis closed this task as Resolved.

What about?

  • Ensure backups are running on the right host(s) and with the latest data (e.g. attempt a test recovery)

BTullis closed this task as Resolved.

What about?

  • Ensure backups are running on the right host(s) and with the latest data (e.g. attempt a test recovery)

Oh you're right. Sorry, I failed to confirm this.

I have checked to my own satisfaction that the latest database dump is using up-to-date data.

As an example, here is a record in datahub that relates to metadata about our Kafka topics. This was updated shortly before the database dump was created.

root@dbprov1002:/srv/backups/dumps/latest/dump.analytics_meta.2024-01-02--04-00-48# zgrep 2024 datahub.metadata_aspect_v2.sql.gz|tail -n 1
grep: datahub.metadata_aspect_v2.sql.gz: binary file matches
("urn:li:dataset:(urn:li:dataPlatform:kafka,VirtualPageView,PROD)","datasetProperties",261,"{\"name\":\"virtualpageview\",\"customProperties\":{\"Partitions\":\"1\",\"Replication Factor\":\"3\",\"cleanup.policy\":\"delete\",\"max.message.bytes\":\"10485760\",\"min.insync.replicas\":\"1\",\"retention.ms\":\"604800000\",\"unclean.leader.election.enable\":\"false\",\"retention.bytes\":\"-1\"},\"tags\":[]}","{\"lastObserved\":1704153802303,\"runId\":\"kafka-2024_01_02-00_03_00\"}","2024-01-02 00:03:22.000000","urn:li:corpuser:__datahub_system",NULL),

I don't think that I need to go as far as a test restore in this case. All the files look as I would expect them to. Thanks again for the reminder @jcrespo.