Technic Blog

Technik BlogDisaster recovery strategies for MySQL database systems

13. November 2017 Max Fruth

Decision parameters for the best strategy

Backup and recovery processes are a critical part of any IT application and therefore an important part of business continuity planning. A well-considered and proven backup and recovery system can make the difference between a small failure and a serious threat to an organization.

When planning your backup and disaster recovery strategy, you need to consider three things: Recovery Time Objective (RTO), Recovery Point Objective (RPO) and Risk Mitigation.

Recovery Time Objective (RTO)

The RTO describes the time schedule for the recovery of a database system after a crash. What is the potential loss of sales due to a breakdown? What was agreed in the SLAs with customers? The answers to these questions are the basis for determining the RTO value.

Recovery Point Objective (RPO)

The RPO is the point in time at which you want to restore. Restoring a database from a backup always results in data loss. If an RPO of more than one day is sufficient, this can be covered by daily backup routines. However, if stricter RPOs have been defined, additional measures such as binary streaming or incremental backup are required.

Risk Mitigation

When defining a strategy, it is also important to identify the type of risk to be mitigated. Are bugs from programming or errors in the administration of database applications to be fixed? Is there an increased risk of data integrity due to hacker attacks or user errors? How can the risk of a long downtime of the DB servers or the failure of a data center be reduced?

Backup strategies for MySQL databases

DB backups are divided into logical and physical backups. Logical backups protect against the loss of individual data points, while physical backups protect against total data loss, e. g. after server hardware failure.

The challenge when creating hot backups with active database systems is that availability and performance of the service should not be affected. The service must not be stopped or the access locked. Simply copying the data files of an active database usually results in an inconsistent copy of the database. Such a backup cannot be used because all or some transactions performed during the backup process are missing. On the other hand, stopping the database for cold backup causes database applications to become unavailable.

What software tools are available to create MySQL backups?

mysqldump creates logical backups of the data and is mainly used for smaller database systems and in development. Due to the disadvantages of mysqldump with regard to scalability (duration of backup and restore) and the resulting loss of performance, this tool is rarely used in large and high availability applications.

XtraBackup creates physical backups of the database files, i. e. copies of data files. It then applies the transaction log (also known as redo log) to the physical backups in order to back up all active transactions that were not completed during the creation of the backups to ensure data consistency with a running database.

Decision Flow Chart for defining a backup strategy

Decision Flow Chart

Restoring database systems

The recovery time can be significantly reduced by a DB Staging Area. Restoring the data from a central backup software (e. g. VEEAM) requires considerably more time, since the data must first be made available from a large, usually slow storage medium. Active backup processes can also delay access to the backup media.

Keeping a complete backup cycle including full, incremental and binlog backups in a staging area is particularly important for larger IT infrastructures in order to achieve the goals defined in the RTOs. The DB Staging Area should meet the following requirements:

  • The storage medium should be available if the DB server hardware fails.
  • The system processes of the recovery software should have direct access to the storage medium.
  • The access latency (disk I/O, network I/O) to stored data of the staging area should be as short as possible.

GlusterFS a highly available, distributed file system meets the requirements for a staging area. We at netplace have been using GlusterFS for DB backups for years.

The decision diagram for the backup strategy shows that the RPO has a large influence on the definition of the backup schema. Especially if the recovery time is short, i. e. only a small loss of data can be tolerated, data backups are required at shorter intervals. Since these additional backups fall within the time periods of high productivity, this must not affect the DB system. Backup procedures with locks on read or write operations would reduce the availability or performance of applications and should therefore not be used.

With backup of the MySQL binary protocol (BINLOG), it is possible to extend the backup scheme for highly productive systems. Normally, storing Write operations in binary logs is already active when asynchronous replication (Master/Slave, Master/Master) is used, so that backing up these files has no negative effect on the performance of the DB system.

Galera Cluster Systems use a synchronous replication protocol. Therefore, BINLOG must be explicitly activated on a node in the cluster. Since the performance of the cluster system is the slowest node, the activation of BINLOG has a negative effect on all write operations. This must be considered to avoid problems due to poor performance.

In further articles I will go into detail about the most popular MySQL backup tools xtrabackup and mysqldump. I will also provide scripts for download, which show and implement a comprehensive backup scheme for MySQL databases.