Some believe that backups are a routine that should be set and forgot about. Such people also believe that ransomware attacks, downtime caused by hardware failures, and human mistakes that lead to data loss are things that happen to people in the news, or topic starters on Reddit – in other words, to someone else.
We all know that these kinds of beliefs are a recipe for a disaster: Lost data, lost business opportunity, lost productivity and, in the end, lost money. Backups can break down at many stages, the recovery may not go as fast as expected or, simply, your infrastructure might not be ready for the recovery operation.
To avoid all that, you need to create a comprehensive plan of how you test your backups. This guide will help you to build a robust action plan to make sure that your backed-up data is recoverable.
Create a List for Your Backups
First and foremost, you should make sure that all backup routines are documented. Why do you need to create a list if you already have a system administrator who knows all the routines that are in place (or if you know them yourself)? There are two main reasons:
- The person with direct knowledge of your backups may be absent or may leave your company, thus leaving you without a full understanding of your backup infrastructure. You, or your new system administrator, will spend ages just trying to understand how things work. Needless to say, in the event of any disaster, the lack of such knowledge will lead to downtime.
- Even the best professionals may mix up or forget specific technical details once things have got out of control and the situation is stressful.
So, you should create a list that includes all backups that are run, their types, retention settings, and the hardware that you need for the backup or recovery processes. Don’t forget to include your recovery time and recovery point calculations. These will help you to test your backups later and evaluate whether your backup plans are sufficient.
Backup and Recovery Tests
As we have already mentioned, backup and recovery are two different processes. And both should be tested in order to be sure that nothing goes wrong when you need to get your data back from the storage.
When testing backups, you should create a map of every piece of infrastructure and data that you need to back up. Here are the basic checks that you should perform regularly:
- Check your backup infrastructure. If you have a local backup infrastructure, check the health of your SMART drives and your NAS devices. If you back up to the cloud, check that all files are consistent in the storage.
- Check the consistency of your data. Some backup solutions have a feature to check the consistency of your data on the machine and in the storage, in order to ensure data integrity.
- Check that all parts of your infrastructure are covered. You have previously listed all the backup plans you run, but what if you missed something vital? Audit your infrastructure and make sure that everything critical for the company is being backed up.
- Check security settings. Have you enabled data encryption in transit and at rest? Do you need to encrypt filenames? Lastly, who has access to your backup storage? As a rule of thumb, you should use the rule of the least minimal privilege for your access policies.
Further reading Key Technical Backup Challenges and How to Solve Them
Here are the most common rules for tests and checks that you should adhere to in order to be sure that you can recover anything, any time:
- Test in accordance with recovery time and recovery point estimations. This shows you how fast you should recover and how much data you can afford to lose in the event of downtime. If you haven't yet defined these parameters, here's an article on how you can estimate RTO and RPO.
- Define the scope of your tests. You should break down your recovery testing from the simplest to the most demanding, and make sure that you regularly test each of these, including single file recovery, single machine or server (including and excluding the infrastructure), recovery tests of the interconnected parts of your network and infrastructure; and lastly, test disaster recovery in various scenarios.
- Define the schedule for your tests. You should schedule your tests for two reasons. First, you should do it regularly, to be sure that you catch up with all the changes in your infrastructure. Secondly, you should make sure that your tests won't affect your business operations, which means scheduling them outside of business hours.
- Document everything. Every single part of your tests should be documented, including the schedule, the scope, the exact tests and their results, your RTO and RPO estimation, people authorized to perform the tests, and other team members that you might need to notify regarding the tests.
Backup and recovery tests are not mere routine and dull exercises. Although they do not sound like the most enjoyable activities for the IT professional, they are designed to make sure that you can bring back every piece of your infrastructure in the event of any disaster, human fault, or failure. And if you take a look at your interconnected, partly on-prem and partly cloud-based, complex infrastructure and network, you will immediately observe how fragile all this complexity is.
Create a great testing environment and make sure that you have covered everything that is vital to your business, thus ensuring you a good night's sleep.