Disaster Recovery Planning Checklist for MSPs
Being prepared for a disaster starts with a plan. MSPs can offer their clients a service of creating a disaster recovery plan (DRP), a documented set of procedures to execute to recover a business IT infrastructure in the event of a disaster.
In this guide, we’ll explain how a disaster recovery plan differs from the business continuity plan, discuss its key elements and provide a checklist of what you have to keep in mind when creating such a plan.
It seems that not a week goes by without news of another organization that has had all of their files encrypted by ransomware, a data center hit by a natural disaster, or a cloud application that becomes suddenly unavailable. These businesses range from one-person shops to city governments, to one of the largest shipping concerns in the world.
The amounts of money involved grow dramatically with the size of the company, but the chances of the company surviving without their data are worse for smaller businesses that may lack the funds to go 30 days or more without revenue if a disaster occurs.
Unless organizations that use your managed IT services can do without all of their files, apps and computing capacity, there is a critical need for every organization, no matter their size or where their files are located, to have a plan for how to recover after a disaster, and how to keep the business running during and after an event.
Plan your perfect disaster recovery strategy on AWS:
Disaster Recovery Plan vs Business Continuity Plan
Recovering from a disaster will have two parts – finding a place to put files back, and then actually putting them back. If a disaster such as a fire, hurricane or flood destroys your data center, you’ll need new servers to restore files to.
Even if there is a backup that’s effective and complete, where will the files go?
If the problem is ransomware, or a service provider losing your files, it may be possible to restore them to their original locations, but the servers themselves may have to be re-configured with different network addresses, or your client workstations may have also been encrypted and will need to be restored before users can resume their work.
Both parts of the plan, the plan to ensure users can continue to work, and the plan to get files back are critical. This is why the BCP (business continuity plan) and DRP (disaster recovery plan) are often merged into a single plan.
Learn the difference between disaster recovery and business continuity in more detail and check how to ensure that the disaster recovery solution meets the requirements of a given business continuity plan:
Further reading Business Continuity vs Disaster Recovery vs BCDR
Disaster Recovery Plan Checklist
1. Prepare your customers
You can start the process of creating a plan by educating your customers. A quick search on the web will uncover a wide variety of disasters that have recently befallen companies like theirs. Share these with your clients and ask for their help in figuring out ways to get around the problems.
2. Identify business processes
The first step in creating a BCDR plan is to identify all of your customer’s business processes. This includes not only what data is stored where, but can include individuals with critical knowledge that is not written down anywhere.
For example, a company recently found that the only person with access to $190 million in company funds stored in a cryptocurrency account had died or disappeared. Another company found that the only user with a key to their account had forgotten the code, which cost them $30,000. What if there is only one user who knows how to create an invoice through the accounting system, and they are hit by a bus on the way into work?
3. Prepare for coming disasters if you are warned
If you have sufficient warning of an oncoming disaster, and your client’s employees haven’t left to take care of their homes, there are steps you can take to protect your site before disaster arrives.
If you know an extended lightning storm is coming, shut down and unplug major systems. Make sure you can recover them after the storm. If a flood is coming, shut down systems and get equipment up above expected flood levels. If a hurricane is coming, nail plywood over the windows.
4. Make use of available tools
There are procedural methods such as ISO 9001 that can help to identify and document all business processes. It isn’t necessary to adopt ISO 9001 to make use of the techniques, just to ensure that you look at all the steps in your customer’s business processes and make sure they can survive a disaster.
You may find that some employees are reluctant to document their day-to-day activities, whether because of a sense of turf or so that they remain irreplaceable. One simple way that some organizations have used to motivate employees to cover their bases is to point out that vacations can’t be taken if they have information the company can’t do without.
Again, this doesn’t have to be complex and include every possible step – it’s only necessary to identify potential sticking points and document them.
5. Locate data and create a backup plan
After identifying all the processes and what information is stored where, it’s time to begin creating a plan to find and back up all that data. At one time, this would have involved backing up the mainframe. Today, with data on individual users’ PC’s, servers in the data center, in one or more public clouds, and stored in private clouds as well, just finding all the data your customer’s organization might be using could take a while, and you might find that some users have added to the pile after your plan was completed. This is why testing and regularly revisiting the plan is also a necessity.
6. Test, test, test
The plan needs to be tested. Regularly tested. No, you should test a lot. The annals of IT are full of stories of backups that completed without a hitch for years, until someone tried to restore, at which point it was found that the backups weren’t complete or that they’d been failing silently for months or years, and that there was no way to recover lost files. So test. You don’t have to test the whole system every day, or week, or even month.
A useful schedule might be to test the whole system once it’s first built, then a different directory every week, a full server once a quarter, and the whole data center once a year.
It’s perfectly feasible to make most tests non-disruptive – restore files to an alternate location, and make sure they’re all up to date, then delete the copies once they’ve been verified.
However, some tests may need to be disruptive – for instance, to test a backup that includes moving servers from one location to another, it’s necessary to test the steps necessary to allow users to access the new servers in their different locations – this could be done over a long weekend while most users were off-line and out of the office.
Some backups can serve two purposes – for instance, a backup copy of the company’s customer database can be used by application developers to test against. It’s simply necessary to ensure that private data such as users’ SSN or other personal information is properly encrypted or redacted.
Further reading Disaster Recovery Testing: Scenarios, Best Practices, Methods
How much is too much?
It’s possible to have a plan that can ensure that vital operations are back up and running in almost no time, regardless of the level of disaster. For instance, some Navy operations resumed the day after the Pentagon was hit on 9/11. All files had been mirrored off-site, and users were able to deploy to an alternate location and continue work almost immediately, using PCs and other systems that had been deployed in advance. Most businesses can’t afford this level of preparation, but it certainly is possible. The key here is to identify a recovery time objective (RTO) that your client’s business can afford to be without its data or equipment, and set the response based on that. For instance, a system that allows for 24 hours of downtime will be much less expensive than one that only allows 15 minutes.
- Direct-to-cloud recovery
- Recovery with bootable drive
- File-level and VM restore
- Remote recovery
Key Elements of IT Disaster Recovery Response Plan
Each type of disaster has checklist items that may not apply to other types of disasters. For instance, the 2018 Camp Fire in Paradise, California not only destroyed thousands of homes and businesses, but people lost their lives, access to the area was restricted for many days afterward, and the area may take years to rebuild. This makes a server farm shutdown seem trivial by comparison. Even this level of disaster can have a measured response, though.
- Make sure responsibilities for disaster responses are assigned
It might be to pick a nearby city and identify workspaces for rent that can be used to set up an alternate location.
- Implement your response to the disaster
This might means recovering servers to a location in the cloud and restoring files to them, or renting new office space in another city. The primary driver here is cost. Most small businesses can’t afford to rent empty office space with computers and furniture against the off chance that they might be needed, but knowing where the nearest workspace rental is, and who the person to call to get started might be.
- Restore data from backups to the new systems
If every business process is documented and backups have been tested, then moving should be relatively painless – everyone should have an understanding of where information can be found and how data can be accessed. Everything can and should be protected – even the file cabinets full of old documents can be scanned and stored online for relatively little. How much would you miss the paper files if a fire swept through?
- Reverse the process when it’s safe to return your users to their home office
The backup plan should include the process to resume normal operations. Part of the testing process is to identify potential problems in the reversal process and fix them before they’re really needed.
Floods, fires and bad weather are increasing in regularity, intensity and the amount of damage caused. Include human-caused disasters such as ransomware, server malfunctions, and storage problems, and it seems like the future is a minefield for businesses of all sizes. The degree of protection of your clients’ business needs will vary, but options are widely available and much cheaper than trying to identify what has been lost and how it might be recovered after the fact.