Recently, we conducted disaster recovery (DR) testing on one of our crucial applications. The server was running Windows 2008 on an HP physical box. We performed bare metal restore (BMR) using Symantec Netbackup 7.1. However, after Symantec BMR completed the restore, the server will not boot up. We troubleshoot the problem and tried several configurations. It took a couple of days before we figured out the issue. The issue, by the way, was that the boot sector got misaligned after the restore and we have to use Windows installation disk to repair it.
What if it was a real server disaster? The business cannot wait for a couple of days to restore the server. We defined an RTO (Recovery Time Objective) for that server to be 8 hours. And we did not meet it during our testing. This is the reason why DR testing is very important.
During DR testing, we have to test the restore technology and the restore procedures. In addition, we need to test if we can restore it on time (RTO) and if we can restore the data at a point in time (or RPO – Recovery Point Objective) (e.g. from a day before, or from a week ago).
With a lot of companies outsourcing their DR to third parties or to the cloud, DR testing becomes even more important. How do you know if the restore works? How do you know if their DR solution meets your RPO and RTO? Companies assume that because backups are being done, then restore will automatically work.
We perform DR testing once a year. But for crucial applications and data, I recommend DR testing twice a year. Also, perform a test every time you make significant changes on your backup infrastructure, such as software updates.