HADR for Crying Out Loud

HADR for Crying Out Loud

The need for High Availability and Disaster Recovery

I was on my morning walk recently when I noticed this laying on the walking path. Some call it a binky. Others a pacifier or fooler. Whatever you call it, it’s an essential piece of equipment for some infants. Without it, their world falls apart. And the calamity is contagious. They, and everyone around them, are miserable until an adequate replacement can be found and peace is restored. That’s where HADR comes in.

HADR for binkies? What? And how does this relate to SQL Server?

Sometimes the difference between the HA (High Availability) and DR (Disaster Recovery) can be difficult to grasp. There a nuances and the approaches can overlap. So, let’s consider an analogy – binkies and HADR in SQL Server.

As an infant, one of our daughters was very partial to her binky. She was seldom without one. Day or night, she “needed” her binky.

The Disaster Recovery analogy

We knew that life was better when she had a binky. Every time we left home, we strapped her into her infant car carrier, also known as a her “cry bucket”, and ensured she had a binky in her mouth.

If it was lost during our trip, life would be unpleasant. So we were prepared. We carried extra. At least one, sometimes more than one, just in case the unthinkable happened.

If she somehow managed to lose her binky, if it fell out along the way unnoticed, we would open the diaper bag and retrieve the spare.

That was binky DR. We would have to manually step in to solve the problem. She was disrupted until the fix was in place. But it was better than long periods of screaming about it. We could handle a little disruption and still avoid an all out disaster.

How this applies to SQL Server

The same is true for SQL Server. DR is a way to recover from some kind of disaster. It could be at the data center level, or at a SQL Server instance level. In either case, we need to recover from some kind of failure as quickly as possible.

In SQL Server, that could mean:

  • Restoring from a backup. Assuming your backup process is robust, meets your RTOs and RPOs, and is regularly tested, you can apply the various backup files to get you back to whole again. Typically, you’ll restore a full backup, a differential backup, and then a serious of transaction log backups.
  • Restore a virtual machine. If the failure affects the instance and not just a database, you may have to restore a virtual machine image. You may be done after that. Or you may have to restore a the backup files as described before to meet your RPO requirement.
  • Switch to a stand-by SQL Server. Log-shipping can be a cost effective way to implement a level of DR. If you’re leveraging it as part of DR strategy, to recover you’ll need to make the standby SQL Server the new production SQL Server. You’ll stop log-shipping, recover database on the standby server, and point everything to the new production instance.
  • Site-wide DR. Of course, some companies have a site-wide DR plan. If the entire data center is out of commission, they can failover to the DR site in another city or region. Properly configured, that can work very well. Typically, however, you don’t want to do that for a single instance of SQL Server. It’s reserved for disasters on a larger scale.

Disaster Recovery Summary

In all of these cases, manual intervention is required. Someone had to do something to get things back to normal. In our analogy, someone had to dig through the diaper bag to find the spare and give it to our daughter. Until that point, everyone was unhappy.

It’s also worth noting that we checked and double checked that we had a spare binky before leaving home. For SQL Server, you want to regularly test your backups to make sure you can indeed recover from them when needed.

High Availability

Our approach to binky DR when we left home tended to work pretty well. There were times when we weren’t able to recover quickly enough and drew awkward glares from those around us. However, for the most part it was a success.

At night it was a different story.

We would go through the bedtime process – reading a story, rocking for a little while, maybe even humming a lullaby, and gently placing her in her crib. Then we’d tip toe out of the room.

Unfortunately, at some point during the night, her binky would slip from her mouth. And before long, she’d notice. And when she noticed, we all were quickly made aware. One of us would have to get up, search through the blanket and put the binky back into her mouth.

Fortunately, we discovered a better way.

We began putting her in her crib with a binky in her mouth, and a spare one in each hand. That way, when the binky slipped from her mouth, she could quickly and easily replace it with one from her hand. This allowed her to sleep for longer periods and time. And us too!

Binky high availability at its finest.

How this applies to SQL Server

Many mission critical systems depend on SQL Server. For example, Practice Management Systems or Electronic Medical Records in healthcare, Project Management Systems or Job Costing Systems in construction, and core systems in the banking. They all rely on SQL Server. If SQL Server is down, the business is down. And that can be costly.

For those situations, we can leverage High Availability in SQL Server. Two technologies are commonly used.

  • Failover Cluster Instance (FCI). With an FCI, one SQL Server instance is effectively spread across multiple Windows Server Failover Cluster nodes. SQL Server runs on the primary node. In the event of a planned or unplanned downtime, a secondary node will start its services, take ownership of the shared resources, and begin servicing application requests. This can happen automatically, often with minimal disruption to users.
  • Availability Groups (AG). With an AG, SQL Server is installed on two or more nodes of a Windows Server Failover Cluster. (Strictly speaking the nodes don’t have to be WSFC. Linux clusters can be used.) As with the FCI, if a disruption occurs on the primary node, a secondary node will be promoted to primary and all application requests will be automatically directed to the new primary node. There are a lot of nuances to configuring an AG including synchronous / asynchronous transactions, automatic or manual failover, etc. The point here is that it can be configured to happen automatically and will little noticeable disruption to the end users.

High Availability Summary

One really nice distinction with High Availability is that it can happen automatically, without someone having to do something to make it happen. For businesses running critical applications on SQL Server, this can prevent costly unplanned and planned downtime.

For parents with kids dependent on binkies, this could mean hours of additional uninterrupted sleep each night.

In either case, HA can be really worth looking into.

Want to learn more?

Hopefully this analogy has helped. Here are a few addition resources that may likewise be helpful.

If you’d like to talk though how improve your SQL Server’s reliability, give us a call. We’re happy to help.

 

Leave a Reply

Your email address will not be published. Required fields are marked *