High Availability

Today’s Web-based applications usually depend on multiple servers, perhaps hundreds or thousands of client workstations, internal and external network communications, database services, operational processes, and a host of other infrastructure services that must all work uniformly together. Where the business ideal is a continuous flow of information — and an interruption costs your company money — creating high-availability applications becomes an important business strategy.

Downtime causes problems and long-term losses that extend far beyond the local computing environment (such as annoyed customers and broken supply channels).

About 40% of application downtime occurs due to inadequate testing, change management, and lack of ongoing failure monitoring. Another 40% of application downtime derives from operations errors caused by lack of rigorous procedures and backup/restoration errors. Hardware reliability has improved so much over the past few years that typically less than 10% of the downtime is due to hardware problems. The remaining 10% of downtime is caused by environmental and other miscellaneous problems.

As a general idea, availability is a measure of how often the application is available for use. More specifically, availability is a percentage calculation based on how often the application is actually available to handle service requests when compared to the total, planned, available runtime. The formal calculation of availability includes repair time because an application that is being repaired is not available for use.

The calculation for availability uses several measurements:

Name Acronym Calculation Definition
Mean Time Between Failure MTBF Hours / Failure Count Average length of time the application runs before failing.
Mean Time To Recovery MTTR Repair Hours / Failure Count Average length of time needed to repair and restore service after a failure.

The availability formula looks like this:

Availability = (MTBF / (MTBF + MTTR)) X 100

For example, consider an application that is intended to run perpetually. If we assume 1000 continuous hours as a checkpoint, two 1-hour failures during that time would result in availability of ((1000/2) / ((1000/2) + 1)) X 100 = (500 / 501) X 100 = .998 X 100 = 99.8%.

One popular way to describe availability is by the “nines,” such as three nines for 99.9% availability. However, the implication of measuring by nines is sometimes misunderstood. You need to do the arithmetic to discover that three nines (99.9% availability) represents about 8.5 hours of service outage in a single year. The next level up, four nines (99.99%), represents about 1 hour of service outage in a single year. Five nines (99.999%) represents only about 5 minutes of outage per year.

Availability standards are decided by criticality:

Non-Commercial Applications (99%)
Commercial Applications (99.5%)
Business critical Applications (99.90%)
Mission critical Applications (99.999%)
1. Designing for High Availability
1.1. Clustering
– A cluster consists of multiple computers that are physically networked together and logically connected using cluster software. Clustering allows two or more independent servers to behave as a single system. In the event of a failure (such as CPU, motherboard, storage adapter, network card, or application component), the workload is automatically moved to another server, current client processes are switched over, and the failed application service is restarted — all automatically and with no apparent downtime. When either a hardware or software resource fails, customers connected to that server cluster may experience a slight delay but the service will be completed. Cluster software can provide failover support for applications, file and print services, databases, and messaging systems.

1.2. Network Load Balancing – Network Load Balancing (NLB) for the even distribution of Web traffic across the available servers. NLB can also help with availability: should a server fail, with NLB you can redefine the cluster and direct traffic to the other servers. NLB is especially beneficial for e-commerce applications that link external clients with transactions to back-end data stores: as client traffic increases, additional servers can be added (up to 32 servers in a single cluster).
NLB automatically detects a server failure and redirects client traffic to the remaining servers within ten seconds — all the time maintaining continuous, unbroken client service. It is worth noting that NLB is preferable to round robin DNS.

1.3. RAID for Data Stores – The RAID acronym stands for a redundant array of independent disks. RAID is a way to use multiple hard disks so that data is stored in multiple places. The benefit of RAID is that any disk failure automatically transfers to a mirrored or reconstructable data image, the application continues running, and the failed disk can be replaced with no interruption to the running application.
There are several ways in which RAID can be configured, but the outcome is the same. If there is a loss on one drive, a new drive can be inserted and the data can be rebuilt by using information on another drive.

1.4. Distributed File System (DFS) – DFS is a logical file structure applied to multiple servers and file shares. The most obvious feature of DFS is that it resolves drive letters to UNC names, thus providing real file nomenclatures at the same time that the physical file location remains hidden. This provides a subtle benefit to availability engineering: a DFS administrator can point to redundant file copies and thereby increase the likelihood of accessing the needed file — even if the primary datastore is down.

1.5. Reduce Planned Downtime
1.6. Isolate Mission-Critical Applications

 

1.2. Testing for Availability

3.3.1. Test the Change Control Process
3.3.2. Test the Failover Technologies
3.3.3. Test the Monitoring Technology

Leave a Reply