Friday, March 7, 2014

Zero - Downtime: an opportunity cost is, an opportunity lost!



What is Fault Tolerance in a virtual world? Sometimes I wonder, what does this ‘fault tolerance’ or a ‘zero-downtime’ means to people at large? How does it make a difference in our lives? Do we really need to know all the technology jargon? How things function? Probably not. Folks! We don’t need to know all of it. But, we can definitely cherish the benefits coming over to us with the magic words ‘Abracadabra’, what this technology is providing us with ‘a glitch free’ access to our everyday applications that we use. Whether, it is a ‘Facebook’ or a ‘Twitter’ application we are working on or trying to access our email messages, or a mobile applications running on a server hosted far off in a data center or a banking application across the globe. It doesn’t matter, even, if we are digging deep into ‘Big Data’ – performing data mining task or using analytics for businesses, social media marketing, political campaigns, philanthropic activities, astronomical data crunching exploring the universe on a high power computing set up extracting information from silos of data or databases running with the support of virtual machines with no down time. Or, it could be any other possible task, we can think off that require uninterrupted access. It might astonish some of us that a lot of jugglery is happening behind the scene that we would have never noticed or come to know. A huge number of swapping around of VMs (Virtual Machines) in a fraction of a second that occurs within the network in a data center or different interconnected data centers or in a cloud or in the clouds of clouds – ‘The Cloud’ – a larger cloud formed with multiple integrated clouds constantly working in real time to provide us an environment with a zero-down time.

Well, everything in a virtualized environment is automated linked to one another with pre coded instruction derived from ‘n’ number of algorithms designed to perform millions of permutations and combinations a human factor could think to make this happen. Friends! You know the irony, we build a technology to eliminate ourselves. This is the price we pay to build and design an automated system that will take care of itself with a minimal intervention from us. Sometimes, I think, so far the down time we all have experienced in past can definitely be blamed on a person who did not do his/her job carefully. Else, we would have never thought of a fool proof system that will take care of itself. In fact, an automated system that can fix itself before we even come to know. The amount of energy, resources, efforts, and technology used, algorithms running, information documented and processed, security checks, cross overs, swapping of machines and much more happens just to provide us an environment, where our single key stroke may not go waste either. Or a frustrated devil may not come out of us to hit the monitor or a keyboard on a smallest glitch that can cost us, a million dollar worth of loss at work. It could be a trigger to ‘an opportunity cost is an opportunity……?’

Well, coming back to the concept of virtualization floating around for a while, from a leading technology organization. True! We cannot miss it and the very first name that pops up ‘VMWare’. Am I right? To be, honest, I think these folks have really pushed their brand name hard on us. It reminds me of the word ‘Windows’ in early 80’s and how it evolved in our everyday life.  This may not have reached that magnitude as such. But, somewhat we can relate to it that way. Alright! If not everyone, at least some of us for sure. Question is, how did they come up with this idea and what was the objective behind it? My guess is that our industry was struggling with the traditional methods of handling hardware and software failures. And, these guys turned out be the lucky ones!! Alright!! Intelligent ones:). Today, I believe, VMware has made our lives easy by developing this component of ‘fault tolerance’, which is now, widely used in the enterprise businesses these days to prevent application disruption due to hardware failures. They have designed it, to be associated with mission-critical enterprise applications that can be very expensive and disruptive to businesses. Whereas, traditional solutions that address this problem through hardware redundancy or clustering are complex and expensive. If we compare, VMware high availability (HA) that addresses server failures by automatically restarting virtual machines on alternate servers. We will find that VMware’s ‘fault tolerance’ things takes the entire ball game of high availability to the next level. What it does, it completely wipes out downtime due to hardware failures with simplicity, across all applications, regardless of operating system. Isn’t it amazing!!

Basically, it provides operational continuity and high levels of uptime to an information technology’s infrastructure environment, with simplicity and at a low cost. Let’s try to understand how it works, so all of us can get hang of it? It works with existing VMware’s high availability (HA) or (Distributed Resource Scheduler - DRS) clusters and can be simply turned on or turned off for virtual machines. When applications require operational continuity during critical periods such as month end or quarter end time periods for financial applications, the fault tolerance feature can be turned on with the click of a button to provide extra assurance. The operational simplicity of this ‘fault tolerance’ component is embedded in the vSphere and is a big life saver and cost at times.

High availability is commonly is understood as a method to ensure a resource is always available. But, the fact of the matter is that the resource may get affected with a few minor downtimes. For instance, with Hyper-V also has high availability feature. Because, in the event a host fails, the guest operating system just stop. And, it doesn’t give enough time to migrate the up and running state to another host. Thus, it results into a minor downtime. Irrespective of the technology owners, it the same scenario with VMware High Availability (HA). Despite of, VMware’s vMotion capabilities it cannot be used because, the host stops right at that time itself. And, it leaves us with no live memory to move the guest OS. Thus, we lose the in-memory application state with high availability.

On the other hand, ‘fault tolerance’ mean we don't lose the in-memory application state in the event of a failure such as occurrence of a host crash. If we see ‘Fault Tolerance’ is much stronger than high availability in a virtual environment. But, it forces us to maintain two copies of a virtual machine, each on separate hosts. In the event of a change in the state of memory and device status on the primary host, these changes get automatically recoded and are replayed simultaneously, on the secondary copy of the VM copied earlier.

Currently, only VMware vSphere has this fault tolerance capabilities, but it only supports a single logical processor on the VM is supported. Fault tolerance also has very high network requirements, but it provides the capability for a fault tolerant solution that results in no downtime, even if a host fails.


No comments: