AVAIL – Availability Calculator
Back to ARC Home Page
The “AVAIL” Availability Calculator will help help you to analyse the availability performance of networks made up of components with occasional random failures followed by restorations with random durations. Within that sort of environment, would you believe that the following situations are all possible, even if unusual:
Using the AVAIL Workbook
The Calculator Sheet
A Note About Accuracy
What is a Component
Two Systems with the Same Availability
A Fault That Gets “Younger" the Longer it “Lives"
One More for the Road
Back to AVAIL
To begin, download the Zip file of the Excel Workbook “Availability Calculator" for an interactive tutorial on Availability Calculations.The SHA256 Checksums for the zip file and the unzipped workbook are:
Back to AVAIL
Security is hard – really hard. SHA256 is generally taken to be very secure, however checking SHA sums relies on the assumption that a miscreant has not hacked my web site, replaced the zip file with a zipped spreadsheet containing malware, and re-computed both hashes, then edited this page to show the new hashes. To the best of my knowledge that is no more likely to have happened to this web site than to many others you probably trust. In any case, you also need to trust that I have not included malware in the spreadsheet myself – maliciously or inadvertently. I assure you I have not done so maliciously, and I am confident I have not done so inadvertently because my students at the University of Melbourne have used it for many years with no complaints. If you have any remaining doubts, check my Linkedin page. I look really honest. Even without a neck-tie. And many people, including many students, have joined my network.
Back to AVAIL
Using the AVAIL WorkbookDownload then uncompress the Zip file, then open the Excel Workbook. You will see the dialog box:
The Availability Calculator relies on Excel Macros, so click on “Enable Macros", since I am now a fully trusted source.
The workbook opens on the first sheet “Availability and Reliability", which is considered “Home". Altogether there are 16 sheets:
The “Units” sheets for Availability, Failures, and Restorations allow conversions between different ways of expressing those quantities. For example, a motor car may be unavailable for an average of six hours per year; a bus service may have three “nines” availability – which has the better availability? Of course this calculation is trivial, but having a servant to perform the arithmetic will be useful for more complicated problems. To answer this particular question, click your way to “Availability Units” (or just select the tab of that sheet at the bottom of the workbook). You will see this screen:
Following the instructions in red, enter the value 6 in the Data column next to the label “hours of downtime per year”
(the cell presently contains the value 87.66).
Press Enter, then click on the Calc button right next to the value you just entered.
All the other entries will be updated. The car has availability, as a real number, of 0.999315537
(don't worry that the precision is a bit excessive!).
The Unavailability (defined as 1-Availability) is a bit less than 0.001 or 10^-3.
Hence the number of “Nines” (defined as -log(Unavailability)) is a bit greater than 3,
so the car beats the bus, in this example.
Both of the other “Units” sheets work in much the same way.
Tehcnically speaking the terms “year,” “month,” “week,” “day,” “hour,” and “minute” are not Scientific units of time. The Scientific time unit is the second, the definition of which you may read about in Wikipedia. Those other units, which we use every day, are defined legally rather than scientifically. Every now and again, standards authorities around the world agree to add a leap-second into our lives, making the minute, hour, and so on vary by plus one second. From time to time some administrations add or subtract an hour so we can enjoy the sunshine better. Hence “one day” may be 23, 24, or 25 hours. Of course the number of days per month is not constant, and even the number of days in a year varies. For engineering purposes, when I refer to one minute, I mean 60 seconds, not 60 or 61. One hour is 60 minutes and a day is 24 hours. Now comes the tricky bit. What is a month? Well I just shut my eyes and call it 30.5 days on average. For one year, I use 8766 hours which is 365.25 days. You may consider this a bit sloppy, especially if you are a Mathematician or Scientist, in which case I invite you to consider the best achievable accuracy of measurement of an Average Time to do Anything in Particular (see A Note About Accuracy).
Back to AVAIL
The Calculator SheetThe main sheet in the workbook is the Calculator, so before leaving the Availabiity Units sheet, press on the button labelled “Transfer to Calculator". You should see this:
The Calculator is triangular in shape, based on a simple mnemonic. The three quantities, Average Downtime per year, or DT; Failure Rate in failures per year, or FR; and Mean Service Restoration Time (MSRT), in hours, are related by the simple equation:
DT = FR × MSRT
If you are given any two quantities, the third is easily calculated. For example, if you cover the top of the triange you will see the equation for DT as a function of the other two, like this:
But what if it is MSRT that we consider to be “unknown”? Just cover the unknown and you will be reminded of the right equation. In this case
MSRT = DT / FR
Which can be seen here:
Likewise, covering the “FR” corner of the triangle, we see the pattern:
FR = DT / MSRT
From the Availability Units page we transferred in the value of DT as 6 hours per year. However, the three values are out of kilter. To fix that, we need to press the button corresponding to the “unknown quantity”. Let's say it is MSRT that is unknown, so click on the button “MSRT =”. You should be expecting the value 3 hours to appear – hopefully it does. Now the three quantities are “in kilter,” and the red warning has disappeared.
What if we have two similar motor cars in parallel – many households have two or more cars. Of course, in the case of scheduled maintenance we could make sure one car was always available. However, that is not the situation analysed here. It is essential to note that we assume all failures are random, and all restoration times are random also. In fact we assume that both are (Negative) Exponential Random Variables with some known Mean Value. Furthermore we will assume that all components are statistically independent. If the first car is down that tells us nothing about the other car – it may be up or down at random, just as it is when the first car is up instead of down.
To estimate how two cars would work for us, on the Calculator sheet, move down to the Parallel Work Area. It looks like this:
To load it up with two copies of our car, click on “From Calculator” next to “Component 1”, then again, click on “From Calculator” next to “Component 2”. Don't go crazy and click again or you could have three cars in the Parallel work area – leave Component 3 blank. Now click on “To Calculator" to return the two-car-combination as a single component in the Calculator. We can see the MSRT is now 1.5 hours – a nice convenient figure – but why is it precisely half the value for a single car? Also, it turns out the Failure Rate is now 0.002737851 (again – don't worry about the excessive precision), but is that good or bad. Clearly it is an improvement, but by how much? Click on “Transfer to Units” above the Failure Rate to convert into more meaningful units. The answer appears in multiple different units now – the one I can relate to is a mean time to failure of 365.25 years. So with two random, independent, and identical cars we have a situation of no car available for an average of one and a half hours every 365 years or so, on average. Of course, cars don't last 365 years, but if there were 365 identical instances of the same situation, we would expect about one outage per year across all 365 instances.
But what if you need to drive to the bus-stop and then catch a bus (with three nines availability, and assume two hours MSRT). You have two cars to choose from, but now the bus service is in series with the two-car-combo. Go back to the calculator, and transfer the two-car-combo into the Series Work Area to the right of the Calculator (on the same sheet). Click on the first “From Calculator” button. It should look like this – showing the numbers that refer to the two-car-combo:
Now return to the main part of the Calculator and enter the parameters of the bus service: 3 nines unavailability (I recommend using the Availability Units Sheet to convert 3 nines to average Downtime per year of 8.766 hours), and 2 hours MSRT (assumption). Press on the button to calculate the unknown Failure Rate. I get 4.383 failures per year (one every eighty three and a third days, on average – pretty awful). Transfer the answer into the Series Work Area as the second component, then transfer the result back to the main part of the Calculator by pressing “To Calculator” (the process analogous to the one for the Parallel Work Area). The final result is that by using the two-car-combo to get to the bus stop, then relying on the bus to get the rest of the way, the failure rate is one failure per 83.28 days, and the MSRT is 119.98 minutes or one hour fifty-nine minutes and fifty-nine seconds approximately – just about exactly the same as the bus service on its own! You experience almost exactly the same availability service as someone living right at the bus stop – why is it so?
Of course the calculations performed are quite simple – you can do them by hand if you prefer, but I find the spreadsheet quicker than manuual calculation after a little practice. The formulas used depend on the assumptions of:
Understanding those limitations, following are the formulas used for series and parallel combinations.
Back to AVAIL
Series CombinationsIf two components are in series, then an outage of either one implies an outage of the network. We assume that:
Back to AVAIL
Parallel CombinationsIf two components are in parallel, then an outage of the network occurs only when both components are down concurrently. We assume that:
Quick Quiz to see if you have mastered the concept: If you start with a single component, with known FR, DT, and MSRT, and then add another component in parallel, two of the three parameters must get better, but the third one may get better, or worse, or stay the same. Which parameters are the ones that must get better?
Back to AVAIL
A Note About Accuracy“Precision” refers to the number of decimal digits in our answers. “Accuracy” refers to how far the answers lie from the truth. Of course, showing a precision of six or seven significant figures in the answers is not meant as a claim of that degree of accuracy. The precision is used to allow follow-on calculations, such as subtracting two quantities that are very close.
Back to AVAIL
What is a ComponentWhat is a component. Furthermore, what does it mean to consider series or parallel combinations of components. It is a mistake to limit our thinking to one component corresponding to one physical entity in a network. Rather, one component represents one failure mode of our system. To illustrate the point, consider the example of a radio tower on a remote mountaintop. Let's assume we have arranged for a commercial power supply, which is not perfectly reliable. We may choose to have battery backup power. Then we may consider all the equipment as “electronics”. Finally, let's agree that the radio tower is at risk due to natural disturbances. For the sake of illustration we will say that it may fail due to “lightning”.
|Accident||0.0010 per year|
|Cancer||0.0020 per year|
|Heart Disease||0.0025 per year|
|Other Diseases||0.0040 per year|
|All Other Causes||0.0100 per year|
Back to AVAIL
Paradoxes ExplainedIf you work through the problems you will find the answers to two of the paradoxes above.
Back to AVAIL
Two Systems with the Same AvailabilityTwo different systems with the same numerical value of Avaliability may have such radically different properties that some customers would reject the first and accept the second, whilst other customers would reject the second and insist on the first.
Of course the explanation lies in the fact that the two systems have different Failure Rate and Mean Time to Restore Service. For example, assume System 1 has a Failure Rate of one per year and a MSRT of 30 seconds. System 2 has a failure rate of one every 2,880 years and a MSRT of 24 hours. Use the Calculator to check their Availabilities (should be equal). So, which system is better? Of course, it depends.
Suppose you are operating a commercial airline and all of your planes behave like System 1. Once per year every plane fails in mid air for 30 seconds. The passengers experience a full half minute of virtual weightlessness. If you have a few hundred planes in your fleet, you can expect several such incidents per week. Wouldn't you prefer a plane that lasts 2,880 years, on average?
On the other hand, suppose your company operates 2,880 elevators nationwide. Suppose that once per year each elevator experiences a software glitch, and stays locked in place for 30 seconds – irritating, indeed. Compare that with a set of elevators similar to System 2. About once per year one of your 2,880 elevators remains locked in place for 24 hours. Your passengers in the elevator have to spend a full day in a prison of your making. TV cameras and news reporters arrive at the scene to record the drama. A hole has to be made to pass in food and water. You are famous, but not in a good way. Surely the frequent irritation of 30 second outages would be better?
Back to AVAIL
A Fault That Gets “Younger" the Longer it “Lives"The longer we live, the older we get (of course), meaning that our expected remaining time-to-live reduces. But it is possible that the longer a fault has been resisting attempts to fix it, the “younger” it gets, in the sense that its expected remaining time-to-repair increases.
The Negative Exponential random variable has the very special property that its mean time to persist is constant – it is the only distribution with such a property. In fact, no matter how long it has been going, it still has the exact same distribution of remaining time to go as it had when it commenced at time zero. Think of a single radioactive atom with a half-life of one day. It has probability one half of decaying spontaneously within one day. But if it happens to be a “lucky” atom which survives one day, then it has a probability of one half of decaying during the following day. And so it goes on. No matter how long it survives, it has the same probabiility one half of decaying in the following day, and hence the same mean-time-remaining-to-live that it started with. It has no memory – it does not age – it is neither “lucky” nor “unlucky” – suddenly and unpredicatbly it simply passes to another state. That is the way with Negative Exponential Random Variables. I wouldn't like to have a Negative Exponential lifetime, even though it means never ageing.
A simple example where a random variable has increasing time to live would be a “Mixture Distribution”. For example, suppose you are in charge of a communication link between Earth and Mars. Let's assume all faults have Negative Exponential lifetimes, but there are two types of fault: software faults with a one-hour MSRT, and hardware faults with a one-day MSRT – both random with Negative Exponential Distributions. Suddenly the channel goes silent. Everything is good to go here on Earth, so the fault must be on Mars. But what sort is it.
Of course, the channel you operate is the only channel between the two planets (let's assume), so there seems to be no way of knowing. Let's assume the different types of fault are equally likely, so the time to restore service has an overall average of 0.5 times one hour plus 0.5 times one day, or 12.5 hours, without knowing which type it is.
However, we do get some information as time goes by. After three hours, it is more likely that the fault is hardware rather than software. Only a very small fraction of software faults persist longer than three hours (even taking into account the propagation time). But the majority of hardware faults last longer than three hours. If it is still not fixed after 24 hours we can just about rule out software problems.
Hence after 24 hours we are almost certain of a hardware fault, so the expected remaining time to restore service is 24 hours. Whether we get to 36, or 48, or 72 hours without any success, the expected remaining time to restore service is still 24 hours in every case! In fact, the expected remaining time to restore service transitions smoothly from about 12.5 hours initially up to a limit of 24 hours, the longer the fault persists. If our lifetimes worked in similar fashion we might say: “the more we age, the younger we get”, in the sense that the more years we have behind us, the more we can expect in front of us.
Back to AVAIL
One More for the RoadHere is a final puzzle for you. Suppose you have not merely two cars, but three cars. If you go back to the Calculator and add a third car in parallel with the other two you should find a failure rate of one outage, with an average duration of one hour, once every 3,118,535,181 years, approximately. But do you really expect your three cars to last three billion years? Scientists assure us that within that time the Earth's magnetic field will surely fail (slight exaggeration for emphasis). Surely you don't believe your three cars will outlast the Earth's magnetic field? Even if your cars are still functioning, will there still be a bus service to convey you to your office?
It makes barely more sense to talk about 3.1 billion identical instances of three-car-combos. Yes, if the cars can survive one year, then we have the equivalent of 9.3 billion “car-years” in just twelve months. Surely the majority of the cars (and the Earth's magnetic field) can be relied on for that relatively trivial time. Yes, in theory, if 3.1 billion families all were faced with the same situation we would expect, in theory, about one outage per year.
In reality, and I think this is the clue to the answer, other factors would arise and swamp the purely random failures we have assumed in our lovely mathematical model. In the case of a single instance of a three-car-combo considered over 3 billion years, the failure rate of the components will change with time – presumably increasing asymptotically to 1 as the centuries go by. Even without considering the aging of components, riots, political upheavals, earthquakes, extreme weather events, and many other factors could come into play. By analysing our three-car-combo in isolation we have tacitly assumed that “nothing else can go wrong”. In general, Mother Nature (or is it “Ms Nature”) rarely has the grace to follow any of my models – no matter how elegant they are.
So what is a reasonable interpretation of your extreme numerical result? I invite you to ponder that, as I will.
A related question: is the improvement due to the third car sufficient to justify, economically, the cost of ownership of another car (for three billion years)? For that matter, was the upgrade from a one-car family to a two-car family worth it in terms of improved car availability. Of course, availability is not the main reason why a family would invest in a second car. But what if you are operating a satellite sevice. Is the second hundred-million-dollar satellite good value for money as a backup for the first? This thought leads on to the topic of "Avail-o-nomics".
I hope you enjoy exploring the Availability Calculator, now that you know how to use it.
Back to AVAIL
Avail-o-nomicsHaving mastered the basic calculations of Availability and Reliability, the time has come to consider whether it is all worth it. Click on the link below to proceed to the Avail-o-nomics section of the ARC web site.
To Avail-o-nomics – Availability for Fun and Profit
Back to AVAIL
Back to ARC