- Forward


Reliability
An Introduction


Prof. David Bernstein
James Madison University

Computer Science Department
bernstdh@jmu.edu

Print

Definitions
Back SMYC Forward
  • Rugged:
    • Reliable - Low probability of failure under normal operating conditions
    • Robust - Able to operate under a wide variety of condtions
    • Safe - Able to minmize the damage resulting from failure
  • (Probabilistic) Reliability:
    • The probability that a system will perform its required functions (without failure) in a particular environment for a particular amount of time
Different Kinds of Systems
Back SMYC Forward
  • Non-Repairable:
    • The system can't be repaired after it fails
    • Our concern is with measures like the probability of a failure and the remaining life
  • Repairable:
    • The system can be repaired after it fails
    • Our concern is with measures like the probability that it won't fail, the time between failures, and the availability
Failure Patterns
Back SMYC Forward
  • Increasing Failure Rate:
    • Usually a result of deterioration
  • Decreasing Failure Rate:
    • Usually a result of "hardening" or "burn-in"
    • Usually temporary
  • Constant Failure Rate:
    • Usually a result of "stress"/"overload" (i.e., beyond normal operating conditions)
Differences between Hardware and Software
Back SMYC Forward
Hardware Software
Components can "burn in" and "wear out" Components do not "burn in" or "wear out"
Variation across "copies" No variation across "copies"
Mathematical Preliminaries
Back SMYC Forward
  • The Setting:
    • A system that we are considering over a time interval \([0,t]\)
  • Our Concern:
    • \(R\) - the reliability of the system over the interval
Some Notation
Back SMYC Forward
  • Component Performance:
    • \(X_i\) is 1 if the component performs adequately over the interval and 0 otherwise
  • System Performance:
    • \(\phi(X_1, X_2, \ldots, X_n)\) is 1 if the system performs adequately over the interval and 0 otherwise
Archetypical Systems
Back SMYC Forward
  • Series Systems:
    • The system fails if any of its components fails
    • \(\phi(X_1, X_2, \ldots, X_n) = \min\{X_1, X_2, \ldots, X_n\}\)
  • Parallel Systems:
    • The system fails if all of its components fails
    • \(\phi(X_1, X_2, \ldots, X_n) = \max\{X_1, X_2, \ldots, X_n\}\)
Reliability of Archetypical Systems
Back SMYC Forward
  • Series System:
    • \(R = P\{\min(X_1, X_2, \ldots, X_n) = 1\} = P\{X_1=1, X_2=1, \ldots, X_n=1\}\)
    • If the components are independent, \(R = P\{X_1=1\} \cdot P\{X_2=1\} \cdots \cdot P\{X_n=1\}\)
  • Parallel Systems:
    • \(R = P\{ \max\{X_1, X_2, \ldots, X_n\} = 1\} = 1 - P\{X_1=0, X_2=0, \ldots, X_n=0\}\)
    • If the components are independent, \(R = 1 - (1-P\{X_1=1\}) \cdot (1-P\{X_2=1\}) \cdots \cdot (1-P\{X_n=1\})\)
Examples Involving Archetypical Systems
Back SMYC Forward
  • Series System:
    • Suppose we have two networks that are connected by a single path consisting of four routers
    • Suppose further that the routers have probabilities of working of 0.9, 0.8, 0.9, and 0.7.
    • The reliability of the path working is \(R = 0.9 \cdot 0.8 \cdot 0.9 \cdot 0.7 = 0.45\)
  • Parallel System:
    • Suppose we have an electronic commerce site that has four servers all providing the same services
    • Suppose further that the servers have probabilities of working of 0.9, 0.8, 0.9, and 0.7.
    • The reliability of the site working is \(R = 1 - (1-0.9) \cdot (1-0.8) \cdot (1-0.9) \cdot (1-0.7) = 1 - (.1)(.2)(.1)(.3) = 1 - 0.0006 = 0.9994\)
\(k\) Out of \(n\) Systems
Back SMYC Forward
  • Defined:
    • The system operates adequately if any \(k\) out of \(n\) components operate adequately
  • An Implication:
    • \(\phi(X_1,X_2,\ldots,X_n) = \left\{ \begin{array}{l l} \text{1 if }\sum_{i=1}^{n}X_i \geq k \\ \text{0 if }\sum_{i=1}^{n}X_i \lt k \end{array} \right. \)
Reliability of \(k\) Out of \(n\) Systems
Back SMYC Forward
  • In General:
    • Finding the reliability for such systems is very complicated
  • An Important Special Case:
    • All of the components are independent and have the same probability of failure, \(p\). In this case, \(\sum_{i=1}^{k}X_i\) has a binomial distribution so:
    • \(R = \sum_{i=k}^{n} \left( \begin{array}{c}n \\ i\end{array} \right) p^i (1 - p)^{n-i} \)
  • Recall:
    • \(\left( \begin{array}{c}n \\ i\end{array} \right) = \frac{n!}{i!(n-i)!}\)
An Example of a \(k\) Out of \(n\) System
Back SMYC Forward
  • The System:
    • Suppose we have a disk array that can continue to work adequately if any 4 out of 8 disks is working adequately
    • Suppose further that each disk has a probability of 0.95 of working adequately
  • The Reliability:
    • \(R = \sum_{i=4}^{8} \left( \begin{array}{c}8 \\ i\end{array} \right) (0.95)^i (0.05)^{8-i} = 0.9999 \)
Reliability of Networks
Back SMYC Forward
  • More General Systems:
    • We would like to consider the reliability of systems that can be represented as graphs
  • An Example:
    • network_4node5arc
    • Traffic can move from A to D if 1 and 4 are operating, or if 2 and 5 are operating, or if 1 and 3 and 5 are operating
Reliability of Networks (cont.)
Back SMYC Forward
  • An Observation:
    • Since there are 5 edges and each can be working or not, there are \(2^5 = 32\) realizations to consider
  • The System Reliability:
    • Determine the probability of each realization
    • Add the probabilities of the realizations that contain a path
Reliability of Networks - A Better Way
Back SMYC Forward
  • Minimal Paths:
    • A minimal set of components that by functioning ensure the function of the system
  • Minimal Cuts:
    • A minimal set of components that by failing ensure the failure of the system
  • Finding the Reliability using Minmal Paths:
    • The system will operate if all of the components in at least one of the minimal paths operate
  • Finding the Reliability using Minmal Cuts:
    • The system will fail if and only if all the components in at least one of the minmal cuts fail
Dynamic Software Reliability Models
Back SMYC Forward
  • Objective:
    • Predict the failure rate as a function of the number of defects/faults in the system
    • Determine the amount of testing time (and subsequent debugging) required to reduce the failure rate to a given threshold
  • The Process being Modeled:
    • The software initially has an unknown number of defects/faults
    • It is tested for some amount of time during which failures occur
    • When a failure occurs the defect/fault is fixed (which is assumed to require no time for simplicity) and no new defects/faults are introduced
Dynamic Software Reliability Models (cont.)
Back SMYC Forward
  • Notation:
    • \(N(t)\) denotes the number of defects/faults at time \(t\)
    • \(\lambda(t)\) denotes the failure rate at time \(t\)
    • \(c\) denotes a constant
    • \(M(t) = N(0) - N(t)\) denotes the number of defects/faults found (and corrected) in the interval \([0, t]\)
  • An Important Caveat:
    • The accuracy of these models is not known
The Jelinski-Moranda Model
Back SMYC Forward
  • Assumptions:
    • The failure rate is described by a Poisson process that varies with time, specifically, \(\lambda(t) = c N(t)\)
    • The testing time between failures is exponentially distributed
  • The Model:
    • \(R(t) = e^{-\lambda(0) t} = e^{-cN(0) t}\)
    • Given a failure at time \(\tau\), the probability that the interval \([\tau, \tau+t]\) will be failure free is \(R(t|\tau) = e^{-cN(\tau) t}\)
  • Criticism:
    • It assumes "all faults are created equal" (i.e., all faults contribute equally to the failure rate)
The Jelinski-Moranda Model (cont.)
Back SMYC Forward
  • An Example:
    • \(N(0) = 40\)
    • \(c = 0.025\)
  • What We Want to Know:
    • The reliability (i.e., the probability that the system will not have a failure) in the first day of testing
    • The reliability (i.e., the probability that the system will not have a failure) in the first three days of testing
  • Applying the Model:
    • \(R(1) = e^{-cN(0) \cdot 1} = e^{-0.025 \cdot 40 \cdot 1} = 0.368\)
    • \(R(3) = e^{-cN(0) \cdot 3} = e^{-0.025 \cdot 40 \cdot 3} = 0.050\)
The Musa-Okumoto Model
Back SMYC Forward
  • Assumptions:
    • \(N(0) = \infty\)
    • \(\lambda(0)\) is known and finite
    • It is exponentially more difficult to find/cause failures over time, specifically \(\lambda(t) = \lambda(0) e^{-c E[M(t)]}\)
  • Some Math (You Don't Need to Know):
    • The Differential Equation:
    • \(\frac{dE[M(t)]}{dt} = \lambda(0) e^{-c E[M(t)]}\)
    • The Solution:
    • \(E[M(t)] = \frac{\ln(\lambda(0)ct + 1)}{c}\)
    • \(\lambda(t) = \frac{\lambda(0)}{\lambda(0)ct + 1}\)
    • So:
    • \(R(t) = e^{-\int_0^t \lambda(x)dx} = e^{-E[M(t)]}\)
    • \(R(t|\tau) = e^{-\int_{\tau}^{\tau+t} \lambda(x)dx} = e^{-E[M(\tau+t)]+E[M(\tau)]}\)
    Expand
  • The Result:
    • \(R(t) = (1 + \lambda(0) c t)^{-\frac{1}{c}}\)
    • \(R(t|\tau) = \left( 1 + \frac{\lambda(0)ct}{1+\lambda(0)c\tau} \right)^{-\frac{1}{c}}\)
The Musa-Okumoto Model (cont.)
Back SMYC Forward
  • An Example:
    • \(\lambda(0) = 1\)
    • \(c = 0.025\)
  • What We Want to Know:
    • The reliability (i.e., the probability that the system will not have a failure) in the first day of testing
    • The reliability (i.e., the probability that the system will not have a failure) in the first three days of testing
  • Applying the Model:
    • \(R(1) = (1 + \lambda(0) c \cdot 1)^{-\frac{1}{c}} = (1 + 1 \cdot 0.025 \cdot 1)^{-\frac{1}{0.025}} = 0.372\)
    • \(R(3) = (1 + \lambda(0) c \cdot 3)^{-\frac{1}{c}} = (1 + 1 \cdot 0.025 \cdot 3)^{-\frac{1}{0.025}} = 0.055\)
There's Always More to Learn
Back -