For a complex software solution, you usually have to stick to customer requirements for reliability and availability as defined in the SLA. For a monolithic appliance, this could be trivially determined, but most real world applications requires multiple physical nodes, VM or machine.
Extrapolating the reliability and availability figures for a complex multi-tier software system could pose a challenge to an IT practitioner who is not familiar with reliability engineering. So, let's dine right into it.
Let's first define some key terms,
- MTTF: Mean Time To Failure aka 'Average time betwwen two failure of a non-reparable component'.
- MTBF: Mean Time Between Failure aka 'Average time between two failures of a reparable component'.
- MTTR: Mean Time To Repair aka 'Average time to repair a component'.
$\lambda = \frac{1}{MTBF}$or
$ = \frac{1}{MTTF}$
The reliability function is defined as,
$R(t) = e^{-\lambda t}$
$A(t) = \frac{MTBF}{(MTBF + MTTR)}$
Now, for a multi-component system $C_1$, $C_2$, $C_3$, $C_4$ having failure rates $\lambda_1$, $\lambda_2$, $\lambda_3$, $\lambda_4$ respectively,
Effective failure rate is calculated as follows,
- For series connected components, the effective failure rate is determined as the sum of failure rates of each component such that,
$\lambda = \sum_{i=1}^{n} \lambda_i $
- For parallel connected components, MTTF is determined as the reciprocal sum of failure rates of each components,
$\frac{1}{\lambda} = \sum_{i=1}^{n} \frac{1}{\lambda_i} $To calculate reliability and availability metrics,
- For series connected components,
$ R(t) = \prod_ {i=1}^{n} R_i(t) $
$ A(t) = \prod_ {i=1}^{n} A_i(t) $
- For parallel connected components,
$ R(t) = 1- \prod_ {i=1}^{n} (1- R_i(t)) $
$ A(t) = 1- \prod_ {i=1}^{n} (1- A_i(t)) $
Thus, it's very clear that the reliability and availability of a series-connected network of components is lower than the specifications of individual components. For example, two components with 99% availability connect in series to yield 98.01% ($= 0.99 * 0.99 $) availability. The converse is true for parallel combination model. If one component has 99% availability specifications, then two components combine in parallel to yield 99.99% availability; and four components in parallel connection yield 99.9999% ($ =1-(1-0.99)*(1-0.99) $) availability. Adding redundant components to the network further increases the reliability and availability performance.
Comments
Post a Comment