Reliability and Availability Metrics and Calculations

For a complex software solution, you usually have to stick to customer requirements for reliability and availability as defined in the SLA. For a monolithic appliance, this could be trivially determined, but most real world applications requires multiple physical nodes, VM or machine.

Extrapolating the reliability and availability figures for a complex multi-tier software system could pose a challenge to an IT practitioner who is not familiar with reliability engineering. So, let's dine right into it.

Let's first define some key terms,

MTTF: Mean Time To Failure aka 'Average time betwwen two failure of a non-reparable component'.
MTBF: Mean Time Between Failure aka 'Average time between two failures of a reparable component'.
MTTR: Mean Time To Repair aka 'Average time to repair a component'.

Now, let's estabish the concept of failure rate (λ) as ,

$\lambda = \frac{1}{MTBF}$

$ = \frac{1}{MTTF}$

The reliability function is defined as,

$R(t) = e^{-\lambda t}$

And availability is defined as the ratio of uptime to total-time (uptime + time to repair).

$A(t) = \frac{MTBF}{(MTBF + MTTR)}$

Now, for a multi-component system $C_1$, $C_2$, $C_3$, $C_4$ having failure rates $\lambda_1$, $\lambda_2$, $\lambda_3$, $\lambda_4$ respectively,

Effective failure rate is calculated as follows,

For series connected components, the effective failure rate is determined as the sum of failure rates of each component such that,

$\lambda = \sum_{i=1}^{n} \lambda_i $

For parallel connected components, MTTF is determined as the reciprocal sum of failure rates of each components,

$\frac{1}{\lambda} = \sum_{i=1}^{n} \frac{1}{\lambda_i} $

To calculate reliability and availability metrics,

For series connected components,

$ R(t) = \prod_ {i=1}^{n} R_i(t) $

$ A(t) = \prod_ {i=1}^{n} A_i(t) $

For parallel connected components,

$ R(t) = 1- \prod_ {i=1}^{n} (1- R_i(t)) $

$ A(t) = 1- \prod_ {i=1}^{n} (1- A_i(t)) $

Thus, it's very clear that the reliability and availability of a series-connected network of components is lower than the specifications of individual components. For example, two components with 99% availability connect in series to yield 98.01% ($= 0.99 * 0.99 $) availability. The converse is true for parallel combination model. If one component has 99% availability specifications, then two components combine in parallel to yield 99.99% availability; and four components in parallel connection yield 99.9999% ($ =1-(1-0.99)*(1-0.99) $) availability. Adding redundant components to the network further increases the reliability and availability performance.

Cynic's Blog

Search This Blog

Reliability and Availability Metrics and Calculations

Labels

Comments

Post a Comment

Popular posts from this blog

Multimaster replication with Symmetric DS

Reset root password RHEL/Rocky/CentOS 9

Devstack installation with cells in multiple machines