Skip to main content

Reliability and Availability Metrics and Calculations

For a complex software solution, you usually have to stick to customer requirements for reliability and availability as defined in the SLA. For a monolithic appliance, this could be trivially determined, but most real world applications requires multiple physical nodes, VM or machine.

Extrapolating the reliability and availability figures for a complex multi-tier software system could pose a challenge to an IT practitioner who is not familiar with reliability engineering. So, let's dine right into it.

Let's first define some key terms,

  1. MTTF: Mean Time To Failure aka 'Average time betwwen two failure of a non-reparable component'.
  2. MTBF: Mean Time Between Failure aka 'Average time between two failures of a reparable component'.
  3. MTTR: Mean Time To Repair aka 'Average time to repair a component'.

Now, let's estabish the concept of failure rate (λ) as ,
$\lambda = \frac{1}{MTBF}$
or
$    = \frac{1}{MTTF}$

The reliability function is defined as, 

$R(t) = e^{-\lambda t}$
And availability is defined as the ratio of uptime to total-time (uptime  + time to repair).
$A(t) = \frac{MTBF}{(MTBF + MTTR)}$

Now, for a multi-component system $C_1$, $C_2$, $C_3$, $C_4$ having failure rates $\lambda_1$, $\lambda_2$, $\lambda_3$, $\lambda_4$ respectively,

Effective failure rate is calculated as follows,

  • For series connected components, the effective failure rate is determined as the sum of failure rates of each component such that,
$\lambda = \sum_{i=1}^{n} \lambda_i $
  • For parallel connected components, MTTF is determined as the reciprocal sum of failure rates of each components,
 $\frac{1}{\lambda} = \sum_{i=1}^{n} \frac{1}{\lambda_i} $
To calculate reliability and availability metrics,

  • For series connected components,
$ R(t) = \prod_ {i=1}^{n} R_i(t) $
$ A(t) = \prod_ {i=1}^{n} A_i(t) $ 
  • For parallel connected components, 
$ R(t) = 1- \prod_ {i=1}^{n} (1- R_i(t)) $
$ A(t) = 1- \prod_ {i=1}^{n} (1- A_i(t)) $ 

Thus, it's very clear that the reliability and availability of a series-connected network of components is lower than the specifications of individual components. For example, two components with 99% availability connect in series to yield 98.01% ($= 0.99 * 0.99 $) availability. The converse is true for parallel combination model. If one component has 99% availability specifications, then two components combine in parallel to yield 99.99% availability; and four components in parallel connection yield 99.9999% ($ =1-(1-0.99)*(1-0.99) $) availability. Adding redundant components to the network further increases the reliability and availability performance.

Comments

Popular posts from this blog

Multimaster replication with Symmetric DS

Symmetric DS is an awesome tool for trigger based replication whcih works for all major database vendors, including but not limited to PostgreSQL, MySQL, MSSQL, Oracle and many others. Symmetric-DS is a java application and can execute on any platform on whcih JRE is available including Windows and Linux. Trigger based replication, in constrast to disk based (eg. DRBD ) or transaction log file shipping based or statement based , works by registering triggers on DMLs and sending the data thus generated to remote machines. Another very popular trigger based DB replication tool is Slony . Symmetric-DS in addition to being database agnostic also supports multi-master replication (MMR). MMR usecase involves multiple database nodes, connected in a pool with DML updates coming from any of them. This is different from the normal master-slave replication, where slaves are not expected to generate any data events, and the sole authority of database is the master. MMR requirement causes d...

PC Power supply and hacks

For posterity and myself, I'm leaving some tips and tricks of PC Power Supply Unit (PSU) whcih is an SMPS (Switched Mode Power Supply). There are a variety of uses of a +12V, +5V and +3V DC power supply like lighting up an LED strip or powering a raspberry pi. There are various colored cables in a typical ATX 12V SMPS. I'll list out the various color lines and what they mean, Sr. No Cable color Number of cables in a PSU Use 1 Green exactly one (1) Wake up signal from motherboard. Pressing PC power button makes this signal carry wake up signal to PSU to start. Green needs to be touched with the any ground to make the SMPS start. For self-starting PSUs, green needs to be connected with one black all the time. 2 Blue exactly one (1) -12V 3 Purple exactly one (1) +5V standby. When power supply is on standby mode (not on by signalling green), this line can give 1-2 A current. 4 Gray exactly one (1) Power good signal. When PSU levels has reached specificati...

Cryptographic Primitive III: RSA Asymmetric Keys

RSA cryptosystems involves, a private key (which is kept private) and a public key, which is kept public i.e. known to everyone. The security of RSA hinges on the mathematically difficult problem of finding prime factorization of a very large number. Let's quickly disuss how a public, private key pair can be generated, Let, p and q be two large primes, then $n = q \times q$ $\phi(n) = (p-1) \times (q-1)$ Here, $\phi(n)$ is called euler's totient function Choose a random number $e$ such that, $e \in \left\{0,1,2...\phi(n)-1\right\}$ and $gcd(e,\phi(n)) = 1$ The gcd condition will ensure that we have an inverse of $e$ in $\mathbb{Z}_{26}$. Now, using extended euclidian algorithm one can get the inverse of e as d such that, $d \equiv e \pmod{\phi(n)}$ So, there we have it, the private key is $e$ and the public key is $(n,d)$. Few points to note here are, $p$ and $q$ are both $\geq 2^{512}$, although the recommened size is $2^{1024}$ $n$ is $\geq 2^{1024}$, although the recommended...