Skip to main content

Reliability and Availability Metrics and Calculations

For a complex software solution, you usually have to stick to customer requirements for reliability and availability as defined in the SLA. For a monolithic appliance, this could be trivially determined, but most real world applications requires multiple physical nodes, VM or machine.

Extrapolating the reliability and availability figures for a complex multi-tier software system could pose a challenge to an IT practitioner who is not familiar with reliability engineering. So, let's dine right into it.

Let's first define some key terms,

  1. MTTF: Mean Time To Failure aka 'Average time betwwen two failure of a non-reparable component'.
  2. MTBF: Mean Time Between Failure aka 'Average time between two failures of a reparable component'.
  3. MTTR: Mean Time To Repair aka 'Average time to repair a component'.

Now, let's estabish the concept of failure rate (λ) as ,
$\lambda = \frac{1}{MTBF}$
or
$    = \frac{1}{MTTF}$

The reliability function is defined as, 

$R(t) = e^{-\lambda t}$
And availability is defined as the ratio of uptime to total-time (uptime  + time to repair).
$A(t) = \frac{MTBF}{(MTBF + MTTR)}$

Now, for a multi-component system $C_1$, $C_2$, $C_3$, $C_4$ having failure rates $\lambda_1$, $\lambda_2$, $\lambda_3$, $\lambda_4$ respectively,

Effective failure rate is calculated as follows,

  • For series connected components, the effective failure rate is determined as the sum of failure rates of each component such that,
$\lambda = \sum_{i=1}^{n} \lambda_i $
  • For parallel connected components, MTTF is determined as the reciprocal sum of failure rates of each components,
 $\frac{1}{\lambda} = \sum_{i=1}^{n} \frac{1}{\lambda_i} $
To calculate reliability and availability metrics,

  • For series connected components,
$ R(t) = \prod_ {i=1}^{n} R_i(t) $
$ A(t) = \prod_ {i=1}^{n} A_i(t) $ 
  • For parallel connected components, 
$ R(t) = 1- \prod_ {i=1}^{n} (1- R_i(t)) $
$ A(t) = 1- \prod_ {i=1}^{n} (1- A_i(t)) $ 

Thus, it's very clear that the reliability and availability of a series-connected network of components is lower than the specifications of individual components. For example, two components with 99% availability connect in series to yield 98.01% ($= 0.99 * 0.99 $) availability. The converse is true for parallel combination model. If one component has 99% availability specifications, then two components combine in parallel to yield 99.99% availability; and four components in parallel connection yield 99.9999% ($ =1-(1-0.99)*(1-0.99) $) availability. Adding redundant components to the network further increases the reliability and availability performance.

Comments

Popular posts from this blog

Multimaster replication with Symmetric DS

Symmetric DS is an awesome tool for trigger based replication whcih works for all major database vendors, including but not limited to PostgreSQL, MySQL, MSSQL, Oracle and many others. Symmetric-DS is a java application and can execute on any platform on whcih JRE is available including Windows and Linux. Trigger based replication, in constrast to disk based (eg. DRBD ) or transaction log file shipping based or statement based , works by registering triggers on DMLs and sending the data thus generated to remote machines. Another very popular trigger based DB replication tool is Slony . Symmetric-DS in addition to being database agnostic also supports multi-master replication (MMR). MMR usecase involves multiple database nodes, connected in a pool with DML updates coming from any of them. This is different from the normal master-slave replication, where slaves are not expected to generate any data events, and the sole authority of database is the master. MMR requirement causes d...

RabbitMQ and SSL

RabbitMQ is an AMQP provider i.e. it can reliably queue, service and maintain messages according to a range of policies and parameters. By default, it listens to plain old TCP connections and sends and receives messages over plaintext. This feature just works "out of the box". For users who wish to use SSL over TCP aka TLS, it requires a bit more work on their part. First, let's create a bunch of certificates and sign them with our own CA. For this, we'll use easyrsa3 . Easyrsa is a CLI tool to create, sign and manage your own certification authorities. It's maintained by OpenVPN team. Download easyrsa using your native package manager i.e. yum or apt-get $cp -Rp /usr/share/easy-rsa ~/easy-rsa-3   $cd ~/easy-rsa-3 $./easyrsa init-pki $./easyrsa build-ca $./easyrsa build-server-full broker [nopass] $./easyrsa build-client-full client1 [nopass] This creates three entities (collection of private keys, public keys and certificates) for a CA, a s...

Motorola XT502 Custom ROMs

I purchased an Android Phone in the early days of it coming to Indian market and I able to afford it ( just got a job), during mid 2010. The recent and popular version was Éclairs. I just went up to the shop and bought a nice and shiny Motorola Quench XT3 (XT502). When in other places XT502 was having Andriod Donut, I got as a special offer Éclairs. I was happy as hell. The Droid A couple of year passes, and newer versions of Android came from Éclairs to Froyo to Gingerbread then to the bigger version upgrades like Honeycomb, Ice Cream Sandwich and then Jelly Bean. By the end of 2012, I was literally surviving on my Motorola with Éclairs. I had to upgrade, anyhow. Now more that Motorola denied any upgrades for Quench XT3 . A trivia, Android version names are taken from sugary desserts with lexicographic sequencing. So, I did up-gradation from Éclairs to Gingerbread using a custom ROM from Cyanogenmod . An excellent community of enthusiast who develop their own custom ROMs oft...