Reliability Engineering

Few Enterprise Data Centres are designed following the long established design principles of Systematic Reliability Engineering,  which have been applied for over half a century in the Nuclear, Aviation and Military mission critical industries.

A typical example of neglecting Reliability Engineering in the design of enterprise data centres is the large number of Ethernet LAN switch hops in the transmigration channel. One can often find five switch hops which compared to a single core switch :
  • has better than five times the reliability
  • uses less energy
  • is less complex
  • has lower Capex and Opex
  • has lower latency
  • provides for a loss-less Ethernet fabric which supports converged storage area network technologies like FCoE ( Fibre Channel over Ethernet )

This current neglectful situation is partly due the the historic definitions for data centre redundancy/availability ( i.e. the legacy Uptime Tiers ) which only consider the mechanical, electrical and plumbing (ME&P) subsystems of the data centre. The holistic approach which considers all of the data centre subsystems plus external utilities and risks can be found in the new Bicsi Data Centre Standard, a sample of which is downloadable at the Tier & Class Accreditation page of this web site. The appendix of the Bicsi standard also includes an introduction to Reliability Engineering including simple mathematical calculation for series and parallel items in a system.

Below is a mind map covering the wider topic of Reliability Engineering which is a free download. 

I would also recommend reading the classic book on the topic - "Reliability Theory & Practice" by Igor Bazovsky

An enterprise data centre is only as good as its weakest mission critical item so a holistic approach is essential, and if you need help with this then all you need to do is ask.

John Laban,
1 Feb 2015, 05:56