Case on DOM through Material Selection

The Issue: Frequent failure of a critical pipe disrupting Reliability, Availability, Performance and Safety

Solution: Design Innovation –> DOM (Design Out Maintenance)

Component: Outlet pipe from a Furnance

Function: Carries hot (1000 degrees Centigrade) oxidised metal (calcine) from the Furnance.

Dimension: The diameter of the pipe is 350 mm at the inlet and 250 mm at the outlet. The length of the pipe is around 6500 mm.

MOC (material of construction): SS310 or SS321.

Failure Mode: Leakage, Accelerated Wear, Cracking (SCC)

MTBF (Mean Time Between Failures): Low, unacceptable.

Difficulty in maintenance: High and involves high degree of safety risk and uncertainty.

Type of DOM/Design Innovation: Third Type — Increase resilience (Note: First Type — Change System Structure: Second Type — Change interactions)

Goal of Problem Solving/Design Innovation: To determine a suitable material for the application that increases MTBF

Primary Analysis: Enveloped Constraint Analysis

The edges of the constraint envelope are the following:

1) Force/Momentum: High considering that the pipe handles a heavy flow of oxidised material. Resistance to the flow of material is high considering high flow of oxidised, irregular shaped material generating high frictional force.

2) Reaction: Material structure changes under the influence of high temperature

3) Environment: Halide. Induces corrosion and stress corrosion cracking of stainless steel (METALS and alloys which are resistant to corrosion usually depend for their resistance on the formation and maintenance of thin films (commonly between 5 and 200 Å thick) of passivating oxide. The breakdown of such films by “aggressive” anions such as chloride/halide (except fluoride), at sufficiently positive anode potential and at sufficiently high temperature, is often responsible for the failure of such alloys, because it usually leads to serious pitting of the bared metal.)  

4) Time: Due to friction, heat and halides the frequency of the the failure mode is quite high resulting in high maintenance cost and effort — endangering safety of equipment and personnel.

5) Temperature: very high — induces failures in presence of halides and abrasive wear

6) Lubrication/Wear: High wear rate due to depletion of the thin film of passivating oxide, accelerated by temperature and the abrasive nature of the material handled.

7) Surface/Shape: Hydraulic equivalent diameter of the pipe is important so that no chocking or jamming is experienced. If jamming occurs the degradation leading to failure mode is accelerated.

8) Material: Present material is unsuitable for the application.

Pattern of Failure Mode: Predominantly Wear-out (time dependent) but also has the features of Early (frequent) and Random (degree of uncertainty is high).

Probability of Survival of the component working within the above conditions: Extremely low (<10%)

Consequences: High (5) — Safety of human beings and equipment along with production losses

Warning effect: None (so can’t be subjected to condition monitoring)

PLS3D Analysis:

Points: Important points of this case: 2, 3, 5,

Surfaces: 2 -> 1: 2-> 5: 3 -> 4: 3 -> 6: 3 ->5

3D (Type of Problem/Insight): Multiple Causes: Multiple Effects

DOM choices of MOC:

a) 321 SS (UNS S32100) is a titanium stabilized austenitic stainless steel that features improved resistance to intergranular corrosion. This grade is suitable for high-temperature applications up to 1500°F (815°C), where the addition of titanium stabilizes the material against chromium carbide formation.

b) SS310 is a highly alloyed austenitic stainless steel designed for elevated-temperature service. The high Cr and Ni contents enable this alloy to resist oxidation in continuous service at temperatures up to 1200°C provided reducing sulfur gases are not present.

c) INCONEL® nickel-chromium alloy 625 (UNS N06625/W.Nr. 2.4856) is used for its high strength, excellent fabricability (including joining), and out- standing corrosion resistance. Service temperatures range from cryogenic to 1800°F (982°C).

d) Hastelloy C-276 maintains oxidation resistance at temperatures up to 1100 ºC and resistance to pitting, corrosion, and cracking at temperatures up to 1040 ºC [16].

Final Solution (increase resilience): Hastelloy C-276. Reason: It fulfils all the constraints and condition.

Expected Enhanced Probability: 90% (from 10%)

Proving: Awaited

Future Work: a) Estimate the life of the new MOC b) Plan for replacement based on estimated life supported by assessment of condition (CBM methodology to be devised)

Strategy Selection against Failure Modes

In a certain mining company of South Africa the Maintenance Planners chose the following Maintenance Strategies against the failure modes experienced in a crusher.

Failure ModesMaintenance Strategy
High pressure before filterPM
Pipe chockedCorrective
Flow Switch openPM
Thermal TripDOM
Leaking Grease pipeCBM
Low Pressure of oil before filterPM
Oil Flow lowPM
Blocked ChuteCorrective
Lub Pump TrippedCorrective
Low Bearing TemperaturePM
Communication not receivedDOM
Bolts of Top Shell looseCorrective
Motor TrippedCorrective
Speed Switch OpenPM
Crusher OverloadCorrective
High Oil TemperaturePM

Would you agree with the selection of maintenance strategy? What is the critical piece of information the maintenance planners were missing?

PM = Preventive Maintenance
Corrective = Repair or Restore
DOM = Design Out Maintenance
CBM = Condition Based Maintenance


History of FMEA


Failure Mode and Effect Analysis (FMEA) is a name given to a group of activities, which are performed to ensure that all that could possibly go wrong with a product have been identified and appropriate actions have been taken to either prevent undesirable failures or prevent the consequences of both probabilistic and deterministic failures.


In the 1960s, for the space programs (especially for the program of landing on moon), engineers were faced with the staggering consequences of potential failures that might happen to a spacecraft. Hence they created a method of forecasting all possible problems that might happen with the components used in a spacecraft. They thought beyond the normal design considerations. They imagined all the bizarre situations that can possible take place. They got together for extended and concentrated brain storming sessions to identify or recognise all such bizarre situations that might happen. even if the probability of such incidences may be ridiculously low. However, the result of this approach was a resounding success. Man landed on moon in 1969, without a problem.

But after 1969, interest in space program dwindled. As a result many engineers of NASA moved to traditional industries for their living. They carried with them their now famous and celebrated failure forecasting technique. This technique eventually came to be known as FMEA. In 1972, NAAO, a Quality Assurance organization, developed the original reliability training program, which included a module on the execution of FMEA.


Although good engineers have always performed FMEA type of analysis on their designs, most of their efforts were documented only in the form of their final parts and assembly drawings. However, repetition of past mistakes were open to possibilities. The most important reason seemed to be — people leaving the company and were unavailable to check on the new designs or unable to provide on-going education and guidance to new recruits.

This created a big problem for most industries. This was mainly because of stringent liability insurance issues besieging the companies involved in product design, development and marketing. Companies then began to address the issue with all seriousness. For example, in the 1970s, automotive industries took up FMEA as a natural tool to reduce the occurrence of failures.


Since that time, the discipline of FMEA has been spreading among the multibillion dollar companies. In turn, these large companies have been pressing their suppliers to adopt FMEA to improve the reliability of their products during the design stage of product development.

Hence, it may be stated that FMEA is a design tool aimed at reducing failures to the minimum possible level.

Keywords: FMEA, Failures, Consequences, Tool, Reliability, Design,

Rules of Thumb about Decision Making

  1. Consider what you want to achieve, avoid, sustain or improve. Focus on results you want to achieve. Don’t focus on actions. Create goals accordingly. In short a goal is the gap between the vision and current reality.
  2. Generate many options to achieve the goal. Don’t stick to a few options.
  3. Decisions are based on published, measurable criteria and never on an ad hoc basis.
  4. Criteria can usually be classified into “Must” and “Want”. The Must criteria must be satisfied for the option/recommended solution to be viable.

Rules of Thumb — Engineering Communication

One of the most important tasks of engineers, managers, facilitators, guides, mentors, consultants and trainers is to communicate.

Without right and effective communication, nothing seems to get done. One may be working very hard but he/she would fail to see results on the ground. That is indeed very frustrating. The secret is — unless people are involved, nothing worthwhile gets done. The goal of communication is to involve people.

In this article, I would like to focus on the Communication Process

Hence some basic rules of thumb on the Process of Engineering Communication:

  1. Audience Based: Use an audience based and not a writer based approach. The content of the communication (whether written or verbal) must only address questions the audience wants answered. There is little or no point of dumping any information the writer seems knows on the subject. In such cases, a knowledgeable writer or communicator is internally motivated to share whatever he/she knows about a subject. Hence the length of such presentations or written communication is directly proportional to the length of time a communicator has spent researching on a subject. This becomes quite boring to an audience. Worthwhile to remember that there is no need for a communicator to be perceived as a highly knowledgeable person. The audience would decide that anyway. Just in case, one is forced to communicate in a structured fashion the structure should be hierarchial — i.e. — arranged in order of importance to an audience starting from what the audience “must know” about a question to going what is “good to know” type of information, which in most cases can be avoided.

    In any case, the communication structure must not be historical or chronological, unless the audience is interested in the history of a subject. In my workshops, I start off by asking what the audience wants to know and then address their questions one by one. Once you adopt this approach you would see how time flies and how stressfree the environment becomes. The golden rule is — there is no universally accepted template that fits all communications; each communication must answer the questions of the audience.
  2. Be Effective: Effective communication is linked to two important things — a) Problem Solving Skills b) Clear Thinking skills. However, if I were to choose between the two skills I would go for Problem Solving skill alone. This is because good and effective problem solvers develop clear thinking skills, without which it becomes difficult to solve any problem worth its salt. But why is effective communication linked to problem solving skills? This is because any type of audience loves to hear stories. And real life stories keep an audience glued to the communicator. When a communicator tells a story the tension and suspense created are palpable. This moves the audience to be attentive. Moreover, as I have seen, audience learns most from stories. Needless to mention that helping others to learn through their self awareness and then seeing them act upon it is the fundamental effect of any good communication.
  3. Welcome Confusion: Confusion is always welcome since it forces the communicator to check back and rethink his/her thinking. A good way to bring out ‘confusion’ in the open is to trigger a feedback loop. Simply stated, it means asking the audience — which parts of the communication wasn’t well understood or appreciated by the audience. This not only helps the audience stick to the flow of communication but also generates life in the communication.
  4. Let ideas flow: Ideas must coherently flow from one idea to the next. In short, the ideas must be well knitted together and expressed cogently. This helps an audience see the whole picture and appreciate the depth of a topic. It simply generates interest on an on-going basis that propels the audience to think and act on the their understanding. There is no need to stick to one idea throughout a presentation. One may ofcourse, dwell on an idea for sometime before logically connecting the idea to the next idea. Connecting ideas makes a presentation triggers the thinking process of the audience, which is worth its weight in gold.
  5. Revise: Be willing to rethink, revise and rework the whole structure if the structure of the communication doesn’t meet the needs of an audience. Though grammar and style are important there is no great need to endlessly polishing those. And be more than willing to delete or discard entire sections already written instead of incorrectly hoping to deliver everything that is written.
  6. Coherent Written Plan: Develop a coherent written plan — especially when one is supposed to speak on a topic or present verbally. A simple cue card (5 x 7 inches card) with a few bullet point maybe sufficient to keep one pegged to the overall picture. This also helps to keep the mind of a presentator poised and calm. Any tension in the mind of a presentator would soon show up and get communicated. Audience would notice the internal tension of a presentator and they wouldn’t like it much.
  7. Rehearse: Don’t forget to rehearse. Communication like all performance arts needs deep rehearsing. There are many ways of rehearsing, some of which are — a) read aloud (it helps to uncover flaws very quickly) b) present it to your children or spouse or friends c) rehearse in front of a mirror d) mentally rehearse the topic including the possible gestures, pauses etc one is likely to incorporate in the presentation (this can be done even while taking a shower).

By Dibyendu De

Why Systems Fail to Achieve RAM requirements?

Folowing is a summary of the reasons as to why manufacturing plants fail to achieve the desired level of RAM (Reliability, Availability and Maintainability) requirement:

• Poorly defined or unrealistically high RAM requirements.

* Lack of priority on achieving Reliability & Maintainability

* Too little engineering for RAM

  • Failure to design-in reliability early in the development process
  • Inadequate lower level testing at component or subcomponent level
  • Reliance on prediction instead of conducting engineering design analysis.
  • Failure to perform engineering analysis of commercial off the shelf equipment
  • Lack of reliability improvement incentives
  • Inadequate planning for reliability
  • Ineffective implementation of Reliability Tasks in improving reliability
  • Failure to give adequate priority to the importance of integrated Diagnositics (ID) design — influence on overall maintainability atributes, mission readiness, maintenance concept design and associated LCC support concepts.
  • Unanticipated complex software integration issues affecting all aspects of RAM
  • Lack of adequate Integrated Diagnostic (ID) maturation efforts during system integration.
  • Failure to anticipate deisgn integration problems where incremental design approaches influence RAM performance.



Strange case of Semi-elliptical cracks

Assembly: Mill having a power input of more than 1000 KW

Sub-assembly: Gear box

Part: Casing

Location: Around the area of the torsion shaft on the output side

Failure Mode: Casing crack in semi-elliptical pattern (a probabilistic failure mode)

Description: semi-eliptical crack — the pink area on the photo

Semi-elliptical crack on the gearbox casing – output side around torsion shaft area — Figure 1
Zoomed view of the crack as shown above — Figure 2


Phenomenon: Thermal Fatigue (such semi-elliptical cracks are distinguishing patterns for thermal fatigue; but not necessarily the only pattern of failure due to thermal fatigue).

Such a crack is a combination of many interactions taking place simultaneously.

However, the initiator of such a problem with casing is the likelihood of blowholes or casting defects in the casing. On examination this was confirmed.


1. Thermal fatigue then accelerates the development of such a crack and especially so, if the casing thickness is not much.
2. Induced by rotating bending of the torsion shaft. Hence the crack appeared around the middle of the casing around the torsion shaft area.
3. Temperature of the gearbox was higher on the output side. Around 70 degrees C.

Nature of Crack growth and development:

When the depth of the crack is around 0.025 to 0.25 mm the shape of the crack becomes semi elliptical (check depth of crack). The threshold limit is around 0.16 mm, when initial crack starts to propagate.

When such a crack starts to develop and propagate a cyclic noise is audible.

Other areas where Thermal Fatigue can cause failures:

1. Rolls of rolling mills — can be experienced anytime during operation
2. Gear Boxes — usually experienced in winters or relatively cold ambient temperature
3. Motor Windings — usually experienced in topical summers
4. Anti-friction bearings — can be experienced anytime during operation

Method of Montoring: (Varied): But some of the methods might be as follows:

1. Infra-red thermal imaging — especially for motors
2. Visual — powerful technique — both sight and sound
3. Vibration analysis — especially for rolls and anti-friction bearings
4. Temperature monitoring (differential or temperature distribution) — for gearbox casings and motor casings.

Precaution during purchase: (Design Review)

1. Casing thickness
2. Allowable temperature rise
3. Lubricant
4. Cooling system

Reliability Improvement of Physical Assets

What does reliability improvement of physical assets mean?

To answer this question we must first understand what we mean by reliability?

The word reliability conjures up different meanings for different people. To some it might mean quality. For others it might mean safety. While for some it might mean integrity. And for some it might mean a long and useful life of a machine.

That might seem very confusing.

Hence, in short, the word reliability means — “Whatever a user wants of his/her physical assets.”

If that is so, then reliability improvement would simply steer us towards problem solving activities that would help an user of physical assets achieve his/her desired intention.

However, solving problems to improve reliability or MTBF can be quite daunting. This is because the upper limit of reliability of machines is practically set during the design and manufacturing stage. So, to improve reliability during the operation stage is quite difficult though not impossible.

But solving problems to improve reliability during operation can be complex. Hence the entire process of problem solving has to broken up into manageable chunks and executed in a step-by-step manner.

However, the end results of improving reliability of a machine would be:

  1. Relax — to continue in an operational mode as long as possible. This is achieved by ensuring that a machine loses minimum energy during running.
  2. Resilent — to make a machine work with minimum stress under varying conditions, interactions and contexts. This is done by modifying interactions and initial conditions to extend the MTBF to a level — dictated by business goals.
  3. Rejuvenate — to make and keep a machine in a healthy state of performance without disturbing production and production cycles.

Behind all these activities the most important factor that is essential is the application of human awareness. Without it, success of a reliability improvement project is simply not possible, whatever the tools, techniques, process or methods might be.

How do we recognize, develop and apply human awareness is a million dollar question?

Plant Reliability Improvement — RAPID

A) The Background

Engineers have tried to address the issue of Reliability of equipment and systems in various ways. Therefore, many methods and techniques are in existence. The usual approach pivots around the concept of “failure modes” and how best to guard against those so as to prevent the consequences of failures. However, the existing methods do not take into account, flow of energy as an important factor that determines reliability of plant and machinery.

Taking this into consideration, I am putting together some rules and principles that hopefully point to a path that might enable engineers to improve plant reliability with minimum effort, resources and time.

But before I do that I would like to put forward an easy understanding of the term — reliability.

B) Reliability of a Machine:

The period of time a machine would run (fulfils its function) without a problem or trouble.

Longer the period of trouble free running better is the reliability of the plant.

It therefore addresses the heart of reliability improvement — i.e. enhancement of useful operating life of a machine or a system of machines.

C) What does that mean in terms of energy flow?

We may say that a machine that can continue with the smooth flow of energy, with the minimum wastage of energy for a long period of time is more reliable than a machine where energy flow is disrupted frequently in some manner or the energy wastage is high or the energy is pushed out of equilibrium condition, which invariably stops the machine from functioning effectively.

D) So what might be the job of engineers?

It is evident that the fundamental job of engineers (both operation and maintenance) is to run and maintain a machine or system in such a way so that the energy wastage is minimized and smooth flow is ensured for a long or desired period of time.

E) How can this be done?

This may be done in three fundamental ways, which are as follows:

  1. Observe the dynamics of a machine or system to ascertain energy flow patterns and the degree of energy wastage and the reasons for such wastage. Also determine the degree of stability or instability of the system and what affects a system’s stability.
  2. Adjust or maintain the conditions within its operating context to ensure smoother flow of energy with minimum wastage. In the process, learn what changes in a machine/system would disrupt energy flow or push it out of equilibrium conditions and prevent such disruptions.
  3. Change, monitor, modify the system (made up of physical asset and components, process, information flow, analysis and decision making, teams) as necessary, for smoother flow of energy to continue over longer period of time with minimum energy wastage.

F) How to apply it in a real plant?

I have found the following method useful in various plants where I implemented Reliability Improvement Programs.

  1. Make a list of critical machines.
  2. Select a critical machine along with its sub-systems (machines that support its functioning)
  3. Apply the three steps as outlined in Section E (above) — within the operating context.
  4. Improve and stabilize performance; record the learning.
  5. Create a custom made monitoring system to spot changes in time along with custom made expert system to guide engineers to quickly decide the course of actions to be undertaken.
  6. Record the decisions, actions and changes in the form of equipment history.
  7. Move to the next critical machine and its sub-systems.
  8. As we go along, first check whether the all consequences of failures have been taken care of. Next check whether overall plant/area/section MTBF (Mean Time Between Failures) is going up and whether MTTR (Mean Time to Repair) is going down along with consequential lowering of maintenance and operation costs. Lastly check the accuracy of the custom made expert system (usually made up of multivariate parameters) in is ability to forewarn and guide maintenance decisions.

G) General Rules:

Keeping the above in mind I formulated the following four rules that might help engineers managers stay on the path of plant reliability improvement:

  1. For any machine, energy tries to move in sync through all elements of a machine through various interfaces against many contradictions and constraints but always choosing the path of least resistance.
  2. Changes in contradictions, constraints and interfaces change the quality of energy flow forcing energy flow to go out of equilibrium (instability) to either cause degradation of performance or cause failures that lead to unwarranted plant stoppages, affecting costs and productivity.
  3. Changes and the causes of such changes are reflected in the dynamics of a machine in terms of interdependent parameters like vibrations, heat, flow, wear, humidity, temperature, pressure etc.
  4. Reliability of any machine or system can be improved by either maintaining the contradictions, interfaces and constraints to “just right conditions” or changing those to enable smoother flow of energy for a longer period of time with minimum wastage.

H) Applications:

Having applied these basic rules some industrial plants were able to gain on-going benefits for years. Here are some examples of — Plant Wide Reliability Improvement

Fixing Organizational Problems

Every day, managers in different organizations face an array of problems. Usually, such problems keep repeating — either randomly or at regular intervals. After a while, it then becomes clear to the managers that such problems resist current ways of thinking and actions as practiced within the organization. 

The réponse to such problems is — “How this problem can be fixed permanently?” 

A manager would then try to apply known theories, methods, and tools to solve the problem. And in this process, the managers can also increase their skills. But the problem is that the problems simply don’t vanish. They have a bad habit of sneaking back through the backdoor. 

Why is that?

The short answer is — “No problem can be fixed, at least permanently.

This is because the nature of the problems keeps changing with time or the same problem comes back with different intensity or frequency. 

However, one can find and install new guiding ideas. And one can intently engage in redesigning an organization’s infrastructure, policies, rules, methods, and the tools presently used to find new ways of dealing with work and problems. 

The key is to closely observe what is going on in the present and then discover the organizational subconsciousness (mindset) that allows such events to happen with alarming regularity or randomly. Once that mindset is found, a manager can then find new ways of thinking and practices to replace the old governing mindset. 

If one keeps going in this way one can gradually evolve a new type of organizations that is responsive, agile and observant about the numerous interactions that go within an organizational environment to become a better and a fitter organization. 

It would then be able to deal with the problems and opportunities of today and invest in its capacity with the right resources and efforts to embrace a better future. This happens because its members are focused on enhancing and expanding their collective consciousness  — where individual members are able to observe, learn and change together.  

In other words, they collectively create, support and sustain a organization that continually learns from their present situation.