Since the industrial revolution, equipment and manufacturing processes have dramatically grown in complexity. This led to a major evolution in maintenance engineering, which went from basic lubrication to the traditional approach of replacing worn out pieces before they fail. Ultimately, these processes have evolved into Reliability Centered Maintenance (RCM).
RCM changed the paradigm and the concept of maintenance. Thorough studies revealed that the pieces that actually wear out are only a small percentage of the total and moreover, they usually wear out in a random failure distribution. This means that we cannot predict when they are most likely to wear out in order to replace them before they fail. In fact, in some cases by doing so, we just increase the probability of failures.
This article will review some of the core concepts that are key to understanding the RCM methodology and to be able to study it in depth.
Every time we buy a piece of equipment it is because we want that equipment for a specific purpose. For example, manufacturers purchase a packaging machine to wrap up their products. In the same way, every component in our equipment has its specific purpose.
Equipment or components may have more than one function.
It is paramount to clearly state the function. Not only we need to specify the purpose of the asset (e.g. again, to wrap our products) but also quantify what we expect it to do (e.g. number of packages per minute, etc). This is crucial because it will be further used to determine if the asset is actually doing what we want it to do.
Although this may not sound complex, defining the function in the correct way is not an easy task. It requires experience and must be carried out carefully.
The function is always dependent on the operating context, especially at the time of defining and quantifying the expected performance.
Consider two identical pumps. One is installed in a nuclear reactor to pump the cooling water to the core and the other is used to transfer rainwater in a large facility. It is clear that the requirements and operating context will be totally different, as will be the maintenance approach, for each case.
When our asset doesn’t fulfill its function, it undergoes a functional failure. This functional failure can be total if the asset does not perform its function at all. Or it can be partial if the asset is still working, but not to the expected performance.
In our packaging machine example, if the machine is not working at all, this is a total functional failure. However, if it’s working but cannot reach the required speed, we can say that this is a Partial Functional Failure.
Notice that the operational context also plays a key role here, especially for partial functional failures. The wrapping machine can be still delivering what a small plant needs, whereas this would not be enough for a bigger plant with a higher speed requirement.
Assets can fail in many ways. Traditionally, when we had problems with an asset, we used to say that it just failed. On the other hand, in RCM, we clearly differentiate between the functional failure (the machine is not delivering what it should) and the failure modes, which are the events that actually produce the failure.
When we analyse failure modes, we need to consider three situations or categories:
1) When the asset’s performance drops below our desired value and the asset is no longer fulfilling its function. The most common reasons for this to happen are:
2) When the operating context starts demanding more from our asset and the desired performance increases: In this case, either the asset is unable to provide the results, or in order to deliver them it starts wearing out due to the increased stress.
3) When the asset was not capable of delivering the required performance in the first place: In this case, there might be a deficient design or a failure in determining the requirements at the time of acquiring the asset.
All in all, we can expect between one and 30 failure modes for every functional failure, so we need to be careful in the level of detail in our analysis.
We need to tackle only the probable failure modes. This includes the ones in the failure history, the failure modes that are being managed by the current maintenance plans and any probable failures that haven’t happened yet, but are likely to happen in the future.
When a failure mode occurs, of course, it is not an isolated event. Failure effects describe what happens in our asset and in the operating context when a failure mode occurs. It will provide the information that we will use to later analyse the consequences of the failures.
We need to include any relevant information related to:
In the failure effects, we state the facts associated with our failure modes. But when we analyse the failure consequences, we perform a qualitative evaluation to determine the importance of that failure mode for our operations. This analysis will reveal how much this failure matters to us.
The way we proceed will depend on the type of function we are analysing since functions can be
In the case of evident functions, the asset’s functional failure can be detected by the operator and its consequences can have different degrees of importance.
We put people and environmental safety first. Then we look at the operational issues with higher costs related to productivity and/or quality. Lastly, we examine the non-operational consequences with the cost that is related only to the asset repair.
Conversely, hidden functional failures cannot be detected by the operators. This category is closely related to protective and safety devices. Due to the increase in complexity, assets usually include protective devices in order to minimise the consequences of the different failures.
However, their inclusion adds complexity to the RCM analysis and the maintenance strategy in general.
When we can detect if the protective device is in a failed state, we consider it a fail-safe device and the analysis become less critical. But when it is the opposite, we trust that the device will act when something goes wrong, but the piece might had been in failure state for a long time and we won’t notice it until we need it.
This leads to a complex analysis that involves using the probability of failure of the protective and protected devices and other information to determine how we can make the system safer.
The process of analysing all these concepts in detail to use them to design our maintenance strategy is called Failure Mode and Effect Analysis (FMEA) which is frequently mentioned but not always fully understood.
As mentioned before, it is important to choose the right level of analysis. An FMEA can include between 3,000 and 10,000 failure modes with their correspondent effects and consequences.
This is an enormous amount of information and work, so it is important to select the right level of detail as well as the assets that are important enough to justify this effort.
In RCM, we focus on avoiding the failure consequences, not the failure modes. That is why we may decide to leave a failure mode ‘unattended’ and repair it when it breaks down, while in other cases we conclude that the only option is to redesign our system to prevent serious consequences.
The process is not easy and requires much more than simply knowing what the concepts mean. However, being aware of these concepts helps us understand the difference between the RCM approach compared with the classical approach. With this awareness, we can start viewing our maintenance strategy from a different perspective.