Reliability & Risk Analysis
Context
An organisation chooses to operate in a business area because it sees an opportunity to make profit. This choice, like all choices, has consequences, particularly if the undertaking is one that contains hazards.
A Hazard is a potentially harmful situation.
Risk is a measure of:
the likelihood of an event that makes a hazard active
the harm that would follow if the event occurred.
Safety means freedom from unacceptable risk.
So assets must be operated safely in a way that generates the maximum amount of cash; they must also be maintained so that safety and cash generating potential does not deteriorate. (Society demands safe operation and legislation & regulation oblige companies to comply.)
To manage safety in the business an organisation must initiate a process that will identify the hazards and describe what things can happen to harm people, things or the environment. It must determine the likelihood that these things will happen and plan what must be done to reduce risk where required.
Risk vs Benefit
We undertake ventures containing hazards because of the benefits they bring to the
organisation and to society, however the benefits must outweigh the risks.
Where a risk is identified, a decision must be made about what, if anything is required to reduce the risk or to prevent it rising, whichever is appropriate. Generally, risk needs to be reduced if it is above a tolerable level or if it is reasonable to reduce it to a level that is broadly acceptable.
In identifying tolerable risk, specific events are considered, such as one that has the potential to result in fatality. By considering how often such an event might occur and, if it did, what would be the probability of fatality, a tolerable frequency for that event can be established.

In the unacceptable region, risk cannot be justified and steps must be taken to reduce the risk so that it falls into the tolerable region.
In the tolerable region, risk can be tolerated provided that it is as low as is reasonably practicable. In this region, control measures must be taken to lower the risk towards the broadly acceptable region. Any residual risk that remains after this has been carried out is only tolerable if further risk reduction is impractical or the cost or the effort is grossly disproportionate to the reduction in risk achieved.
In the broadly acceptable region, risk is considered to be insignificant and adequately controlled. However, if further risk reduction can be simply achieved, it should be done.
(Note that in the diagram above, there are no solid boundaries between the Unacceptable, Tolerable and Broadly Acceptable regions.)
Analysis
When considering ways to reduce risk, the problem can be approached from two directions:
- What things could happen to cause this event?
E.g. What events could lead to a train derailment? - If this thing happens, what will it lead to?
E.g. What happens if this oil seal fails?
The two approaches are used in different ways.
The first is an analysis of the conditions that can combine to result is some undesired event, known as the 'top event'. It starts with the top event and then breaks down the sequence until a set of 'basic events' (usually some form of failure) is found. The analysis identifies only the basic events that lead to the top event. This analysis is particularly useful when determining the level of risk posed by a system. The probability of the top event can be calculated based on the probability of each of the basic events. This can be done using Fault Tree Analysis. This technique may also be used to evaluate the integrity of a safety system where it is used as an additional layer to improve the overall integrity of a system.
The second is an analysis of what happens when some basic or initiating event occurs. A system is analysed and the effects of each event (again, usually some failure) is considered. This analysis allows us to determine the importance of each failure and whether something must be done to prevent or detect it. Reliability Centred Maintenance (RCM) analysis allows us to determine how things fail, what happens when they do fail and then what maintenance tasks are required to achieve the level of reliability required from assets.
Asset Information
Both Fault Tree Analysis and RCM analysis require data to enable calculation of reliability and risk. Determination of the likelihood of an event requires information about asset operating history. To predict the future requires knowledge of the past. The history consists of operating logs, maintenance records and records of faults & failures and allows prediction of which events can occur and how often. It also helps to predict the likely consequences of these events. The more history we have and the greater its accuracy, the more confidence there can be in the predictions.
Where history is not available, it may be possible to obtain external data based on operation of similar assets in other organisations.
