Software Safety Analysis/ Software FMEA: How is that to be Done?
Do you have to create a Software Safety Analysis (SW FMEA) for your project and don't know how to do it? Have you looked at the available literature, screened the entries on the web and yet you are not any the wiser?
Have you searched in the standards applicable to your project for instructions on how to perform a SW FMEA, but only found vague phrases? Do you also aspire to not just check off such an analysis pro forma, but that you actually want to increase the safety of your product with it?
The following description assumes that you are already familiar with the basic terms of a qualitative FMEA. If not, you will find corresponding information in "Effective FMEAs" by Carl S. Carlson or in IEC 60812.
Why at all a Software Safety Analysis?
Hardware components are becoming more and more reliable. Also, random hardware faults (permanent and transient) can be investigated well with established quantitative analysis methods (e.g. FMEDA according to ISO 26262).
However, it is more difficult to analyze and isolate systematic errors (software errors belong in this category). With increasing complexity and thus an increasing proportion of software, product safety is increasingly determined by systematic software errors. The importance of software safety analysis is therefore constantly increasing.
How do I Best Approach such a Software FMEA?
The following is a step-by-step procedure that has proved its value at Solcept:
- Form a team and appoint a moderator.
- In contrast to other FMEA's you do not need to have an interdisciplinary team for a SW FMEA, i.e. the team can consist of SW specialists only. However, it is advantageous if the team members have a diverse project or experience background and are well acquainted with the system to be evaluated and the corresponding software design.
- Set up the documentation of a qualitative FMEA (e.g. according to IEC 60812).
- I.e. a risk analysis with evaluation of the severity of a failure, its probability of occurrence and the probability of failure detection.
- This can be done in a spreadsheet or with a dedicated software tool.
- Compared to spreadsheets, such software tools have the main advantage of a simpler and integrated documentation of the results of the expert meetings and offer project management features for e.g. tracking of findings.
- Determine the safety objective, the evaluation catalogs and the risk matrix of the FMEA in the team (see below for clues).
- Also the type of mediation/ decision making in case of disagreement within the team.
- In the team, reduce the analysis scope of the FMEA to a minimum.
- Discussions in the expert team should be limited to the main architectural topics of the software. All other sources of error must be eliminated through the architecture and coding guidelines.
- The differentiation between safety analysis and coding guidelines/ automatic static code analysis is important because it can have a significant impact on the overall project effort.
- Review the architecture and coding guidelines as a team.
- Depending on the maturity of these guidelines, the team has to list (e.g. in a brainstorming session) architectural elements (e.g. state machine, fixed-point arithmetic, ring buffers, arrays, control structures) and associated known sources of error and decide whether they should be dealt with in the guidelines or in the expert meeting.
- At this point, the rules of static code analysis (e.g. subset of MISRA rules) are defined also.
- As a team, create a list of prevention and detection measures.
- This is usually given anyway due to the software development process or the applicable safety standard (e.g. application of guidelines, code reviews, unit tests, robustness tests and other verification measures according to the V-model, review of test coverage etc.). They can then be handled in a simplified manner during the analysis (low Occurrence and Detection ratings).
- It is a common mistake to misuse the software safety analysis to specify measures that have to be performed during the normal development process anyway. The analysis then falsely appears as having identified many risks and having defined appropriate actions, but in reality it only reflects the normal development process.
- As a team, go through the individual software parts and the essential architectural elements they contain.
- On the basis of the possible errors that can occur in them (checklists!), assess the risks and define any necessary measures.
- In addition to the architecture, the analysis also includes an examination of the control and data flow. If no documentation of these flows is available, it must be prepared first.
- Adaptation of the software design by the development team.
- After the design adaptation, the team has to re-evaluate the actually implemented measures and maybe demand improvements until the risks are considered to be sufficiently low.
The moderator documents the results of all the above steps, i.e. the expert meetings and ensures that all software parts, the architectural elements contained therein and their error sources are covered.
Which Evaluation Catalogs are to be Used?
The evaluation catalogs are based on the project needs. The following are some points of reference.
Important: If you define less than 10 levels in a range, you should still use the value range between 1 and 10 so that the influence on the risk figure is identical for all factors.
If you only want to assess safety, it is sufficient to distinguish between "no influence" (1) and "safety risk" (10). However, most often more is packed into the analysis, e.g. the "loss of availability" or the loss of non-safety relevant (primary or secondary) features of the product as intermediate stages.
Probability of Occurrence ("Occurrence")
At the lower end of the scale ("almost never": 1 or "remote": 2) there are errors that are already eliminated by a preventive measure. Likewise, a low error probability is assumed if it can be shown that a software is "well trusted", i.e. if it can be proven that it has been running error-free for years in a comparable application. For new software parts, the error probability is rated higher with increasing complexity (simplicity pays off here as well).
The highest probability of error ("very high": 9, "almost certain": 10) exists if the requirements for a software part are (still) incomplete or missing completely. This does not mean that a requirements review is carried out during the software FMEA ! Rather, points whose specifications are considered missing or unclear with regard to the safety objective are treated accordingly.
High detectability ("obvious": 1) exists if appropriate verification measures are in place, preferably already at unit test level and then with decreasing probability at module test (with or without hardware) or system test level. Review or analysis measures result in lower detectability, detectability is the lower, the larger the size of the software that has to be to be examined in the review. The rating "almost impossible" : 10 is given if the team has no idea how to detect a fault.
In principle, a simple Risk Priority Number (RPN) can be used, which is calculated from the product of the severity (S), Occurrence (O) and Detection (D) evaluation: RPN := S*O*D. For this purpose, an RPN threshold value is determined above which a measure must be taken. Depending on the project, however, more detailed decision matrices are also useful and can be determined by the team.
What Must be included in the Analysis Report?
The following checklist can be used to ensure that the analysis report is complete:
- Purpose of the analysis
- safety goal, possibly other goals investigated (e.g. availability, non-safety related functional failures)
- Step by step description of the method
- Standards applied
- Team members with education and experience
- Scope covered by the analysis, level of detail (entire software/software part...)
- Reference to applicable software development processes and guidelines
- Referenced documents
- List of the examined architectural elements
- Used tools
- Valuation catalogs and risk matrices used
- System description (coarse)
- Architecture description and identification of the analyzed software parts
- Results summary
- Results comments/ explanation, if necessary
- Appendix: list of implemented measures
- Appendix: reference to the minutes of the expert meetings
The described procedure has proven to be extremely effective at Solcept. The value for the project and the product lies mainly in the discussions that take place in the expert panel.
Our experience shows that the procedure improves not only in safety. It also improves the overall software quality, the documentation and the development processes and it results in a greater involvement of the engineers in the definition of processes and guidelines.
Moreover, it is astonishing (and very comparable to the effects of EMC measurements) what other changes in the project are triggered by the work of the SW FMEA team. You will catch yourself thinking about using this tool not only in functional safety projects.
We wish you much success with it!