What does the Engineer Do?
Safety Analyses
The first analysis especially occupies the developer of the overall system.
- Hazard-/ Risk Analysis: First the safety level of the safety critical system or subsystem respectively of a function must be determined. This is derived form the possible damage (severity) and the probability of occurrence of such a damage, in most cases based on some kind of flow diagram. Especially in aviation there are predefined safety levels for different systems. Here a short glossary of the most common abbreviations for safety levels:
Level | Range | Highest Risk Level | Industry |
DAL: Design Assurance Level | E..A | A | Aviation: ARP4761, ARP 4754A, DO-178C, DO-254... |
SIL: Safety Integrity Level | 1..4 | 4 | Industry, IEC 61508 and railway, EN 50128/9 |
ASIL: Automotive Safety Integrity Level | A..D | D | Automotive, ISO 26262 |
PL: Performance Level | a..e | e | Machinery, ISO 13849 |
Class: Software Safety Class | A..C | C | Medical, IEC 62304 |
The other analyses then take place on each hierarchical level: system, subsystem, component, function, as well hardware as software safety analyses, sometimes with different characteristics. So these affect every developer. There are different variants, the most important are:
- Fault Tree Analysis (FTA): The FTA proceeds deductively, i.e. from the failure to the cause. The question is: What are the faults in my system that can lead to a certain failure? E.g. which components must fail so that a safety-relevant function is compromised?
This makes this method suitable for design, especially for the top-down system design. The FTA exists in two variants, a purely qualitative and a quantitative one, for which probabilities of occurrence are assigned to the fault events. - Failure Modes and Effects Analysis (FMEA): In contrast the FMEA proceeds inductively, from the cause to the failure. For each subsystem/ component the question asked here is: What kind of safety-relevant failures can arise from a fault? E.g. if this component changes its value over time (i.e. it ages), how does this affect the function? If a state-machine swallows a bit, how does this affect the function? The FMEA also exists in two variants, a purely quantitative and a qualitative one. For the latter the analysis is based on fault probabilities for the different fault mechanisms (short circuit, open, drift, stuck-at...).
In the industrial and automotive areas usually a FMEDA (Failure Modes, Effects and Diagnosis Analysis) for the electronics is performed in which a reduction of the failure rates is taken into account for diagnosis mechanisms (e.g. read back of output signals).
Safety Mechanisms
Based on the safety analyses, safety measures have to be implemented to detect and prevent the following faults.
- random hardware failures
- systematic software failures
- systematic hardware failures
These measures may comprise: plausibility checks, redundancy (i.e. several systems which are checking each other), diverse redundancy (redundancy based on components that are built and developed completely diverse), program flow monitoring, error correction for memories and many more.
Requirements
Errors in the requirements are the most prevalent cause of failure. This is why a lot of importance is attached to requirements in functional safety. Though several aspects have to be considered:
- V-Model: The requirements must be manged according to the V-model in all industries, this means:
- There are successively more detailed requirements on each level (e.g. system, software, software unit). The extent of requirements for each element (system, software, unit) should be so, that a human can still grasp them, the details are moved to the next lower level.
- Basically all requirements are being tested on each level.
- Requirements Traceability: Requirements and tests must be traceable, amongst others to make sure the overall product remains maintainable:
- Vertically: it must be clear which requirements on one hierarchical level are covering the more abstract requirements on the next higher level.
- Horizontally: it must be clear which requirements are tested by which tests.
- Bi-directional: it must be possible, starting from one level, to follow the relationships to all other levels.
- Traceability Coverage Analysis: Evidence must be provided that all requirements on each level exist as more detailed requirements down to the implementation and that all requirements are tested.
- "Derived" Requirements: If new requirements originate from architecture or design, "derived" requirements are generated, e.g. from the definition of interfaces between different subsystems. This means that "derived" are those requirements that cannot be traced to higher levels. Such requirements must undergo a separate analysis. It must be established that they are not jeopardizing the function of the superordinate element and the safety.
- No Unintended Functionality: Another important aspect of the handling of (especially "derived") requirements and traceability is the prevention of unintended functionality inserted into the implementation by e.g. the programmer or by unneeded "derived" requirements. These usually comes from interpretable, i.e. not accurate enough requirements or from good intentions like defensive programming. Both can lead to unintended (mal-)functions.
With regard to the V-model an important misunderstanding must be dispelled here: The V-model should not primarily be seen as a Gantt Chart, but as a data management model. It maps the "divide and conquer" and the relationship between the artifacts. In practice this means that one cannot get by without iterations between the levels. Of course, those should be minimized as much as possible for the sake of efficiency. This results in a natural sequence, because one cannot specify and design anything on the lower levels of detail, if not everything is stable and approved on the upper level. Just as one cannot finish testing at the upper levels of integration if the tests at the lower levels have not been completed.
Verification
Verification is often equated with testing. For safety critical systems, this is not true, tests are just a small part of verification. Most of verification consists of reviews.
- Reviews: Before their release, all artifacts must be verified by a review, often even by a reviewer with exactly defined independence from the project team or even from the organization. For some of the artifacts several reviews take place, if a quality assurance review or a review with respect to the standard is requested.
- Checklists: Usually, for each artifact a checklist exists. Without evidence of the performed reviews the reviews are considered not done, so the filled-in checklists must be filed as review results.
- Tests: There are test specifications, test instructions, maybe test code and also here again evidence of all test results, i.e. all results must be documented. The tests must be requirements based, amongst others there may be no test without an according requirement.
- Code-Coverage Analysis: For the software, tests must make sure that all code is covered by the tests, including that all branches are taken. Note that it says coverage analysis, coverage is not a test for itself, but rather an analysis to show that the tests satisfy some minimal quality criteria. Coverage can be demonstrated using tools for dynamic code analysis.
As a consequence of the required code coverage and the requirements based testing it is not allowed (explicitly so in avionics with DO-178C) to write tests for the code coverage for which no requirements exist. So let's just generate a requirement? ...which as "derived" requirement then needs a safety analysis. There must be no unintended functionality. This is why it is worthwhile only to implement that which is really required.
Standards/ Rules
To ensure homogeneous quality throughout the project, standards, rules are required for many artifacts. Those can be developed internally, but it makes it easier to deal with the external auditors if known standards are used, e.g. MISRA for C/ C++ code.
- Requirement Standards: Those describe how requirements must be formulated, down to formatting.
- Design Standards: Clear guidelines for the design, they must cover all requirements of the standards, like no hidden data flow, hierarchical design...
- Coding Standards: For the software, only a safe, deterministic and well readable subset of the programming language shall be used. Coding standards like MISRA can be substantiated for the most part automatically using tools for static code analysis.
Components
For electronics only high quality components should be selected. When selecting those, the long-term availability should be considered, so the safety evidence upon component changes does not have to be provided again and again. In addition it is key to have good data for the calculation of the failure rates.
Apart from the AEC-Q certificates for automotive there exist almost no "high reliability" parts anymore. Also the "standards" with numbers for failure rates (Siemens SN 27500, MIL-HDBK-217F...) are a victim of the ravages of time respectively the technological advances. Still the standards are used for quantitative analyses as it is in most cases only about the comparison of different technical solutions for the fulfillment of a target value for the overall system, not about a realistic statement on probability of failure.
Tools
No modern electronics or software development without software tools. Software? Is the software of all tools in the project without errors? What happens if an error in a tools leads to an error in an artifact?
- Tool Classification: In a project for functional safety this means that all tools have to be classified. It must be shown whether and if yes which errors a tool can generate in an artifact.
- Tool Qualification: According to the result of the above analysis the tools must be qualified. I.e. it must be demonstrated that the tool as it is used does not generate this error or that the errors can be caught.
Psychology, Too: Giving and Accepting Feedback
Functional safety is logic after all, a clear thing. This is what you think at the beginning... But this is quite wrong. Psychological aspects play a not unimportant role, as well for the achievement of the goals, but also for efficiency and before all for the own satisfaction.
No engineer gets around feedback, at the latest during the review of his results. To accept the positive feedback is usually not a problem, but when something is wrong, then sometimes the emotions run high. Here the own attitude towards errors is the issue: Can I accept own errors and learn from them? Am I ready to to look closely at others errors and point to them? Am I ready to carry out such conflicts in a constructive manner? Because only in a "conventional" project "come on, it works" is a reason not to correct bad code, and maybe not even there?
An because it should be the goal to pass the reviews without findings, the lone warrior approach does not work anymore. If I do not agree my solution with others, if I do not work it out together and find a consensus, then I perform so many rounds of reviews that I get dizzy
In the end I am only satisfied when I do not consider each error, each critique as a attack on me as a human, but as an invitation to get even better, to develop myself.