Development for Functional Safety

The Difference to "Conventional" Development in Detail

Time to Read 22 min

The customer thinks your product is fantastic, she just would like to get it with a functional safety level: DAL, SIL, ASIL, PL ...just a few more letters in the requirements. What does this mean for your development? What do you have to do to fulfill the customers wish? If you read the according standards and still nothing is clear, you are not alone...

Functional safety is not just another feature for the datasheet, like another sneaker color. Functional safety is more like a mountain marathon which makes demands on the whole development team.

I will show the most important differences between such a functionally safe and a purely functional development. This is done in this detailed article in the following steps:

What is This About? Boundary/ Introduction
What does This Mean? Terms
Why are They Doing This? Principles
What does the Engineer Do? Engineering
What does the Project Team Do? Support/ Project Management
What does the Company Do? Process Management
What do we do now, if we want to develop safety critical systems? Development for Functional Safety

Here you can find a summary as a short list. And here what needs to be done before a functional safety development is carried out.

What is This About?

This contribution is only about the areas where we as Solcept are experienced. So it is about development of software and electronics. We ignore the remaining life cycle from production to disposal, also the details of system development on e.g. vehicle or aircraft level.

What does functional safety mean?

Functional safety means that the risks of a product from injury or worse are minimized. Functional because safety depends on the correct functioning of the system. E.g., that a vehicle does not move off when you are about to enter, that the cockpit display does not make the pilot believe that he has more fuel than is actually in the tanks.

This is where functional safety differs from "side effects" of the system's function, such as explosion protection, voltage protection, fire protection etc.

What does "Conventional" Mean here?

As starting point for this overview I assume a "conventional" embedded development in the industrial sector. Note that the quality levels without functional safety (e.g. QM (Quality Managed) in automotive or DAL-E (Design Assurance Level E) in aviation) already need a massively higher effort than the development we take as a base here. This is because these developments should already be carried out according level 3 of automotive SPICE or CMMI (Common Maturity Model Integration). It means that quite many requirements, unit tests, traceability etc. are already present.

Standards

Historically the standards for safety critical systems in the embedded area started in 1982 in the United States with the standard DO-178, targeted to airliners. This standard, today in version DO-178C, is still the mother of all standards and its concepts were taken over by other industries. In Germany IEC 61508 originated in 1998, primary for the chemical industry, but also as umbrella standard from which most of the other industry standards have been derived.

I don't want to delve into the topic of the different standards her, in many cases one cannot manage with just one, there is a whole collection of documents which have to be considered. However the basic concepts are the same, so what has to be done for functional safety can be defined independent of the industry.

The Name of the Game

Finally, here in the introduction, probably the most important point for all those who want to develop for functional safety. You can think about the standards and the models required by functional safety how you like, the moment you say yes to a functional safety project, you have to play the game of functional safety. So you have to carry out the required work and create the documents, to the letter. This applies to the engineers who do the work, but above all to the management, which must provide the time and resources. Otherwise... you have to play another game without functional safety.

Plate: Do not Open Before Train Stands Still!

What does This Mean?

A few basic terms are important, so here a short definition:

Functional Safety: Safety which is dependent on the correct function of the system, thus not explosion protection, voltage protection, fire protection etc.
Systematic Faults: Faults which are built into the system during development (these faults can occur in the whole system)
Random Faults: Faults which occur randomly, failures (these faults can only occur in hardware because the software does not suffer aging processes)
Artifact: All work products (documents, code, data...)

Why are They Doing This?

There are a few basic principles of functional safety which help to understand why one should develop exactly this way and not in any other way.

Quality: The most important point is the focus on processes, on a structured course of action. The axiom here is "use quality to manage risk“. One tries to mitigate the risks by keeping the development quality high, even though the risks can never be completely eliminated. This also means that the higher the risk (expressed as "safety levels"), the higher the quality hurdles.
Divide and conquer: The overall system (vehicle, aircraft, equipment...) is divided into smaller and smaller subsystems. So a human, namely the responsible engineer, can overview the complicatedness on his level.
Planning: If a project is planned cleanly, less errors occur because no emergency exercises over the weekend are needed. Such quick fixes by the heroic developer are suppressed by change management and documentation requirements in all standards.
Evidence: What is not documented has not happened. This is on one hand a basic requirement of a quality management system (as the quality pope W.E. Deming said: "In God we trust, all others bring data"), on the other hand a requirement of liability law. Committing an error cannot be used against anybody, but having left out a step, not having done a review, not having performed a test.
4 (to 6) Eyes Principle: Everything that is somehow important must be checked at least once through a review, sometimes even through several.
Overall View: At the end functional safety has to be right for the complete safety critical system under consideration (vehicle, equipment, aircraft...). This means that there is no functional safety which can be viewed for a subsystem or a built-in equipment alone. So there is no complete "certification" of e.g. sensors, actors, software. Most aspects can be "pre-certified", still it has to be made sure the overall safety is not compromised by e.g. interface problems or errors in the usage.
Traceability: It shall be avoided that errors arise from assumptions which the engineer takes for his artifact. This means that requirements from the highest (equipment) level to the implementation must be traceable and no assumptions must be possible

What does the Engineer Do?

Safety Analyses

The first analysis especially occupies the developer of the overall system.

Hazard-/ Risk Analysis: First the safety level of the safety critical system or subsystem respectively of a function must be determined. This is derived form the possible damage (severity) and the probability of occurrence of such a damage, in most cases based on some kind of flow diagram. Especially in aviation there are predefined safety levels for different systems. Here a short glossary of the most common abbreviations for safety levels:

Level	Range	Highest Risk Level	Industry
DAL: Design Assurance Level	E..A	A	Aviation: ARP4761, ARP 4754A, DO-178C, DO-254...
SIL: Safety Integrity Level	1..4	4	Industry, IEC 61508 and railway, EN 50128/9
ASIL: Automotive Safety Integrity Level	A..D	D	Automotive, ISO 26262
PL: Performance Level	a..e	e	Machinery, ISO 13849
Class: Software Safety Class	A..C	C	Medical, IEC 62304

The other analyses then take place on each hierarchical level: system, subsystem, component, function, as well hardware as software safety analyses, sometimes with different characteristics. So these affect every developer. There are different variants, the most important are:

Fault Tree Analysis (FTA): The FTA proceeds deductively, i.e. from the failure to the cause. The question is: What are the faults in my system that can lead to a certain failure? E.g. which components must fail so that a safety-relevant function is compromised?
This makes this method suitable for design, especially for the top-down system design. The FTA exists in two variants, a purely qualitative and a quantitative one, for which probabilities of occurrence are assigned to the fault events.
Failure Modes and Effects Analysis (FMEA): In contrast the FMEA proceeds inductively, from the cause to the failure. For each subsystem/ component the question asked here is: What kind of safety-relevant failures can arise from a fault? E.g. if this component changes its value over time (i.e. it ages), how does this affect the function? If a state-machine swallows a bit, how does this affect the function? The FMEA also exists in two variants, a purely quantitative and a qualitative one. For the latter the analysis is based on fault probabilities for the different fault mechanisms (short circuit, open, drift, stuck-at...).
In the industrial and automotive areas usually a FMEDA (Failure Modes, Effects and Diagnosis Analysis) for the electronics is performed in which a reduction of the failure rates is taken into account for diagnosis mechanisms (e.g. read back of output signals).

Safety Mechanisms

Based on the safety analyses, safety measures have to be implemented to detect and prevent the following faults.

random hardware failures
systematic software failures
systematic hardware failures

These measures may comprise: plausibility checks, redundancy (i.e. several systems which are checking each other), diverse redundancy (redundancy based on components that are built and developed completely diverse), program flow monitoring, error correction for memories and many more.

Requirements

Errors in the requirements are the most prevalent cause of failure. This is why a lot of importance is attached to requirements in functional safety. Though several aspects have to be considered:

V-Model: The requirements must be manged according to the V-model in all industries, this means:
- There are successively more detailed requirements on each level (e.g. system, software, software unit). The extent of requirements for each element (system, software, unit) should be so, that a human can still grasp them, the details are moved to the next lower level.
- Basically all requirements are being tested on each level.
Requirements Traceability: Requirements and tests must be traceable, amongst others to make sure the overall product remains maintainable:
- Vertically: it must be clear which requirements on one hierarchical level are covering the more abstract requirements on the next higher level.
- Horizontally: it must be clear which requirements are tested by which tests.
- Bi-directional: it must be possible, starting from one level, to follow the relationships to all other levels.
Traceability Coverage Analysis: Evidence must be provided that all requirements on each level exist as more detailed requirements down to the implementation and that all requirements are tested.
"Derived" Requirements: If new requirements originate from architecture or design, "derived" requirements are generated, e.g. from the definition of interfaces between different subsystems. This means that "derived" are those requirements that cannot be traced to higher levels. Such requirements must undergo a separate analysis. It must be established that they are not jeopardizing the function of the superordinate element and the safety.
No Unintended Functionality: Another important aspect of the handling of (especially "derived") requirements and traceability is the prevention of unintended functionality inserted into the implementation by e.g. the programmer or by unneeded "derived" requirements. These usually comes from interpretable, i.e. not accurate enough requirements or from good intentions like defensive programming. Both can lead to unintended (mal-)functions.

With regard to the V-model an important misunderstanding must be dispelled here: The V-model should not primarily be seen as a Gantt Chart, but as a data management model. It maps the "divide and conquer" and the relationship between the artifacts. In practice this means that one cannot get by without iterations between the levels. Of course, those should be minimized as much as possible for the sake of efficiency. This results in a natural sequence, because one cannot specify and design anything on the lower levels of detail, if not everything is stable and approved on the upper level. Just as one cannot finish testing at the upper levels of integration if the tests at the lower levels have not been completed.

Verification

Verification is often equated with testing. For safety critical systems, this is not true, tests are just a small part of verification. Most of verification consists of reviews.

Reviews: Before their release, all artifacts must be verified by a review, often even by a reviewer with exactly defined independence from the project team or even from the organization. For some of the artifacts several reviews take place, if a quality assurance review or a review with respect to the standard is requested.
Checklists: Usually, for each artifact a checklist exists. Without evidence of the performed reviews the reviews are considered not done, so the filled-in checklists must be filed as review results.
Tests: There are test specifications, test instructions, maybe test code and also here again evidence of all test results, i.e. all results must be documented. The tests must be requirements based, amongst others there may be no test without an according requirement.
Code-Coverage Analysis: For the software, tests must make sure that all code is covered by the tests, including that all branches are taken. Note that it says coverage analysis, coverage is not a test for itself, but rather an analysis to show that the tests satisfy some minimal quality criteria. Coverage can be demonstrated using tools for dynamic code analysis.

As a consequence of the required code coverage and the requirements based testing it is not allowed (explicitly so in avionics with DO-178C) to write tests for the code coverage for which no requirements exist. So let's just generate a requirement? ...which as "derived" requirement then needs a safety analysis. There must be no unintended functionality. This is why it is worthwhile only to implement that which is really required.

Standards/ Rules

To ensure homogeneous quality throughout the project, standards, rules are required for many artifacts. Those can be developed internally, but it makes it easier to deal with the external auditors if known standards are used, e.g. MISRA for C/ C++ code.

Requirement Standards: Those describe how requirements must be formulated, down to formatting.
Design Standards: Clear guidelines for the design, they must cover all requirements of the standards, like no hidden data flow, hierarchical design...
Coding Standards: For the software, only a safe, deterministic and well readable subset of the programming language shall be used. Coding standards like MISRA can be substantiated for the most part automatically using tools for static code analysis.

Components

For electronics only high quality components should be selected. When selecting those, the long-term availability should be considered, so the safety evidence upon component changes does not have to be provided again and again. In addition it is key to have good data for the calculation of the failure rates.

Apart from the AEC-Q certificates for automotive there exist almost no "high reliability" parts anymore. Also the "standards" with numbers for failure rates (Siemens SN 29500, MIL-HDBK-217F...) are a victim of the ravages of time respectively the technological advances. Still the standards are used for quantitative analyses as it is in most cases only about the comparison of different technical solutions for the fulfillment of a target value for the overall system, not about a realistic statement on probability of failure.

Tools

No modern electronics or software development without software tools. Software? Is the software of all tools in the project without errors? What happens if an error in a tools leads to an error in an artifact?

Tool Classification: In a project for functional safety this means that all tools have to be classified. It must be shown whether and if yes which errors a tool can generate in an artifact.
Tool Qualification: According to the result of the above analysis the tools must be qualified. I.e. it must be demonstrated that the tool as it is used does not generate this error or that the errors can be caught.

Psychology, Too: Giving and Accepting Feedback

Functional safety is logic after all, a clear thing. This is what you think at the beginning... But this is quite wrong. Psychological aspects play a not unimportant role, as well for the achievement of the goals, but also for efficiency and before all for the own satisfaction.

No engineer gets around feedback, at the latest during the review of his results. To accept the positive feedback is usually not a problem, but when something is wrong, then sometimes the emotions run high. Here the own attitude towards errors is the issue: Can I accept own errors and learn from them? Am I ready to to look closely at others errors and point to them? Am I ready to carry out such conflicts in a constructive manner? Because only in a "conventional" project "come on, it works" is a reason not to correct bad code, and maybe not even there?

An because it should be the goal to pass the reviews without findings, the lone warrior approach does not work anymore. If I do not agree my solution with others, if I do not work it out together and find a consensus, then I perform so many rounds of reviews that I get dizzy

In the end I am only satisfied when I do not consider each error, each critique as a attack on me as a human, but as an invitation to get even better, to develop myself.

What does the Project Team Do?

The whole project team and the project manager also have their obligations and must take care of the following aspects.

Plans

In order to leave nothing to chance, projects for the development of safety critical systems must be conducted in a planned manner. So processes, roles, responsibilities etc. have to be summarized as plans:

Development Interface Agreement: If several partners are developing together, all responsibilities are being distributed in detail, so none of them can be forgotten.
Safety Plan or Plan on Software/ Hardware Aspects of Certification: This is the overall plan for the safety project, it contains all processes, all responsibilities, all dates and all procedures.
Verification Plan: It depicts how the artifacts are being verified.
Integration Plan: A plan how the subsystems, systems etc. are being brought together on all levels.
Configuration Management Plan: It shows in detail how the relationships between the artifacts and their versions and update states are managed over the lifecycle of the product.
Quality Assurance Plan: Clarifies who assures the quality of the artifacts generated in the project when and how.
and others: Lead-Free Development Plan...

Configuration Management

Configuration management assures that at any point in time all artifacts are matching perfectly.

Releases: In different phases of the project, releases are generated. Those not only encompass software and production data, but always all artifacts of the project.
Audits: To make sure that all artifacts in a release really match, an audit of each release must be performed.
Archiving: Depending on the industry, it can be requested that in 20 or more years it must be possible, based on the processes used in the development and with the same tools e.g. to regenerate exactly the same binary. How did you develop 20 years ago?

Change Management

After the first formal release, change management enters into force. The composition of a Change Control Boards is defined. For each change, this board has to answer the following questions while following an exactly defined workflow:

Is the change really needed?
How does the change affect safety?
How is it ensured how the change is verified and that it is verified?
When will the change be implemented?

Of course the whole process is always documented with detailed traceability.

Audits

All important artifacts, according to safety level this can be some, must be audited, i.e. a further person must perform a review. These audits shall make sure that no shortcuts are being taken. This is the reason why for the auditors a strong independence from the project is required, often third parties are being called in: notified bodies, the customer himself, authorities (EASA, FAA...).

Statements

At the end arises the most important document for the customer respectively the authorities, the final statement that the product is safe.

Safety Case or Software/ Hardware Accomplishment Summary: These documents must show what from the plans has been executed how and they explain why the product is safe.

Psychology, Too: Communication

Especially engineers often underestimate the psychological part of communication. Functional safety is team work, thus communication between humans and this communication cannot be reduced to pure information transfer.

Particularly when developing slightly more complicated safety critical systems, most tasks cannot be described in a way that the engineer afterwards can withdraw in his cubbyhole for a few weeks and his solution is good enough. It needs communication between all participants, so that the interfaces are correct and the consensus is reached which is then documented in the many artifacts.

Also good and thereby safe design cannot be reached via metrics, but only as a trade-off, and this consensus is not to be had without communication with each other.

The base for all communication that shall be well received are relationships. Especially when the communication at times is a litlte bit more heavyweight: "this is completely wrong". Good relationship does not mean that everybody has to spend each weekend with everybody, but the relationships have to be so good that an empathic communication is possible.

In the end it is important that even with the meticulous way of work of functional safety, a mood is reached in the team which makes work pleasant.

What does the Company Do?

It is not so simple that the overall responsibility for functional safety can be delegated to the project team. Especially not when the overall product life cycle from production to disposal is included. The organization has to accomplish considerable goals.

Processes/ Models

It is assumed for functional safety that the organization has processes, lives those and also improves on them. For development automotive SPICE or CMMI (Capability Maturity Model Integration) are common as process models. And those models go much further that ISO 9001, there are more goals to be reached and more practices specified. For avionics you need a DOA (Design Organization Approval), which can also be transferred from the customer if he takes the responsibility for the final quality assurance.

The question which I ask myself is the one whether those processes and models are really lived in the sense they are intended, also when working with large organizations with a high, certified maturity...

Level 3

What does level 3 mean? Level 3 means in all process models that processes are defined and are lived for the overall organization, i.e. for all projects. These processes are adapted to the project at hand project by a process called "Tailoring". The processes encompass way of working, tools, templates, checklists....

All those processes must be continuously improved, a learning organization is mandated.

Safety Culture

Last but not least one of the most important factors: safety culture. The organization must make sure that the safety is prioritized over commercial aspects. This means that it is no more possible to throw a product to market using a heroic weekend mission. All plans, reviews and audits must be observed.

In addition a proactive attitude towards errors is stipulated and that errors are used to learn from them, on project and company level. Clear plans without ad-hoc resources allocation are specified and traceable responsibility.

Keep It Simple

As we have seen up to now, the effort for each requirement, each component, each line of code is huge. Conversely this means that above each project for functional safety should be written: Keep it Simple! Every requirement, every "nice-to-have" which can be omitted can save a lot of effort. The motto is: simplify whatever is possible, even when this displeases the product management in many cases.

What do We Do Now, if We Want to Develop Safety Critical Systems?

And now, how to proceed to be able to develop in a way that the product can be called functionally safe?

Should we Develop Anew?

Basically existing code, existing schematics and potentially documentation cannot be reused "like that" for functional safety. So all the activities have to be performed like for a new development and this has to be proven with artifacts. The existing artifacts can only serve as "source of inspiration". Through the simplifications pursued and the correction of errors, which the strict processes will inevitably uncover, the product will anyway emanate changed by the process.

In rare cases, if the simplification is justified by e.g. a new platform, a complete re-development can be sensible. Viewed from the safety aspect a complete re-development by the way has the disadvantage that new errors are built in which have been eradicated in a long-standing product.

But the Product is Running Since Years?

And then the same question arises: Can we not just leave the product as it is, there were no failures until now. Theoretically this is possible, but the hurdles to the confirmation of operation hours and complete traceability of errors over years are im most cases so high, that this variant is almost never applicable.

How can one Establish the Development Capabilities oneself?

We followed the approach of first reaching Level 3, for us with CMMI-DEV (Common Maturity Model Integration for DEVelopment). Thereto we performed audits with external specialists, first with a focus on efficiency. Then for the applicable safety standards and safety levels we let perform a gap analysis and then corrected the way we work, i.e. our processes in order to close the gaps to the safety standards.

The effort for the establishment and the maintenance of such a process landscape is considerable. For Solcept from 2011 to 2018 (8..16 engineers) the effort was between 2 and 4% of the yearly working hours and about 30'000 CHF per external audit or per gap analysis.

There are other ways also:

On one hand one can buy complete process landscapes. However the question with those is what happens to the current ways of work, i.e. whether the processes fit to your organization and are viable.

One can also establish processes directly from safety standards. We had doubts whether we then would loose the focus on efficiency which CMMI compromises.

The third method would then be to just give the standard to the project team and let it work out the processes. This method collides with the requirements of processes on level 3 and the stipulated safety culture.

Or use the Capabilities of Solcept!

We are developing across industries for functional safety, if you wish we transfer the project including the complete project processes back to you.

If you don't want to struggle yourself with the processes, contact me.

Andreas Stucki

Do you have additional questions? Do you have a different opinion? If so, email me or comment your thoughts below!

Author