Development for Functional Safety: the Difference to "Conventional" Development

The customer thinks your product is fantastic, she just would like to get it with a functional safety level: DAL, SIL, ASIL, PL ...just a few letters more in the requirements. What does this mean for your development? Now, what do you have to do to fulfill the customers wish? If you read the according standards and still nothing is clear, you are not alone...

Functional safety is not just another feature for the datasheet, like another sneaker color. Functional safety is more like a mountain marathon which makes demands on the whole development team.

In a summary manner I show what the most important differences are of such a functionally safe to a purely functional development. This happens in this detailed contribution in the following steps:

A summary as a short list you can find here.

 

What is This About?

This contribution is only about the areas where we as Solcept are experienced. So it is about development of software and electronics. We ignore the remaining life cycle from production to disposal, also the details of system development on e.g. vehicle or aircraft level.

What does "Conventional" Mean here?

As starting point for this overview I assume a "conventional" embedded development in the industrial sector. Note that the quality levels without functional safety (e.g. QM (Quality Managed) in automotive or DAL-E (Design Assurance Level E) in aviation) already need a massively higher effort than the development we take as a base here. This is because these developments should already be carried out according level 3 of automotive SPICE or CMMI (Common Maturity Model Integration). It means that quite many requirements, unit tests, traceability etc. are already present.

Standards

Historically the standards for safety critical systems in the embedded area started in 1982 in the United States with the standard DO-178, targeted to airliners. This standard, today in version DO-178C, is still the mother of all standards and its concepts were taken over by other industries. In Germany IEC 61508 originated in 1998, primary for the chemical industry, but also as umbrella standard from which most of the other industry standards have been derived.

I don't want to delve into the topic of the different standards her, in many cases one cannot manage with just one, there is a whole collection of documents which have to be considered. However the basic concepts are the same, so what has to be done for functional safety can be defined independent of the industry.

The Name of the Game

At last in this introduction follows the probably most important point for everybody wanting to develop for functional safety. About the standards and the models required by functional safety you can think how you like, the moment you say yes to a project with functional safety, you have to play the game of functional safety. So to execute the required work to the letter and generate the documents. This applies to the engineers which are doing the work, but especially to the leadership which has to provide time and resources. Otherwise... you have to play another game without functional safety.

What does This Mean?

A few basic terms are important, so here a short definition:

  • Functional Safety: Safety which is dependent on the correct function of the system, thus not explosion protection, voltage protection, fire protection etc.
  • Systematic Errors: Errors which are built into the system during development (these errors can occur in the whole system)
  • Random Errors: Errors which occur randomly, failures (these errors can only occur in hardware because the software does not suffer aging processes)
  • Artifact: All work products (documents, code, data...)

Why are They Doing This?

There are a few basic principles of functional safety which help to understand why one should develop exactly this way and not in any other way.

  • Quality: The most important point is the focus on processes, on a structured course of action. The axiom here is called "use quality to manage risk“. One tries to mitigate the risks by keeping the development quality high, even though the risks can never be completely eliminated. This also means that the higher the risk (expressed as "safety levels"), the higher are the quality hurdles.
  • Divide and conquer: The overall system (vehicle, aircraft, equipment...) is divided into smaller and smaller subsystems. So a human, namely the responsible engineer, can overview the complicatedness on his level.
  • Planning: If a project is planned cleanly, less errors occur because no emergency exercises over the weekend are needed. Such quick fixes by the heroic developer are suppressed by change management and documentation requirements in all standards.
  • Evidence: What is not documented has not happened. This is on one hand a basic requirement of a quality management system (as the quality pope  W.E. Deming said: "In God we trust, all others bring data"), on the other hand a requirement of liability law. Committing an error cannot be used against anybody, but having left out a step, not having done a review, not having performed a test.
  • 4 (to 6) Eyes Principle: Everything that is somehow important must be checked at least once through a review, sometimes even through several.
  • Overall View: At the end functional safety has to be right for the complete safety critical system under consideration (vehicle, equipment, aircraft...). This means that there is no functional safety which can be viewed for a subsystem or a built-in equipment alone. So there is no complete "certification" of e.g. sensors, actors, software. Most aspects can be "pre-certified", still it has to be made sure the overall safety is not compromised by e.g. interface problems or errors in the usage.
  • Traceability: It shall be avoided that errors arise from assumptions which the engineer takes for his artifact. This means that requirements from the highest (equipment) level to the implementation must be traceable and no assumptions must be possible

What does the Engineer Do?

Safety Analyses

The first analysis especially occupies the developer of the overall system.

  • Hazard-/ Risk Analysis: First the safety level of the safety critical system or subsystem respectively of a function must be determined. This is derived form the possible damage (severity) and the probability of occurrence of such a damage, in most cases based on some kind of flow diagram. Especially in aviation there are predefined safety levels for different systems. Here a short glossary of the most common abbreviations for safety levels:

Level

Range Highest Risk Level Industry

DAL: Design Assurance Level

E..A A Aviation: ARP4761, ARP 4754A, DO-178C, DO-254...

SIL: Safety Integrity Level

1..4 4

Industry, IEC 61508 and railway, EN 50128/9

ASIL: Automotive Safety Integrity Level

A..D D

Automotive, ISO 26262

PL: Performance Level

a..e e

Machinery, EN 62061

Class: Software Safety Class A..C C Medical, IEC 62304

The other analyses then take place on each hierarchical level: system, subsystem, component, function, as well for software as for hardware, sometimes with different characteristics. So these affect every developer. There are different variants, the most important are:

  • Fault Tree Analysis (FTA): The FTA proceeds deductively, i.e. from the failure to the cause. The question is: What are the faults in my system that can lead to a certain failure? E.g. which components must fail so that a safety-relevant function is compromised?
    This makes this method suitable for design, especially for the top-down system design. The FTA exists in two variants, a purely qualitative and a quantitative, for which probabilities of occurence are assigned to the fault events.
  • Failure Modes and Effects Analysis (FMEA): In contrast the FMEA proceeds inductively, from the cause to the failure. For each subsystem/ component the question asked here is: What kind of safety-relevant failures can arise from a fault? E.g. if this component changes its value over time (i.e. it ages), how does this affect the function? If a state-machine swallows a bit, how does this affect the function? The FMEA also exists in two variants, a purely quantitative and a qualitative one. For the latter the analysis is based on fault probabilities for the different fault mechanisms (short circuit, open, drift, stuck-at...).
    In the industrial and automotive areas usually a FMEDA (Failure Modes, Effects and Diagnosis Analysis) for the electronics is performed in which a reduction of the failure rates is taken into account for diagnosis mechanisms (e.g. read back of output signals).

Safety Mechanisms

Based on the  safety analyses, safety measures have to be implemented to detect and prevent the following faults.

  • random hardware failures
  • systematic software failures
  • systematic hardware failures

These measures may comprise: plausibility checks, redundancy (i.e. several systems which are checking each other), diverse redundancy (redundancy based on components that are built and developed completely diverse), program flow monitoring, error correction for memories and many more.

Requirements

Errors in the requirements are the most prevalent cause of failure. This is why a lot of importance is attached to requirements in functional safety. Though several aspects have to be considered:

  • V-Model: The requirements must be manged according to the V-model in all industries, this means:

    • There are successively more detailed requirements on each level (e.g. system, software, software unit). The extent of requirements for each element (system, software, unit) should be so, that a human can still grasp them, the details are moved to the next lower level.
    • Basically all requirements are being tested on each level.

  • Requirements Traceability: Requirements and tests must be traceable, amongst others to make sure the overall product remains maintainable:

    • Vertically: it must be clear which requirements on one hierarchical level are covering the more abstract requirements on the next higher level.
    • Horizontally: it must be clear which requirements are tested by which tests.
    • Bi-directional: it must be possible, starting from one level, to follow the relationships to all other levels.

  • Traceability Coverage Analysis: Evidence must be provided that all  requirements on each level exist as more detailed requirements down to the implementation and that all  requirements are tested.
  • "Derived" Requirements: If new requirements originate from architecture or design, "derived" requirements are generated, e.g. from the definition of  interfaces between different subsystems. This means that "derived" are those requirements that cannot be traced to higher levels. Such requirements must undergo a separate analysis. It must be established that they are not jeopardizing the function of the superordinate element and the safety.
  • No Unintended Functionality: Another important aspect of the handling of (especially "derived") requirements and traceability is the prevention of unintended functionality inserted into the implementation by e.g. the programmer or by unneded "derived" requirements. These usually comes from interpretable, i.e. not accurate enough requirements or from good intentions like defensive programming. Both can lead to unintended (mal-)functions.

On the V-model an important misunderstanding must be dispelled here: The V-model should not primarily be seen as a Gantt Chart, but as a data management model. It maps the "divide and conquer" and the relationship between the artifacts. In  practice this means that one cannot get by without iterations between the levels. Of course, those should be minimized as much as possible for the sake of efficiency. A sequence naturally follows because on the lower levels of detailing no specification and design can be completed if on the higher ones the artifacts are not stable and released. Just as it is impossible on the higher integration levels to finalize testing as long as on the lower level not all tests have been passed completely.

Verification

Verification is often equated with testing. For safety critical systems, this is not true, tests are just a small part of verification. Most of verification consists of reviews.

  • Reviews: Before their release, all artifacts must be verified by a review, often even by a reviewer with exactly defined independence from the project team or even from the organization. For some of the artifacts several reviews take place, if a quality assurance review or a review with respect to the standard is requested.
  • Checklists: Usually, for each artifact a checklist exists. Without evidence of the performed reviews the reviews are considered not done, so the filled-in checklists must be filed as review results.
  • Tests: There are test specifications, test instructions, maybe test code and also here again evidence of all test results, i.e. all results must be documented. The tests must be requirements based, amongst others there may be no test without an according requirement.
  • Code-Coverage Analysis: For the software, tests must make sure that all code is covered by the tests, including that all branches are taken. Note that it says coverage analysis, coverage is not a test for itself, but rather an analysis to show that the tests satisfy some minimal quality criteria. Coverage can be demonstrated using tools for dynamic code analysis.

As a consequence of the required code coverage and the requirements based testing it is not allowed (explicitly so in avionics with DO-178C) to write tests for the code coverage for which no requirements exist. So let's just generate a requirement? ...which as "derived" requirement then needs a safety analysis. There must be no unintended functionality. This is why it is worthwhile only to implement that which is really required.

Standards

To ensure a homogeneous quality over the overall project, standards are called forfor many artifacts. Those can be developed internally, but it makes it easier to deal with the external auditors if known standards are used, e.g. MISRA for C/ C++ code.

  • Requirement Standards: Those describe how requirements must be formulated, down to formatting.
  • Design Standards: Clear guidelines for the design, they must cover all requirements of the standards, like no hidden data flow, hierarchical design...
  • Coding Standards: For the software, only a safe, deterministic and well readable subset of the programming language shall be used. Coding standards like MISRA can be substantiated for the most part automatically using tools for static code analysis.

Components

For electronics only high quality components should be selected. When selecting those, the long-term availability should be considered, so the safety evidence upon component changes does not have to be provided again and again. In addition it is key to have good data for the calculation of the failure rates.

Apart from the AEC-Q certificates for automotive there exist almost no "high reliability" part  anymore. Also the "standards" with numbers for failure rates (Siemens SN 27500, MIL-HDBK-217F...) are a victim of the ravages of time respectively the technological advances. Still the standards are used for quantitative analyses as it is in most cases only about the comparison of different technical solutions ofr the fulfillment of a target value for the overall system, not about a realistic statement on probability of failure.

Tools

No modern electronics or software development without software tools. Software? Is the software of all tools in the project without errors? What happens if an error in a tools leads to en error in an artifact?

  • Tool Classification: In a project for functional safety this means that all tools have to be classified. It must be shown whether and if yes which errors a tool can generate in an artifact.
  • Tool Qualification: According to the result of the above analysis the tools must be qualified. I.e. it must be demonstrated that the tool as it is used does not generate this error or that the errors can be caught.

Psychology, Too: Giving and Accepting Feedback

Functional safety is logic after all, a clear thing. This is what you think at the beginning... But this is quite wrong. Psychological aspects play a not unimportant role, as well for the achievement of the goals, but also for efficiency and before all for the own satisfaction.

No engineer gets around feedback, at the latest during the review of his results. To accept the positive feedback is usually not a problem, but when something is wrong, then sometimes the emotions run high. Here the own attitude towards errors is the issue: Can I accept own errors and learn from them? Am I ready to to look closely at others errors and point to them? Am I ready to carry out such conflicts in a constructive manner? Because only in a "conventional" project "come on, it works" is a reason not to correct bad code, and maybe not even there?

An because it should be the goal to pass the reviews without findings, the lone warrior approach does not work anymore. If I do not agree my solution with others, if I do not work it out together and find a consensus, then I perform so many rounds of reviews that I get dizzy

In the end I am only satisfied when I do not consider each error, each critique as a attack on me as a human, but as an invitation to get even better, to develop myself.

What does the Project Team Do?

The whole project team and the project manager also have their obligations and must take care of the following aspects.

Plans

In order to leave nothing to chance, projects for the development of safety critical systems must be conducted in a planned manner. So processes, roles, responsibilities etc. have to be summarized as plans:

  • Development Interface Agreement: If several partners are developing together, all responsibilities are being distributed in detail, so none of them can be forgotten.
  • Safety Plan or Plan on Software/ Hardware Aspects of Certification: This is the overall plan for the safety project, it contains all processes, all responsibilities, all dates and all procedures.
  • Verification Plan: It depicts how the artifacts are being verified.
  • Integration Plan: A plan how the subsystems, systems etc. are being brought together on all levels.
  • Configuration  Management Plan: It shows in detail how the relationships between the artifacts and their versions and update states are managed over the lifecycle of the product.
  • Quality Assurance Plan: Clarifies who assures the quality of the artifacts generated in the project when and how.
  • and others: Lead-Free Development Plan...

Configuration Management

Configuration management assures that at any point in time all artifacts are matching perfectly.

  • Releases: In different phases of the project, releases are generated. Those not only encompass software and production data, but always all artifacts of the project.
  • Audits: To make sure that all artifacts in a release really match, an audit of each release must be performed.
  • Archiving: Depending on the industry, it can be requested that in 20 or more years it must be possible, based on the processes used in the development and with the same tools e.g. to regenerate exactly the same binary. How did you develop 20 years ago?

Change Management

After the first formal release, change management enters into force. The composition of a Change Control Boards is defined. For each change, this board has to answer the following questions while following an exactly defined workflow:

  • Is the change really needed?
  • How does the change affect safety?
  • How is it ensured how the change is verified and that it is verified?
  • When will the change be implemented?

Of course the whole process is always documented with detailed  traceability.

Audits

All important artifacts, according to safety level this can be some, must be audited, i.e. a further person must perform a review. These audits shall make sure that no shortcuts are being taken. This is the reason why for the auditors a strong independence from the project is required, often third parties are being called in: notified bodies, the customer himself, authorities (EASA, FAA...).

Statements

At the end arises the most important document for the customer respectively the authorities, the final statement that the product is safe.

  • Safety Case or Software/ Hardware Accomplishment Summary: These documents must show what from the plans has been executed how and they explain why the product is safe.

Psychology, Too: Communication

Especially engineers often underestimate the psychological part of communication. Functional safety is team work, thus communication between humans and this communication cannot be reduced to pure information transfer.

Especially when developing slightly more complicated safety critical systems, most tasks cannot be described in a way that the engineer afterwards can withdraw in his cubbyhole for a few weeks and his solution is good enough. It needs communication between all participants, so that the interfaces are correct and the consensus is reached which is then documented in the many artifacts.

Also good and thereby safe design cannot be reached via metrics, but only as a trade-off, and this consensus is not to be had without communication with each other.

The base for all communication that shall be well received are relationships. Especially when the communication at times is a litlte bit more heavyweight: "this is completely wrong". Good relationship does not mean that everybody has to spend each weekend with everybody, but the relationships have to be so good that an empathic communication is possible. 

In the end it is important that even with the meticulous way of work of functional safety, a mood is reached in the team which makes work pleasant. 

What does the Company Do?

It is not so simple that the overall responsibility for functional safety can be delegated to the project team. Especially not when the overall product life cycle from production to disposal is included. The organization has to accomplish considerable goals.

Processes/ Models

It is assumed for functional safety that the organization has processes, lives those and also improves on them. For development automotive SPICE or CMMI (Common Maturity Model Integration) are common as process models. And those models go much further that ISO 9001, there are more goals to be reached and more practices specified. For avionics you need a DOA (Design Organization Approval), which can also be transferred from the customer if he takes the responsibility for the final quality assurance.

The question which I ask myself is the one whether those processes and models are really lived in the sense they are intended, also when working with large organizations with a high, certified maturity...

Level 3

What does level 3 mean? Level 3 means in all process models that processes are defined and are lived for the overall organization, i.e. for all projects. These processes are adapted to the project at hand project by a process called "Tailoring". The processes encompass way of working, tools, templates, checklists....

All those processes must be continuously improved, a learning organization is mandated.

Safety Culture

Last but not least one of the most important factors: safety culture. The organization must make sure that the safety is prioritized over commercial aspects. This means that it is no more possible to throw a product to market using a heroic weekend mission. All plans, reviews and audits must be observed.

In addition a proactive attitude towards errors is stipulated and that errors are used to learn from them, on project and company level. Clear plans without ad-hoc resources allocation are specified and traceable responsibility.

Keep It Simple

As we have seen up to now, the effort for each requirement, each component, each line of code is huge. Conversely this means that above each project for functional safety should be written: Keep it Simple!  Every requirement, every "nice-to-have" which can be omitted can save a lot of effort. The motto is: simplify whatever is possible, even when this displeases the product management in many cases.

What do we do now, if we want to develop safety critical systems?

And now, how to proceed to be able to develop in a way that the product can be called functionally safe?

Should we Develop Anew?

Basically existing code, existing schematics and potentially documentation cannot be reused "like that" for functional safety. So all the activities have to be performed like for a new development and this has to be proven with artifacts. The existing artifacts can only serve as "source of inspiration". Through the simplifications pursued and the correction of errors, which the strict processes will inevitably uncover, the product will anyway emanate changed by the process.

In rare cases, if the simplification is justified by e.g. a new platform, a complete re-development can be sensible. Viewed from the safety aspect a complete re-development by the way has the disadvantage that new errors are built in which have been eradicated in a long-standing product.

But the Product is Running Since Years?

And then the same question arises: Can we not just leave the product as it is, there were no failures until now. Theoretically this is possible, but the hurdles to the confirmation of operation hours and complete traceability of errors over years are im most cases so high, that this variant is almost never applicable.

How can one Establish the Development Capabilities oneself?

We followed the approach of first reaching Level 3, for us with CMMI-DEV (Common Maturity Model Integration for DEVelopment). Thereto we performed audits with external specialists, first with a focus on efficiency. Then for the applicable safety standards and safety levels we let perform a gap analysis and then corrected the way we work, i.e. our processes in order to close the gaps to the safety standards.

The effort for the establishment and the maintenance of such a process landscape is considerable. For Solcept from 2011 to 2018 (8..16 engineers) the effort was between 2 and 4% of the yearly working hours and about 30'000 CHF per external audit or per gap analysis.

There are other ways also:

On one hand one can buy complete process landscapes. However the question with those is what happens to the current ways of work, i.e. whether the processes fit to your organization and are viable.

One can also establish processes directly from safety standards. We had doubts whether we then would loose the focus on efficiency which CMMI compromises.

The third method would then be to just give the standard to the project team and let it work out the processes. This method collides with the requirements of processes on level 3 and the stipulated safety culture.

Or use the Capabilities of Solcept!

We are developing across industries for functional safety, if you wish we transfer the project including the complete project processes back to you.

If you don't want to struggle yourself with the processes, contact me: 

Andreas Stucki

Keywords/ Tags

No comments

What is Your Opinion?

Share On