Developemnt for Functional Safety: the Difference to "Conventional" Development

The customer thinks your product is fantastic, she would just like to get it with a functional safety level: DAL, SIL, ASIL, PL ...just a few letters more in the requirements. What does this mean for your development? What do you have to do now if you want to fulfill the customers wish? If you read the according standards and still nothing is clear, you are not the only one...

Functional safety is not just another feature for the datasheet, like another sneaker color. Functional safety is more like a mountain marathon which makes demands on the whole development.

I show in a summary manner what are the most important differences of such a functionally safe to a purely functional development. This happens in this detailed contribution in the following steps: 

What does the Project Team Do? What does the Company Do?

A summary as a short list you can find here.

 

What is This About?

This is only about areas where we as Solcept are experienced, So this contribution is about development of software and electronics. We ignore the remaining life cycle from production to disposal, also the details of system development on e.g. vehicle or aircraft level.

What does "Conventional" Mean here?

As starting point for this overview I assume a "conventional" embedded development in the industrial sector. Note that the quality levels without functional safety (e.g. QM (Quality Managed) in automotive or DAL-E (Design Assurance Level E) in aviation already need a massively higher effort than the development we take as a base here. This because these developments should already be carried out according level 3 of automotive SPICE or CMMI (Common Maturity Model Integration). This means that quite many requirements, unit tests, traceability etc. are already present.

Standards

Historically the standards for embedded functional safety started in 1982 in the United States with the standard DO-178, targeted to airliners. This standard, today in version DO-178C, is still the mother of all standards and its concepts were taken over by other industries. In Germany IEC 61508 originated in 1998, primary for the chemical industry, but also as umbrella standard from which most of the other industry standards have been derived.

I don't want to delve into the topic of the different standards her, in many cases one cannot manage with just one, there is a whole collection of documents which have to be considered. However the basic concepts are the same, so what has to be done for functional safety can be defined independent of the industry.

The Name of the Game

At last in this introduction the probably most important point for everybody that would like to develop for functional safety. One can think how one likes  about the standards and the models required by them, the moment one says yes to a project with functional safety, one has to play the game of functional safety. So to execute the required work to the letter and generate the documents. This applies to the engineers which are doing the work, but especially to the leadership which has to provide time and resources. Otherwise... one has to play another game without functional safety.

What does This Mean?

A few basic terms are important, so here a short definition:

  • Functional Safety: Safety which is dependent on the correct function of the system, thus not explosion protection, voltage protection, fire protection etc.
  • Systematic Errors: Errors which are built into the system during development (these errors can occur in the whole system)
  • Random Errors: Errors which occur randomly, failures (these errors can only occur in hardware because the software does not suffer aging processes)
  • Artifact: All work products (documents, code, data...)

Why are They Doing This?

There are a few basic principles of functional safety which help to understand why one should develop exactly this way and not in any other way.

  • Quality: The most important point is the focus on processes, on a structured course of action. The axiom here is called "use quality to manage risk“. One tries to mitigate the risks by keeping the development quality high, even though the risks can never be completely eliminated. This also means that the higher the risk (expressed as "safety levels"), the higher are the quality hurdles.
  • Divide and conquer: The overall system (vehicle, aircraft, equipment...) is divided into smaller and smaller subsystems. So a human, namely the responsible engineer, can overview the complicatedness on his level.
  • Planning: If a project is planned cleanly, less errors occur because no emergency exercises over the weekend are needed. Such quick fixes by the heroic developer are suppressed by change management and documentation requirements in all standards.
  • Evidence: What is not documented has not happened. This is on one hand a basic requirement of a quality management system (as the quality pope  W.E. Deming said: "In God we trust, all others bring data"), on the other hand a requirement of liability law. Committing an error cannot be used against anybody, but having left out a step, not having done a review, not having performed a test.
  • 4 (to 6) Eyes Principle: Everything that is somehow important must be checked at least once through a review, sometimes even through several.
  • Overall View: At the end functional safety has to be right for the complete system under consideration (vehicle, equipment, aircraft...). This means that there is no functional safety which can be viewed for a subsystem or a built-in equipment alone. So there is no complete "certification" of e.g. sensors, actors, software. Most aspects can be "pre-certified", still it has to be made sure the overall safety is not compromised by e.g. interface problems or errors in the usage.
  • Traceability: It shall be avoided that errors arise from assumptions which the engineer takes for his artifact. This means that requirements from the highest (equipment) level to the implementation must be traceable and no assumptions must be possible

What does the Engineer Do?

Safety Analyses

An analysis which especially occupies the developer of the overall system.

  • Hazard-/ Risk Analysis: First the safety level of the system or subsystem respectively of a function must be determined. This is derived form the possible damage and the probability with which such a damage occurs, in most cases based on some kind of flow diagram. Especially in aviation there are predefined safety levels for different systems. Here a short glossary of the most common abbreviations:

Level

Range Highest Risk Level Industry

DAL: Design Assurance Level

E..A A Aviation: ARP4761, ARP 4754A, DO-178C, DO-254...

SIL: Safety Integrity Level

1..4 4

Industry, IEC 61508 and railway, EN 50128/9

ASIL: Automotive Safety Integrity Level

A..D D

Automotive, ISO 26262

PL: Performance Level

a..e e

Machinery, EN 62061

The other analyses then take place on each level: system, subsystem, component, function, as well for software as for hardware, sometimes with different characteristics. So those affect each developer. There are different variants, the most important are:

  • Fault Tree Analysis (FTA): The FTA proceeds deductively, i.e. from the failure to the cause. What are the faults in my system that can lead to a certain failure. E.g. which components must fail so that the safety-relevant function is compromised?
    This makes this method suitable for design, especially for the top-down system design. The FTA exists in two variants, a purely qualitative and a quantitative, for which probabilities are assigned to the fault events.
  • Failure Modes and Effects Analysis (FMEA): In contrast the FMEA proceeds inductively, from the cause to the failure. For each subsystem/ component the question asked here is: What kind of safety-relevant failures can arise from a fault. E.g. if this component changes its value over time (i.e. it ages), how does this affect the function? If a state-machine swallows a bit, how does this affect the function? The FMEA also exists in two variants, a purely quantitative and a quantitative one. For the latter the analysis is based on fault probabilities and fault mechanisms for the components.
    In the industrial and automotive areas usually a FMEDA (Failure Modes, Effects and Diagnosis Analysis) is performed in which a reduction of the failure rates is calculated for diagnosis mechanisms (e.g. read back of output signals).

Safety Mechanisms

Based on the  safety analyses, safety measures have to be implemented to catch the discovered faults.

  • against random hardware failures
  • against systematic software failures
  • against systematic hardware failures

These measures can comprise: plausibility checks, redundancy (i.e. several systems which are checking each other), diverse redundancy (redundancy based on components that are built and developed complete diverse), program flow monitoring, error correction for memories and many more.

Requirements

Errors in the requirements are the most prevalent cause of failure. This is why a lot of importance is attached to requirements in functional safety. Though several aspects have to be considered:

  • V-Model: The requirements must be manged according to the V-model in all industries, this means:

    • There are successively more detailed requirements on each level (e.g. system, software, software unit). The extent of requirements for each element (system, software, unit) should be so, that a human still can understand them, the details are moved to the next lower level.
    • Basically all requirements are being tested on each level.

  • Requirements Traceability: Requirements and Test must be traceable, amongst others to make sure the overall product remains maintainable:

    • Vertically: it must be clear which requirements on one level are covering the more abstract requirements on the next higher level.
    • Horizontally: it must be clear which tests are testing which requirements.
    • Bi-directional: it must be possible, starting from one level to follow the relationships to all other levels.

  • Traceability Coverage Analysis: Evidence must be provided that all  requirements on each level exist as more detailed requirements down to the implementation and that all  requirements are tested.
  • "Derived" Requirements: If from architecture or design new requirements are generated, "derived" requirements are generated, e.g. from interfaces between different subsystems. This means that "derived" are those requirements that cannot be traced to higher levels. Such requirements must undergo a separate analysis. It must be established that they are not jeopardizing the function of the superordinate element and the safety.
  • No Unintended Functionality: Another important aspect of the handling of (especially "derived") requirements and traceability is to prevent that unintended functionality is inserted into the implementation by e.g. the programmer. These usually comes from interpretable, i.e. not accurate enough requirements or from good intentions like defensive programming. Both can lead to unintended malfunctions.

About the V-model an important point: The V-model should not primarily be seen as a Gantt Chart, but before all as a data management model. It maps the "divide and conquer" and the relationship between the artifacts. In  practice this means that one cannot get by without iterations between the levels. Of course, those should be minimized as much as possible for the sake of efficiency. A sequence naturally follows because on the lower levels of detailing no specification and design can be completed if on the higher ones the artifacts are not stable and released. As it is impossible in the higher integration levels to finalize testing as long as on the lower level not all tests have run completely.

Verification

Verification is often equated with testing. Here, this is not true, tests are just a small part of verification. Most of verification consists of reviews.

  • Reviews: Before their release, all artifacts must be verified by a review, often even by a reviewer with exactly defined independence. For some of the artifacts several reviews take place, if a quality assurance review or a review with respect to the standard is requested.
  • Checklists: Usually, for each artifact there exists a checklist. Without evidence of the performed reviews the reviews are considered not done, so the filled-in checklists must be filed as review results.
  • Tests: There are test specifications, test instructions, maybe test code and also here again evidence of all test results. The tests must be requirements based, amongst others there may be no test without an according requirement.
  • Code-Coverage Analysis: For the software, tests must make sure that all code is covered by the tests, including that all branches were taken. Note that it is coverage analysis, coverage is not a test for itself, but rather an analysis whether the tests satisfy some minimal quality criteria. Coverage can be demonstrated using tools for dynamic code analysis.

From the required code coverage and the requirements based testing results by the way that it is not allowed (explicitly so in avionics with DO-178C), to write tests for the code coverage for which no requirements exist. So let's just generate a requirement? ...which as "derived" requirement then needs a safety analysis. There must be no unintended functionality. This is why it is worthwhile only to implement that which is really required.

Standards

To ensure a homogeneous quality over the overall project, for many artifacts standards are called for. Those can be developed internally, but it makes it easier to deal with the external auditors if known standards are used, e.g. MISRA for C/ C++ code.

  • Requirement Standards: Those describe how requirements must be formulated, down to formatting.
  • Design Standards: Clear guidelines for the design, they must cover all requirements of the standards, like no hidden data flow, hierarchical design...
  • Coding Standards: For the software, only a safe, deterministic and well readable subset of the programming language shall be used. Coding standards like MISRA can be substantiated for the most part automatically using tools for static code analysis.

Components

For electronics only high quality components should be selected. Those should be available as long as possible, so the safety evidence upon component changes does not have to be provided again and again. In addition it would be favorable to have good data for the calculation of the failure rates.

Unfortunately outside the AEC-Q certificates for automotive there exist almost no "high reliability" part  anymore. Also the "standards" with numbers for failure rates are a victim of the ravages of time respectively the technological advances. And because to my knowledge there is no organization that collects new statistic data on failure rates and modes of modern components, it can be very troublesome to calculate a realistic analysis of the failure rate of a circuit.

Tools

No modern electronics or software development without software tools. Software? Is the software of all tools in the project without errors? What happens if an error in a tools leads to en error in an artifact?

  • Tool Classification: In a project for functional safety this means that all tools have to be classified. It must be shown whether and if yes which errors a tool can generate in an artifact.
  • Tool Qualification: According to the result of the above analysis the tools must be qualified. I.e. it must be demonstrated that the tool as it is used does not generate this error or that the errors can be caught.

What does the Project Team Do?

The whole project team and the project manager also have their obligations and must take care of the following aspects.

Plans

In order to leave nothing to chance, projects for functional safety must be conducted in a planned manner. So processes, roles, responsibilities etc. have to be summarized as plans:

  • Development Interface Agreement: If several partners are developing together, all responsibilities are being distributed in detail, so none of them can be forgotten.
  • Safety Plan or Plan on Software/ Hardware Aspects of Certification: This is the overall plan for the safety project, it contains all processes, all responsibilities, all dates and all procedures.
  • Verification Plan: It depicts how the artifacts are being verified.
  • Integration Plan: A plan, how the subsystems, systems etc. are being brought together on all levels.
  • Configuration  Management Plan: It shows in detail how the relationships between the artifacts and their versions and update states are managed over the lifecycle of the product.
  • Quality Assurance Plan: Clarifies who assures when and how the quality of the artifacts generated in the project.
  • and others: Lead-Free Development Plan...

Configuration Management

The configuration management assures that at any point in time all artifacts are matching perfectly.

  • Releases: In different phases of the project, releases are generated. Those not only encompass software and production data, but always all artifacts of the project.
  • Audits: To make sure that all artifacts in a release really match, an audit of each release must be performed.
  • Archiving: Depending on the industry, it can be requested that in 20 or more years it must be possible, based on the processes used in the development and with the same tools e.g. to regenerate exactly the same binary. How did you develop 20 years ago?

Change Management

After the first formal release, change management enters into force. The composition of a Change Control Boards is defined. For each change, this board has to answer the following questions while following an exactly defined workflow:

  • Is the change really needed?
  • How does the change affect safety?
  • How is ensured how  the change is verified and that it is verified?
  • When will the change be implemented?

The whole process of course always with detailed  traceability.

Audits

All important artifacts, according to safety level this can be several, must be audited, i.e. a further person must perform a review. These audits shall make sure that no shortcuts are being taken. This is why for the auditors a strong independence from the project is required, often third parties are being called in: notified bodies, the customer himself, authorities (EASA, FAA...).

Statements

At the end arises the most important document for the customer respectively the authorities, the final statement that the product is safe.

  • Safety Case or Software/ Hardware Accomplishment Summary: These documents must show what from the plans has been executed how and explain why the product is safe.

What does the Company Do?

It is not so simple that the overall responsibility for functional safety can be delegated to the project team. Especially not, when the overall product life cycle from production to disposal is included. The organization has to accomplish considerable goals.

Processes/ Models

It is assumed for functional safety that the organization has processes, lives those and also improves on them. For development automotive SPICE or CMMI (Common Maturity Model Integration) are common as process models. And those models go much further that ISO 9001, there are more goals to be reached and more practices specified. For avionics you need a DOA (Design Organization Approval), which can also be transferred from the customer if he takes the responsibility for the final quality assurance.

The question which I ask my self is the one whether those processes and models are really lived in the sense they are intended, also when working with large organizations with a high, certified maturity...

Level 3

What does level 3 mean? Level 3 means in al process models that processes for the overall organization, i.e. for all projects are defined and are lived. These processes are adapted to the project at hand project by a process called "Tailoring". The processes encompass way of working, tools, templates, checklists....

All those processes must be continuously improves, a learning organization is mandated.

Safety Culture

Last but not least one of the most important factors: safety culture. The organization must make sure that the safety is prioritized over commercial aspects. This means that it is no more possible to throw a product to market using a heroic weekend mission. All plans, reviews and audits must be observed.

In addition a proactive attitude towards errors is stipulated and that errors are used to learn from them, on project and company level. Clear plans without ad-hoc resources allocation are specified and traceable responsibility.

Keep It Simple

As we have seen up to now, the effort for each requirement, each component, each line of code is huge. Conversely this means that above each project for functional safety should be written: Keep it Simple!  Every requirement, every "nice-to-have" which can be omitted can save a lot of effort. The motto is: simplify whatever is possible, even when this displeases the product management in many cases.

How do We get There?

And now, how to proceed to be able to develop in a way that the product can be called functionally safe?

Should we Develop Anew?

Basically existing code, existing schematics and potentially documentation cannot be reused "like that" for functional safety. So all the activities have to be performed like for a new development and this has to be proven with artifacts. The existing artifacts can only serve as "source of inspiration". Through the simplifications pursued and the correction of errors, which the strict processes will inevitably uncover, the product will anyway emanate changed by the process.

In rare cases, if the simplification is justified by e.g. a new platform, a complete re-development can be sensible. Viewed from the safety aspect a complete re-development by the way has the disadvantage that new errors are built in which have been eradicated in a long-standing product.

But the Product is Running Since Years?

And the the same question arises: Can we not just leave the product as it is, there were no failures until now. Theoretically this is possible, but the hurdles to the confirmation of operation hours and complete traceability of errors over years are im most cases so high, that this variant is almost never applicable.

How can one Establish the Development Capabilities oneself?

We followed the approach of first reaching Level 3, for us with CMMI-DEV (Common Maturity Model Integration for DEVelopment). Thereto we performed audits with external specialists, first with a focus on efficiency. Then for the applicable safety standards and safety levels we let perform a safety analysis and then corrected the way we work, i.e. our processes in order to close the gaps to the safety standards.

The effort for the establishment and the maintenance of such a process landscape is considerable. For Solcept from 2011 to 2018 (8..16 engineers) the effort was between 2 and 4% of the yearly working hours and about 30'000 CHF per external audit or per gap analysis.

There are other ways also:

On one hand one can buy complete process landscapes. However the question with those is what happens to the current ways of work, i.e. whether the processes fit to your organization and are viable.

One can also establish processes directly from safety standards. We had doubts whether we then would loose the focus on efficiency which CMMI compromises.

The third method would then be to just give the standard to the project team and let it work out zhe processes. This method collides with the requirements of processes on level 3 and the stipulated safety culture.

Or use the Capabilities of Solcept!

We are developing across industries or functional safety, if you wish we transfer the project including the complete project processes back to you.

If you don't want to struggle yourself with the processes, contact me: 

Andreas Stucki

Keywords/ Tags

No comments

What is Your Opinion?

Share On