Good Programming Practices for Safety

출처: 8719.13 NASA Software Safety Guidebook


Good Programming Practices for Safety


Solving the Software Safety Paradox”에서의 practices

  1. CPU self test. If the CPU becomes partially crippled, it is important for the software to know this. Cosmic Radiation, EMI, electrical discharge, shock, or other effects could have damaged the CPU. A CPU self-test, usually run at boot time, can verify correct operations of the processor. If the test fails, then the CPU is faulty, and the software can go to a safe state.
  2. Guarding against illegal jumps. Filling ROM or RAM with a known pattern, particularly a halt or illegal instruction, can prevent the program from operating after it jumps accidentally to unknown memory. On processors that provide traps for illegal instructions (or a similar exception mechanism), the trap vector could point to a process to put the system into a safe state.
  3. ROM tests. Prior to executing the software stored in ROM (EEPROM, Flash disk), it is important to verify its integrity. This is usually done at power-up, after the CPU self test, and before the software is loaded. However, if the system has the ability to alter its own programming (EEPROMS or flash memory), then the tests should be run periodically.
  4. Watchdog Timers. Usually implemented in hardware, a watchdog timer resets (reboots) the CPU if it is not “tickled” within a set period of time. Usually, in a process implemented as an infinite loop, the watchdog is written to once per loop. In multitasking operating systems, using a watchdog is more difficult. Do NOT use an interrupt to tickle the watchdog. This defeats the purpose of having one, since the interrupt could still be working while all the real processes are blocked!
  5. Guard against Variable Corruption. Storing multiple copies of critical variables, especially on different storage media or physically separate memory, is a simple method for verifying the variables. A comparison is done when the variable is used, using two-out-of-three voting if they do not agree, or using a default value if no two agree. Also, critical variables can be grouped, and a CRC used to verify they are not corrupted.
  6. Stack Checks. Checking the stack guards against stack overflow or corruption. By initializing the stack to a known pattern, a stack monitor function can be used to watch the amount of available stack space. When the stack margin shrinks to some predetermined limit, an error processing routine can be called that fixes the problem or puts the system into a safe state.
  7. Program Calculation Checks. Simple checks can be used to give confidence in the results from calculations.


“30 Pitfalls for Real-Time Software Developers,”에서의 bad practices 

  1. 􀂒 Delays implemented as empty loops. 시간을 때우기 위해서 타이머를 이용하여 시간을 재지 않고 empty loop를 사용한다면 더 빠르거나 느린 machine에서는 시간이 맞지 않을 것이고 새로운 컴파일러로 재 컴파일하거나 최적화하게 되면 시간 문제가 발생한다.
  2. 􀂒 Interactive and incomplete test programs. 시험은 계획되고 문서화되어야 한다. 이렇게 함으로써 누락되는 테스트를 방지할 수 있다. 변경이 발생한 이후에는 기능시험을 수행해서 수정한 코드가 다른 부분에 영향을 보이지 않음을 입증해야 한다.
  3. 􀂒 Reusing code not designed for reuse. 만약 코드가 재사용을 위해 설계되지 않았다고 하면 다른 component간에 의존성이 있을 수 있다.
  4. 􀂒 One big loop. 단일화된 큰 루프는 소프트웨어의 다른 모든 part들을 동일한 rate로 동작되도록 하는데 이것은 일반적으로 바람직한 것이 아니다. (그런데 스케줄링 하기 쉽다고 그렇게들 하지..).
  5. 􀂒 No analysis of hardware peculiarities before starting software design. 다른 프로세서들은 시간이 걸릴 수 있는 연산을 가지고 있다. 예를 들면 메모리의 영역을 접근하는데 시간이 어느정도 소요된다. 소프트웨어를 설계하기 전에 하드웨어를 이해하는 것은 통합 단계의  문제점들을 감소시켜줄 수 있다.
  6. 􀂒 Fine-grain optimizing during first implementation. 일부 프로그래머들은 문제점(anomalies)을 내다보기도 한다. (일부는 진짜이고, 일부는 미신적이다.) 미신적인 anomaly의 사례는 곱하기가 더하기보다 훨씬 더 많은 시간을 소비한다.
  7. 􀂒 Too many inter-component dependencies. 소프트웨어의 재사용성을 최대화시키기 위해서는 컴포넌트들은 서로에게 복잡하게 의존적이어서는 안된다.
  8. 􀂒 Only a single design diagram. 대부분의 소프트웨어 시스템이 설계되어 전체 시스템이 하나의 다이어그램으로 정의된다.( 혹은 심지어 더 나쁜 경우는 하나의 다이어그램조차도 정의하지 않음). 소프트웨어를 설계할 때 문서화된 전체 설계를 갖는 것은 매우 중요하다. –> 여러 관점의 설계도가 필요하다는 것을 의미하는 것 같다.
  9. 􀂒 Error detection and handling are an afterthought and implemented through trial and error. 처음부터 에러 탐지 및 처리 방법을 설계한다. 코드 수준의 노력을 조절한다. 모든 것을 한번에 넣지 말아라. 데이터가 옳아야 할 필요가 있는 위치 혹은 소프트웨어 및 하드웨어가 나쁜 입력이나 출력에 취약한 지역을 살펴본다.
  10. 􀂒 No memory analysis. 얼마나 많은 메모리가 너의 시스템에서 사용하고 있는지를 점검한다. 너의 설계로부터 그것을 예측해서, 시스템이 한계를 넘어서는 것을 조절할 수 있다. 동일한 개념의 다른 두 구현 사이에서 결정하려고 할 때, 메모리 사용량으로부터 판단하는 것은 좋은 선택이 된다.
  11. 􀂒 Documentation was written after implementation. 너가 필요한 것을 write하고 너가 write한 것을 사용하라. 계약상으로 요구하지 않는 한, 불필요하게 너무 자세하게 기술하지 말아라. 개발자가 실제로 읽고 사용하기 위한 짧은 문서가 더 낫다.
  12. 􀂒 Indiscriminate use of interrupts. Use of interrupts can cause priority inversion in real-time systems if not implemented carefully. 이것은 시간적 문제를 유발하여 필수적인 데드라인을 실패할 수 있다.
  13. 􀂒 No measurements of execution time. 실시간 시스템을 설계하는 많은 프로그래머들은 그들 코드의 어떤 부분에 대한 수행 시간에 대한 개념이 없다.


“Software Risk Management for Medical Devices,” Table III에서 제시한 가능한 failures들을 감소시키기 위한 방법

  1. 􀂒 Check variables for reasonableness before use. 값이 범위 밖이면 문제가 발생한다.  – 메모리 충돌, 부정확한 계산, 하드웨어 문제, 다른 문제
  2. 􀂒 Use execution logging, with independent checking, to find software runaway, illegal functions, or out-of-sequence execution. 소프트웨어가 컴포넌트의 알려진 경로를 따라야 한다면, check log가 문제가 발생한 즉시 발견할 수 있다.
  3. 􀂒 Come-from checks. 의도하지 않은 호출을 탐지하도록 하기 위해서 누가 누구에게 호출을 하는지를 검사함으로써 제어 흐름을 확인한다.
  4. 􀂒 Test for memory leakage. 코드를 Instrument하고 load한 다음 스트레스 시험을 수행하여 메모리 사용 변화에 따라 메모리 누수가 얼마나 발생하는지를 확인한다.
  5. 􀂒 Use read-backs to check values. 메모리, 디스플레이, 하드웨어, 다른 함수에 어떤 값을 쓸 때 다시 read해서 그 값이 올바르게 write되었는지를 확인한다.


그 외에 Safety를 위해서 고려해야 할 점들

  1. Use a simulator or ICE (In-circuit Emulator) system for debugging in embedded systems. These tools allow the programmer/tester to find some subtle problems more easily. Combined with some of the techniques described above, they can find memory access problems and trace back to the statement that generated the error.
  2. Reduce complexity. cyclomatic complexity로 unit function의 complexity를 감소.
  3. Design for weak coupling between components (modules, classes, etc.). The more independent the components are, the fewer undesired side effects there will be later in the process. “Fixes” when an error is found in testing may create problems because of misunderstood dependencies between components.
  4. Consider the stability of the requirements. If the requirements are likely to change, design as much flexibility as possible into the system.
  5. Consider compiler optimization carefully. Debuggers may not work well with optimized code. It is hard to trace from the source code to the optimized object code. Optimization may change the way the programmer expected the code to operate (removing “unused” features that are actually used!).
  6. Be careful if using multi-threaded programs. Developing multi-threaded programs is notoriously difficult. Subtle program errors can result from unforeseen interactions among multiple threads. In addition, these errors can be very hard to reproduce since they often depend on the non-deterministic behavior of the scheduler and the environment.
  7. A dependency graph is a valuable software engineering aid. Given such a diagram, it is easy to identify what parts of the software can be reused, create a strategy for incremental testing of components, and develop a method to limit error propagation through the entire system.
  8. Follow the two person rule. At least two people should be thoroughly familiar with the design, code, testing and operation of each software component of the system. If one person leaves the project, someone else understands what is going on.
  9. Prohibit program patches. During development, patching a program is a bad idea. Make the changes in the code and recompile instead. During operations, patching may be a necessity, but the pitfalls should still be carefully considered.
  10. Keep Interface Control Documents up to date. Out-of-date information usually leads to one programmer creating a component or unit that will not interface correctly with another unit. The problem isn’t found until late in the testing phase, when it is expensive to fix. Besides keeping the documentation up to date, use an agreed-upon method to inform everyone of the change.
  11. Create a list of possible hardware failures that may impact the software, if they are not spelled out in the software requirements document. Have the hardware and systems engineers review the list. The software must respond properly to these failures. The list will be invaluable when testing the error handling capabilities of the software. Having a list also makes explicit what the software can and cannot handle, and unvoiced assumptions will usually be discovered as the list is reviewed.


다음의 사항들은 SSP 50038에서 권장하는 방법들이다.
SSP 50038, Computer-Based Control System Safety Requirements for the International Space Station Program:

  • Provide separate authorization and separate control functions to initiate a critical or hazardous function. This includes separate “arm” and “fire” commands for critical capabilities.
  • Do not use input/output ports for both critical and non-critical functions.
  • Provide sufficient difference in addresses between critical I/O ports and non-critical I/O ports, such that a single address bit failure does not allow access to critical functions or ports.
  • Make sure all interrupt priorities and responses are defined. All interrupts should be initialized to a return, if not used by the software.
  • Provide for an orderly shutdown (or other acceptable response) upon the detection of unsafe conditions. The system can revert to a known, predictable, and safe condition upon detection of an anomaly.
  • Provide for an orderly system shutdown as the result of a command shutdown, power interruptions, or other failures. Depending on the hazard, battery (or capacitor) backup may be required to implement the shutdown when there is a power failure.
  • Protect against out-of-sequence transmission of safety-critical function messages by detecting any deviation from the normal sequence of transmission. Revert to a known safe state when out-of-sequence messages are detected.
  • Initialize all unused memory locations to a pattern that, if executed as an instruction, will cause the system to revert to a known safe state.
  • Hazardous sequences should not be initiated by a single keyboard entry.
  • Prevent inadvertent entry into a critical routine. Detect such entry if it occurs, and revert to a known safe state.
  • Don’t use a stop or halt instruction. The CPU should be always executing, whether idling or actively processing.
  • When possible, put safety-critical operational software instructions in nonvolatile read-only memory.
  • Don’t use scratch files for storing or transferring safety-critical information between computers or tasks within a computer.
  • When safety interlocks are removed or bypassed for a test, the software should verify the reinstatement of the interlocks at the completion of the testing.
  • Critical data communicated from one CPU to another should be verified prior to operational use.
  • Set a dedicated status flag that is updated between each step of a hazardous operation. This provides positive feedback of the step within the operation, and confirmation that the previous steps have been correctly executed.
  • Verify critical commands prior to transmission, and upon reception. It never hurts to check twice!
  • Make sure all flags used are unique and single purpose.
  • Put the majority of safety-critical decisions and algorithms in a single (or few) software development component(s).
  • Decision logic using data from hardware or other software components should not be based on values of all ones or all zeros. Use specific binary patterns to reduce the likelihood of malfunctioning hardware/software satisfying the decision logic.
  • Safety-critical components should have only one entry and one exit point.
  • Perform reasonableness checks on all safety-critical inputs.
  • Perform a status check of critical system elements prior to executing a potentially hazardous sequence.
  • Always initialize the software into a known safe state. This implies making sure all variables are set to an initial value, and not the previous value prior to reset.



Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s