Traditional approaches to dealing with software malfunctions have included such mechanisms as:
All of these approaches are relatively successful at detecting a software fault. But the net result of this detection, especially when faced with a multitude of faults in several potentially separate software components, is the rather drastic action of a system reset.
One of the principal reasons for this lack of graceful recovery is the monolithic architecture of a traditional realtime embedded system. At the heart of most of these systems lies a realtime executive — a single memory image consisting of the RTOS itself and often numerous tasks.
Since all tasks — including critical system-level services — share the very same address space, when the integrity of one task is called into question, the integrity of the entire system is at risk. If a single component such as a device driver fails, the RTOS itself could fail. In HA terms, each software component becomes a single point of failure (SPOF).
The only sure recovery mechanism in such an environment is to reset the system and start from scratch.
Such realtime systems present a very low granularity of fault recovery, making the HA procedure of planning for and dealing with failure seemingly straightforward (a system reset), yet often very costly (in terms of downtime, system restoration, etc.). For some embedded applications, a reset may involve a specialized, time-consuming procedure in order to restore the system to full operation in the field.
What is really needed here is a more modular approach. System architects often de-couple and modularize their systems from a design/implementation point of view. Ideally, these modules would be the focus not only of the design, but also of the fault-recovery process, so that if one module malfunctions, then only that module would require a reset — the integrity of the rest of the system would remain intact. In other words, that particular module wouldn't be a SPOF.
This modular approach would also help us address the fact that the mean time to repair (MTTR) for a system reboot is a magnitude larger than the MTTR for replacing a single running task.
This type of increased granularity on the recovery of individual tasks is precisely what the QNX Neutrino microkernel offers. The architecture of the QNX Neutrino realtime operating system itself provides so many intrinsic HA features that many QNX users take them for granted and often design recoverability into their systems without giving it a second thought.
Let's look briefly at the key features of the QNX Neutrino RTOS and see how system designers can easily make use of these builtin HA-ready features to build effective HA systems.
Three key factors of the QNX Neutrino architecture contribute directly to intrinsic HA:
Also, the kernel's fixed-priority preemptive scheduler ensures a predictable system — there are fewer HA software paths to analyze and deal with separately.
The process model also offers dynamic process creation and destruction, which is especially important for HA systems, because you can more readily perform fault detection, recovery, and live upgrades in the field.
The POSIX API provides a standard programming environment and can help achieve system simplification, validation, and verification.
In addition, the process model lets you easily monitor external tasks, which not only aids in fault detection and diagnosis, but also in service distribution.
Local and network-remote messaging is identical and practically transparent for the application. In a network-distributed HA system, the QNX message-based approach fosters replication, redundancy, and system simplification.
These represent some of the more prominent HA-oriented features that become readily apparent when the QNX Neutrino RTOS forms the basis of an HA design.