The term High Availability (HA) is commonly used in telecommunications and other industries to describe a system's ability to remain up and running without interruption for extended periods of time. The celebrated “five nines” availability metric refers to the percentage of uptime a system can sustain in a year — 99.999% uptime amounts to about five minutes downtime per year.
Obviously, an effective HA solution involves various hardware and software components that conspire to form a stable, working system. Assuming reliable hardware components with sufficient redundancy, how can an OS best remain stable and responsive when a particular component or application program fails? And in cases where redundant hardware may not be an option (e.g. consumer appliances), how can the OS itself support HA?
If you had to design an HA-capable OS from the ground up, would you start with a single executable environment? In this simple, high-performance design, all OS components, device drivers, applications, the works, would all run without memory protection in kernel mode.
On second thought, maybe such an OS wouldn't be suited for HA, simply because if a single software component were to fail, the entire system would crash. And if you wanted to add a software component or otherwise modify the HA system, you'd have to take the system out of service to do so. In other words, the conventional realtime executive architecture wasn't built with HA in mind.
Suppose, then, that you base your HA-enabled OS on a separation of kernel space and user space, so that all applications would run in user mode and enjoy memory protection. You'd even be able to upgrade an application without incurring any downtime.
So far so good, but what would happen if a device driver, filesystem manager, or other essential OS component were to crash? Or what if you needed to add a new driver to a live system? You'd have to rebuild and restart the kernel. Based on such a monolithic kernel architecture, your HA system wouldn't be as available as it should be.
A true microkernel that provides full memory protection is inherently the most stable OS architecture. Very little code is running in kernel mode that could cause the kernel itself to fail. And individual processes, whether applications or OS services, can be started and stopped dynamically, without jeopardizing system uptime.
QNX Neutrino inherently provides several key features that are well-suited for HA systems:
While any claims regarding “five nines” availability on the part of an OS must be viewed only in the context of the entire hardware/software HA system, one can always ask whether an OS truly has the appropriate underlying architecture capable of supporting HA.
Apart from its inherently robust architecture, Neutrino also provides several components to help developers simplify the task of building and maintaining effective HA systems:
While many operating systems provide HA support in a hardware-specific way (e.g. via PCI Hot Plug), QNX Neutrino isn't tied to PCI. Your particular HA system may be built on a custom chassis, in which case an OS that offers a PCI-based HA “solution” may not address your needs at all.
QNX Software Systems is an actively contributing member of the Service Availability Forum (www.saforum.org), an industry body dedicated to developing open, industry-standard specifications for building HA systems. |
The HA client-side library provides a drop-in enhancement solution for many standard C Library I/O operations. The HA library's cover functions allow for automatic and transparent recovery mechanisms for failed connections that can be recovered from in an HA scenario. Note that the HA library is both thread-safe and cancellation-safe.
The main principle of the client library is to provide drop-in replacements for all the message-delivery functions (i.e. MsgSend*). A client can select which particular connections it would like to make highly available, thereby allowing all other connections to operate as ordinary connections (i.e. in a non-HA environment).
Normally, when a server that the client is talking to fails, or if there's a transient network fault, the MsgSend* functions return an error indicating that the connection ID (or file descriptor) is stale or invalid (e.g. EBADF). But in an HA-aware scenario, these transient faults are recovered from almost immediately, thus making the services available again.
The following example demonstrates a simple recovery scenario, where a client opens a file across a network file system. If the NFS server were to die, the HA Manager would restart it and remount the filesystem. Normally, any clients that previously had files open across the old connection would now have a stale connection handle. But if the client uses the ha_attach functions, it can recover from the lost connection.
The ha_attach functions allow the client to provide a custom recovery function that's automatically invoked by the cover-function library. This recovery function could simply reopen the connection (thereby getting a connection to the new server), or it could perform a more complex recovery (e.g. adjusting the file position offsets and reconstructing its state with respect to the connection). This mechanism thus lets you develop arbitrarily complex recovery scenarios, while the cover-function library takes care of the details (detecting a failure, invoking recovery functions, and retransmitting state information).
#include <stdio.h> #include <string.h> #include <stdlib.h> #include <unistd.h> #include <sys/stat.h> #include <fcntl.h> #include <errno.h> #include <ha/cover.h> #define TESTFILE "/net/machine99/home/test/testfile" typedef struct handle { int nr; int curr_offset; } Handle ; int recover_conn(int oldfd, void *hdl) { int newfd; Handle *thdl; thdl = (Handle *)hdl; newfd = ha_reopen(oldfd, TESTFILE, O_RDONLY); if (newfd >= 0) { // adjust file offset to previously known point lseek(newfd, thdl->curr_offset, SEEK_SET); // increment our count of successful recoveries (thdl->nr)++; } return(newfd); } int main(int argc, char *argv[]) { int status; int fd; int fd2; Handle hdl; char buf[80]; hdl.nr = 0; hdl.curr_offset = 0; // open a connection // recovery will be using "recovery_conn", and "hdl" will // be passed to it as a parameter fd = ha_open(TESTFILE, O_RDONLY, recover_conn, (void *)&hdl, 0); if (fd < 0) { printf("could not open file\n"); exit(-1); } status = read(fd,buf,15); if (status < 0) { printf("error: %s\n",strerror(errno)); exit(-1); } else { hdl.curr_offset += status; } fd2 = ha_dup(fd); // fs-nfs3 fails, and is restarted, the network mounts // are re-instated at this point. // Our previous "fd" to the file is stale sleep(18); // reading from dup-ped fd // will fail, and will recover via recover_conn status = read(fd,buf,15); if (status < 0) { printf("error: %s\n",strerror(errno)); exit(-1); } else { hdl.curr_offset += status; } printf("total recoveries, %d\n",hdl.nr); ha_close(fd); ha_close(fd2); exit(0); }
Since the cover-function library takes over the lowest MsgSend*() calls, most standard library functions (read(), write(), printf(), scanf(), etc.) are also automatically HA-aware. The library also provides an ha-dup() function, which is semantically equivalent to the standard dup() function in the context of HA-aware connections. You can replace recovery functions during the lifetime of a connection, which greatly simplifies the task of developing highly customized recovery mechanisms.
The High Availability Manager (HAM) provides a mechanism for monitoring processes and services on your system. The goal is to provide a resilient manager (or “smart watchdog”) that can perform multistage recovery whenever system services or processes fail, no longer respond, or are detected to be in a state where they cease to provide acceptable levels of service. The HA framework, including the HAM, uses a simple publish/subscribe mechanism to communicate interesting system events between interested components in the system. By automatically integrating itself into the native networking mechanism (Qnet), this framework transparently extends a local monitoring mechanism to a network-distributed one.
The HAM acts as a conduit through which the rest of the system can both obtain and deliver information regarding the state of the system as a whole. Again, the system could be simply a single node or a collection of nodes connected via Qnet. The HAM can monitor specific processes and can control the behavior of the system when specific components fail and need to be recovered. The HAM also allows external detectors to detect and report interesting events to the system, and can associate actions with the occurrence of those events.
In many HA systems, each single points of failure (SPOF) must be identified and dealt with carefully. Since the HAM maintains information about the health of the system and also provides the basic recovery framework, the HAM itself must never become a SPOF.
As a self-monitoring manager, the HAM is resilient to internal failures. If, for whatever reason, the HAM itself is stopped abnormally, it can immediately and completely reconstruct its own state. A mirror process called the Guardian perpetually stands ready and waiting to take over the HAM's role. Since all state information is maintained in shared memory, the Guardian can assume the exact same state that the original HAM was in before the failure.
But what happens if the Guardian terminates abnormally? The Guardian (now the new HAM) creates a new Guardian for itself before taking the place of the original HAM. Practically speaking, therefore, one can't exist without the other.
Since the HAM/Guardian pair monitor each other, the failure of either one can be completely recovered from. The only way to stop the HAM is to explicitly instruct it to terminate the Guardian and then to terminate itself.
HAM consists of three main components:
Entities are the fundamental units of observation/monitoring in the system. Essentially, an entity is a process (pid). As processes, all entities are uniquely identifiable by their pids. Associated with each entity is a symbolic name that can be used to refer to that specific entity. Again, the names associated with entities are unique across the system. Managers are currently associated with a node, so uniqueness rules apply to a node. As we'll see later, this uniqueness requirement is very similar to the naming scheme used in a hierarchical filesystem.
There are three fundamental entity types:
Conditions are associated with entities; a condition represents the entity's state.
Condition | Description |
---|---|
CONDDEATH | The entity has died. |
CONDABNORMALDEATH | The entity has died an abnormal death. Whenever an entity dies, this condition is triggered by a mechanism that results in the generation of a core dump file. |
CONDDETACH | The entity that was being monitored is detaching. This ends the HAM's monitoring of that entity. |
CONDATTACH | An entity for whom a place holder was previously created (i.e. some process has subscribed to events relating to this entity) has joined the system. This is also the start of the HAM's monitoring of the entity. |
CONDBEATMISSEDHIGH | The entity missed sending a “heartbeat” message specified for a condition of “high” severity. |
CONDBEATMISSEDLOW | The entity missed sending a “heartbeat” message specified for a condition of “low” |
CONDRESTART | The entity was restarted. This condition is true after the entity is successfully restarted. |
CONDRAISE | An externally detected condition is reported to the HAM. Subscribers can associate actions with these externally detected conditions. |
CONDSTATE | An entity reports a state transition to the HAM. Subscribers can associate actions with specific state transitions. |
CONDANY | This condition type matches any condition type. It can be used to associate the same actions with one of many conditions. |
For the conditions listed above (except CONDSTATE, CONDRAISE, and CONDANY), the HAM is the publisher — it automatically detects and/or triggers the conditions. For the CONDSTATE and CONDRAISE conditions, external detectors publish the conditions to the HAM.
For all conditions, subscribers can associate with lists of actions that will be performed in sequence when the condition is triggered. Both the CONDSTATE and CONDRAISE conditions provide filtering capabilities, so subscribers can selectively associate actions with individual conditions based on the information published.
Any condition can be associated as a wild card with any entity, so a process can associate actions with any condition in a specific entity, or even in any entity. Note that conditions are also associated with symbolic names, which also need to be unique within an entity.
Actions are associated with conditions. Actions are executed when the appropriate conditions are true with respect to a specific entity. The HAM API includes several functions for different kinds of actions:
Action | Description |
---|---|
ham_action_restart() | This action restarts the entity. |
ham_action_execute() | Executes an arbitrary command (e.g. to start a process). |
ham_action_notify_pulse() | Notifies some process that this condition has occurred. This notification is sent using a specific pulse with a value specified by the process that wished to receive this notify message. |
ham_action_notify_signal() | Notifies some process that this condition has occurred. This notification is sent using a specific realtime signal with a value specified by the process that wished to receive this notify message. |
ham_action_notify_pulse_node() | This is the same as ham_action_notify_pulse() above, except that the node name specified for the recipient of the pulse can be the fully qualified node name. |
ham_action_notify_signal_node() | This is the same as ham_action_notify_signal() above, except that the node name specified for the recipient of the signal can be the fully qualified node name. |
ham_action_waitfor() | Lets you insert delays between consecutive actions in a sequence. You can also wait for certain names to appear in the namespace. |
ham_action_heartbeat_healthy() | Resets the heartbeat mechanism for an entity that had previously missed sending heartbeats and had triggered a missed heartbeat condition, but has now recovered. |
ham_action_log() | Reports this condition to a logging mechanism. |
Actions are also associated with symbolic names, which are unique within a specific condition.
What happens if an action itself fails? You can specify an alternate list of actions to be performed to recover from that failure. These alternate actions are associated with the primary actions through several ham_action_fail* functions:
Entities or other components in the system can inform the HAM about conditions (events) that they deem interesting, and the HAM in turn can deliver these conditions (events) to other components in the system that have expressed interest in (subscribed to) them.
This publishing feature allows arbitrary components that are capable of detecting error conditions (or potentially erroneous conditions) to report these to the HAM, which in turn can notify other components to start corrective and/or preventive action.
There are currently two different ways of publishing information to the HAM; both of these are designed to be general enough to permit clients to build more complex information exchange mechanisms:
An entity can report its state transitions to the HAM, which maintains every entity's current state (as reported by the entity). The HAM doesn't interpret the meaning of the state value itself, nor does it try to validate the state transitions, but it can generate events based on transitions from one state to another.
Components can publish transitions that they want the external world to know about. These states needn't necessarily represent a specific state the application uses internally for decision making.
To notify the HAM of a state transition, components can use the ham_entity_condition_state() function. Since the HAM is interested only in the next state in the transition, this is the only information that's transmitted to the HAM. The HAM then triggers a condition state-change event internally, which other components can subscribe to using the ham_condition_state() API call (see below).
In addition to the above, components on the system can also publish autonomously detected conditions by using the ham_entity_condition_raise() API call. The component raising the condition can also specify a type, class, and severity of its choice, to allow subscribers further granularity in filtering out specific conditions to subscribe to. As a result of this call, the HAM triggers a condition-raise event internally, which other components can subscribe to using the ham_condition_raise() API call (see below).
To express their interest in events published by other components, subscribers can use the ham_condition_state() and ham_condition_raise() API calls. These are similar to the ham_condition() API call (e.g. they return a handle to a condition), but they allow the subscriber customize which of several possible published conditions they're interested in.
When an entity publishes a state transition, a state transition condition is raised for that entity, based on the two states involved in the transition (the from state and the to state). Subscribers indicate which states they're interested in by specifying values for the fromstate and tostate parameters in the API call. For more information, see the API reference documentation for the ham_condition_state() call in the High Availability Framework Developer's Guide.
To express interest in conditions raised by entities, subscribers can use the API call ham_condition_raise(), indicating as parameters to the call what sort of conditions they're interested in. For more information, refer to the API documentation for the ham_condition_raise() call in the High Availability Framework Developer's Guide.
Effectively, HAM's internal state is like a hierarchical filesystem, where entities are like directories, conditions associated with those entities are like subdirectories, and actions inside those conditions are like leaf nodes of this tree structure.
HAM also presents this state as a read-only filesystem under /proc/ham. As a result, arbitrary processes can also view the current state (e.g. you can do ls /proc/ham).
The /proc/ham filesystem presents a lot of information about the current state of the system's entities. It also provides useful statistics on heartbeats, restarts, and deaths, giving you a snapshot in time of the system's various entities, conditions, and actions.
HAM can perform a multistage recovery, executing several actions in a certain order. This technique is useful whenever strict dependencies exist between various actions in a sequence. In most cases, recovery requires more than a single restart mechanism in order to properly restore the system's state to what it was before a failure.
For example, suppose you've started fs-nfs3 (the NFS filesystem) and then mounted a few directories from multiple sources. You can instruct HAM to restart fs-nfs3 upon failure, and also to remount the appropriate directories as required after restarting the NFS process.
As another example, suppose io-pkt* (the network I/O manager) were to die. We can tell HAM to restart it and also to load the appropriate network drivers (and maybe a few more services that essentially depend on network services in order to function).
The basic mechanism to talk to HAM is to use its API. This API is implemented as a library that you can link against. The library is thread-safe as well as cancellation-safe.
To control exactly what/how you're monitoring, the HAM API provides a collection of functions, including:
Function | Description |
---|---|
ham_action_control() | Perform control operations on an action object. |
ham_action_execute() | Add an execute action to a condition. |
ham_action_fail_execute() | Add to an action an execute action that will be executed if the corresponding action fails. |
ham_action_fail_log() | Insert a log message into the activity log. |
ham_action_fail_notify_pulse() | Add to an action a notify pulse action that will be executed if the corresponding action fails. |
ham_action_fail_notify_pulse_node() | Add to an action a node-specific notify pulse action that will be executed if the corresponding action fails. |
ham_action_fail_notify_signal() | Add to an action a notify signal action that will be executed if the corresponding action fails. |
ham_action_fail_notify_signal_node() | Add to an action a node-specific notify signal action that will be executed if the corresponding action fails. |
ham_action_fail_waitfor() | Add to an action a waitfor action that will be executed if the corresponding action fails. |
ham_action_handle() | Get a handle to an action in a condition in an entity. |
ham_action_handle_node() | Get a handle to an action in a condition in an entity, using a nodename. |
ham_action_handle_free() | Free a previously obtained handle to an action in a condition in an entity. |
ham_action_heartbeat_healthy() | Reset a heartbeat's state to healthy. |
ham_action_log() | Insert a log message into the activity log. |
ham_action_notify_pulse() | Add a notify-pulse action to a condition. |
ham_action_notify_pulse_node() | Add a notify-pulse action to a condition, using a nodename. |
ham_action_notify_signal() | Add a notify-signal action to a condition. |
ham_action_notify_signal_node() | Add a notify-signal action to a condition, using a nodename. |
ham_action_remove() | Remove an action from a condition. |
ham_action_restart() | Add a restart action to a condition. |
ham_action_waitfor() | Add a waitfor action to a condition. |
ham_attach() | Attach an entity. |
ham_attach_node() | Attach an entity, using a nodename. |
ham_attach_self() | Attach an application as a self-attached entity. |
ham_condition() | Set up a condition to be triggered when a certain event occurs. |
ham_condition_control() | Perform control operations on a condition object. |
ham_condition_handle() | Get a handle to a condition in an entity. |
ham_condition_handle_node() | Get a handle to a condition in an entity, using a nodename. |
ham_condition_handle_free() | Free a previously obtained handle to a condition in an entity. |
ham_condition_raise() | Attach a condition associated with a condition raise condition that's triggered by an entity raising a condition. |
ham_condition_remove() | Remove a condition from an entity. |
ham_condition_state() | Attach a condition associated with a state transition condition that's triggered by an entity reporting a state change. |
ham_connect() | Connect to a HAM. |
ham_connect_nd() | Connect to a remote HAM. |
ham_connect_node() | Connect to a remote HAM, using a nodename. |
ham_detach() | Detach an entity from a HAM. |
ham_detach_name() | Detach an entity from a HAM, using an entity name. |
ham_detach_name_node() | Detach an entity from a HAM, using an entity name and a nodename. |
ham_detach_self() | Detach a self-attached entity from a HAM. |
ham_disconnect() | Disconnect from a HAM. |
ham_disconnect_nd() | Disconnect from a remote HAM. |
ham_disconnect_node() | Disconnect from a remote HAM, using a nodename. |
ham_entity() | Create entity placeholder objects in a HAM. |
ham_entity_condition_raise() | Raise a condition. |
ham_entity_condition_state() | Notify the HAM of a state transition. |
ham_entity_control() | Perform control operations on an entity object in a HAM. |
ham_entity_handle() | Get a handle to an entity. |
ham_entity_handle_node() | Get a handle to an entity, using a nodename. |
ham_entity_handle_free() | Free a previously obtained handle to an entity. |
ham_entity_node() | Create entity placeholder objects in a HAM, using a nodename. |
ham_heartbeat() | Send a heartbeat to a HAM. |
ham_stop() | Stop a HAM. |
ham_stop_nd() | Stop a remote HAM. |
ham_stop_node() | Stop a remote HAM, using a nodename. |
ham_verbose() | Modify the verbosity of a HAM. |